Cheating on Subversion Without Getting Caught

2010-08-12 00:00:00 -0700

This post lightly explores the problem of “automatically” backing up a git repository to subversion. Why would anyone want to do this? Well, if your organization has a policy that all code must make it into subversion, but your team is interested in leveraging git in a deeper way than just by using git-svn as a sexy subversion client, then you’ll find yourself pondering the question of repository synchronization.

For our purposes, we’re guided by the following requirements:

An at least periodic (< 24 hour backup) of code must exist in subversion — While it’s not strictly necessary to have a perfect commit history stored in subversion, all code should be present and buildable with minimal fuss.
Code must exist in git with full history.
Branching & tagging must remain usable and should properly translated between svn and git
To the extent possible, non-linear git histories should be handled
it is not required that all features of git are usable (i.e., sub-modules), but there should be viable workarounds for each feature we sacrifice.
it should be possible to have an external git/public mirror of source code.
There should be a way to take changes committed to forks of the public mirror to be merged back into the system.

Given the requirements, their are at least two flows that make sense:

Subversion Primary is an arrangement where all commits are made into subversion (using the client of your choice, possibly git-svn) and “read only” git mirrors of the subversion repository support the other requirements.
The Dumb Subversion Backup flips the order around and says that there is a single git repo that is the source of truth, which is automatically backed up to subversion (perhaps in a lossy fashion).

The remainder of this document will discuss these two approaches.

The Subversion Primary

In this scenario, all commits funnel first through the subversion repository. There is an automated process that polls or pushes changes from subversion into an internal git staging repository. That staging repository uses git-svn to map the complete subversion commit history into git, and performs some processing to deal with mapping branches and tags among other things.

The internal staging repository can then mirror to an external repository which is visible to all for code review and contributions. External contributors can fork, change, and send pull requests in any form they like. Internal developers can then manage pulling commits into the main svn repository.

Key Features

The Subversion Primary is a straightforward solution that has the following characteristics:

Internal devs interact only with subversion and can use the client tool of their choice
Downtime that affects the internal git repo or the public git repo do not affect internal developers
All git repositories are ephemeral, and their loss or corruption is Not A Disaster.
External developers are insulated from the fact that SVN is in use, except for the git-svn artifacts attached to commit messages

Challenges

There are a couple obstacles to surmount in this arrangement, and we’ll briefly explore them here.

Username Mapping

One problem that exists with git-svn is that when commits are pulled from subversion the ‘author’ field is meaningless. Here’s an example:

commit f6c77de3df60340b01abb219cbe7215a93dcdc9c Author: lloydh <lloydh@3fcf768e-b076-0410-b9b8-e8c34e2d470b> Date: Thu Jul 29 19:36:48 2010 +0000 update changelog with the skinny on 2.9.7

git-svn-id: svn+ssh://svncorp/yahoo/BrowserPlus/public_platform/trunk@793 3fcf768e-b076-0410-b9b8-e8c34e2d470b

In order to map internal svn usernames (i.e. lloydh) to meaningful author entries (i.e. “Lloyd Hilaiel ”) we can use mechanisms built into git-svn. Simply create an authors.txt file that looks something like:

lloydh = Lloyd Hilaiel <lloyd@somewhe.re>
dgrigsby = David Grigsby <dg@wordmaven.org>
... etc

having created this file, use the -A argument on all git svn operations:

-A, --authors-file= Syntax is compatible with the file used by git cvsimport: loginname = Joe User <user@example.com>

If this option is specified and git svn encounters an SVN committer name that does not exist in the authors-file, git svn will abort operation. The user will then have to add the appropriate entry. Re-running the previous git svn command after the authors-file is modified should continue operation.

Mapping SVN branches and tags to git

Another problem that exists is how SVN branches and tag appear in git. Subsequent to a git svn clone you can view the remote branches with git branch -r. You’ll see something like this:

...
sdk_and_installer
service_api_v5
tags/2.10.0
tags/2.10.1
tags/2.10.2
tags/2.5.0
...

Because subversion has no proper notion of tags, you’ll notice that the tags set in subversion are branches in git. If you want your published git repository to look reasonable to an average git user, it’d be nice to turn these tags into proper git branches. To solve this problem I wrote a small shell script designed to be run after fetching changes from SVN.

#!/usr/bin/env bashfor r in `git branch -r | grep -v trunk`; do
  istag=x$(echo $r | egrep -v '^tags')
  if [ "$istag" == "x" ] ; then
     tn=$(echo $r | sed -e 's/^tags\///')
     git tag -f $tn refs/remotes/tags/$tn
   else
     git branch --track -f $r refs/remotes/$r
   fi
done

Accepting contributions

Given the flexibility of git, there are many tactics for merging changes forked off of the public mirror back into subversion’s linear view of the world: the least sophisticated approach would be to manually do the merge using diff and her friend patch and perhaps add a little --fuzz. A much nicer way would be to cherry-pick or rebase changes directly into a git repository that is svn cloned from the original, and then to dcommit back.

This latter approach seems like it would be much faster and less error prone, however there’s at least one gotcha: In order to simplify the merging of changes, one should ensure that commits are identical between the public repository and the version that’s cloned from SVN. This means when an internal developer clones a git repository from SVN, they should use the same authors.txt file as is used by the internal git repo (which pulls from subversion and publishes to external). Additionally, the internal developers clone should refer to the subversion repository in the same way as the internal staging git repo does, specifically hostname must be the same in both cases (two cases where they might diverge include using ssh tunnels and aliases, or using DNS shorthands or ip addresses).

Once commits are bitwise identical in the developers svn clone and the public mirror, it’s possible to use the advanced merging and rebasing features of git in a straightforward way to absorb external changes and then dcommit them to SVN. Even if some circumstances arise and it’s not possible to get bitwise identical commits, it’s still possible to fetch an entire remote branch and cherry-pick individual commits, an approach which is perhaps trivially better than manual diffing and patching.

NOTE: In order to artificially join histories without bitwise identical commits (messages or authors), one might try the “ours” merge strategy. I’ve no personal experience with this approach.

I think whatever strategy you end up employing, the key, is that the person or people on the team that are most in love with git should be the ones integrating external contributions.

The Dumb Subversion Backup

In this arrangement, rather than all commits flowing through subversion, its simply a mechanism for the backup of the git repository. The key idea here is that there is a single git repository that is backed up to subversion. While other requirements then require the ability to publish the repo externally and pull contributions internally, these tasks can be accomplished in a thousand different ways leveraging the built-in features of git. Restated, the only element of this arrangement worth discussing is the challenge of actually synchronizing changes from git to subversion.

Key Features

All developers interact with git, while subversion is a hidden implementation detail
The internal team is on the hook to set up and maintain a more complex system than with the Subversion Primary arrangement
Inevitably, there will be information stored in the many git clones that will cannot be represented in subversion, so restoration from subversion backup would inevitably imply loss of some information.
External developers see no artifacts which suggest that subversion is used as a backing store

Challenges

While this arrangement undeniably requires more systems administration, most of it is straightforward. The setup of the main internal git repo will probably require the installation of some scripts to allow multiple users to push changes over ssh to the same repository, like gitosis or gitolite.

The real challenge in this setup, however, is given a git repository, how do you back it up into subversion? The range of solutions includes:

periodically take snapshots of the main branch of development (think, literally, a git archive), and overlay them on the current subversion view of the world. commit. prosper.
the same as #1, but do so for all branches present in the git repository (automatically bring branches into existence in SVN as they’re created in git)
#1 or #2 but on a per commit basis, with log message (and author?) preservation
invent a system that will actually use git svn dcommit to perform the commits, squashing non-linear histories into a linear history as it goes.

So, I’ve seen #1 in place on a successful, vibrant, large scale project. Depending on the intent of the policy that requires a backup to SVN (and how much you really care), this may or may not fit the bill.

Options 3 and 4 are what are really interesting. These options would seem to be possible with a little bit of well-thought-out scripting on top of the existing facilities of git-svn, and hold the potential to capture a significant subset of the information that’s represented in the git original — preserving commit messages, authorship, and performing admirably when smashing non-linear histories into something subversion can understand.

The path to implement option 3 would be similar to fairly well documented tactics for moving a single git branch with a (mostly) linear history into subversion, but would have to take into account multiple branches. Specifically, some areas of difficulty would seem to be:

keeping cruft out of the public git repository – when dcommitting using git-svn, the branch that you dcommit from is changed as meta-data is embedded into the commits. In order to use git-svn you would need to design a mechanism that could merge changes from the branch where a commit lands, into a branch which has been associated with a subversion branch, and dcommit from there.
incremental commits — An extension of #1, commits are made into the internal git repository, it would be required that you then move these commits into the branch from where you can dcommit. Doing so will require that you establish a merge base (common ancestor) between the pristine actual branch in git, and the git branch that’s associated with subversion.
non-linearity and git merges — Whenever you have a merge commit in git (a commit with more than one ancestor), you create an instance of non-linearity where git-svn will preserve changes but loose commit history. There may be cases where git-svn will simply break (!?)
sub-modules — git sub-modules (which point to remote git repositories) cannot be mapped onto svn:externals (which point to remote subversion repositories), so there would need to be a mechanism that could either fetch and commit the remote sub-module, or
changing history — in git it’s very possible to change the history of a remote repository. Subversion would get really pissed and the whole mechanism would probably go to hell.

Hairy, no?

Non-linearity and git-svn

Above it was mentioned that git-svn will attempt to collapse non-linear histories in git into something linear in subversion. A concrete example of this in action follows: This is your commit history in git:

* fef76c5 Merge branch 'newb'
|\
| * d49b6d4 a third change on the newb branch
| * 7a1ecfa a second commit on the newb branch
| * 9197078 commit #1 on the newb branch
* | 9332815 (testing) master branch commit
|/
* d1b9181 Merge branch 'test_branch'
|\
| * 2f74914 a test commit
* | 792cc2c a test commit on the main branch
|/
* a01c1ff fix svn:ignore props

This is your git commit history once dcommitted via git-svn

* 4090f22 Merge branch 'newb'
* c3a85f5 (testing) master branch commit
* 4ad5b30 Merge branch 'test_branch'
* 792cc2c a test commit on the main branch
* a01c1ff fix svn:ignore props

When looking at the commits, changes 9197078, 7a1ecfa, and d49b6d4 in the former non-linear history are all combined into 4090f22 once they’ve been routed through git-svn.

Now What?

I’ve implemented a Subversion Primary system myself, with great success, and I’ve seen the “periodic snapshot” approach used effectively to implement a Dumb Subversion Backup. While it seems like one could go further to backup more information from git incrementally into subversion, I fear the resultant system might be too brittle and high maintenance to make any sense. What do you think? Have you ever seen an automatic git –> subversion backup that didn’t suck?