GreyCat August 28, 2013 at 12:25

Pouring a legacy story into a tree: finding the optimal branch point

As a result of my service, I inherited a certain system that has ~ 15 years of history and about several dozen installations in different organizations. The system itself is relatively small (~ 25K lines of code, ~ 1K commits), but the problem was in release management:

there was a main tree in subversion (initially in cvs, of course), where the “main course of the party” was held - some large-scale changes were made, new features were added, global errors were fixed, etc.
specific installations were made by:
- at best, svn checkout, which was then updated via svn update; in almost all installations local improvements were made “live” (at least the configuration files were corrected) and these changes were not committed anywhere; if during the next svn update changes in upstream created a conflict - the conflict resolved “in place” by the programmer who did the update, again, without any tracking changes
- in the worst case, svn export, which then, of course, was not updated at all, remaining once and for all (or at least until the authorities change their mind) at the level of development of the export date; in especially neglected cases (from the late 1990s - early 2000s), this was also done because there simply was no physical opportunity to make a checkout - the organization did not have Internet access, the archive was simply brought on a diskette and deployed once in place

In practice, of course, grateful customers of this system from time to time still want to receive support, bug fixes, and even sometimes some global improvements in the system core.

After a short consultation, to continue supporting such a distributed system in svn was considered inappropriate and it was decided to migrate to git.

Problem number one - to drag the master tree from svn to git - was generally decided simply by the standard git-svn tools .

Set of problems number two - how to pour numerous forks in different installations into this tree - it was decided to disassemble "as they arrive." When the next organization woke up, it was necessary:

get them fork
understand where he was forked at one time and to what level the last time was relocated (if this is svn checkout)
create a new brunch for this fork
try to divide the changes made into more or less semantically related smaller pieces and commit them all to this brunch

The main plug suddenly appeared at step 2 - to understand where the next installation was forged from. In the case of svn checkout, you could at least look at the current state of working copy, in the case of svn export it was not trivial to guess. Having stumbled upon a semi-handed archaeological study of the state of the code a couple of times, I was tired and decided to automate the searches. There was no ready-made solution (git bisect is, unfortunately, not suitable here) and the following script turned out:

#!/bin/sh -ef
if [ $# -ne 2 ]; then
	echo "Usage: $0 "
	exit 1
fi
GIT_REPO="$1"
CANDIDATE_DIR=$(cd "$2" && pwd)
TAB=$(printf '\t')
cd "$GIT_REPO"
COMMITS=$(git log --all --format=format:%H)
# Remember current commit
CURRENT_COMMIT=$(git rev-parse HEAD)
for C in $COMMITS; do
	git checkout --quiet $C
	echo -n "$C$TAB"
	diff -urN --exclude=.git --exclude=.svn "$CANDIDATE_DIR" . | wc -l
done | sort -t"$TAB" -k2,2n
# Restore current commit
git checkout --quiet "$CURRENT_COMMIT"

The script takes 2 parameters: (1) the path to the git repository, (2) the path to the next fork candidate, for which you need to find the place of the "insert" into the general tree of the project development. The script trivially calculates the diff volume (in lines) between each checkout of the repository and the cut-in candidate. With a high probability - a commit, where the amount of differences is minimal - and there is an optimal place for basing the brunch. The result of the work looks something like this:

3810315aaa238e32a7106312f9973f1d1f0ea097 651
19b595d87eecc43933ea60d89882319c7ac3f512 835
989cee69664733b773a4a81cc49e2a1a0cdff38a 872
9026dae1154f98018c808b73c7f1c6cd09310dc7 885
802943edf287ad28d5e71a57510400afacb49176 894
c5bd4050fce754e16664e6e1eeb57a4ff3ed06c6 894
dcb70c4a2e9fc0431ceb6154ecd1688189362622 908
...

This means that most likely the problem will be solved somehow like this:

$ git branch new-organization 3810315aaa238e32a7106312f9973f1d1f0ea097
$ git checkout new-organization
$ cp -r ../new-organization-fork/* .

... after which you can already deal with the changes, try to split them into parts and commit (perhaps even with --date and --author, if you can figure them out).

I would be glad if the above solution is useful to someone else. Comments and tips on how to do better are welcome.

Tags:

Pouring a legacy story into a tree: finding the optimal branch point

Also popular now: