Changing the git repository encoding

Hey. Due to the specifics, Linux with KOI8-R is used at work, all commits to the git repository were carried out in local encoding. After some time, it was decided to transcode the repository into UTF-8. In this article I want to discuss the technology for changing the encoding of an existing git repository, as well as the correction of some errors made in certain commits.

Warning


In fact, a new repository will be created, accordingly, before the procedure, it is necessary to suspend the current development, merge all the changes into a conditionally central repository, in which we will transcode. After checking the received repository, it will be necessary to re-clone it to all machines.

Git and encoding


Git operates with binary data, therefore, it does not interact with the encoding of files, as far as the comments of commits are concerned, it also saves them in the form that we passed them to it, but at the same time, for each commit, a header is filled encoding, which can be used later requesting comments. If the title is encodingempty, git considers it equal to UTF-8.

To configure, there are two parameters located in the [i18n] section:

[i18n]
        commitencoding = UTF-8
        logoutputencoding = KOI8-R

The first one just sets the content encoding header for the teams git commit, and git commit-treethe second tells the teams git log, git show, git blamein which encoding should transcode the comment text before displaying to the user. If none of the parameters is specified, git considers it to logoutputencodingbe UTF-8, however, if only the first parameter is set, git uses its value for the second as well.

Because of this, various errors may occur - for example, if the header in the commits encodingdoes not match the comment encoding, but is equal to the parameter valuelogoutputencoding, git decides that transcoding is not required and displays the comment text as it is, respectively, on machines with the locale installed in the same encoding as the comment, the content will be displayed correctly, although there will be garbage on everyone else.

In order to view the value of the encodingcomment title , you can use the following command:

git log –pretty=”%h - ‘%e’: %s”

More about the team's possibilities git logcan be found here .

Git filter-branch


So, we come to the main topic of this article. In order to “rewrite the history” of the existing repository, a command is used git filter-branch. It allows you to sequentially repeat all the commits made by previously processing the files or meta-data with various filters.

This article uses three filters:
  • --msg-filter - used to overwrite the text of the comment commits;
  • --env-filter - it is applied if it is necessary to change the environment in which the commit was made (name of the author, email address, etc.);
  • --tag-name-filter - used to overwrite tag texts.

After each filter, a command is issued that git filter-branchwill be executed before recording the commit.

In order to go through the entire repository, you need to specify a parameter --all, separating it with additional --filters, specify HEADas a target and overwrite labels ( tags) according to new commits. To do this, add a filter tag-namewith the command cat:

git filter-branch <фильтры> --tag-name-filter 'cat' -- --all HEAD

Before changing the encoding of comments, do not forget to set the correct value of the directive i18n.commitencoding- it will be written in all the headers received after the repository operation.

To convert the comment encoding, use the following command:

'iconv -c -s -f KOI8-R -t UTF-8'

  • s - silent mode;
  • c - skip characters that cannot be converted.

The command git filter-branchtakes the following form:

git filter-branch --msg-filter 'iconv -c -s -f KOI8-R -t UTF-8' \
--tag-name-filter 'cat' -- --all HEAD

Since the operation of "rewriting history" roughly interferes with the workflow, it makes sense (if you still decide to make it) to try to fix the maximum number of errors. These may be incorrectly set environment parameters, files stored in the repository that should not be there, encoding or part of the data of individual files, etc.

In particular, I found that the author’s e-mail was incorrectly set for a couple of commits. Since at that time all the commits were created by me, the problem was solved simply by overwriting this parameter in all commits:

git filter-branch --msg-filter 'iconv -c -s -f KOI8-R -t UTF-8' \
--env-filter 'export GIT_AUTHOR_EMAIL="xxx@gmail.com" export GIT_COMMITTER_EMAIL="xxx@gmail.com"' \
--tag-name-filter 'cat' -- --all HEAD

But naturally, no one bothers to use more complex structures with different conditions, etc.

In general, the team git filter-branchprovides very rich functionality for modifying / fixing the git repository. You can read about all its features here .

Also popular now: