There is a common saying around version control systems stating the following:
Do not rewrite the history.
And it is pretty solid saying to be fair, supported at many threads, for instance at FS#45425 or elsewhere. You simply have to assume that once you pushed something into the public, it should stay as it is. But there are some times when cleaning the mess is not just required, but may play out well in the long term.
Rewriting history will almost definitely not be possible for some public projects with commits added on top of yours. But for a very freshly created repositories somewhere in the forgotten parts of the Internet, the leap of faith might be worth taking. For not-pushed work, it is almost always safe and very encouraged to do the cleaning, so knowing the efficient tools to get the job done is essential. Also, even better is to know the tool that gets the job done while not placing the user at the risk of unrecoverable damage in the form of mangled history.
Getting started
There is one more saying that is especially relevant in this context, which states:
Always keep multiple backups.
This saying cuts even deeper. Nothing is 100% reliable. Before continuing, back up your work. Some software has a proven history to be battle tested, usually meaning that the edge cases were polished to the point they are not visible, but you can bet on the fact that Murphy will always get you. You have been warned. The tool we take a look at is newren/git-filter-repo.
Beware: Using the tool can lead to catastrophic scenarios if used incorrectly.
The tool is encouraged to be used only on the fresh clones to make sure the
work is recoverable in case of a disaster. Try to avoid using the --force
parameter at all costs to prevent data loss.
If unsure, instead use --dry-run
or --analyze
along with the actual
command to inspect the changes before doing them. Now lets look at some of
the use cases of the tool.
Replace sensitive string in all files
The most common use case for rewriting git history is probably removing sensitive information such as passwords or access tokens checked in by accident. It is not enough to just replace all occurrences in the current index, because the information might still be present in earlier commits. Doing this manually via interactive rebase is time-consuming and error prone. Instead, this command can be used:
git filter-repo --replace-text <(echo 'my_password==>xxxxxxxx')
The reason for the <( ... )
syntax denoting an
I/O redirection is
that the --replace-text
argument originally requires a file descriptor
with as many key-value pairs as needed. With the above syntax, one can skip
creating a file altogether. Useful when only a single replace is needed.
This is where usually the use case for the filter-repo
tool ends. It is
also quite hard to remember due to used shell intricacies and the uncommon
syntax requiring a long double-arrow symbol ==>
, so you probably end up
searching this up every time the need arises. But there is much more one
can do, so lets look at some less documented features I found scattered
around the internet.
Remove a single folder, keeping history
Scenario, where a repository has a folder folder that has to be taken out of it, leaving no traces in history:
git-filter-repo --path path_to_the_folder/ --invert-paths
Now the repository has no trace of the tracked files inside
./path_to_the_folder/
. Beware that all the untracked files are
preserved while tracked files are completely delete. If all the files in
the folder are tracked, then the empty folder will be deleted as well.
Extract a single folder, keeping history
The opposite is even simpler with one less parameter. When you want to extract commit history of a single folder, omitting every other file:
git-filter-repo --path path_to_the_folder/
The repository now contains only the ./path_to_the_folder/
and all other
files that are untracked.
Move everything from sub-folder one level up
This goes very well together with the above command. After extraction, sometimes you need to make the contents of the extracted folder the root of the repository, shifting everything one level up in the path:
git-filter-repo --path-rename path_to_the_folder/:
Note the colon character :
at the end. The repository now no longer
contains ./path_to_the_folder/
and instead you will look at contents on
that folder directly.
Replace email address in commits
This is a little bit different from the above commands, but sometimes you
made commits with a wrong email address. This can be fixed by creating a
file named .mailmap
in the desired repository with the following
contents:
<new@email> <curent@email>
Note that angle brackets <
and >
around both email addresses are
mandatory, otherwise the following error happens:
Unparseable mailmap file: line #1 is bad: ...
With the properly formatted .mailmap
file in place, issue the rewrite
command:
git-filter-repo --use-mailmap
Even though changing email address in commits seems like an innocent change, it too changes commit SHA hashes, as they are computed with the authors email address in mind.
Replace author's name in commits
A variation of the above is to replace the author's name. I have not used
this personally, but I can think of a situation of using a nickname for
commits you just want to make public or the opposite scenario, where you
made commits with your true identity, but you want to show off just using a
nickname. All the steps are identical, with a tweak to the .mailmap
file:
Name Surname <current@email> <current@email>
And again, run the following:
git-filter-repo --use-mailmap
You can obviously also combine changing both the author and the email in the same step:
Name Surname <new@email> <current@email>
Note that the author's name will be only changed for the commits that match
current@email
so this is something to keep in mind!
Checking the changes
After you've done your changes, it is always safe to check if everything went right. One way of doing so is to use the git inspection GUI to inspect all branches:
gitk --all
If GUI is not available, this command could serve as a base for the endeavor:
git log --graph --all --format='%h %an <%ae>'
Tweak the above if needed.
Conclusion
The git-filter-repo
is a very versatile tool that can do many actions
with just one line. It is the official preferred way of rewriting git
history. Most of the time you find yourself using it for removing sensitive
information such as passwords, but most other actions needed for a
repository clean-up are possible, when you know the right syntax. Remember
to keep the backups, do not rewrite public repositories unless absolutely
necessary and keep your repositories clean. Enjoy!