Today at work I ran into an interesting problem. We have a git repository that was formerly a SVN one. This previous repository was home to multiple big binary files, and since it is now git, all these files are making the.git directory be a painfull 2GB in size.
To confirm the culprits of the gigantic file sizes, I used this short script from stack Overflow.
No problem, you would think, just git-filter-repo
and be done with it. But there is a catch. Those big files are actually from other departments at our company, and had files in directories like doc
, hw
, etc while what is relevant to us was under sw
. At some point, it was decided that this repository would be used only for software, and they moved all the contents from sw
to the repository root dir. But sw
had a doc
directory too, I can't just filter it out.
I wanted to perform different filter-repo
s before and after the move commit, but just couldn't find how, so I hacked together this script using filter-repo python library:
#!/usr/bin/env python3
import os
import subprocess
try:
import git_filter_repo as fr
except ImportError:
raise SystemExit("Error: Couldn't find git_filter_repo.py. Did you forget to make a symlink to git-filter-repo named git_filter_repo.py or did you forget to put the latter in your PYTHONPATH?")
# Before the commit, we will ignore everything except sw directory.
# We take this opportunity to also remove some unwanted things from sw
def should_ignore_before(filename):
# Dont allow unwated directories
if filename.startswith(b"sw/delete.me"):
return True
if filename.startswith(b"sw/delete.me.2"):
return True
# Allow sw directory:
if filename.startswith(b"sw"):
return False
# Delete everything else:
return True
# After the big move commit, we will allow everything, except the unwanted directories
def should_ignore_after(filename):
# Dont allow unwanted directories
if filename.startswith(b"delete.me"):
return True
if filename.startswith(b"delete.me2"):
return True
# Allow everything else:
return False
before_move_commit = True
def fixup_commits(commit, metadata):
global before_move_commit
# Hacky way to do it, I filtered from the commit message, specifically the part including
# conversion from svn info, which should not repeat in any other commit:
# Note this probably only works because this part of history is totally linear, because it was SVN
if (b"branches/develop@582" in commit.message):
before_move_commit = False
# Apply different filtering depending on before or after commit.
if before_move_commit == True:
commit.file_changes = [x for x in commit.file_changes
if not should_ignore_before(x.filename)]
else:
commit.file_changes = [x for x in commit.file_changes
if not should_ignore_after(x.filename)]
fr_args = fr.FilteringOptions.parse_args(['--force'])
filter = fr.RepoFilter(fr_args, commit_callback=fixup_commits)
filter.run()
You will notice it is a very hacky script, but for something that I should run maybe only once in my life, it is OK enough.