Rewriting git history: Dealing with big directory moves

Today at work I ran into an interesting problem. We have a git repository that was formerly a SVN one. This previous repository was home to multiple big binary files, and since it is now git, all these files are make out .git directory be a painfull 2GB in size.

To confirm the culprits of the gigantic file sizes, I used this short script from stack Overflow.

No problem, you would think, just git-filter-repo and be done with it. But there is a catch. Those big files are actually from other departments at our company, and had files in directories like doc, hw, etc while what is relevant to us was under sw. At some point, it was decided that this repository would be used only for software, and they moved all the contents from sw to the repository root dir. But sw had a doc directory too, I can’t just filter it out.

I wanted to perform different filter-repos before and after the move commit, but just couldn’t find how, so I hacked together this script using filter-repo python library:

#!/usr/bin/env python3

import os
import subprocess
    import git_filter_repo as fr
except ImportError:
    raise SystemExit("Error: Couldn't find  Did you forget to make a symlink to git-filter-repo named or did you forget to put the latter in your PYTHONPATH?")

# Before the commit, we will ignore everything except sw directory.
# We take this opportunity to also remove some unwanted things from sw
def should_ignore_before(filename):
    # Dont allow unwated directories
    if filename.startswith(b"sw/"):
        return True
    if filename.startswith(b"sw/"):
        return True
    # Allow sw directory:
    if filename.startswith(b"sw"):
        return False
    # Delete everything else:
    return True

# After the big move commit, we will allow everything, except the unwanted directories
def should_ignore_after(filename):
    # Dont allow unwanted directories
    if filename.startswith(b""):
        return True
    if filename.startswith(b"delete.me2"):
        return True
    # Allow everything else:
    return False

before_move_commit = True

def fixup_commits(commit, metadata):
    global before_move_commit

    # Hacky way to do it, I filtered from the commit message, specifically the part including
    # conversion from svn info, which should not repeat in any other commit:
    # Note this probably only works because this part of history is totally linear, because it was SVN
    if (b"branches/develop@582" in commit.message):
        before_move_commit = False
    # Apply different filtering depending on before or after commit.
    if before_move_commit == True:
        commit.file_changes = [x for x in commit.file_changes
                                if not should_ignore_before(x.filename)]
        commit.file_changes = [x for x in commit.file_changes
                                if not should_ignore_after(x.filename)]

fr_args = fr.FilteringOptions.parse_args(['--force'])

filter = fr.RepoFilter(fr_args, commit_callback=fixup_commits)

You will notice it is a very hacky script, but for something that I should run maybe only once in my life, it is OK enough.

