{"id":228,"date":"2023-07-07T17:50:10","date_gmt":"2023-07-07T17:50:10","guid":{"rendered":"https:\/\/mlago.dev\/blog\/?p=228"},"modified":"2024-06-04T16:22:58","modified_gmt":"2024-06-04T16:22:58","slug":"rewriting-git-history-dealing-with-big-directory-moves","status":"publish","type":"post","link":"https:\/\/mlago.dev\/blog\/index.php\/2023\/07\/07\/rewriting-git-history-dealing-with-big-directory-moves\/","title":{"rendered":"Rewriting git history: Dealing with big directory moves"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Today at work I ran into an interesting problem. We have a git repository that was formerly a SVN one. This previous repository was home to multiple big binary files, and since it is now git, all these files are making the.git directory be a painfull 2GB in size.<\/p>\n\n<!--more-->\n\n<p class=\"wp-block-paragraph\">To confirm the culprits of the gigantic file sizes, I used <a href=\"https:\/\/stackoverflow.com\/a\/32506324\" data-type=\"URL\" data-id=\"https:\/\/stackoverflow.com\/a\/32506324\">this short script from stack Overflow<\/a>.<\/p>\n\n\n<p class=\"wp-block-paragraph\">No problem, you would think, just <code>git-filter-repo<\/code> and be done with it. But there is a catch. Those big files are actually from other departments at our company, and had files in directories like <code>doc<\/code>, <code>hw<\/code>, etc while what is relevant to us was under <code>sw<\/code>. At some point, it was decided that this repository would be used only for software, and they moved all the contents from <code>sw<\/code> to the repository root dir. But <code>sw<\/code> had a <code>doc<\/code> directory too, I can't just filter it out.<\/p>\n\n\n<p class=\"wp-block-paragraph\">I wanted to perform different <code>filter-repo<\/code>s before and after the move commit, but just couldn't find how, so I hacked together this script using filter-repo python library:<\/p>\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/env python3\n\nimport os\nimport subprocess\ntry:\n    import git_filter_repo as fr\nexcept ImportError:\n    raise SystemExit(\"Error: Couldn't find git_filter_repo.py.  Did you forget to make a symlink to git-filter-repo named git_filter_repo.py or did you forget to put the latter in your PYTHONPATH?\")\n\n# Before the commit, we will ignore everything except sw directory.\n# We take this opportunity to also remove some unwanted things from sw\ndef should_ignore_before(filename):\n    # Dont allow unwated directories\n    if filename.startswith(b\"sw\/delete.me\"):\n        return True\n    if filename.startswith(b\"sw\/delete.me.2\"):\n        return True\n    # Allow sw directory:\n    if filename.startswith(b\"sw\"):\n        return False\n    # Delete everything else:\n    return True\n\n# After the big move commit, we will allow everything, except the unwanted directories\ndef should_ignore_after(filename):\n    # Dont allow unwanted directories\n    if filename.startswith(b\"delete.me\"):\n        return True\n    if filename.startswith(b\"delete.me2\"):\n        return True\n    # Allow everything else:\n    return False\n\nbefore_move_commit = True\n\ndef fixup_commits(commit, metadata):\n    global before_move_commit\n\n    # Hacky way to do it, I filtered from the commit message, specifically the part including\n    # conversion from svn info, which should not repeat in any other commit:\n    # Note this probably only works because this part of history is totally linear, because it was SVN\n    if (b\"branches\/develop@582\" in commit.message):\n        before_move_commit = False\n    # Apply different filtering depending on before or after commit.\n    if before_move_commit == True:\n        commit.file_changes = &#91;x for x in commit.file_changes\n                                if not should_ignore_before(x.filename)]\n    else:\n        commit.file_changes = &#91;x for x in commit.file_changes\n                                if not should_ignore_after(x.filename)]\n\nfr_args = fr.FilteringOptions.parse_args(&#91;'--force'])\n\nfilter = fr.RepoFilter(fr_args, commit_callback=fixup_commits)\nfilter.run()\n<\/code><\/pre>\n\n\n<p class=\"wp-block-paragraph\">You will notice it is a very hacky script, but for something that I should run maybe only once in my life, it is OK enough.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I explain for I performed a git history rewrite in a not so trivial scenario<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-228","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/posts\/228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=228"}],"version-history":[{"count":2,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/posts\/228\/revisions"}],"predecessor-version":[{"id":254,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/posts\/228\/revisions\/254"}],"wp:attachment":[{"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mlago.dev\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}