diff --git a/README.md b/README.md index 62e06df..0964d02 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,12 @@ -git filter-repo is a tool for rewriting history, which includes some -capabilities I have not found anywhere else. It is most similar to -[git filter-branch](https://git-scm.com/docs/git-filter-branch), -though it fixes what I perceive to be some glaring deficiencies in -that tool and brings a much different taste in usability. Also, being -based on fast-export/fast-import, it is orders of magnitude faster (it -has speed roughly comparable to BFG repo cleaner, but isn't -multi-threaded). +git filter-repo is a tool for rewriting history, which includes [some +capabilities I have not found anywhere +else](#design-rationale-behind-filter-repo-why-create-a-new-tool). It is +most similar to [git +filter-branch](https://git-scm.com/docs/git-filter-branch), though it fixes +what I perceive to be some glaring deficiencies in that tool and brings a +much different taste in usability. Also, being based on +fast-export/fast-import, it is [orders of magnitude +faster](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/). filter-repo is a single-file python script, depending only on the python standard library (and execution of git commands), all of which @@ -14,20 +15,113 @@ $PATH. # Table of Contents - * Background - * [Why create another repo filtering tool?](#why-git-filter-repo) - * [Warnings: Not yet ready for external usage]( - #warnings-not-yet-ready-for-external-usage) - * [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor) + * [Why filter-repo instead of filter-branch?](#why-filter-repo-instead-of-filter-branch) + * [Example usage, comparing to filter-branch](#example-usage-comparing-to-filter-branch) + * [Design rationale behind filter-repo](#design-rationale-behind-filter-repo-why-create-a-new-tool) * [Usage](#usage) # Background -## Why git-filter-repo? +## Why filter-repo instead of filter-branch? -None of the [existing repository filtering -tools](#why-not-favorite_competitor) do what I want. They're all good -in their own way, but come up short for my needs. In no particular order: +filter-branch has a number of problems: + + * filter-branch is extremely to unusably slow (multiple orders of + magnitude slower than it should be) for non-trivial repositories. + + * filter-branch made a number of usability choices that are okay for + small repos, but these choices sometimes conflict as more options + are combined, and the overall usability often causes difficulties + for users trying to work with intermediate or larger repos. + + * filter-branch is missing some basic features. + +The first two are intrinsic to filter-branch's design at this point +and cannot be backward-compatibly fixed. + + +## Example usage, comparing to filter-branch + +Let's say that we want to extract a piece of a repository, with the intent +on merging just that piece into some other bigger repo. We also want to know +how much smaller this extracted repo is without the binary-blobs/ directory +in it. For extraction, we want to: + + * extract the history of a single directory, src/. This means that only + paths under src/ remain in the repo, and any commits that only touched + paths outside this directory will be removed. + * rename all files to have a new leading directory, my-module/ (e.g. so that + src/foo.c becomes my-module/src/foo.c) + * rename any tags in the extracted repository to have a 'my-module-' + prefix (to avoid any conflicts when we later merge this repo into + something else) + +Doing this with filter-repo is as simple as the following command: +```shell + git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-' +``` +(the single quotes are unnecessary, but make it clearer to a human that we +are replacing the empty string as a prefix with `my-module-`) + +By contrast, filter-branch comes with a pile of caveats (more on that +below) even once you figure out the necessary invocation(s): + +```shell + git filter-branch --tree-filter 'mkdir -p my-module && git ls-files | grep -v ^src/ | xargs git rm -f -q && ls -d * | grep -v my-module | xargs -I files mv files my-module/' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all + git clone file://$(pwd) newcopy + cd newcopy + git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin + git gc --prune=now +``` + +Some might notice that the above filter-branch invocation will be really +slow due to using --tree-filter; you could alternatively use the +--index-filter option of filter-branch, changing the above commands to: + +```shell + git filter-branch --index-filter 'git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all + git clone file://$(pwd) newcopy + cd newcopy + git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin + git gc --prune=now +``` + +However, for either filter-branch command there are a pile of caveats. +First, some may be wondering why I list five commands here for +filter-branch. Despite the use of --all and --tag-name-filter, and +filter-branch's manpage claiming that a clone is enough to get rid of +old objects, the extra steps to delete the other tags and do another +gc are still required to clean out the old objects and avoid mixing +new and old history before pushing somewhere. Other caveats: + * Commit messages are not rewritten; so if some of your commit + messages refer to prior commits by (abbreviated) sha1, after the + rewrite those messages will no refer to commits that are no longer + part of the history. It would be better to rewrite those + (abbreviated) sha1 references to refer to the new commit ids. + * The --prune-empty flag sometimes missing commits that should be + pruned, and it will also prune commits that *started* empty rather + than just ended empty due to filtering. For repositories that + intentionally use empty commits for versioning and publishing + related purposes, this can be detrimental. + * The commands above are OS-specific. GNU vs. BSD issues for sed, + xargs, and other commands often trip up users; I think I failed to + get most folks to use --index-filter since the only example in the + filter-branch manpage that both uses it and shows how to move + everything into a subdirectory is linux-specific, and it is not + obvious to the reader that it has a portability issue since it + silently misbehaves rather than failing loudly. + * The --index-filter version of the filter-branch command may be two to + three times faster than the --tree-filter version, but both + filter-branch commands are going to be multiple orders of magnitude + slower than filter-repo. + + +## Design rationale behind filter-repo (why create a new tool?) + +None of the existing repository filtering tools do what I want. They're +all good in their own way, but come up short for my needs. No tool +provided any of the first seven traits below I wanted, and all failed to +provide at least one of the last three traits as well: 1. [Starting report] Provide user an analysis of their repo to help them get started on what to prune or rename, instead of expecting @@ -81,6 +175,17 @@ in their own way, but come up short for my needs. In no particular order: when using the --tag-name-filter option, and sometimes also an issue when only filtering a subset of branches.) + 1. [Versatility] Provide the user the ability to extend the tool or + even write new tools that leverage existing capabilities, and + provide this extensibility in a way that (a) avoids the need to + fork separate processes (which would destroy performance), (b) + avoids making the user specify OS-dependent shell commands (which + would prevent users from sharing commands with each other), (c) + takes advantage of rich data structures (because hashes, dicts, + lists, and arrays are prohibitively difficult in shell) and (d) + provides reasonable string manipulation capabilities (which are + sorely lacking in shell). + 1. [Commit message consistency] If commit messages refer to other commits by ID (e.g. "this reverts commit 01234567890abcdef", "In commit 0013deadbeef9a..."), those commit messages should be @@ -116,109 +221,7 @@ in their own way, but come up short for my needs. In no particular order: 1. [Speed] Filtering should be reasonably fast -## Warnings: Not yet ready for external usage - -This repository is still under heavy construction. Some caveats: - - * It will not work without a specially compiled version of git: - * git clone --branch fast-export-import-improvements https://github.com/newren/git/ - * Build according to normal git.git build instructions. You can find 'em. - * I have a list of known bugs, conveniently mostly tracked in my head. - I'll fix that, but the fact that you're reading this sentence means - I haven't yet. - * Actually, there's a couple exceptions to where bugs are tracked mentioned - above. In particular, the following bugs are tracked here: - * Multiple unimplemented placeholder option flags exist. Just because it - shows up in --help doesn't mean it does anything. - * Usage instructions and examples at the end of this document are rather - lacking. - * Random debugging code or extraneous files might be checked in at any - given time; I'll probably rewrite history to remove them...eventually. - * I reserve the right to: - * Rename the tool altogether (filter-repo to be like filter-branch?) - * Rename or redefine any command line options - * Rewrite the history of this repository at any time - * and possibly more...but do you really need any more reasons than - the above? This isn't ready for widespread use. - -## Why not $FAVORITE_COMPETITOR? - -Here are some of the prominent competitors I know of: - * git_fast_filter.py (Original link dead, use google if you care; this repo - is the successor, though.) - * [reposurgeon](http://www.catb.org/esr/reposurgeon/) - * [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/) - * [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html) - -Here's why I think these tools don't meet my needs: - - * git_fast_filter.py: - * This was actually the basis for filter-repo, though it required lots of - additional work. - * Was meant as a library more than a tool, and had too high of an - activation energy. - * empty commit pruning was not as thorough as it should have been - * had no provision for commit message rewriting for commit message - consistency. - * missing lots of little conveniences - - * reposurgeon - * focused on converting repositories between version control systems, - and handles all the crazy impedance mismatches inherent in such - conversions. I only care about rewriting history that starts in git - and ends in git. If you care about converting between version control - systems, though, reposurgeon is a much better tool. - * might be general enough to use for other uses, but can't find any - documentation or examples on anything other than huge repository - conversions between version control systems. - * way too much effort for many simple repository rewrites that many - users want to perform - - * BFG repo cleaner - * Very focused on just removing crazy big files and sensitive data. - Probably the best tool if that's all you want. But lacks the ability - to handle anything outside this special (but important!) usecase. - * Has useful options for helping you remove the N biggest blobs, but - nothing to help you know how big N should be. - * Doesn't prune commits which become empty due to filtering; if you - just want to extract a directory added 3 months ago and its history, - you'd be stuck with years of commits touching other directories, all - empty. - * The refusal to rewrite HEAD, while it makes sense when trying to - remove a few crazy big files and sensitive data (users tend to - re-add and re-commit bad files if you didn't manually remove it - and have them update), is totally misaligned with more general - rewrite cases (e.g. the desire to turn a subdirectory into the - root of a repository, or move the root of the repository into a - subdirectory for merging into some other bigger repo.) - * Telling the user how to shrink the repo afterwards seems lame since - that was the whole point; just do it for them by default. - - * git filter-branch - - * Fundamental design flaw causing it to be orders of magnitude - slower than it should be for most repo rewriting jobs. So slow - that it becomes a major usability impediment, if not a deal - breaker. However, it is _extremely_ versatile. - * Generally quick for users to invoke (quick one-liners with lots - of examples), just missing some useful capabilities like - selecting wanted paths (as opposed to unwanted paths) and - providing easier path renaming (also, e.g. no - --to-subdirectory-filter as the opposite of - --subdirectory-filter) - * Doesn't rewrite commit hashes in commit messages, causing commit messages - to refer to phantom commits instead. - * Mixes old repository information (original tags, unrewritten branches) - with new, risking re-pushing the old stuff - * Lame defaults - * --prune-empty should be default (although only commits which become - empty, not ones which started empty) - * allows user to mess with repos which aren't a clean clone without an - override - * Makes it very difficult to actually get rid of unwanted objects and - shrink repository. Long multi-step instructions in manpage for this, - which are incomplete when --tag-name-filter is in use. # Usage -Run `git filter-repo --help` and figure it out from there. Good luck. +Run `git filter-repo -h`; more detailed docs will be added soon...