mirror of
https://github.com/newren/git-filter-repo.git
synced 2024-07-06 18:32:14 +02:00
filter-repo: update README.md
Signed-off-by: Elijah Newren <newren@gmail.com>
This commit is contained in:
parent
73e91edecc
commit
911f234e3d
243
README.md
243
README.md
@ -1,11 +1,12 @@
|
||||
git filter-repo is a tool for rewriting history, which includes some
|
||||
capabilities I have not found anywhere else. It is most similar to
|
||||
[git filter-branch](https://git-scm.com/docs/git-filter-branch),
|
||||
though it fixes what I perceive to be some glaring deficiencies in
|
||||
that tool and brings a much different taste in usability. Also, being
|
||||
based on fast-export/fast-import, it is orders of magnitude faster (it
|
||||
has speed roughly comparable to BFG repo cleaner, but isn't
|
||||
multi-threaded).
|
||||
git filter-repo is a tool for rewriting history, which includes [some
|
||||
capabilities I have not found anywhere
|
||||
else](#design-rationale-behind-filter-repo-why-create-a-new-tool). It is
|
||||
most similar to [git
|
||||
filter-branch](https://git-scm.com/docs/git-filter-branch), though it fixes
|
||||
what I perceive to be some glaring deficiencies in that tool and brings a
|
||||
much different taste in usability. Also, being based on
|
||||
fast-export/fast-import, it is [orders of magnitude
|
||||
faster](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/).
|
||||
|
||||
filter-repo is a single-file python script, depending only on the
|
||||
python standard library (and execution of git commands), all of which
|
||||
@ -14,20 +15,113 @@ $PATH.
|
||||
|
||||
# Table of Contents
|
||||
|
||||
* Background
|
||||
* [Why create another repo filtering tool?](#why-git-filter-repo)
|
||||
* [Warnings: Not yet ready for external usage](
|
||||
#warnings-not-yet-ready-for-external-usage)
|
||||
* [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor)
|
||||
* [Why filter-repo instead of filter-branch?](#why-filter-repo-instead-of-filter-branch)
|
||||
* [Example usage, comparing to filter-branch](#example-usage-comparing-to-filter-branch)
|
||||
* [Design rationale behind filter-repo](#design-rationale-behind-filter-repo-why-create-a-new-tool)
|
||||
* [Usage](#usage)
|
||||
|
||||
# Background
|
||||
|
||||
## Why git-filter-repo?
|
||||
## Why filter-repo instead of filter-branch?
|
||||
|
||||
None of the [existing repository filtering
|
||||
tools](#why-not-favorite_competitor) do what I want. They're all good
|
||||
in their own way, but come up short for my needs. In no particular order:
|
||||
filter-branch has a number of problems:
|
||||
|
||||
* filter-branch is extremely to unusably slow (multiple orders of
|
||||
magnitude slower than it should be) for non-trivial repositories.
|
||||
|
||||
* filter-branch made a number of usability choices that are okay for
|
||||
small repos, but these choices sometimes conflict as more options
|
||||
are combined, and the overall usability often causes difficulties
|
||||
for users trying to work with intermediate or larger repos.
|
||||
|
||||
* filter-branch is missing some basic features.
|
||||
|
||||
The first two are intrinsic to filter-branch's design at this point
|
||||
and cannot be backward-compatibly fixed.
|
||||
|
||||
|
||||
## Example usage, comparing to filter-branch
|
||||
|
||||
Let's say that we want to extract a piece of a repository, with the intent
|
||||
on merging just that piece into some other bigger repo. We also want to know
|
||||
how much smaller this extracted repo is without the binary-blobs/ directory
|
||||
in it. For extraction, we want to:
|
||||
|
||||
* extract the history of a single directory, src/. This means that only
|
||||
paths under src/ remain in the repo, and any commits that only touched
|
||||
paths outside this directory will be removed.
|
||||
* rename all files to have a new leading directory, my-module/ (e.g. so that
|
||||
src/foo.c becomes my-module/src/foo.c)
|
||||
* rename any tags in the extracted repository to have a 'my-module-'
|
||||
prefix (to avoid any conflicts when we later merge this repo into
|
||||
something else)
|
||||
|
||||
Doing this with filter-repo is as simple as the following command:
|
||||
```shell
|
||||
git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'
|
||||
```
|
||||
(the single quotes are unnecessary, but make it clearer to a human that we
|
||||
are replacing the empty string as a prefix with `my-module-`)
|
||||
|
||||
By contrast, filter-branch comes with a pile of caveats (more on that
|
||||
below) even once you figure out the necessary invocation(s):
|
||||
|
||||
```shell
|
||||
git filter-branch --tree-filter 'mkdir -p my-module && git ls-files | grep -v ^src/ | xargs git rm -f -q && ls -d * | grep -v my-module | xargs -I files mv files my-module/' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
|
||||
git clone file://$(pwd) newcopy
|
||||
cd newcopy
|
||||
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
|
||||
git gc --prune=now
|
||||
```
|
||||
|
||||
Some might notice that the above filter-branch invocation will be really
|
||||
slow due to using --tree-filter; you could alternatively use the
|
||||
--index-filter option of filter-branch, changing the above commands to:
|
||||
|
||||
```shell
|
||||
git filter-branch --index-filter 'git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
|
||||
git clone file://$(pwd) newcopy
|
||||
cd newcopy
|
||||
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
|
||||
git gc --prune=now
|
||||
```
|
||||
|
||||
However, for either filter-branch command there are a pile of caveats.
|
||||
First, some may be wondering why I list five commands here for
|
||||
filter-branch. Despite the use of --all and --tag-name-filter, and
|
||||
filter-branch's manpage claiming that a clone is enough to get rid of
|
||||
old objects, the extra steps to delete the other tags and do another
|
||||
gc are still required to clean out the old objects and avoid mixing
|
||||
new and old history before pushing somewhere. Other caveats:
|
||||
* Commit messages are not rewritten; so if some of your commit
|
||||
messages refer to prior commits by (abbreviated) sha1, after the
|
||||
rewrite those messages will no refer to commits that are no longer
|
||||
part of the history. It would be better to rewrite those
|
||||
(abbreviated) sha1 references to refer to the new commit ids.
|
||||
* The --prune-empty flag sometimes missing commits that should be
|
||||
pruned, and it will also prune commits that *started* empty rather
|
||||
than just ended empty due to filtering. For repositories that
|
||||
intentionally use empty commits for versioning and publishing
|
||||
related purposes, this can be detrimental.
|
||||
* The commands above are OS-specific. GNU vs. BSD issues for sed,
|
||||
xargs, and other commands often trip up users; I think I failed to
|
||||
get most folks to use --index-filter since the only example in the
|
||||
filter-branch manpage that both uses it and shows how to move
|
||||
everything into a subdirectory is linux-specific, and it is not
|
||||
obvious to the reader that it has a portability issue since it
|
||||
silently misbehaves rather than failing loudly.
|
||||
* The --index-filter version of the filter-branch command may be two to
|
||||
three times faster than the --tree-filter version, but both
|
||||
filter-branch commands are going to be multiple orders of magnitude
|
||||
slower than filter-repo.
|
||||
|
||||
|
||||
## Design rationale behind filter-repo (why create a new tool?)
|
||||
|
||||
None of the existing repository filtering tools do what I want. They're
|
||||
all good in their own way, but come up short for my needs. No tool
|
||||
provided any of the first seven traits below I wanted, and all failed to
|
||||
provide at least one of the last three traits as well:
|
||||
|
||||
1. [Starting report] Provide user an analysis of their repo to help
|
||||
them get started on what to prune or rename, instead of expecting
|
||||
@ -81,6 +175,17 @@ in their own way, but come up short for my needs. In no particular order:
|
||||
when using the --tag-name-filter option, and sometimes also an
|
||||
issue when only filtering a subset of branches.)
|
||||
|
||||
1. [Versatility] Provide the user the ability to extend the tool or
|
||||
even write new tools that leverage existing capabilities, and
|
||||
provide this extensibility in a way that (a) avoids the need to
|
||||
fork separate processes (which would destroy performance), (b)
|
||||
avoids making the user specify OS-dependent shell commands (which
|
||||
would prevent users from sharing commands with each other), (c)
|
||||
takes advantage of rich data structures (because hashes, dicts,
|
||||
lists, and arrays are prohibitively difficult in shell) and (d)
|
||||
provides reasonable string manipulation capabilities (which are
|
||||
sorely lacking in shell).
|
||||
|
||||
1. [Commit message consistency] If commit messages refer to other
|
||||
commits by ID (e.g. "this reverts commit 01234567890abcdef", "In
|
||||
commit 0013deadbeef9a..."), those commit messages should be
|
||||
@ -116,109 +221,7 @@ in their own way, but come up short for my needs. In no particular order:
|
||||
|
||||
1. [Speed] Filtering should be reasonably fast
|
||||
|
||||
## Warnings: Not yet ready for external usage
|
||||
|
||||
This repository is still under heavy construction. Some caveats:
|
||||
|
||||
* It will not work without a specially compiled version of git:
|
||||
* git clone --branch fast-export-import-improvements https://github.com/newren/git/
|
||||
* Build according to normal git.git build instructions. You can find 'em.
|
||||
* I have a list of known bugs, conveniently mostly tracked in my head.
|
||||
I'll fix that, but the fact that you're reading this sentence means
|
||||
I haven't yet.
|
||||
* Actually, there's a couple exceptions to where bugs are tracked mentioned
|
||||
above. In particular, the following bugs are tracked here:
|
||||
* Multiple unimplemented placeholder option flags exist. Just because it
|
||||
shows up in --help doesn't mean it does anything.
|
||||
* Usage instructions and examples at the end of this document are rather
|
||||
lacking.
|
||||
* Random debugging code or extraneous files might be checked in at any
|
||||
given time; I'll probably rewrite history to remove them...eventually.
|
||||
* I reserve the right to:
|
||||
* Rename the tool altogether (filter-repo to be like filter-branch?)
|
||||
* Rename or redefine any command line options
|
||||
* Rewrite the history of this repository at any time
|
||||
* and possibly more...but do you really need any more reasons than
|
||||
the above? This isn't ready for widespread use.
|
||||
|
||||
## Why not $FAVORITE_COMPETITOR?
|
||||
|
||||
Here are some of the prominent competitors I know of:
|
||||
* git_fast_filter.py (Original link dead, use google if you care; this repo
|
||||
is the successor, though.)
|
||||
* [reposurgeon](http://www.catb.org/esr/reposurgeon/)
|
||||
* [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
|
||||
* [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html)
|
||||
|
||||
Here's why I think these tools don't meet my needs:
|
||||
|
||||
* git_fast_filter.py:
|
||||
* This was actually the basis for filter-repo, though it required lots of
|
||||
additional work.
|
||||
* Was meant as a library more than a tool, and had too high of an
|
||||
activation energy.
|
||||
* empty commit pruning was not as thorough as it should have been
|
||||
* had no provision for commit message rewriting for commit message
|
||||
consistency.
|
||||
* missing lots of little conveniences
|
||||
|
||||
* reposurgeon
|
||||
* focused on converting repositories between version control systems,
|
||||
and handles all the crazy impedance mismatches inherent in such
|
||||
conversions. I only care about rewriting history that starts in git
|
||||
and ends in git. If you care about converting between version control
|
||||
systems, though, reposurgeon is a much better tool.
|
||||
* might be general enough to use for other uses, but can't find any
|
||||
documentation or examples on anything other than huge repository
|
||||
conversions between version control systems.
|
||||
* way too much effort for many simple repository rewrites that many
|
||||
users want to perform
|
||||
|
||||
* BFG repo cleaner
|
||||
* Very focused on just removing crazy big files and sensitive data.
|
||||
Probably the best tool if that's all you want. But lacks the ability
|
||||
to handle anything outside this special (but important!) usecase.
|
||||
* Has useful options for helping you remove the N biggest blobs, but
|
||||
nothing to help you know how big N should be.
|
||||
* Doesn't prune commits which become empty due to filtering; if you
|
||||
just want to extract a directory added 3 months ago and its history,
|
||||
you'd be stuck with years of commits touching other directories, all
|
||||
empty.
|
||||
* The refusal to rewrite HEAD, while it makes sense when trying to
|
||||
remove a few crazy big files and sensitive data (users tend to
|
||||
re-add and re-commit bad files if you didn't manually remove it
|
||||
and have them update), is totally misaligned with more general
|
||||
rewrite cases (e.g. the desire to turn a subdirectory into the
|
||||
root of a repository, or move the root of the repository into a
|
||||
subdirectory for merging into some other bigger repo.)
|
||||
* Telling the user how to shrink the repo afterwards seems lame since
|
||||
that was the whole point; just do it for them by default.
|
||||
|
||||
* git filter-branch
|
||||
|
||||
* Fundamental design flaw causing it to be orders of magnitude
|
||||
slower than it should be for most repo rewriting jobs. So slow
|
||||
that it becomes a major usability impediment, if not a deal
|
||||
breaker. However, it is _extremely_ versatile.
|
||||
* Generally quick for users to invoke (quick one-liners with lots
|
||||
of examples), just missing some useful capabilities like
|
||||
selecting wanted paths (as opposed to unwanted paths) and
|
||||
providing easier path renaming (also, e.g. no
|
||||
--to-subdirectory-filter as the opposite of
|
||||
--subdirectory-filter)
|
||||
* Doesn't rewrite commit hashes in commit messages, causing commit messages
|
||||
to refer to phantom commits instead.
|
||||
* Mixes old repository information (original tags, unrewritten branches)
|
||||
with new, risking re-pushing the old stuff
|
||||
* Lame defaults
|
||||
* --prune-empty should be default (although only commits which become
|
||||
empty, not ones which started empty)
|
||||
* allows user to mess with repos which aren't a clean clone without an
|
||||
override
|
||||
* Makes it very difficult to actually get rid of unwanted objects and
|
||||
shrink repository. Long multi-step instructions in manpage for this,
|
||||
which are incomplete when --tag-name-filter is in use.
|
||||
|
||||
# Usage
|
||||
|
||||
Run `git filter-repo --help` and figure it out from there. Good luck.
|
||||
Run `git filter-repo -h`; more detailed docs will be added soon...
|
||||
|
Loading…
Reference in New Issue
Block a user