mirror of
https://github.com/newren/git-filter-repo.git
synced 2024-07-06 18:32:14 +02:00
a238e3b7e6
Flags like --local, --shared, --reference (and --dissociate), and --origin would all mess up the fresh clone checker. Attempting to defend against all of them would not only be costly, but make it harder to draw the line about guesses as to whether a repository is a fresh clone or not. --origin also has problems in that filter-repo has special handling for the 'origin' remote that I don't want to apply to other random remotes. Flags like --depth, --single-branch, and --no-tags could prevent enough data from being downloaded to do a full rewrite and result in a partially rewritten or possibly even corrupt history (no idea how shallow clones interact; probably badly). --filter would also make the repo start without enough info though it'd at least be downloaded on demand; it'd still be a really slow way to do it, though, so it's a bad idea. filter-repo doesn't really provide an easy mechanism to rewrite a repo and its submodule simultaneously, so recursing submodules seems useless and unhelpful. --shallow-submodules would be bad for at least the same reasons --depth is for the parent module, assuming we handled submodules. --remote-submodules just provides a way to make the repo dirty to start, which is counter-productive. --jobs could be useful, if recursing submodules was. --no-checkout might be safe to use and --sparse might also be okay for as long as it only affects the working tree, but in both cases why not go --bare or --mirror if you're doing that? Likewise, --no-hardlinks is useless given that we're already saying people need to use --no-local. -b would be okay to use, but why wouldn't you just change the default branch on the server rather than just within this one clone used for rewriting the history? Whether you push back to the original repository or to a new repo, you'd have to take a separate step to change it in that remote repo. And if you really will use this new local repository as the official source, then you can switch branches at the end of the rewrite just as easily. --separate-git-dir and --template might be okay to use, I haven't tested. If either doesn't work now, or breaks at any point in the future, I feel much better being able to say, "I told you to only use these three flags to git clone." -u only affects the ability to receive the clone; it's fine to use. Also, -q only affects the console output during the clone operation, so you could use it. There will probably be more flags added to git-clone over time. Testing against all of them is insanity. Recommend people only use --no-local, --bare, and --mirror, with the first only needed when cloning from a local filesystem, and the other two never needed but allowed for those that prefer. Signed-off-by: Elijah Newren <newren@gmail.com>
1313 lines
56 KiB
Plaintext
1313 lines
56 KiB
Plaintext
git-filter-repo(1)
|
|
==================
|
|
|
|
NAME
|
|
----
|
|
git-filter-repo - Rewrite repository history
|
|
|
|
SYNOPSIS
|
|
--------
|
|
[verse]
|
|
'git filter-repo' --analyze
|
|
'git filter-repo' [<path_filtering_options>] [<content_filtering_options>]
|
|
[<ref_renaming_options>] [<commit_message_filtering_options>]
|
|
[<name_or_email_filtering_options>] [<parent_rewriting_options>]
|
|
[<generic_callback_options>] [<miscellaneous_options>]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
Rapidly rewrite entire repository history using user-specified filters.
|
|
This is a destructive operation which should not be used lightly; it
|
|
writes new commits, trees, tags, and blobs corresponding to (but
|
|
filtered from) the original objects in the repository, then deletes the
|
|
original history and leaves only the new. See <<DISCUSSION>> for more
|
|
details on the ramifications of using this tool. Several different
|
|
types of history rewrites are possible; examples include (but are not
|
|
limited to):
|
|
|
|
* stripping large files (or large directories or large extensions)
|
|
* stripping unwanted files by path
|
|
* extracting wanted paths and their history (stripping everything else)
|
|
* restructuring the file layout (such as moving all files into a
|
|
subdirectory in preparation for merging with another repo, making a
|
|
subdirectory become the new toplevel directory, or merging two
|
|
directories with independent filenames into one directory)
|
|
* renaming tags (also often in preparation for merging with another repo)
|
|
* replacing or removing sensitive text such as passwords
|
|
* making mailmap rewriting of user names or emails permanent
|
|
* making grafts or replacement refs permanent
|
|
* rewriting commit messages
|
|
|
|
Additionally, several concerns are handled automatically (many of these
|
|
can be overridden, but they are all on by default):
|
|
|
|
* rewriting (possibly abbreviated) hashes in commit messages to
|
|
refer to the new post-rewrite commit hashes
|
|
* pruning commits which become empty due to the above filters (also
|
|
handles edge cases like pruning of merge commits which become
|
|
degenerate and empty)
|
|
* creating replace-refs (see linkgit:git-replace[1]) for old commit
|
|
hashes, which if pushed and fetched will allow users to continue to
|
|
refer to new commits using (unabbreviated) old commit IDs
|
|
* stripping of original history to avoid mixing old and new history
|
|
* repacking the repository post-rewrite to shrink the repo for the
|
|
user
|
|
|
|
Also, it's worth noting that there is an important safety mechanism:
|
|
|
|
* abort if run from a repo that is not a fresh clone (to prevent
|
|
accidental data loss from rewriting local history that doesn't
|
|
exist anywhere else). See <<FRESHCLONE>>.
|
|
|
|
For those who know that there is large unwanted stuff in their history
|
|
and want help finding it, this command also
|
|
|
|
* provides an option to analyze a repository and generate reports that
|
|
can be useful in determining what to filter (or in determining
|
|
whether a separate filtering command was successful).
|
|
|
|
See also <<VERSATILITY>>, <<DISCUSSION>>, <<EXAMPLES>>, and
|
|
<<INTERNALS>>.
|
|
|
|
OPTIONS
|
|
-------
|
|
|
|
Analysis Options
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
--analyze::
|
|
Analyze repository history and create a report that may be
|
|
useful in determining what to filter in a subsequent run (or
|
|
in determining if a previous filtering command did what you
|
|
wanted). Will not modify your repo.
|
|
|
|
Filtering based on paths (see also --filename-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--invert-paths::
|
|
Invert the selection of files from the specified
|
|
--path-{match,glob,regex} options below, i.e. only select
|
|
files matching none of those options.
|
|
|
|
--path-match <dir_or_file>::
|
|
--path <dir_or_file>::
|
|
Exact paths (files or directories) to include in filtered
|
|
history. Multiple --path options can be specified to get a
|
|
union of paths.
|
|
|
|
--path-glob <glob>::
|
|
Glob of paths to include in filtered history. Multiple
|
|
--path-glob options can be specified to get a union of paths.
|
|
|
|
--path-regex <regex>::
|
|
Regex of paths to include in filtered history. Multiple
|
|
--path-regex options can be specified to get a union of paths.
|
|
|
|
--use-base-name::
|
|
Match on file base name instead of full path from the top of
|
|
the repo. Incompatible with --path-rename, and incompatible
|
|
with matching against directory names.
|
|
|
|
Renaming based on paths (see also --filename-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Note: if you combine path filtering with path renaming, be aware that
|
|
a rename directive does not select paths, it only says how to
|
|
rename paths that are selected with the filters.
|
|
|
|
--path-rename <old_name:new_name>::
|
|
--path-rename-match <old_name:new_name>::
|
|
Path to rename; if filename or directory matches <old_name>
|
|
rename to <new_name>. Multiple --path-rename options can be
|
|
specified.
|
|
|
|
Path shortcuts
|
|
~~~~~~~~~~~~~~
|
|
|
|
--paths-from-file <filename>::
|
|
Specify several path filtering and renaming directives, one
|
|
per line. Lines with `==>` in them specify path renames, and
|
|
lines can begin with `literal:` (the default), `glob:`, or
|
|
`regex:` to specify different matching styles
|
|
|
|
--subdirectory-filter <directory>::
|
|
Only look at history that touches the given subdirectory and
|
|
treat that directory as the project root. Equivalent to using
|
|
`--path <directory>/ --path-rename <directory>/:`
|
|
|
|
--to-subdirectory-filter <directory>::
|
|
Treat the project root as instead being under
|
|
<directory>. Equivalent to using `--path-rename :<directory>/`
|
|
|
|
Content editing filters (see also --blob-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--replace-text <expressions_file>::
|
|
A file with expressions that, if found, will be replaced. By
|
|
default, each expression is treated as literal text, but
|
|
`regex:` and `glob:` prefixes are supported. You can end the
|
|
line with `==>` and some replacement text to choose a
|
|
replacement choice other than the default of `***REMOVED***`.
|
|
|
|
--strip-blobs-bigger-than <size>::
|
|
Strip blobs (files) bigger than specified size (e.g. `5M`,
|
|
`2G`, etc)
|
|
|
|
--strip-blobs-with-ids <blob_id_filename>::
|
|
Read git object ids from each line of the given file, and
|
|
strip all of them from history
|
|
|
|
Renaming of refs (see also --refname-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--tag-rename <old:new>::
|
|
Rename tags starting with <old> to start with <new>. For example,
|
|
--tag-rename foo:bar will rename tag foo-1.2.3 to bar-1.2.3;
|
|
either <old> or <new> can be empty.
|
|
|
|
Filtering of commit messages (see also --message-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--preserve-commit-hashes::
|
|
By default, since commits are rewritten and thus gain new
|
|
hashes, references to old commit hashes in commit messages are
|
|
replaced with new commit hashes (abbreviated to the same
|
|
length as the old reference). Use this flag to turn off
|
|
updating commit hashes in commit messages.
|
|
|
|
--preserve-commit-encoding::
|
|
Do not reencode commit messages into UTF-8. By default, if the
|
|
commit object specifies an encoding for the commit message,
|
|
the message is re-encoded into UTF-8.
|
|
|
|
Filtering of names & emails (see also --name-callback and --email-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--mailmap <filename>::
|
|
Use specified mailmap file (see linkgit:git-shortlog[1] for details
|
|
on the format) when rewriting author, committer, and tagger names
|
|
and emails. If the specified file is part of git history,
|
|
historical versions of the file will be ignored; only the current
|
|
contents are consulted.
|
|
|
|
--use-mailmap::
|
|
Same as: '--mailmap .mailmap'
|
|
|
|
Parent rewriting
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add}::
|
|
Replace refs (see linkgit:git-replace[1]) are used to rewrite
|
|
parents (unless turned off by the usual git mechanism); this
|
|
flag specifies what do do with those refs afterward. Replace
|
|
refs can either be deleted or updated to point at new commit
|
|
hashes. Also, new replace refs can be added for each commit
|
|
rewrite. With 'update-or-add', new replace refs are only
|
|
added for commit rewrites that aren't used to update an
|
|
existing replace ref. default is 'update-and-add' if
|
|
$GIT_DIR/filter-repo/already_ran does not exist;
|
|
'update-or-add' otherwise.
|
|
|
|
--prune-empty {always, auto, never}::
|
|
Whether to prune empty commits. 'auto' (the default) means
|
|
only prune commits which become empty (not commits which were
|
|
empty in the original repo, unless their parent was
|
|
pruned). When the parent of a commit is pruned, the first
|
|
non-pruned ancestor becomes the new parent.
|
|
|
|
--prune-degenerate {always, auto, never}::
|
|
Since merge commits are needed for history topology, they are
|
|
typically exempt from pruning. However, they can become
|
|
degenerate with the pruning of other commits (having fewer
|
|
than two parents, having one commit serve as both parents, or
|
|
having one parent as the ancestor of the other.) If such merge
|
|
commits have no file changes, they can be pruned. The default
|
|
('auto') is to only prune empty merge commits which become
|
|
degenerate (not which started as such).
|
|
|
|
--no-ff::
|
|
Even if the first parent is or becomes an ancestor of another
|
|
parent, do not prune it. This modifies how --prune-degenerate
|
|
behaves, and may be useful in projects who always use merge
|
|
--no-ff.
|
|
|
|
Generic callback code snippets
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--filename-callback <function_body>::
|
|
Python code body for processing filenames; see <<CALLBACKS>>.
|
|
|
|
--message-callback <function_body>::
|
|
Python code body for processing messages (both commit messages and
|
|
tag messages); see <<CALLBACKS>>.
|
|
|
|
--name-callback <function_body>::
|
|
Python code body for processing names of people; see <<CALLBACKS>>.
|
|
|
|
--email-callback <function_body>::
|
|
Python code body for processing emails addresses; see
|
|
<<CALLBACKS>>.
|
|
|
|
--refname-callback <function_body>::
|
|
Python code body for processing refnames; see <<CALLBACKS>>.
|
|
|
|
--blob-callback <function_body>::
|
|
Python code body for processing blob objects; see <<CALLBACKS>>.
|
|
|
|
--commit-callback <function_body>::
|
|
Python code body for processing commit objects; see <<CALLBACKS>>.
|
|
|
|
--tag-callback <function_body>::
|
|
Python code body for processing tag objects; see <<CALLBACKS>>.
|
|
|
|
--reset-callback <function_body>::
|
|
Python code body for processing reset objects; see <<CALLBACKS>>.
|
|
|
|
Location to filter from/to
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
NOTE: Specifying alternate source or target locations implies --partial
|
|
except that the normal default for --replace-refs is used. However, unlike
|
|
normal uses of --partial, this doesn't risk mixing old and new history
|
|
since the old and new histories are in different repositories.
|
|
|
|
--source <source>::
|
|
Git repository to read from
|
|
|
|
--target <target>::
|
|
Git repository to overwrite with filtered history
|
|
|
|
Miscellaneous options
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--help::
|
|
-h::
|
|
Show a help message and exit.
|
|
|
|
--force::
|
|
-f::
|
|
Rewrite history even if the current repo does not look like a
|
|
fresh clone. See <<FRESHCLONE>>. Note that when cloning
|
|
repos on a local filesystem, it is better to pass `--no-local`
|
|
to git clone than passing `--force` to git-filter-repo.
|
|
|
|
--partial::
|
|
Do a partial history rewrite, resulting in the mixture of old and
|
|
new history. This implies a default of update-no-add for
|
|
--replace-refs, disables rewriting refs/remotes/origin/* to
|
|
refs/heads/*, disables removing of the 'origin' remote, disables
|
|
removing unexported refs, disables expiring the reflog, and
|
|
disables the automatic post-filter gc. Also, this modifies
|
|
--tag-rename and --refname-callback options such that instead of
|
|
replacing old refs with new refnames, it will instead create new
|
|
refs and keep the old ones around. Use with caution.
|
|
|
|
--refs <refs+>::
|
|
Limit history rewriting to the specified refs. Implies --partial.
|
|
In addition to the normal caveats of --partial (mixing old and new
|
|
history, no automatic remapping of refs/remotes/origin/* to
|
|
refs/heads/*, etc.), this also may cause problems for pruning of
|
|
degenerate empty merge commits when negative revisions are
|
|
specified.
|
|
|
|
--dry-run::
|
|
Do not change the repository. Run `git fast-export` and filter its
|
|
output, and save both the original and the filtered version for
|
|
comparison. This also disables rewriting commit messages due to
|
|
not knowing new commit IDs and disables filtering of some empty
|
|
commits due to inability to query the fast-import backend.
|
|
|
|
--debug::
|
|
Print additional information about operations being performed and
|
|
commands being run. (If used together with --dry-run, shows
|
|
extra information about what would be run).
|
|
|
|
--stdin::
|
|
Instead of running `git fast-export` and filtering its output,
|
|
filter the fast-export stream from stdin. The stdin must be in
|
|
the expected input format (e.g. it needs to include original-oid
|
|
directives).
|
|
|
|
--quiet::
|
|
Pass --quiet to other git commands called.
|
|
|
|
[[FRESHCLONE]]
|
|
FRESH CLONE SAFETY CHECK AND --FORCE
|
|
------------------------------------
|
|
|
|
Since filter-repo does irreversible rewriting of history, it is
|
|
important to avoid making changes to a repo for which the user doesn't
|
|
have a good backup. The primary defense mechanism is to simply
|
|
educate users and rely on them to be good stewards of their data; thus
|
|
there are several warnings in the documentation about how filter repo
|
|
rewrites history.
|
|
|
|
However, as a service to users, we would like to provide an additional
|
|
safety check beyond the documentation. There isn't a good way to
|
|
check if the user has a good backup, but we can ask a related question
|
|
that is an imperfect but quite reasonable proxy: "Is this repository a
|
|
fresh clone?" Unfortunately, that is also a question we can't get a
|
|
perfect answer to; git provides no way to answer that question.
|
|
However, there are approximately a dozen things that I found that seem
|
|
to always be true of brand new clones (assuming they are either clones
|
|
of remote repositories or are made with the `--no-local` flag), and I
|
|
check for all of those.
|
|
|
|
These checks can have both false positives and false negatives.
|
|
Someone might have a perfectly good backup of their repo without it
|
|
actually being a fresh clone -- but there's no way for filter-repo to
|
|
know that. Conversely, someone could look at all things that
|
|
filter-repo checks for in its safety checks and then just tweak their
|
|
non-backed-up repository to satisfy those conditions (though it would
|
|
take a fair amount of effort, and it's astronomically unlikely that a
|
|
repo that isn't a fresh clone randomly happens to match all the
|
|
criteria). In practice, the safety checks filter-repo uses seem to be
|
|
really good at avoiding people accidentally running filter-repo on a
|
|
repository that they shouldn't be running it on. It even caught me
|
|
once when I did mean to run filter-repo but was in a different
|
|
directory than I thought I was.
|
|
|
|
In short, it's perfectly fine to use `--force` to override the safety
|
|
checks as long as you're okay with filter-repo irreversibly rewriting
|
|
the contents of the current repository. It is a really bad idea to
|
|
get in the habit of always specifying `--force`; if you do, one day
|
|
you will run one of your commands in the wrong directory like I did,
|
|
and you won't have the safety check anymore to bail you out. Also, it
|
|
is definitely NOT okay to recommend `--force` on forums, Q&A sites, or
|
|
in emails to other users without first carefully explaining that
|
|
`--force` means putting your repositories' data at risk. I am
|
|
especially bothered by people who suggest the flag when it clearly is
|
|
NOT needed; they are needlessly putting other peoples' data at risk.
|
|
|
|
[[VERSATILITY]]
|
|
VERSATILITY
|
|
-----------
|
|
|
|
filter-repo has a hierarchy of capabilities on the spectrum from easy to
|
|
use convenience flags that perform pre-defined types of filtering, to
|
|
choices that provide lots of flexibility in controlling how filtering
|
|
occurs. This spectrum includes the following:
|
|
|
|
* Convenience flags making common types of history rewriting simple (e.g.
|
|
--path, --strip-blobs-bigger-than, --replace-text, --mailmap)
|
|
* Options which are shorthand for others or which provide greater control
|
|
than others (e.g. --subdirectory-filter could just be written using
|
|
both a path selection (--path) and a path rename (--path-rename)
|
|
filter; --paths-from-file can handle all other --path* options and more
|
|
such as regex renaming of paths)
|
|
* Generic python callbacks for handling a certain type of data (the
|
|
filename, message, name, email, and refname callbacks)
|
|
* Generic python callbacks for handling fundamental git objects, allowing
|
|
greater control over the combination of data types the object holds
|
|
(the commit, tag, blob, and reset callbacks)
|
|
* The ability to import filter-repo as a module in a python program and
|
|
use its classes and functions for even greater control and flexibility
|
|
while still leveraging lots of basic capabilities. One can even use
|
|
this to write new tools with a completely different interface.
|
|
|
|
For more information about callbacks, see <<CALLBACKS>>. For examples on
|
|
writing python programs that import filter-repo as a module to create new
|
|
history rewriting tools, look at the contrib/filter-repo-demos/ directory.
|
|
That directory includes, among other examples, a reimplementation of
|
|
git-filter-branch which is faster than git-filter-branch, and a
|
|
reimplementation of BFG Repo Cleaner with several bug fixes and new
|
|
features.
|
|
|
|
[[DISCUSSION]]
|
|
DISCUSSION
|
|
----------
|
|
|
|
Using filter-repo is relatively simple, but rewriting history is part of
|
|
a larger discussion in terms of collaboration. When you rewrite
|
|
history, the old and new histories are no longer compatible; if you push
|
|
this history somewhere for others to view, it will look as though you've
|
|
done a rebase of all branches and tags. Make sure you are familiar with
|
|
the "RECOVERING FROM UPSTREAM REBASE" section of linkgit:git-rebase[1]
|
|
(and in particular, "The hard case") before proceeding, in addition to
|
|
this section.
|
|
|
|
Steps to use git-filter-repo as part of the bigger picture of doing a
|
|
history rewrite are roughly as follows:
|
|
|
|
1. Create a clone of your repository (if you created special refs outside
|
|
of refs/heads/ or refs/tags/, make sure to fetch those too). You may
|
|
pass `--bare` or `--mirror` to `git clone`, if you prefer. You should
|
|
pass `--no-local` if the repository you are cloning from is on the local
|
|
filesystem. Avoid other flags; some might confuse the fresh clone
|
|
check, and others could cause parts of the data to be missing that are
|
|
needed for the rewrite.
|
|
|
|
2. (Optional) Run `git filter-repo --analyze`. This will create a
|
|
directory of reports mentioning renames that have occurred in your
|
|
repo and also listing sizes of objects aggregated by
|
|
path/directory/extension/blob-id; this information may be useful in
|
|
choosing how to filter your repo. It can also be useful to re-run
|
|
--analyze after filtering to verify the changes look correct.
|
|
|
|
3. Run filter-repo with your desired filtering options. Many examples
|
|
are given below. For more complex cases, note that doing the
|
|
filtering in multiple steps (by running multiple filter-repo
|
|
invocations in a sequence) is supported. If anything goes wrong here,
|
|
simply delete your clone and restart.
|
|
|
|
4. Push your new repository to its new home (note that
|
|
refs/remotes/origin/* will have been moved to refs/heads/* as the
|
|
first part of filter-repo, so you can just deal with normal branches
|
|
instead of remote tracking branches). While you can force push this
|
|
to the same URL you cloned from, there are good reasons to consider
|
|
pushing to a different location instead:
|
|
|
|
* People who cloned from the original repo will have old history.
|
|
When they fetch the new history you force pushed up, unless they
|
|
do a `git reset --hard @{u}` on their branches or rebase their
|
|
local work, git will think they have hundreds or thousands of
|
|
commits with very similar commit messages as what exist upstream
|
|
(but which include files you wanted excised from history), and
|
|
allow the user to merge the two histories, resulting in what
|
|
looks like two copies of each commit. If they then push this
|
|
history back up, then everyone now has history with two copies of
|
|
each commit and the bad files have returned. You're more likely
|
|
to succeed in forcing people to get rid of the old history if
|
|
they have to clone a new URL.
|
|
|
|
* Rewriting history will rewrite tags; those who have already
|
|
downloaded tags will not get the updated tags by default (see the
|
|
"On Re-tagging" section of linkgit:git-tag[1]). Every user
|
|
trying to use an existing clone will have to forcibly delete all
|
|
tags and re-fetch them; it may be easier for them to just
|
|
re-clone, which they are more likely to do with a new clone URL.
|
|
|
|
* Rewriting history may delete some refs (e.g. branches that only
|
|
had files that you wanted excised from history); unless you run
|
|
git push with the `--mirror` or `--prune` options, those refs
|
|
will continue to exist on the server. If folks then merge these
|
|
branches into others, then people have started mixing old and new
|
|
history. If users had already cloned these branches, removing
|
|
them from the server isn't enough; you need all users to delete
|
|
any local branches based on these refs and run fetch with the
|
|
`--prune` option as well. Simply re-cloning from a new URL is
|
|
easier.
|
|
|
|
* The server may not allow you to force push over some refs.
|
|
For example, code review systems may have special ref
|
|
namespaces (e.g. refs/changes/, refs/pull/,
|
|
refs/merge-requests/) that they have locked down.
|
|
|
|
5. If you still want to push your rewritten history back to the
|
|
original url despite my warnings above, you'll have to manage it
|
|
very carefully:
|
|
|
|
* git-filter-repo deletes the "origin" remote to help avoid people
|
|
accidentally repushing to the same repository, so you'll need to
|
|
remind git what origin's url was. You'll have to look up the
|
|
command for that.
|
|
|
|
* You'll need to carefully synchronize with *everyone* who has
|
|
cloned the repository, and will also need to carefully
|
|
synchronize with *everything* (e.g. CI systems) that has cloned
|
|
it. Every single clone will either need to be thrown away and
|
|
re-cloned, or need to take all the steps outlined in item 4 as
|
|
well as follow the necessary steps from "RECOVERING FROM UPSTREAM
|
|
REBASE" section of linkgit:git-rebase[1]. If you miss fixing any
|
|
clones, you'll risk mixing old and new history and end up with an
|
|
even worse mess to clean up.
|
|
|
|
* Finally, you'll need to consult any documentation from your
|
|
hosting provider about how to remove any server-side references
|
|
to the old commits (example:
|
|
https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html[GitLab's
|
|
docs on reducing repository size]).
|
|
|
|
6. (Optional) Some additional considerations
|
|
|
|
* filter-repo by default creates replace refs (see
|
|
linkgit:git-replace[1]) for each rewritten commit ID, allowing
|
|
you to use old (unabbreviated) commit hashes to refer to the
|
|
newly rewritten commits. If you want to use these replace refs,
|
|
push them to the relevant clone URL and tell users to adjust
|
|
their fetch refspec (e.g. `git config --add remote.origin.fetch
|
|
+refs/replace/*:refs/replace/*`) Sadly, some existing git servers
|
|
(e.g. Gerrit, GitHub) do not yet understand replace refs, and
|
|
thus one can't use old commit hashes within their UI; this may
|
|
change in the future. But replace refs at least help users
|
|
locally within the git CLI.
|
|
|
|
* If you have a central repo, you may want to prevent people
|
|
from pushing old commit IDs, in order to avoid mixing old
|
|
and new history. Every repository manager does this
|
|
differently, some provide specialized commands
|
|
(e.g. https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html),
|
|
others require you to write hooks.
|
|
|
|
[[EXAMPLES]]
|
|
EXAMPLES
|
|
--------
|
|
|
|
Path based filtering
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To only keep the 'README.md' file plus the directories 'guides' and
|
|
'tools/releases/':
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path README.md --path guides/ --path tools/releases
|
|
--------------------------------------------------
|
|
|
|
Directory names can be given with or without a trailing slash, and all
|
|
filenames are relative to the toplevel of the repo. To keep all files
|
|
except these paths, just add `--invert-paths`:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths
|
|
--------------------------------------------------
|
|
|
|
If you want to have both an inclusion filter and an exclusion filter, just
|
|
run filter-repo multiple times. For example, to keep the src/main
|
|
subdirectory but exclude files under src/main named 'data', run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/
|
|
git filter-repo --path-glob 'src/*/data' --invert-paths
|
|
--------------------------------------------------
|
|
|
|
Note that the asterisk (`*`) will match across multiple directories, so the
|
|
second command would remove e.g. src/main/org/whatever/data. Also, the
|
|
second command by itself would also remove e.g. src/not-main/foo/data, but
|
|
since src/not-main/ was removed by the first command, that's not an issue.
|
|
Also, the use of quotes around the asterisk is sometimes important to avoid
|
|
glob expansion by the shell.
|
|
|
|
You can also select paths by regular expression (see
|
|
https://docs.python.org/3/library/re.html#regular-expression-syntax).
|
|
For example, to only include files from the repo whose name is in the
|
|
format YYYY-MM-DD.txt and is found at least two subdirectories deep:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'
|
|
--------------------------------------------------
|
|
|
|
If you want two directories to be renamed (and maybe merged if both are
|
|
renamed to the same location), use --path-rename; for example, to rename
|
|
both 'cmds/' and 'src/scripts/' to 'tools/':
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/
|
|
--------------------------------------------------
|
|
|
|
As with `--path`, directories can be specified with or without a
|
|
trailing slash for `--path-rename`.
|
|
|
|
If you do a `--path-rename` to something that was already in use, it will
|
|
be silently overwritten. However, if you try to rename multiple files to
|
|
the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh
|
|
both existed and had different content with the renames above), then you
|
|
will be given an error. If you have such a case, you may want to add
|
|
another rename command to move one of the paths somewhere else where it
|
|
won't collide:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
|
|
--path-rename cmds/:tools/ \
|
|
--path-rename src/scripts/:tools/
|
|
--------------------------------------------------
|
|
|
|
Also, `--path-rename` brings up ordering issues; all path arguments are
|
|
applied in order. Thus, a command like
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename sources/:src/main/ --path src/main/
|
|
--------------------------------------------------
|
|
|
|
would make sense but reversing the two arguments would not (src/main/ is
|
|
created by the rename so reversing the two would give you an empty repo).
|
|
Also, note that the rename of cmds/run_release.sh a couple examples ago was
|
|
done before the other renames.
|
|
|
|
Note that path renaming does not do path filtering, thus the following
|
|
command
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/ --path-rename tools/:scripts/
|
|
--------------------------------------------------
|
|
|
|
would not result in the tools or scripts directories being present, because
|
|
the single filter selected only src/main/. It's likely that you would
|
|
instead want to run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/ --path tools/ --path-rename tools/:scripts/
|
|
--------------------------------------------------
|
|
|
|
If you prefer to filter based solely on basename, use the `--use-base-name`
|
|
flag (though this is incompatible with `--path-rename`). For example, to
|
|
only include README.md and Makefile files from any directory:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --use-base-name --path README.md --path Makefile
|
|
--------------------------------------------------
|
|
|
|
If you wanted to delete all .DS_Store files in any directory, you could
|
|
either use:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
|
|
--------------------------------------------------
|
|
|
|
or
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'
|
|
--------------------------------------------------
|
|
|
|
(the `--path-glob` isn't sufficient by itself as it might miss a toplevel
|
|
.DS_Store file; further while something like `--path-glob '*.DS_Store'`
|
|
would workaround that problem it would also grab files named `foo.DS_Store`
|
|
or `bar/baz.DS_Store`)
|
|
|
|
Finally, see also the `--filename-callback` from <<CALLBACKS>>.
|
|
|
|
Filtering based on many paths
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you have a long list of files, directories, globs, or regular
|
|
expressions to filter on, you can stick them in a file and use
|
|
`--paths-from-file`; for example, with a file named stuff-i-want.txt with
|
|
contents of
|
|
|
|
--------------------------------------------------
|
|
README.md
|
|
guides/
|
|
tools/releases
|
|
glob:*.py
|
|
regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$
|
|
tools/==>scripts/
|
|
regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt
|
|
--------------------------------------------------
|
|
|
|
then you could run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --paths-from-file stuff-i-want.txt
|
|
--------------------------------------------------
|
|
|
|
to get a repo containing only the toplevel README.md file, the guides/
|
|
and tools/releases/ directories, all python files, files whose name
|
|
was of the form YYYY.MM-DD.txt at least two subdirectories deep, and
|
|
would rename tools/ to scripts/ and rename files like foo/bar/baz.text
|
|
to bar/foo/baz.txt. Note the special line prefixes of `glob:` and
|
|
`regex:` and the special string `==>` denoting renames.
|
|
|
|
Sometimes you have a way of easily generating all the files you want.
|
|
For example, if you know that none of the currently tracked files have
|
|
any newlines or special characters in them (see core.quotePath from
|
|
`git config --help`) so that `git ls-files` would print all files
|
|
literally one per line, and you knew that you wanted to keep only the
|
|
files that are currently tracked (thus deleting from all commits in
|
|
history any files that only appear on other branches or that only
|
|
appear in older commits), then you could use a pair of commands such
|
|
as
|
|
|
|
--------------------------------------------------
|
|
git ls-files >../paths-i-want.txt
|
|
git filter-repo --paths-from-file ../paths-i-want.txt
|
|
--------------------------------------------------
|
|
|
|
Similarly, you could use --paths-from-file to delete many files. For
|
|
example, you could run `git filter-repo --analyze` to get reports,
|
|
look in one such as .git/filter-repo/analysis/path-deleted-sizes.txt
|
|
and copy all the filenames into a file such as
|
|
/tmp/files-i-dont-want-anymore.txt and then run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --paths-from-file /tmp/files-i-dont-want-anymore.txt
|
|
--------------------------------------------------
|
|
|
|
to delete them all.
|
|
|
|
Directory based shortcuts
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Let's say you had a directory structure like the following:
|
|
|
|
module/
|
|
foo.c
|
|
bar.c
|
|
otherDir/
|
|
blah.config
|
|
stuff.txt
|
|
zebra.jpg
|
|
|
|
If you wanted just the module/ directory and you wanted it to become the
|
|
new root so that your new directory structure looked like
|
|
|
|
foo.c
|
|
bar.c
|
|
|
|
then you could run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --subdirectory-filter module/
|
|
--------------------------------------------------
|
|
|
|
If you wanted all the files from the original repo, but wanted to move
|
|
everything under a subdirectory named my-module/, so that your new
|
|
directory structure looked like
|
|
|
|
my-module/
|
|
module/
|
|
foo.c
|
|
bar.c
|
|
otherDir/
|
|
blah.config
|
|
stuff.txt
|
|
zebra.jpg
|
|
|
|
then you would instead run run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --to-subdirectory-filter my-module/
|
|
--------------------------------------------------
|
|
|
|
Content based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you want to filter out all files bigger than a certain size, you can use
|
|
`--strip-blobs-bigger-than` with some size (K, M, and G suffixes are
|
|
recognized), e.g.:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --strip-blobs-bigger-than 10M
|
|
--------------------------------------------------
|
|
|
|
If you want to strip out all files with specified git object ids (hashes),
|
|
list the hashes in a file and run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS
|
|
--------------------------------------------------
|
|
|
|
If you want to modify file contents, you can do so based on a list of
|
|
expressions in a file, one per line. For example, with a file named
|
|
expressions.txt containing
|
|
|
|
--------------------------------------------------
|
|
p455w0rd
|
|
foo==>bar
|
|
glob:*666*==>
|
|
regex:\bdriver\b==>pilot
|
|
literal:MM/DD/YYYY==>YYYY-MM-DD
|
|
regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
|
|
--------------------------------------------------
|
|
|
|
then running
|
|
--------------------------------------------------
|
|
git filter-repo --replace-text expressions.txt
|
|
--------------------------------------------------
|
|
|
|
will go through and replace `p455w0rd` with `***REMOVED***`, `foo` with
|
|
`bar`, any line containing `666` with a blank line, the word `driver` with
|
|
`pilot` (but not if it has letters before or after; e.g. `drivers` will be
|
|
unmodified), replace the exact text `MM/DD/YYYY` with `YYYY-MM-DD` and
|
|
replace date strings of the form MM/DD/YYYY with ones of the form
|
|
YYYY-MM-DD. In the expressions file, there are a few things to note:
|
|
|
|
* Every line has a replacement, given by whatever is on the right of
|
|
`==>`. If `==>` does not appear on the line, the default replacement
|
|
is `***REMOVED***`.
|
|
* Lines can start with `literal:`, `glob:`, or `regex:` to specify
|
|
whether to do literal string matches,
|
|
globs (see https://docs.python.org/3/library/fnmatch.html), or regular
|
|
expressions (see https://docs.python.org/3/library/re.html#regular-expression-syntax).
|
|
If none of these are specified, `literal:` is assumed.
|
|
* globs and regexes are applied to each line of the file; it is not
|
|
possible with --replace-text to match a multi-line string.
|
|
* If multiple matches are found on a line, all are replaced.
|
|
|
|
See also the `--blob-callback` from <<CALLBACKS>>.
|
|
|
|
Refname based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To rename tags, use `--tag-rename`, e.g.:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-rename foo:bar
|
|
--------------------------------------------------
|
|
|
|
This will rename any tags starting with `foo` to now start with `bar`.
|
|
Either side of the colon could be blank, e.g.
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-rename '':'my-module-'
|
|
--------------------------------------------------
|
|
|
|
For more general refname modification, see `--refname-callback` from
|
|
<<CALLBACKS>>.
|
|
|
|
User and email based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To modify username and emails of commits, you can create a mailmap
|
|
file in the format accepted by linkgit:git-shortlog[1]. For example,
|
|
if you have a file named my-mailmap you can run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --mailmap my-mailmap
|
|
--------------------------------------------------
|
|
|
|
and if the current contents of that file are as follows (if the
|
|
specified mailmap file is version controlled, historical versions of
|
|
the file are ignored):
|
|
|
|
--------------------------------------------------
|
|
Name For User <email@addre.ss>
|
|
<new@ema.il> <old1@ema.il>
|
|
New Name And <new@ema.il> <old2@ema.il>
|
|
New Name And <new@ema.il> Old Name And <old3@ema.il>
|
|
--------------------------------------------------
|
|
|
|
then we can update username and/or emails based on the specified
|
|
mapping.
|
|
|
|
See also the `--name-callback` and `--email-callback` from
|
|
<<CALLBACKS>>.
|
|
|
|
Parent rewriting
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
To replace $commit_A with $commit_B (e.g. make all commits which had
|
|
$commit_A as a parent instead have $commit_B for that parent), and
|
|
rewrite history to make it permanent:
|
|
|
|
--------------------------------------------------
|
|
git replace $commit_A $commit_B
|
|
git filter-repo --force
|
|
--------------------------------------------------
|
|
|
|
To create a new commit with the same contents as $commit_A except with
|
|
different parent(s) and then replace $commit_A with the new commit,
|
|
and rewrite history to make it permanent:
|
|
|
|
--------------------------------------------------
|
|
git replace --graft $commit_A $new_parent_or_parents
|
|
git filter-repo --force
|
|
--------------------------------------------------
|
|
|
|
The reason to specify --force is two-fold: filter-repo will error out
|
|
if no arguments are specified, and the new graft commit would
|
|
otherwise trigger the not-a-fresh-clone check.
|
|
|
|
Partial history rewrites
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To rewrite the history on just one branch (which may cause it to no longer
|
|
share any common history with other branches), use `--refs`. For example,
|
|
to remove a file named 'extraneous.txt' from the 'master' branch:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path extraneous.txt --refs master
|
|
--------------------------------------------------
|
|
|
|
To rewrite just some recent commits:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
|
|
--------------------------------------------------
|
|
|
|
[[CALLBACKS]]
|
|
CALLBACKS
|
|
---------
|
|
|
|
For flexibility, filter-repo allows you to specify functions on the
|
|
command line to further filter all changes. Please note that there
|
|
are some API compatibility caveats associated with these callbacks
|
|
that you should be aware of before using them; see the "API BACKWARD
|
|
COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source
|
|
code.
|
|
|
|
All callback functions are of the same general format. For a command line
|
|
argument like
|
|
|
|
--------------------------------------------------
|
|
--foo-callback 'BODY'
|
|
--------------------------------------------------
|
|
|
|
the following code will be compiled and called:
|
|
|
|
--------------------------------------------------
|
|
def foo_callback(foo):
|
|
BODY
|
|
--------------------------------------------------
|
|
|
|
Thus, you just need to make sure your _BODY_ modifies and returns
|
|
_foo_ appropriately. One important thing to note for all callbacks is
|
|
that filter-repo uses bytestrings (see
|
|
https://docs.python.org/3/library/stdtypes.html#bytes) everywhere
|
|
instead of strings.
|
|
|
|
There are four callbacks that allow you to operate directly on raw
|
|
objects that contain data that's easy to write in
|
|
linkgit:fast-import[1] format:
|
|
|
|
--------------------------------------------------
|
|
--blob-callback
|
|
--commit-callback
|
|
--tag-callback
|
|
--reset-callback
|
|
--------------------------------------------------
|
|
|
|
We'll come back to these later because it is often the case that the
|
|
other callbacks are more convenient. The other callbacks operate on a
|
|
small piece of the raw objects or operate on pieces across multiple
|
|
types of raw object (e.g. author names and committer names and tagger
|
|
names across commits and tags, or refnames across commits, tags, and
|
|
resets, or messages across commits and tags). The convenience
|
|
callbacks are:
|
|
|
|
--------------------------------------------------
|
|
--filename-callback
|
|
--message-callback
|
|
--name-callback
|
|
--email-callback
|
|
--refname-callback
|
|
--------------------------------------------------
|
|
|
|
in each you are expected to simply return a new value based on the one
|
|
passed in. For example,
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'
|
|
--------------------------------------------------
|
|
|
|
would result in the following function being called:
|
|
|
|
--------------------------------------------------
|
|
def name_callback(name):
|
|
return name.replace(b"Wiliam", b"William")
|
|
--------------------------------------------------
|
|
|
|
The email callback is quite similar:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'
|
|
--------------------------------------------------
|
|
|
|
The refname callback is also similar, but note that the refname passed in
|
|
and returned are expected to be fully qualified (e.g. b"refs/heads/master"
|
|
instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"):
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --refname-callback '
|
|
# Change e.g. refs/heads/master to refs/heads/prefix-master
|
|
rdir,rpath = os.path.split(refname)
|
|
return rdir + b"/prefix-" + rpath'
|
|
--------------------------------------------------
|
|
|
|
The message callback is quite similar to the previous three callbacks,
|
|
though it operates on a bytestring that is likely more than one line:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --message-callback '
|
|
if b"Signed-off-by:" not in message:
|
|
message += b"\nSigned-off-by: Me My <self@and.eye>"
|
|
return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'
|
|
--------------------------------------------------
|
|
|
|
The filename callback is slightly more interesting. Returning None means
|
|
the file should be removed from all commits, returning the filename
|
|
unmodified marks the file to be kept, and returning a different name means
|
|
the file should be renamed. An example:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --filename-callback '
|
|
if b"/src/" in filename:
|
|
# Remove all files with a directory named "src" in their path
|
|
# (except when "src" appears at the toplevel).
|
|
return None
|
|
elif filename.startswith(b"tools/"):
|
|
# Rename tools/ -> scripts/misc/
|
|
return b"scripts/misc/" + filename[6:]
|
|
else:
|
|
# Keep the filename and do not rename it
|
|
return filename
|
|
'
|
|
--------------------------------------------------
|
|
|
|
In contrast, the blob, reset, tag, and commit callbacks are not
|
|
expected to return a value, but are instead expected to modify the
|
|
object passed in. Major fields for these objects are (subject to API
|
|
backward compatibility caveats mentioned previously):
|
|
|
|
* Blob: `original_id` (original hash) and `data`
|
|
* Reset: `ref` (name of reference) and `from_ref` (hash or integer mark)
|
|
* Tag: `ref`, `from_ref`, `original_id`, `tagger_name`, `tagger_email`,
|
|
`tagger_date`, `message`
|
|
* Commit: `branch`, `original_id`, `author_name`, `author_email`,
|
|
`author_date`, `committer_name`, `committer_email`,
|
|
`committer_date`, `message`, `file_changes` (list of
|
|
FileChange objects, each containing a `type`, `filename`,
|
|
`mode`, and `blob_id`), `parents` (list of hashes or integer
|
|
marks)
|
|
|
|
An example of each:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --blob-callback '
|
|
if len(blob.data) > 25:
|
|
# Mark this blob for removal from all commits
|
|
blob.skip()
|
|
else:
|
|
blob.data = blob.data.replace(b"Hello", b"Goodbye")
|
|
'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-callback '
|
|
if tag.tagger_name == b"Jim Williams":
|
|
# Omit this tag
|
|
tag.skip()
|
|
else:
|
|
tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --commit-callback '
|
|
# Remove executable files with three 6s in their name (including
|
|
# from leading directories).
|
|
# Also, undo deletion of sources/foo/bar.txt (change types are
|
|
# either b"D" (deletion) or b"M" (add or modify); renames are
|
|
# handled by deleting the old file and adding a new one)
|
|
commit.file_changes = [
|
|
change for change in commit.file_changes
|
|
if not (change.mode == b"100755" and
|
|
change.filename.count(b"6") == 3) and
|
|
not (change.type == b"D" and
|
|
change.filename == b"sources/foo/bar.txt")]
|
|
# Mark all .sh files as executable; modes in git are always one of
|
|
# 100644 (normal file), 100755 (executable), 120000 (symlink), or
|
|
# 160000 (submodule)
|
|
for change in commit.file_changes:
|
|
if change.filename.endswith(b".sh"):
|
|
change.mode = b"100755"
|
|
'
|
|
--------------------------------------------------
|
|
|
|
[[INTERNALS]]
|
|
INTERNALS
|
|
---------
|
|
|
|
You probably don't need to read this section unless you are just very
|
|
curious or you are trying to do a very complex history rewrite.
|
|
|
|
How filter-repo works
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Roughly, filter-repo works by running
|
|
|
|
--------------------------------------------------
|
|
git fast-export <options> | filter | git fast-import <options>
|
|
--------------------------------------------------
|
|
|
|
where filter-repo not only launches the whole pipeline but also serves as
|
|
the _filter_ in the middle. However, filter-repo does a few additional
|
|
things on top in order to make it into a well-rounded filtering tool. A
|
|
sequence that more accurately reflects what filter-repo runs is:
|
|
|
|
1. Verify we're in a fresh clone
|
|
2. `git fetch -u . refs/remotes/origin/*:refs/heads/*`
|
|
3. `git remote rm origin`
|
|
4. `git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger --signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes --mark-tags --all | filter | git -c core.ignorecase=false fast-import --force --quiet`
|
|
5. `git update-ref --no-deref --stdin`, fed with a list of refs to nuke, and a list of replace refs to delete, create, or update.
|
|
6. `git reset --hard`
|
|
7. `git reflog expire --expire=now --all`
|
|
8. `git gc --prune=now`
|
|
|
|
Some notes or exceptions on each of the above:
|
|
|
|
1. If we're not in a fresh clone, users will not be able to recover if
|
|
they used the wrong command or ran in the wrong repo. (Though
|
|
`--force` overrides this check, and it's also off if you've already
|
|
ran filter-repo once in this repo.)
|
|
2. Technically, we actually use a `git update-ref` command fed with a lot
|
|
of input due to the fact that users can use `--force` when local
|
|
branches might not match remote branches. But this fetch command
|
|
catches the intent rather succinctly.
|
|
3. We don't want users accidentally pushing back to the original repo, as
|
|
discussed in <<DISCUSSION>>. It also reminds users that since history
|
|
has been rewritten, this repo is no longer compatible with the
|
|
original. Finally, another minor benefit is this allows users to push
|
|
with the `--mirror` option to their new home without accidentally
|
|
sending remote tracking branches.
|
|
4. Some of these flags are always used but others are actually
|
|
conditional. For example, filter-repo's `--replace-text` and
|
|
`--blob-callback` options need to work on blobs so `--no-data` cannot
|
|
be passed to fast-export. But when we don't need to work on blobs,
|
|
passing `--no-data` speeds things up. Also, other flags may change
|
|
the structure of the pipeline as well (e.g. `--dry-run` and `--debug`)
|
|
5. We use this step to write replace refs for accessing the newly written
|
|
commit hashes using their previous names. Also, if refs were renamed
|
|
by various steps, we need to delete the old refnames in order to avoid
|
|
mixing old and new history.
|
|
6. Users also have old versions of files in their working tree and index;
|
|
we want those cleaned up to match the rewritten history as well. Note
|
|
that this step is skipped in bare repos.
|
|
7. Reflogs will hold on to old history, so we need to expire them.
|
|
8. We need to gc to avoid mixing new and old history. Also, it shrinks
|
|
the repository for users, so they don't have to do extra work. (Odds
|
|
are that they've only rewritten trees and commits and maybe a few
|
|
blobs, so `--aggressive` isn't needed and would be too slow.)
|
|
|
|
Information about these steps is printed out when `--debug` is passed
|
|
to filter-repo. When doing a `--partial` history rewrite, steps 2, 3,
|
|
7, and 8 are unconditionally skipped, step 5 is skipped if
|
|
`--replace-refs` is `update-no-add`, and just the nuke-unused-refs
|
|
portion of step 5 is skipped if `--replace-refs` is something else.
|
|
|
|
Limitations
|
|
~~~~~~~~~~~
|
|
|
|
Inherited limitations
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Since git filter-repo calls fast-export and fast-import to do a lot of the
|
|
heavy lifting, it inherits limitations from those systems:
|
|
|
|
* extended commit headers, if any, are stripped
|
|
* commits get rewritten meaning they will have new hashes; therefore,
|
|
signatures on commits and tags cannot continue to work and instead are
|
|
just removed (thus signed tags become annotated tags)
|
|
* tags of commits are supported. Prior to git-2.24.0, tags of blobs and
|
|
tags of tags are not supported (fast-export would die on such tags).
|
|
tags of trees are not supported in any git version (since fast-export
|
|
ignores tags of trees with a warning and fast-import provides no way to
|
|
import them).
|
|
* annotated and signed tags outside of the refs/tags/ namespace are not
|
|
supported (their location will be mangled in weird ways)
|
|
* fast-import will die on various forms of invalid input, such as a
|
|
timezone with more than four digits
|
|
* fast-export cannot reencode commit messages into UTF-8 if the commit
|
|
message is not valid in its specified encoding (in such cases, it'll
|
|
leave the commit message and the encoding header alone).
|
|
* commits without an author will be given one matching the committer
|
|
* tags without a tagger will be given a fake tagger
|
|
* references that include commit cycles in their history (which can be
|
|
created with linkgit:git-replace[1]) will not be flagged to the user as
|
|
an error but will be silently deleted by fast-export as though the
|
|
branch or tag contained no interesting files
|
|
|
|
There are also some limitations due to the design of these systems:
|
|
|
|
* Trying to insert additional files into the stream can be tricky; since
|
|
fast-export only lists file changes in a merge relative to its first
|
|
parent, if you insert additional files into a commit that is in the
|
|
second (or third or fourth) parent history of a merge, then you also
|
|
need to add it to the merge manually. (Similarly, if you change which
|
|
parent is the first parent in a merge commit, you need to manually
|
|
update the list of file changes to be relative to the new first
|
|
parent.)
|
|
|
|
* fast-export and fast-import work with exact file contents, not patches.
|
|
(e.g. "Whatever the current contents of this file, update them to now
|
|
have these contents") Because of this, removing the changes made in a
|
|
single commit or inserting additional changes to a file in some commit
|
|
and expecting them to propagate forward is not something that can be
|
|
done with these tools. Use linkgit:git-rebase[1] for that.
|
|
|
|
Intrinsic limitations
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Some types of filtering have limitations that would affect any tool
|
|
attempting to perform them; the most any tool can do is attempt to notify
|
|
the user when it detects an issue:
|
|
|
|
* When rewriting commit hashes in commit messages, there are a variety
|
|
of cases when the hash will not be updated (whenever this happens, a
|
|
note is written to `.git/filter-repo/suboptimal-issues`):
|
|
** if a commit hash does not correspond to a commit in the old repo
|
|
** if a commit hash corresponds to a commit that gets pruned
|
|
** if an abbreviated hash is not unique
|
|
|
|
* Pruning of empty commits can cause a merge commit to lose an entire
|
|
ancestry line and become a non-merge. If the merge commit had no
|
|
changes then it can be pruned too, but if it still has changes it needs
|
|
to be kept. This might cause minor confusion since the commit will
|
|
likely have a commit message that makes it sound like a merge commit
|
|
even though it's not. (Whenever a merge commit becomes a non-merge
|
|
commit, a note is written to `.git/filter-repo/suboptimal-issues`)
|
|
|
|
Issues specific to filter-repo
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
* Multiple repositories in the wild have been observed which use a bogus
|
|
timezone (`+051800`); google will find you some reports. The intended
|
|
timezone wasn't clear or wasn't always the same. Replace with a
|
|
different bogus timezone that fast-import will accept (`+0261`).
|
|
|
|
* `--path-rename` can result in pathname collisions; to avoid excessive
|
|
memory requirements of tracking which files are in all commits or
|
|
looking up what files exist with either every commit or every usage of
|
|
--path-rename, we just tell the user that they might clobber other
|
|
changes if they aren't careful. We can check if the clobbering comes
|
|
from another --path-rename without much overhead. (Perhaps in the
|
|
future it's worth adding a slow mode to --path-rename that will do the
|
|
more exhaustive checks?)
|
|
|
|
* There is no mechanism for directly controlling which flags are passed
|
|
to fast-export (or fast-import); only pre-defined flags can be turned
|
|
on or off as a side-effect of other options. Direct control would make
|
|
little sense because some options like `--full-tree` would require
|
|
additional code in filter-repo (to parse new directives), and others
|
|
such as `-M` or `-C` would break assumptions used in other places of
|
|
filter-repo.
|
|
|
|
* Partial-repo filtering, while supported, runs counter to filter-repo's
|
|
"avoid mixing old and new history" design. This support has required
|
|
improvements to core git as well (e.g. it depends upon the
|
|
`--reference-excluded-parents` option to fast-export that was added
|
|
specifically for this usage within filter-repo). The `--partial` and
|
|
`--refs` options will continue to be supported since there are people
|
|
with usecases for them; however, I am concerned that this inconsistency
|
|
about mixing old and new history seems likely to lead to user mistakes.
|
|
For now, I just hope that long explanations of caveats in the
|
|
documentation of these options suffice to curtail any such problems.
|
|
|
|
Comments on reversibility
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Some people are interested in reversibility of of a rewrite; e.g. rewrite
|
|
history, possibly add some commits, then unrewrite and get the original
|
|
history back plus a few new "unrewritten" commits. Obviously this is
|
|
impossible if your rewrite involves throwing away information
|
|
(e.g. filtering out files or replacing several different strings with
|
|
`***REMOVED***`), but may be possible with some rewrites. filter-repo is
|
|
likely to be a poor fit for this type of workflow for a few reasons:
|
|
|
|
* most of the limitations inherited from fast-export and fast-import
|
|
are of a type that cause reversibility issues
|
|
* grafts and replace refs, if present, are used in the rewrite and made
|
|
permanent
|
|
* rewriting of commit hashes will probably be reversible, but it is
|
|
possible for rewritten abbreviated hashes to not be unique even if the
|
|
original abbreviated hashes were.
|
|
* filter-repo defaults to several forms of unreversible rewriting that
|
|
you may need to turn off (e.g. the last two bullet points above or
|
|
reencoding commit messages into UTF-8); it's possible that additional
|
|
forms of unreversible rewrites will be added in the future.
|
|
* I assume that people use filter-repo for one-shot conversions, not
|
|
ongoing data transfers. I explicitly reserve the right to change any
|
|
API in filter-repo based on this presumption (and a comment to this
|
|
effect is found in multiple places in the code and examples). You
|
|
have been warned.
|
|
|
|
SEE ALSO
|
|
--------
|
|
linkgit:git-rebase[1], linkgit:git-filter-branch[1]
|
|
|
|
GIT
|
|
---
|
|
Part of the linkgit:git[1] suite
|