diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt new file mode 100644 index 0000000..eeee5b4 --- /dev/null +++ b/Documentation/git-filter-repo.txt @@ -0,0 +1,1091 @@ +git-filter-repo(1) +================== + +NAME +---- +git-filter-repo - Rewrite repository history + +SYNOPSIS +-------- +[verse] +'git filter-repo' --analyze +'git filter-repo' [] [] + [] [] + [] [] + [] [] + +DESCRIPTION +----------- + +Rapidly rewrite entire repository history using user-specified filters. +This is a destructive operation which should not be used lightly; it +writes new commits, trees, tags, and blobs corresponding to (but +filtered from) the original objects in the repository, then deletes the +original history and leaves only the new. See <> for more +details on the ramifications of using this tool. Several different +types of history rewrites are possible; examples include (but are not +limited to): + + * stripping large files (or large directories or large extensions) + * stripping unwanted files by path + * extracting wanted paths and their history (stripping everything else) + * restructuring the file layout (such as moving all files into a + subdirectory in preparation for merging with another repo, making a + subdirectory become the new toplevel directory, or merging two + directories with independent filenames into one directory) + * renaming tags (also often in preparation for merging with another repo) + * replacing or removing sensitive text such as passwords + * making mailmap rewriting of user names or emails permanent + * making grafts or replacement refs permanent + * rewriting commit messages + +Additionally, several concerns are handled automatically (many of these +can be overridden, but they are all on by default): + + * rewriting (possibly abbreviated) hashes in commit messages to + refer to the new post-rewrite commit hashes + * pruning commits which become empty due to the above filters (also + handles edge cases like pruning of merge commits which become + degenerate and empty) + * creating replace-refs (see linkgit:git-replace[1]) for old commit + hashes, which if pushed and fetched will allow users to continue to + refer to new commits using (unabbreviated) old commit IDs + * stripping of original history to avoid mixing old and new history + * repacking the repository post-rewrite to shrink the repo for the + user + +Also, it's worth noting that there is an important safety mechanism: + + * abort if run from a repo that is not a fresh clone (to prevent + accidental data loss from rewriting local history that doesn't + exist anywhere else) + +For those who know that there is large unwanted stuff in their history +and want help finding it, this command also + + * provides an option to analyze a repository and generate reports that + can be useful in determining what to filter (or in determining + whether a separate filtering command was successful). + +See also <>, <>, <>, and +<>. + +OPTIONS +------- + +Analysis Options +~~~~~~~~~~~~~~~~ + +--analyze:: + Analyze repository history and create a report that may be + useful in determining what to filter in a subsequent run (or + in determining if a previous filtering command did what you + wanted). Will not modify your repo. + +Filtering based on paths (see also --filename-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--invert-paths:: + Invert the selection of files from the specified + --path-{match,glob,regex} options below, i.e. only select + files matching none of those options. + +--path-match :: +--path :: + Exact paths (files or directories) to include in filtered + history. Multiple --path options can be specified to get a + union of paths. + +--path-glob :: + Glob of paths to include in filtered history. Multiple + --path-glob options can be specified to get a union of paths. + +--path-regex :: + Regex of paths to include in filtered history. Multiple + --path-regex options can be specified to get a union of paths. + +--use-base-name:: + Match on file base name instead of full path from the top of + the repo. Incompatible with --path-rename. + +Renaming based on paths (see also --filename-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--path-rename :: +--path-rename-match :: + Path to rename; if filename or directory matches + rename to . Multiple --path-rename options can be + specified. + +Path shortcuts +~~~~~~~~~~~~~~ + +--paths-from-file :: + Specify several path filtering and renaming directives, one + per line. Lines with `==>` in them specify path renames, and + lines can begin with `literal:` (the default), `glob:`, or + `regex:` to specify different matching styles + +--subdirectory-filter :: + Only look at history that touches the given subdirectory and + treat that directory as the project root. Equivalent to using + `--path / --path-rename /:` + +--to-subdirectory-filter :: + Treat the project root as instead being under + . Equivalent to using `--path-rename :/` + +Content editing filters (see also --blob-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--replace-text :: + A file with expressions that, if found, will be replaced. By + default, each expression is treated as literal text, but + `regex:` and `glob:` prefixes are supported. You can end the + line with `==>` and some replacement text to choose a + replacement choice other than the default of `***REMOVED***`. + +--strip-blobs-bigger-than :: + Strip blobs (files) bigger than specified size (e.g. `5M`, + `2G`, etc) + +--strip-blobs-with-ids :: + Read git object ids from each line of the given file, and + strip all of them from history + +Renaming of refs (see also --refname-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--tag-rename :: + Rename tags starting with to start with . For example, + --tag-rename foo:bar will rename tag foo-1.2.3 to bar-1.2.3; + either or can be empty. + +Filtering of commit messages (see also --message-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--preserve-commit-hashes:: + By default, since commits are rewritten and thus gain new + hashes, references to old commit hashes in commit messages are + replaced with new commit hashes (abbreviated to the same + length as the old reference). Use this flag to turn off + updating commit hashes in commit messages. + +--preserve-commit-encoding:: + Do not reencode commit messages into UTF-8. By default, if the + commit object specifies an encoding for the commit message, + the message is re-encoded into UTF-8. + +Filtering of names & emails (see also --name-callback and --email-callback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--mailmap :: + Use specified mailmap file (see linkgit:git-shortlog[1] for details + on the format) when rewriting author, committer, and tagger names + and emails. If the specified file is part of git history, + historical versions of the file will be ignored; only the current + contents are consulted. + +--use-mailmap:: + Same as: '--mailmap .mailmap' + +Parent rewriting +~~~~~~~~~~~~~~~~ + +--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add}:: + Replace refs (see linkgit:git-replace[1]) are used to rewrite + parents (unless turned off by the usual git mechanism); this + flag specifies what do do with those refs afterward. Replace + refs can either be deleted or updated to point at new commit + hashes. Also, new replace refs can be added for each commit + rewrite. With 'update-or-add', new replace refs are only + added for commit rewrites that aren't used to update an + existing replace ref. default is 'update-and-add' if + $GIT_DIR/filter-repo/already_ran does not exist; + 'update-or-add' otherwise. + +--prune-empty {always, auto, never}:: + Whether to prune empty commits. 'auto' (the default) means + only prune commits which become empty (not commits which were + empty in the original repo, unless their parent was + pruned). When the parent of a commit is pruned, the first + non-pruned ancestor becomes the new parent. + +--prune-degenerate {always, auto, never}:: + Since merge commits are needed for history topology, they are + typically exempt from pruning. However, they can become + degenerate with the pruning of other commits (having fewer + than two parents, having one commit serve as both parents, or + having one parent as the ancestor of the other.) If such merge + commits have no file changes, they can be pruned. The default + ('auto') is to only prune empty merge commits which become + degenerate (not which started as such). + +Generic callback code snippets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--filename-callback :: + Python code body for processing filenames; see <>. + +--message-callback :: + Python code body for processing messages (both commit messages and + tag messages); see <>. + +--name-callback :: + Python code body for processing names of people; see <>. + +--email-callback :: + Python code body for processing emails addresses; see + <>. + +--refname-callback :: + Python code body for processing refnames; see <>. + +--blob-callback :: + Python code body for processing blob objects; see <>. + +--commit-callback :: + Python code body for processing commit objects; see <>. + +--tag-callback :: + Python code body for processing tag objects; see <>. + +--reset-callback :: + Python code body for processing reset objects; see <>. + +Location to filter from/to +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +NOTE: Specifying alternate source or target locations will disable some +auxiliary steps such as disconnecting the origin remote, and avoiding +mixing new and old history. + +--source :: + Git repository to read from + +--target :: + Git repository to overwrite with filtered history + +Miscellaneous options +~~~~~~~~~~~~~~~~~~~~~ + +--help:: +-h:: + Show a help message and exit. + +--force:: +-f:: + Rewrite history even if the current repo does not look like a fresh + clone. + +--dry-run:: + Do not change the repository. Run `git fast-export` and filter its + output, and save both the original and the filtered version for + comparison. This also disables rewriting commit messages due to + not knowing new commit IDs and disables filtering of some empty + commits due to inability to query the fast-import backend. + +--debug:: + Print additional information about operations being performed and + commands being run. (If used together with --dry-run, shows + extra information about what would be run). + +--stdin:: + Instead of running `git fast-export` and filtering its output, + filter the fast-export stream from stdin. The stdin must be in + the expected input format (e.g. it needs to include original-oid + directives). + +--quiet:: + Pass --quiet to other git commands called. + +[[VERSATILITY]] +VERSATILITY +----------- + +filter-repo has a hierarchy of capabilities on the spectrum from easy to +use convenience flags that perform pre-defined types of filtering, to +choices that provide lots of flexibility in controlling how filtering +occurs. This spectrum includes the following: + + * Convenience flags making common types of history rewriting simple (e.g. + --path, --strip-blobs-bigger-than, --replace-text, --mailmap) + * Options which are shorthand for others or which provide greater control + than others (e.g. --subdirectory-filter could just be written using + both a path selection (--path) and a path rename (--path-rename) + filter; --paths-from-file can handle all other --path* options and more + such as regex renaming of paths) + * Generic python callbacks for handling a certain type of data (the + filename, message, name, email, and refname callbacks) + * Generic python callbacks for handling fundamental git objects, allowing + greater control over the combination of data types the object holds + (the commit, tag, blob, and reset callbacks) + * The ability to import filter-repo as a module in a python program and + use its classes and functions for even greater control and flexibility + while still leveraging lots of basic capabilities. One can even use + this to write new tools with a completely different interface. + +For more information about callbacks, see <>. For examples on +writing python programs that import filter-repo as a module to create new +history rewriting tools, look at the contrib/filter-repo-demos/ directory. +That directory includes, among other examples, a reimplementation of +git-filter-branch which is faster than git-filter-branch, and a +reimplementation of BFG Repo Cleaner with several bug fixes and new +features. + +[[DISCUSSION]] +DISCUSSION +---------- + +Using filter-repo is relatively simple, but rewriting history is part of +a larger discussion in terms of collaboration. When you rewrite +history, the old and new histories are no longer compatible; if you push +this history somewhere for others to view, it will look as though you've +done a rebase of all branches and tags. Make sure you are familiar with +the "RECOVERING FROM UPSTREAM REBASE" section of linkgit:git-rebase[1] +(and in particular, "The hard case") before proceeding, in addition to +this section. + +Steps to use git-filter-repo as part of the bigger picture of doing a +history rewrite are roughly as follows: + +1. Create a clone of your repository (if you created special refs outside + of refs/heads/ or refs/tags/, make sure to fetch those too). Note + that `--bare` and `--mirror` clones are supported too, if you prefer. + +2. (Optional) Run `git filter-repo --analyze`. This will create a + directory of reports mentioning renames that have occurred in your + repo and also listing sizes of objects aggregated by + path/directory/extension/blob-id; this information may be useful in + choosing how to filter your repo. It can also be useful to re-run + --analyze after filtering to verify the changes look correct. + +3. Run filter-repo with your desired filtering options. Many examples + are given below. For more complex cases, note that doing the + filtering in multiple steps (by running multiple filter-repo + invocations in a sequence) is supported. If anything goes wrong here, + simply delete your clone and restart. + +4. Push your new repository to its new home (note that + refs/remotes/origin/* will have been moved to refs/heads/* as the + first part of filter-repo, so you can just deal with normal branches + instead of remote tracking branches). While you can force push this + to the same URL you cloned from, there are good reasons to consider + pushing to a different location instead: + + * People who cloned from the original repo will have old history. + When they fetch the new history you force pushed up, unless they + do a `git reset --hard @{u}` on their branches or rebase their + local work, git will think they have hundreds or thousands of + commits with very similar commit messages as what exist upstream + (but which include files you wanted excised from history), and + allow the user to merge the two histories, resulting in what + looks like two copies of each commit. If they then push this + history back up, then everyone now has history with two copies of + each commit and the bad files have returned. You're more likely + to succeed in forcing people to get rid of the old history if + they have to clone a new URL. + + * Rewriting history will rewrite tags; those who have already + downloaded tags will not get the updated tags by default (see the + "On Re-tagging" section of linkgit:git-tag[1]). Every user + trying to use an existing clone will have to forcibly delete all + tags and re-fetch them; it may be easier for them to just + re-clone, which they are more likely to do with a new clone URL. + + * Rewriting history may delete some refs (e.g. branches that only + had files that you wanted excised from history); unless you run + git push with the `--mirror` or `--prune` options, those refs + will continue to exist on the server. If folks then merge these + branches into others, then people have started mixing old and new + history. If users had already cloned these branches, removing + them from the server isn't enough; you need all users to delete + any local branches based on these refs and run fetch with the + `--prune` option as well. Simply re-cloning from a new URL is + easier. + + * The server may not allow you to force push over some refs. + For example, code review systems may have special ref + namespaces (e.g. refs/changes/, refs/pull/, + refs/merge-requests/) that they have locked down. + +5. (Optional) Some additional considerations + + * filter-repo by default creates replace refs (see + linkgit:git-replace[1]) for each rewritten commit ID, allowing + you to use old (unabbreviated) commit hashes to refer to the + newly rewritten commits. If you want to use these replace refs, + push them to the relevant clone URL and tell users to adjust + their fetch refspec (e.g. `git config --add remote.origin.fetch + +refs/replace/*:refs/replace/*`) Sadly, some existing git servers + (e.g. Gerrit, GitHub) do not yet understand replace refs, and + thus one can't use old commit hashes within their UI; this may + change in the future. But replace refs at least help users + locally within the git CLI. + + * If you have a central repo, you may want to prevent people + from pushing old commit IDs, in order to avoid mixing old + and new history. Every repository manager does this + differently, some provide specialized commands + (e.g. https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html), + others require you to write hooks. + +[[EXAMPLES]] +EXAMPLES +-------- + +Path based filtering +~~~~~~~~~~~~~~~~~~~~ + +To only keep the 'README.md' file plus the directories 'guides' and +'tools/releases/': + +-------------------------------------------------- +git filter-repo --path README.md --path guides/ --path tools/releases +-------------------------------------------------- + +Directory names can be given with or without a trailing slash, and all +filenames are relative to the toplevel of the repo. To keep all files +except these paths, just add `--invert-paths`: + +-------------------------------------------------- +git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths +-------------------------------------------------- + +If you want to have both an inclusion filter and an exclusion filter, just +run filter-repo multiple times. For example, to keep the src/main +subdirectory but exclude files under src/main named 'data', run: + +-------------------------------------------------- +git filter-repo --path src/main/ +git filter-repo --path-glob 'src/*/data' --invert-paths +-------------------------------------------------- + +Note that the asterisk (`*`) will match across multiple directories, so the +second command would remove e.g. src/main/org/whatever/data. Also, the +second command by itself would also remove e.g. src/not-main/foo/data, but +since src/not-main/ was removed by the first command, that's not an issue. +Also, the use of quotes around the asterisk is sometimes important to avoid +glob expansion by the shell. + +You can also select paths by regular expression (see +https://docs.python.org/3/library/re.html#regular-expression-syntax). +For example, to only include files from the repo whose name is in the +format YYYY-MM-DD.txt and is found at least two subdirectories deep: + +-------------------------------------------------- +git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$' +-------------------------------------------------- + +If you want two directories to be renamed (and maybe merged if both are +renamed to the same location), use --path-rename; for example, to rename +both 'cmds/' and 'src/scripts/' to 'tools/': + +-------------------------------------------------- +git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/ +-------------------------------------------------- + +As with `--path`, directories can be specified with or without a +trailing slash for `--path-rename`. + +If you do a `--path-rename` to something that was already in use, it will +be silently overwritten. However, if you try to rename multiple files to +the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh +both existed and had different content with the renames above), then you +will be given an error. If you have such a case, you may want to add +another rename command to move one of the paths somewhere else where it +won't collide: + +-------------------------------------------------- +git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \ + --path-rename cmds/:tools/ \ + --path-rename src/scripts/:tools/ +-------------------------------------------------- + +Also, `--path-rename` brings up ordering issues; all path arguments are +applied in order. Thus, a command like + +-------------------------------------------------- +git filter-repo --path-rename sources/:src/main/ --path src/main/ +-------------------------------------------------- + +would make sense but reversing the two arguments would not (src/main/ is +created by the rename so reversing the two would give you an empty repo). +Also, note that the rename of cmds/run_release.sh a couple examples ago was +done before the other renames. + +If you prefer to filter based solely on basename, use the `--use-base-name` +flag (though this is incompatible with `--path-rename`). For example, to +only include README.md and Makefile files from any directory: + +-------------------------------------------------- +git filter-repo --use-base-name --path README.md --path Makefile +-------------------------------------------------- + +If you wanted to delete all .DS_Store files in any directory, you could +either use: + +-------------------------------------------------- +git filter-repo --invert-paths --path '.DS_Store' --use-base-name +-------------------------------------------------- + +or + +-------------------------------------------------- +git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store' +-------------------------------------------------- + +(the `--path-glob` isn't sufficient by itself as it might miss a toplevel +.DS_Store file; further while something like `--path-glob '*.DS_Store'` +would workaround that problem it would also grab files named `foo.DS_Store` +or `bar/baz.DS_Store`) + +If you have a long list of files, directories, globs, or regular +expressions to filter on, you can stick them in a file and use +`--paths-from-file`; for example, with a file named stuff-i-want.txt with +contents of + +-------------------------------------------------- +README.md +guides/ +tools/releases +glob:*.py +regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$ +tools/==>scripts/ +regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt +-------------------------------------------------- + +then you could run + +-------------------------------------------------- +git filter-repo --paths-from-file stuff-i-want.txt +-------------------------------------------------- + +to get a repo containing only the toplevel README.md file, the guides/ and +tools/releases/ directories, all python files, files whose name was of the +form YYYY.MM-DD.txt at least two subdirectories deep, and would rename +tools/ to scripts/ and rename files like foo/bar/baz/bleh.text to +baz/foo/bar/bleh.txt. Note the special line prefixes of `glob:` and +`regex:` and the special string `==>` denoting renames. + +Finally, see also the `--filename-callback` from <>. + +Content based filtering +~~~~~~~~~~~~~~~~~~~~~~~ + +If you want to filter out all files bigger than a certain size, you can use +`--strip-blobs-bigger-than` with some size (K, M, and G suffixes are +recognized), e.g.: + +-------------------------------------------------- +git filter-repo --strip-blobs-bigger-than 10M +-------------------------------------------------- + +If you want to strip out all files with specified git object ids (hashes), +list the hashes in a file and run + +-------------------------------------------------- +git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS +-------------------------------------------------- + +If you want to modify file contents, you can do so based on a list of +expressions in a file, one per line. For example, with a file named +expressions.txt containing + +-------------------------------------------------- +p455w0rd +foo==>bar +glob:*666*==> +regex:\bdriver\b==>pilot +literal:MM/DD/YYYY=>YYYY-MM-DD +regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2 +-------------------------------------------------- + +then running +-------------------------------------------------- +git filter-repo --replace-text expressions.txt +-------------------------------------------------- + +will go through and replace `p455w0rd` with `***REMOVED***`, `foo` with +`bar`, any line containing `666` with a blank line, the word `driver` with +`pilot` (but not if it has letters before or after; e.g. `drivers` will be +unmodified), replace the exact text `MM/DD/YYYY` with `YYYY-MM-DD` and +replace date strings of the form MM/DD/YYYY with ones of the form +YYYY-MM-DD. In the expressions file, there are a few things to note: + + * Every line has a replacement, given by whatever is on the right of + `==>`. If `==>` does not appear on the line, the default replacement + is `***REMOVED***`. + * Lines can start with `literal:`, `glob:`, or `regex:` to specify + whether to do literal string matches, + globs (see https://docs.python.org/3/library/fnmatch.html), or regular + expressions (see https://docs.python.org/3/library/re.html#regular-expression-syntax). + If none of these are specified, `literal:` is assumed. + * globs and regexes are applied to each line of the file; it is not + possible with --replace-text to match a multi-line string. + * If multiple matches are found on a line, all are replaced. + +See also the `--blob-callback` from <>. + +Refname based filtering +~~~~~~~~~~~~~~~~~~~~~~~ + +To rename tags, use `--tag-rename`, e.g.: + +-------------------------------------------------- +git filter-repo --tag-rename foo:bar +-------------------------------------------------- + +This will rename any tags starting with `foo` to now start with `bar`. +Either side of the colon could be blank, e.g. + +-------------------------------------------------- +git filter-repo --tag-rename '':'my-module-' +-------------------------------------------------- + +For more general refname modification, see `--refname-callback` from +<>. + +User and email based filtering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To modify username and emails of commits, you can create a mailmap +file in the format accepted by linkgit:git-shortlog[1]. For example, +if you have a file named my-mailmap you can run + +-------------------------------------------------- +git filter-repo --mailmap my-mailmap +-------------------------------------------------- + +and if the current contents of that file are as follows (if the +specified mailmap file is version controlled, historical versions of +the file are ignored): + +-------------------------------------------------- +Name For User + +New Name And +New Name And Old Name And +-------------------------------------------------- + +then we can update username and/or emails based on the specified +mapping. + +See also the `--name-callback` and `--email-callback` from +<>. + +Parent rewriting +~~~~~~~~~~~~~~~~ + +To replace $commit_A with $commit_B (e.g. make all commits which had +$commit_A as a parent instead have $commit_B for that parent), and +rewrite history to make it permanent: + +-------------------------------------------------- +git replace $commit_A $commit_B +git filter-repo --force +-------------------------------------------------- + +To create a new commit with the same contents as $commit_A except with +different parent(s) and then replace $commit_A with the new commit, +and rewrite history to make it permanent: + +-------------------------------------------------- +git replace --graft $commit_A $new_parent_or_parents +git filter-repo --force +-------------------------------------------------- + +The reason to specify --force is two-fold: filter-repo will error out +if no arguments are specified, and the new graft commit would +otherwise trigger the not-a-fresh-clone check. + +[[CALLBACKS]] +CALLBACKS +--------- + +For flexibility, filter-repo allows you to specify functions on the +command line to further filter all changes. Please note that there +are some API compatibility caveats associated with these callbacks +that you should be aware of before using them; see the "API BACKWARD +COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source +code. + +All callback functions are of the same general format. For a command line +argument like + +-------------------------------------------------- +--foo-callback 'BODY' +-------------------------------------------------- + +the following code will be compiled and called: + +-------------------------------------------------- +def foo_callback(foo): + BODY +-------------------------------------------------- + +Thus, you just need to make sure your _BODY_ modifies and returns +_foo_ appropriately. One important thing to note for all callbacks is +that filter-repo uses bytestrings (see +https://docs.python.org/3/library/stdtypes.html#bytes) everywhere +instead of strings. + +There are four callbacks that allow you to operate directly on raw +objects that contain data that's easy to write in +linkgit:fast-import[1] format: + +-------------------------------------------------- +--blob-callback +--commit-callback +--tag-callback +--reset-callback +-------------------------------------------------- + +We'll come back to these later because it is often the case that the +other callbacks are more convenient. The other callbacks operate on a +small piece of the raw objects or operate on pieces across multiple +types of raw object (e.g. author names and committer names and tagger +names across commits and tags, or refnames across commits, tags, and +resets, or messages across commits and tags). The convenience +callbacks are: + +-------------------------------------------------- +--filename-callback +--message-callback +--name-callback +--email-callback +--refname-callback +-------------------------------------------------- + +in each you are expected to simply return a new value based on the one +passed in. For example, + +-------------------------------------------------- +git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")' +-------------------------------------------------- + +would result in the following function being called: + +-------------------------------------------------- +def name_callback(name): + return name.replace(b"Wiliam", b"William") +-------------------------------------------------- + +The email callback is quite similar: + +-------------------------------------------------- +git-filter-repo --email-callback 'return email.replace(b".cm", b".com")' +-------------------------------------------------- + +The refname callback is also similar, but note that the refname passed in +and returned are expected to be fully qualified (e.g. b"refs/heads/master" +instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"): + +-------------------------------------------------- +git-filter-repo --refname-callback ' + # Change e.g. refs/heads/master to refs/heads/prefix-master + rdir,rpath = os.path.split(refname) + return rdir + b"/prefix-" + rpath' +-------------------------------------------------- + +The message callback is quite similar to the previous three callbacks, +though it operates on a bytestring that is likely more than one line: + +-------------------------------------------------- +git-filter-repo --message-callback ' + if b"Signed-off-by:" not in message: + message += b"\nSigned-off-by: Me My " + return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)' +-------------------------------------------------- + +The filename callback is slightly more interesting. Returning None means +the file should be removed from all commits, returning the filename +unmodified marks the file to be kept, and returning a different name means +the file should be renamed. An example: + +-------------------------------------------------- +git-filter-repo --filename-callback ' + if b"/src/" in filename: + # Remove all files with a directory named "src" in their path + # (except when "src" appears at the toplevel). + return None + elif filename.startswith(b"tools/"): + # Rename tools/ -> scripts/misc/ + return b"scripts/misc/" + filename[6:] + else: + # Keep the filename and do not rename it + return filename + ' +-------------------------------------------------- + +In contrast, the blob, reset, tag, and commit callbacks are not +expected to return a value, but are instead expected to modify the +object passed in. Major fields for these objects are (subject to API +backward compatibility caveats mentioned previously): + + * Blob: `original_id` (original hash) and `data` + * Reset: `ref` (name of reference) and `from_ref` (hash or integer mark) + * Tag: `ref`, `from_ref`, `original_id`, `tagger_name`, `tagger_email`, + `tagger_date`, `message` + * Commit: `branch`, `original_id`, `author_name`, `author_email`, + `author_date`, `committer_name`, `committer_email`, + `committer_date `, `message`, `file_changes` (list of + FileChange objects, each containing a `type`, `filename`, + `mode`, and `blob_id`), `parents` (list of hashes or integer + marks) + +An example of each: + +-------------------------------------------------- +git filter-repo --blob-callback ' + if len(blob.data) > 25: + # Mark this blob for removal from all commits + blob.skip() + else: + blob.data = blob.data.sub(b"Hello", b"Goodbye") + ' +-------------------------------------------------- + +-------------------------------------------------- +git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")' +-------------------------------------------------- + +-------------------------------------------------- +git filter-repo --tag-callback ' + if tag.tagger_name == b"Jim Williams": + # Omit this tag + tag.skip() + else: + tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)' +-------------------------------------------------- + +-------------------------------------------------- +git filter-repo --commit-callback ' + # Remove executable files with three 6s in their name (including + # from leading directories). + # Also, undo deletion of sources/foo/bar.txt (change types are + # either b"D" (deletion) or b"M" (add or modify); renames are + # handled by deleting the old file and adding a new one) + commit.file_changes = [ + change for change in commit.file_changes + if not (change.mode == b"100755" and + change.filename.count(b"6") == 3) and + not (change.type == b"D" and + change.filename == b"sources/foo/bar.txt")] + # Mark all .sh files as executable; modes in git are always one of + # 100644 (normal file), 100755 (executable), 120000 (symlink), or + # 160000 (submodule) + for change in commit.file_changes: + if change.filename.endswith(b".sh"): + change.mode = b"100755" + ' +-------------------------------------------------- + +[[INTERNALS]] +INTERNALS +--------- + +You probably don't need to read this section unless you are just very +curious or you are trying to do a very complex history rewrite. + +How filter-repo works +~~~~~~~~~~~~~~~~~~~~~ + +Roughly, filter-repo works by running + +-------------------------------------------------- +git fast-export | filter | git fast-import +-------------------------------------------------- + +where filter-repo not only launches the whole pipeline but also serves as +the _filter_ in the middle. However, filter-repo does a few additional +things on top in order to make it into a well-rounded filtering tool. A +sequence that more accurately reflects what filter-repo runs is: + + 1. Verify we're in a fresh clone + 2. `git fetch -u . refs/remotes/origin/*:refs/heads/*` + 3. `git remote rm origin` + 4. `git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger --signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes --all | filter | git fast-import --force --quiet` + 5. `git update-ref --no-deref --stdin`, fed with a list of refs to nuke, and a list of replace refs to delete, create, or update. + 6. `git reset --hard` + 7. `git reflog expire --expire=now --all` + 8. `git gc --prune=now` + +Some notes or exceptions on each of the above: + + 1. If we're not in a fresh clone, users will not be able to recover if + they used the wrong command or ran in the wrong repo. (Though + `--force` overrides this check, and it's also off if you've already + ran filter-repo once in this repo.) + 2. Technically, we actually use a `git update-ref` command fed with a lot + of input due to the fact that users can use `--force` when local + branches might not match remote branches. But this fetch command + catches the intent rather succinctly. + 3. We don't want users accidentally pushing back to the original repo, as + discussed in <>. It also reminds users that since history + has been rewritten, this repo is no longer compatible with the + original. Finally, another minor benefit is this allows users to push + with the `--mirror` option to their new home without accidentally + sending remote tracking branches. + 4. Some of these flags are always used but others are actually + conditional. For example, filter-repo's `--replace-text` and + `--blob-callback` options need to work on blobs so `--no-data` cannot + be passed to fast-export. But when we don't need to work on blobs, + passing `--no-data` speeds things up. Also, other flags may change + the structure of the pipeline as well (e.g. `--dry-run` and `--debug`) + 5. We use this step to write replace refs for accessing the newly written + commit hashes using their previous names. Also, if refs were renamed + by various steps, we need to delete the old refnames in order to avoid + mixing old and new history. + 6. Users also have old versions of files in their working tree and index; + we want those cleaned up to match the rewritten history as well. Note + that this step is skipped in bare repos. + 7. Reflogs will hold on to old history, so we need to expire them. + 8. We need to gc to avoid mixing new and old history. Also, it shrinks + the repository for users, so they don't have to do extra work. (Odds + are that they've only rewritten trees and commits and maybe a few + blobs, so `--aggressive` isn't needed and would be too slow.) + +Information about these steps is printed out when `--debug` is passed to +filter-repo. + +Limitations +~~~~~~~~~~~ + +Inherited limitations +^^^^^^^^^^^^^^^^^^^^^ + +Since git filter-repo calls fast-export and fast-import to do a lot of the +heavy lifting, it inherits limitations from those systems: + + * extended commit headers, if any, are stripped + * commits get rewritten meaning they will have new hashes; therefore, + signatures on commits and tags cannot continue to work and instead are + just removed (thus signed tags become annotated tags) + * tags of commits are supported; tags of anything else (blobs, trees, or + tags) are not. (fast-export aborts on tags of blobs and tags of tags, + and simply ignores tags of trees with a warning.) + * annotated and signed tags outside of the refs/tags/ namespace are not + supported (their location will be mangled in weird ways) + * fast-import will die on various forms of invalid input, such as a + timezone with more than four digits + * fast-export cannot reencode commit messages into UTF-8 if the commit + message is not valid in its specified encoding (in such cases, it'll + leave the commit message and the encoding header alone). + * commits without an author will be given one matching the committer + * tags without a tagger will be given a fake tagger + * references that include commit cycles in their history (which can be + created with linkgit:git-replace[1]) will not be flagged to the user as an + error but will be silently deleted by fast-export as though the branch + or tag contained no interesting files + +There are also some limitations due to the design of these systems: + + * Trying to insert additional files into the stream can be tricky; since + fast-export only lists file changes in a merge relative to its first + parent, if you insert additional files into a commit that is in the + second (or third or fourth) parent history of a merge, then you also + need to add it to the merge manually. + + * fast-export and fast-import work with exact file contents, not patches. + (e.g. "Whatever the current contents of this file, update them to now + have these contents") Because of this, removing the changes made in a + single commit or inserting additional changes to a file in some commit + and expecting them to propagate forward is not something that can be + done with these tools. Use linkgit:git-rebase[1] for that. + +Intrinsic limitations +^^^^^^^^^^^^^^^^^^^^^ + +Some types of filtering have limitations that would affect any tool +attempting to perform them; the most any tool can do is attempt to notify +the user when it detects an issue: + + * When rewriting commit hashes in commit messages, there are a variety + of cases when the hash will not be updated (whenever this happens, a + note is written to `.git/filter-repo/suboptimal-issues`): + * if a commit hash does not correspond to a commit in the old repo + * if a commit hash corresponds to a commit that gets pruned + * if an abbreviated hash is not unique + + * Pruning of empty commits can cause a merge commit to lose an entire + ancestry line and become a non-merge. If the merge commit had no + changes then it can be pruned too, but if it still has changes it needs + to be kept. This might cause minor confusion since the commit will + likely have a commit message that makes it sound like a merge commit + even though it's not. (Whenever a merge commit becomes a non-merge + commit, a note is written to `.git/filter-repo/suboptimal-issues`) + +Issues specific to filter-repo +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * Multiple repositories in the wild have been observed which use a bogus + timezone (`+051800`); google will find you some reports. The intended + timezone wasn't clear or wasn't always the same. Replace with a + different bogus timezone that fast-import will accept (`+0261`). + + * `--path-rename` can result in pathname collisions; to avoid excessive + memory requirements of tracking which files are in all commits or + looking up what files exist with either every commit or every usage of + --path-rename, we just tell the user that they might clobber other + changes if they aren't careful. We can check if the clobbering comes + from another --path-rename without much overhead. (Perhaps in the + future it's worth adding a slow mode to --path-rename that will do the + more exhaustive checks?) + + * There is no mechanism for directly controlling which flags are passed + to fast-export (or fast-import); only pre-defined flags can be turned + on or off as a side-effect of other options. Direct control would make + little sense because some options like `--full-tree` would require + additional code in filter-repo (to parse new directives), and others + such as `-M` or `-C` would break assumptions used in other places of + filter-repo. + + * Partial-repo filtering does not mesh well with filter-repo's "avoid + mixing old and new history" design. filter-repo has some capability + in this area but it is intentionally underdocumented and mostly left + for use by external scripts which import filter-repo as a module + (some examples in contrib/filter-repo-demos/ do use this). The only + real usecases I've seen for partial repo filtering, though, are + sidestepping filter-branch's insanely slow execution on commits that + would not be changed by the filters in question anyway (which is + largely irrelevant since filter-repo is multiple orders of magnitude + faster), or to do operations better suited to linkgit:git-rebase[1] + and which rebase grew special options for years ago (e.g. the + `--signoff` option). + +Comments on reversibility +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Some people are interested in reversibility of of a rewrite; e.g. rewrite +history, possibly add some commits, then unrewrite and get the original +history back plus a few new "unrewritten" commits. Obviously this is +impossible if your rewrite involves throwing away information +(e.g. filtering out files or replacing several different strings with +`***REMOVED***`), but may be possible with some rewrites. filter-repo is +likely to be a poor fit for this type of workflow for a few reasons: + + * most of the limitations inherited from fast-export and fast-import + are of a type that cause reversibility issues + * grafts and replace refs, if present, are used in the rewrite and made + permanent + * rewriting of commit hashes will probably be reversible, but it is + possible for rewritten abbreviated hashes to not be unique even if the + original abbreviated hashes were. + * filter-repo defaults to several forms of unreversible rewriting that + you may need to turn off (e.g. the last two bullet points above or + reencoding commit messages into UTF-8); it's possible that additional + forms of unreversible rewrites will be added in the future. + * I assume that people use filter-repo for one-shot conversions, not + ongoing data transfers. I explicitly reserve the right to change any + API in filter-repo based on this presumption (and a comment to this + effect is found in multiple places in the code and examples). You + have been warned. + +SEE ALSO +-------- +linkgit:git-rebase[1], linkgit:git-filter-branch[1] + +GIT +--- +Part of the linkgit:git[1] suite