git-filter-repo

mirror of https://github.com/newren/git-filter-repo.git synced 2024-09-20 00:01:12 +02:00

Author	SHA1	Message	Date
Elijah Newren	e6dd613e3f	filter-repo: add a --version option Note that this isn't a version number or even the more generalized version string that folks are used to seeing, but a version hash (or leading portion thereof). A few import points: * These version hashes are not strictly monotonically increasing values. Like I said, these aren't version numbers. If that bothers you, read on... * This scheme has incredibly nice semantics satisfying a pair of properties that most version schemes would assume are mutually incompatible: This scheme works even if the user doesn't have a clone of filter-repo and doesn't require any build step to inject the version into the program; it works even if people just download git-filter-repo.py off GitHub without any of the other sources. And: This scheme means that a user is running precisely version X of the code, with the version not easily faked or misrepresented when third parties edit the code. Given the wonderful semantics provided by satisfying this pair of properties that all other versioning schemes seem to miss out on, I think I should name this scheme. How about "Semantic Versioning"? (Hehe...) * The version hash is super easy to use; I just go to my own clone of filter-repo and run either: git show $VERSION_HASH or git describe $VERSION_HASH * A human consumable version might suggest to folks that this software is something they might frequently use and upgrade. This program should only be used in exceptional cases (because rewriting history is not for the faint of heart). * A human consumable version (i.e. a version number or even the more relaxed version strings in more common use) might suggest to folks that they can rely on strict backward compatibility. It's nice to subtly undercut any such assumption. * Despite all that, I will make releases (downloadable tarballs with real version numbers in the tarball name; I'm just going to re-use whatever version git is released with at the time). But those version numbers won't be used by the --version option; instead the version hash will. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-19 14:06:08 -07:00
Elijah Newren	1e21d6e2ec	Add installation instructions Try to make it a little more friendly for distros to package. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:59:23 -07:00
Elijah Newren	62c311c69f	filter-repo: fix an unmarked bytestring to be marked as such Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:56:37 -07:00
Elijah Newren	e0140bb2ad	git-filter-repo.txt: minor updates to docs A few changes: * Include notes about git-2.24.0 changes * Make it clearer that messing with the first parent could have negative side-effects if the file_changes aren't also updated. * Fix wrapping of a line that was too long. Also, update the README.md: * Note the upstream improvements made in (not yet released) git-2.24.0 Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:55:09 -07:00
Elijah Newren	320c85f941	filter-repo: improve support for partial history rewrites Partial history rewrites were possible before with the (previously hidden) --refs flag, but the defaults were wrong. That could be worked around with the --source or --target flags, but that disabled --no-data for fast-export and thus slowed things down, and also would require overridding --replace-refs. And the defaults for --source and --target may diverge further from what is wanted/needed for partial history rewrites in the future. So, add --partial as a first-class supported option with scary documentation about how it permits mixing new and old history. Make --refs imply that flag. Make the behavioral similarities (in regards to which steps are skipped) between --source, --target, and --partial more clear. Add relevant documentation to round it out. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:55:09 -07:00
Elijah Newren	509a624b6a	filter-repo: fix issue with pruning of empty commits In order to build the correct tree for a commit, git-fast-import always takes a list of file changes for a merge commit relative to the first parent. When the entire first-parent history of a merge commit is pruned away and the merge had paths with no difference relative to the first parent but which differed relative to later parents, then we really need to generate a new list of file changes in order to have one of those other parents become the new first parent. An example might help clarify... Let's say that there is a merge commit, and: * it resolved differences in pathA between its two parents by taking the version of pathA from the first parent. * pathB was added in the history of the second parent (it is not present in the first parent) and is NOT included in the merge commit (either being deleted, or via rename treated as deleted and added as something else) For this merge commit, neither pathA nor pathB differ from the first parent, and thus wouldn't appear in the list of file changes shown by fast-export. However, when our filtering rules determine that the first parent (and all its parents) should be pruned away, then the second parent has to become the new first parent of the merge commit. But to end up with the right files in the merge commit despite using a different parent, we need a list of file changes that specifies the changes for both pathA and pathB. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:55:09 -07:00
Elijah Newren	cdec483573	filter-repo: use our own textdomain for translations git.git wants to move more towards core-only rather than batteries included, and as such, filter-repo will not be part of the git distribution. Therefore, due to keeping the projects apart, there will need to be separate translation files (assuming filter-repo ever gains any translations) and as such we will need a different textdomain definition. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:55:09 -07:00
Elijah Newren	71bb8d26a9	filter-repo: add a --state-branch option for incremental exporting Allow folks to periodically update the export of a live repo without re-exporting from the beginning. This is a performance improvement, but can also be important for collaboration. For example, for sensitivity reasons, folks might want to export a subset of a repo and update the export periodically. While this could be done by just re-exporting the repository anew each time, there is a risk that the paths used to specify the wanted subset might need to change in the future; making the user verify that their paths (including globs or regexes) don't also pick up anything from history that was previously excluded so that they don't get a divergent history is not very user friendly. Allowing them to just export stuff that is new since the last export works much better for them. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-10-17 18:55:09 -07:00
Elijah Newren	b5b0cf4230	filter-repo (README): restructure and update Since we now have a separate user manual and it does not make sense to duplicate information in multiple places, restructure the README: * Refer to the actual manual early on * Limit the README to mostly be about why I wrote it and why folks might want to consider it instead of existing tools * Include a new section on upstream improvements, especially since it looks like inclusion of git-filter-repo in git.git is unlikely. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-09-14 08:59:30 -07:00
Elijah Newren	a6a6a1b0f6	git-filter-repo.txt: add a manpage for the filter-repo command Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-20 23:58:47 -07:00
Elijah Newren	622afca813	contrib: new filter-repo demo named bfg-ish This implements most of BFG Repo Cleaner, with several additional features and bugfixes included. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-20 23:58:47 -07:00
Elijah Newren	df575fb181	contrib: new filter-repo demo named filter-lamely (or filter-branch-ish) This is a re-implementation of git-filter-branch that is nearly perfectly bug compatible (it can replace git-filter-branch and still pass the git testsuite). It deviates in one minor way that should not matter to real world usecases, but allows it to run a few times faster than filter-branch. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-20 23:58:47 -07:00
Elijah Newren	65f0ecaef7	filter-repo: updates and minor fixes in option help and README Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-20 23:58:47 -07:00
Elijah Newren	6d231c0a94	contrib: simple examples of tools based on filter-repo Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-13 14:25:29 -07:00
Elijah Newren	2094221721	filter-repo: do not claim we are repacking if we are not Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-13 14:25:29 -07:00
Elijah Newren	9fbe2569ec	filter-repo: split repacking logic into a separate function for reuse Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-13 14:25:29 -07:00
Elijah Newren	6ba30e9b98	filter-repo: use more versatile commit rename function Being able to find the new commit hash for either an abbreviated commit hash or a full commit hash is much more useful than only working for a full commit hash. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-08-13 14:25:29 -07:00
Elijah Newren	47406a6cc0	filter-repo: re-license When I wrote git_fast_filter.py, I was unaware of and did not forsee libgit2. So, although the license said the project could be used under whatever license git.git was, there was still a potential barrier for usage by libgit2. I'm not sure if libgit2 will ever want to use filter-repo, but I don't want the barrier there and I would like to avoid a repeat of this problem. (Also, since filter-repo is for the most part a one-shot usage tool, I doubt that the normal copyleft provisions could provide much value.) MIT is widely used, compatible with just about everyting, and is preferred by Palantir (my current employer) for open source contributions. So, I contacted all other contributors (Jim is still at Sandia) and got permission to relicense. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-07-16 13:15:52 -07:00
Elijah Newren	7a12d7a38b	filter-repo: add ability to parse and dump encoding Commit `346f2ba891` (filter-repo: make reencoding of commit messages togglable, 2019-05-11) made reencoding of commit messages togglable but forgot to add parsing and outputting of the encoding header itself. Add such ability now. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-25 20:17:40 -06:00
Elijah Newren	e9678a367f	filter-repo: support deleteall directive Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:10 -06:00
Elijah Newren	a78831c984	filter-repo: remove _seen_refs as it is now unused Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	532dc047b3	filter-repo: use exported/imported refs for cleanup and metadata recording Now that we are tracking exported and imported refs, we no longer need to rely on _orig_refs and _seen_refs for deletion of "unused" refs at the end of the run. Verify that we correctly tracked exported and imported refs by using them instead for the post-run ref deletion. This removes the last use of _seen_refs, which will be removed in a subsequent commit. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	e162bcc496	filter-repo: track exported and imported refs We previously nuked all refs not seen in the import using _seen_refs, by comparing to a full list of original refs. That works okay when doing a full repository rewrite, but fails for partial history rewrites. Further, external rewriting tools that wants to implement a tweak of this behavior would have had to access the internal _seen_refs field, but might not be able to rely on _orig_refs if they were doing a partial history rewrite. Fix both by tracking both which refs were exported from the source repository, and which were ultimately imported into the target repository (they may differ due to pruned commits, renamed branches or tags, etc.). Make both available via a new public API, get_exported_and_imported_refs(). Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	1c25be5be7	filter-repo: add public method for adding objects to stream External rewrite tools using filter-repo as a library may want to add additional objects into the stream. Some examples in t/t9391 did this using an internal _output field and using syntax that did not seem so clear. Provide an insert() method for doing this, and convert existing cases over to it. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	4175b808da	filter-repo: rename _handle_final_commands to _final_commands Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	88c1269d5a	filter-repo: ensure branches are updated as we go When we prune a commit for being empty, there is no update to the branch associated with the commit in the fast-import stream. If the parent commit had been associated with a different branch, then the branch associated with the pruned commit would not be updated without additional measures. In the past, we resolved this by recording that the branch needed an update in _seen_refs. While this works, it is a bit more complicated than just issuing an immediate Reset. Also, note that we need to avoid calling callbacks on that Reset because those could rename branches (again, if the commit-callback already renamed once) causing us to not update the intended branch. There was actually one testcase where the old method didn't work: when a branch was pruned away to nothing. A testcase accidentally encoded the wrong behavior, hiding this problem. Fix the testcase to check for correct behavior. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	aaeadac6df	filter-repo: fix explicit ref deletion via reset directive We previously did this incorrectly, but due to our assumptions of full-history rewriting and deleting of unseen refs, we got away with it. Fix this for partial history rewrites. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	e34dff11a1	filter-repo: remove dead code Commit `1f0e57bada` ("filter-repo: avoid pruning annotated tags that we have seen", 2019-03-07) left behind the setting of a variable, full_ref, that is no longer used. Remove it. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:32:04 -06:00
Elijah Newren	aa9ab9df9f	filter-repo: refine choice of when to skip blobs We can pass --no-data to fast-export in one additional case. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-22 22:31:49 -06:00
Elijah Newren	7d42c2093c	filter-repo: limit splicing repos warning to test that splices repos Signed-off-by: Elijah Newren <newren@gmail.com>	2019-06-08 09:05:31 -07:00
Elijah Newren	b6a35f8dcd	filter-repo: implement --strip-blobs-with-ids Add a flag allowing for specifying a file filled with blob-ids which will be stripped from the repository. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	d958b0345c	filter-repo: add basic built-in docs covering callbacks and examples Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	89f9fbbb6d	filter-repo: partial repo filtering considerations Fix a few issues and add a token testcase for partial repo filtering. Add a note about how I think this is not a particularly interesting or core usecase for filter-repo, even if I have put some good effort into the fast-export side to ensure it worked. If there is a core usecase that can be addressed without causing usability problems (particularly the "don't mix old and new history" edict for normal rewrites), then I'll be happy to add more testcases, document it better, etc. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	1a887c5c13	filter-repo: more careful handling of --source and --target Make several fixes around --source and --target: * Explain steps we skip when source or target locations are specified * Only write reports to the target directory, never the source * Query target git repo for final ref values, not the source * Make sure --debug messages avoid throwing TypeErrors due to mixing strings and bytes * Make sure to include entries in ref-map that weren't in the original target repo * Don't: * worry about mixing old and new history (i.e. nuking refs that weren't updated, expiring reflogs, gc'ing) * attempt to map refs/remotes/origin/* -> refs/heads/* * disconnect origin remote * Continue (but only in target repo): * fresh-clone sanity checks * writing replace refs * doing a 'git reset --hard' Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	587f727d19	filter-repo: implement --strip-blobs-bigger-than Add a flag for filtering out blob based on their size, and allow the size to be specified using 'K', 'M', or 'G' suffixes. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	598661dcf4	filter-repo: make logic to get blob sizes reusable Create a new function, GitUtils.get_blob_sizes() to hold some logic that used to be at the beginning of RepoAnalyze.gather_data(). This will allow reuse of this functionality within RepoFilter. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	1b106eeac9	filter-repo: fix ref-map generation bug when commit at ref tip is pruned Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	c73c0304b0	filter-repo: pass more canonical ordering of files to fast-import Although fast-import can take file changes in any order, trying to debug by comparing the original fast-export stream to the filtered version is difficult if the files are randomly reordered. Sometimes we aren't comparing the filtered version to the original but just looking at the stream passed to fast-import, in which case having the files in sorted order may help. Our accumulation of file_changes into a dict() in order to check for collisions when renaming had the unfortunate side effect of sorting files by internals of dictionary ordering. Although the files started in sorted order, we don't in general want to use the original order because renames can cause filenames to become out-of-order. Just apply a simple sort at the end. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	41c91150ec	filter-repo: mark another incompatibility with fast-export's -M and -C I suspect at some point someone will try to pass -M or -C to fast-export; may as well leave a note in the code about another place that's incompatible while I'm thinking about it. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	6fb7da0f0a	filter-repo: rename to --prune-empty and --prune-degenerate Imperative form sounds better than --empty-pruning and --degenerate-pruning, and it probably works better with command line completion. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	ece8a74df9	filter-repo (README): add instructions on parent rewriting Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	4c25fe7a37	filter-repo: handle reset to specific ref and deletion The reset directive can specify a commit hash for the 'from' directive, which can be used to reset to a specify commit, or, if the hash is all zeros, then it can be used to delete the ref. Support such operations. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	0b70b72150	filter-repo: provide extra metadata to some callbacks For other programs importing git-filter-repo as a library and passing a blob, commit, tag, or reset callback to RepoFilter, pass a second parameter to these functions with extra metadata they might find useful. For simplicity of implementation, this technically changes the calling signature of the ---callback functions passed on the command line, but we hide that behind a _do_not_use_this_variable parameter for now, leave it undocumented, and encourage folks who want to use it to write an actual python program that imports git-filter-repo. In the future, we may modify the ---callback functions to not pass this extra parameter, or if it is deemed sufficiently useful, then we'll rename the second parameter and document it. As already noted in our API compatibilty caveat near the top of git-filter-repo, I am not guaranteeing API backwards compatibility. That especially applies to this metadata argument, other than the fact that it'll be a dict mapping strings to some kind of value. I might add more keys, rename them, change the corresponding value, or even remove keys that used to be part of metadata. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	c58e83ea49	filter-repo: fix obvious comment typo Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	ad73b5ed5f	filter-repo: minor cleanups of RepoFilter function names Fix visibility of several functions, and make the callbacks have a more consistent naming. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	27f08be754	filter-repo: consolidate filtering functions into RepoFilter Location of filtering logic was previously split in a confusing fashion between FastExportFilter and RepoFilter. Move all filtering logic from FastExportFilter into RepoFilter, and rename the former to FastExportParser to reflect this change. One downside of this change is that FastExportParser's _parse_commit holds two pieces of information (orig_parents and had_file_changes) which are not part of the commit object but which are now needed by RepoFilter. Adding those bits of info to the commit object does not make sense, so for now we pass an auxiliary dict with the commit_callback that has these two fields. This information is not passed along to external commit_callbacks passed to RepoFilter, though, which seems suboptimal. To be fair, though, commit_callbacks to RepoFilter never had access to this information so this is not a new shortcoming, it just seems more apparent now. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	6584dd760d	filter-repo: add some docstrings for a few functions Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	2bd86a64bb	filter-repo: remove superfluous everything_callback I introduced this over a decade ago thinking it would come in handy in some special case, and the only place I used it was in a testcase that existed almost solely to increase code coverage. Modify the testcase to instead demonstrate how it is trivial to get the effects of the everything_callback without it being present. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	ef2343ac05	filter-repo: clean up RepoFilter callbacks The specially constructed callbacks in RepoFilter.run() were superfluous; we already had special callback functions. Instead of creating new local functions that call the real callbacks and then do one extra step, just put the extra wanted code into the real callbacks. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00
Elijah Newren	e97b195229	filter-repo: avoid accidental output after 'done' directive Using fast-import's feature done capability, any output sent to it after the 'done' directive will be ignored. We do not intend to send any such information, but there have been a couple cases where an accident while refactoring the code resulted in some information being sent after the done directive. To avoid having to debug that again, just close the output stream after sending the 'done' directive to ensure that we get an immediate and clear error if we ever run into such a situation again. Signed-off-by: Elijah Newren <newren@gmail.com>	2019-05-30 22:07:48 -07:00

1 2 3 4 5 ...

297 Commits