Rewrite Git History for Open Source Projects

I’ve recently been spending a lot of my time working on more open source projects with this I’ve had to learn a couple of new things that you typically don’t when you are only contributing to proprietary work. Specifically caring for the commits in your Pull Requests.

I’ve always thought my git workflow was fairly decent; I’d cut a branch off master then do my work and commit into a single --amending anything into that first commit, when I was ready, I would put in a Pull Request (PR). This workflow seemed decent, I wasn’t pushing my history into the mainline, and I was making it easy to revert, but this wasn’t good enough.

This all came up with working on a couple features for the AWS IAM Authenticator. Nick Turner and Matt Landis from the Amazon EKS team have both been extremely helpful and patient with getting the history just right and after a few code reviews of different pull requests I think I’m starting to get use to the approach.

The Idea

The git history for each pull request should be both kept to a minimum number of commits, in an order that makes sense, and lastly in a way that makes reviewing the work very easy. To explain this more take for example a pull request that added one new feature, the feature had no external dependencies and was 100% “handwritten”. This feature and PR would be represented by a single commit.

commit e678906aadb977a8e9161a3213213f5d62c2b05c (HEAD -> feature/125-version-command, christopherhein/feature/125-versi
on-command)
Author: Christopher Hein <me@christopherhein.com>
Date:   Tue Aug 7 15:30:05 2018 -0700

    Adding `version` subcommand

    **Why:**

    * allows you to echo the version of the build, uses built-in
      `goreleaser` `ldflags` to get the latest release information

    **This change addresses the need by:**

    * closes #125

    Signed-off-by: Christopher Hein <me@christopherhein.com>

And the diff for this commit didn’t add any additional components. Perfect use case for a single commit PR.

Now to give an example of a multiple commit PR. If you flipped that last PR and said it did install additional packages and because we use go and vendor our packages after we have written the code we have a status that looks like this:

## feature/125-version-command
 M .goreleaser.yaml
 M Gopkg.lock
?? cmd/aws-iam-authenticator/version.go
?? vendor/github.com/someproject/

📎 In this example we have added an additional package in vendor and updated the .lock file which manages installed packages. This is a great use case for multiple commits, because I didn’t write the vendored code and the dep project updated my Gopkg.lock file for me so the only real review someone should do would be to delete the vendored package and verify that dep ensure continues to return the same files. In this instance you would make two commits one for the vendored additions and one for the code you wrote. Like so:

commit 36f80d6d1f4e85992de30b288928e6c1bb714b3d (HEAD -> feature/125-version-command, christopherhein/feature/125-version-command)
Author: Christopher Hein <me@christopherhein.com>
Date:   Tue Aug 7 15:30:05 2018 -0700

    Adding `version` subcommand

    **Why:**

    * allows you to echo the version of the build, uses built-in
      `goreleaser` `ldflags` to get the latest release information

    **This change addresses the need by:**

    * closes #125

    Signed-off-by: Christopher Hein <me@christopherhein.com>

commit f19cf04c3eeb770f564660a6c89f54bdfc18e08d
Author: Christopher Hein <me@christopherhein.com>
Date:   Tue Aug 7 15:29:42 2018 -0700

    Adding `go-version` vendored package

    Signed-off-by: Christopher Hein <me@christopherhein.com>

If you had multiple parts that are generated and not originally authored in full by you, maybe this is a generated client based on a types.go file and the Kubernetes code generation libraries processing them this would be treated the same way separate the types.go file and anything else you wrote to set up the code gen with then a second commit for the generated code.

In Practice

Now we all have our ways to saving our work along the way, some you .bak files some ignore it, and some use git throughout the development lifecycle, constantly commit-ing locally in-case they need to step back. This is the most common experience I am seeing. So let’s break this down and show you have to rewrite your git history and start rebuild it with only the changes that are needed in each commit.

Rebase Is Really Powerful

So in that example imagine you were continuously commit-ing into your local copy and ended up with something that looks like this.

commit 1f231401246e00721f153dfb8ddc12e6407605f7 (HEAD -> feature/25-rewriting-git-history)
Author: Christopher Hein <me@christopherhein.com>
Date:   Sun Sep 23 00:14:28 2018 -0700

    Yay, it works!

commit d8479321716d937e6ff62f9f7e6bc923b1f687c1
Author: Christopher Hein <me@christopherhein.com>
Date:   Sun Sep 23 00:14:07 2018 -0700

    Think I have it

commit d522a8931ff92a4d45b9c8582d84343566653065
Author: Christopher Hein <me@christopherhein.com>
Date:   Sun Sep 23 00:13:56 2018 -0700

    Trying Again

commit b3dd3b8df5efa20179552f796627a4609c588074
Author: Christopher Hein <me@christopherhein.com>
Date:   Sun Sep 23 00:13:42 2018 -0700

    testing

All four of these commits representing writing some new code which includes dependencies. First and for most reset and rebase can be your friend if these aren’t in your development workflow, they should be. Let’s first reset everything so that we have 2 commits but still all the changes. For this we use reset.

First grab the SHA of the first commit of this feature. for us that is b3dd3b8df5efa20179552f796627a4609c588074 with this we’ll run the following.

git reset --soft b3dd3b8df5efa20179552f796627a4609c588074^

📎 Note the ^ which will actually reset that as the head state, meaning this commit won’t exist anymore, if you did with without the ^ you would see the b3dd3b8df5efa20179552f796627a4609c588074 commit still in your history but nothing else.

Once we have our code in this state we’ll see that using git status all our files are still staged meaning if we try to make a commit all files will be added, so let’s reset again without --soft to unstage.

git reset

Now we can start to reconstruct your history, commit-ing the generated code first and then the “handwritten” code.

git add vendor/
git commit -m "adding vendored code for X"

Then we’ll commit our “handwritten” code separating them to make code reviewable much easier.

git add cmd/aws-iam-authenticator/version.go
git commit -m "adding go-version"

Fantastic, now you have a rewritten history and you go and push it up and the reviewers ask you to change something, does that mean you have to go through this full workflow again? No! We can continue to --amend our commits to the previous commit. Let’s try it. Give a git status of

Updating a Previously “Perfect History”

## feature/125-version-command
 M cmd/aws-iam-authenticator/version.go

We can use --amend to mutate the previous commit.

git add cmd/aws-iam-authenticator/version.go
git commit --amend

This will open the previous text in a vim session and you can just save and close the file to apply.

Now that you have the commit on the top of your stack updated, let’s push it your remote origin to update the Pull Request. To do this we use push but we pass an additional flag --force-with-lease this will update your remote origin and apply the new history rewriting the existing. It is a safer alternative to --force. --force-with-lease will force push to your remote branch only if the HEAD hasn’t changed on the remote since your rewriting.

git push --force-with-lease <origin> <branch>

This can be a lot to write so I use a git alias.

git config --global alias.pf "push --force-with-lease"

Without updates in place the reviewers then give you the feedback that the package you have vendored needs to be updated as well. Does this mean you have to start the whole workflow again? No! You can actually make a new commit with the updates the rebase them into the correct commit. To do this we need to make a commit with the changes to the vendor/.

Rebasing in Additional Commits

git commit -m "vendor commit to be squashed"

Once you have made this commit you can actually get the SHA of the adding vendored code for X commit and run rebase -i to rebase the commits interactively.

git rebase -i c55ed6ab270cb5c724cacf1feacde0c6998d42a3^

📎 Note the ^ if you do not have this your commit won’t appear in the output of the command and you won’t be able to squash it down.

The output of the above command should return a vim session with 3 lines.

pick 2e05ad51 adding vendored code for X
pick 53d69ae4 adding go vendor
pick e3d1bd65 vendor commit to be squashed

With this vim session we’re going to reorder the lines to the structure the way we want the commits to list in and then we want to squash the last commit with the first line. To do this our file should look like.

pick 2e05ad51 adding vendored code for X
squash e3d1bd65 vendor commit to be squashed
pick 53d69ae4 adding go vendor

📎 Note the squash instead of pick for the latest commit.

By reordering the lines we can put the most recent vendor additions on top of the last vendor additions then we can squash them by changing the pick to squash when you save and close this file you will be taking to a vim session with an aggregated view of the commits here you can remove the vendor commit to be squashed leaving the original commit as adding vendored code for X.

Once that processes you can git push --force-with-lease or if you are like me you can git pf and everything will be in-sync on the remote origin.

Conclusion

This topic is vast, no matter if you are working on a private project or open source. Following these methods you make it simple to review code for your reviews, make the review process even more error prone and allow the history to actually make sense. Here are a handful of links to dive deeper into:

If you want to learn more git hacks and workflows reach out @christopherhein on Twitter.