Have you ever found yourself in a situation where you accidentally pushed secret keys or huge files while using Git for version control? Did you know that removing those keys even 20 seconds after exposing the sensitive data to public might be already too late?
In this blog post, I would like to highlight the dangers of exposing confidential information and emphasize what can possibly go wrong. I then provide a few nice tricks that I also use, so that you don’t need to worry or be scared while using Git anymore.
We are human after all, so we all make mistakes; but it is also crucial to learn from those mistakes.
One type of unwanted stuff on Git is the very large files. If you accidentally committed a large file to a repository this will most certainly limit how much time it will take for you to pull or push and even will give you an error if the file is larger than 100MBs.
Second, if you are already into software, by far you had seen this many times: Never push confidential information to a repository. Attackers with minimal resources can compromise many GitHub users by stealing leaked secrets and keys. Yet, I see that people are still not quite careful about this. Therefore, I’d like to share a few stats.
No. In fact, this is one of the most dangerous things that you can do. People tend to think that when they remove the files from a repository, they are no longer reachable. Yet, this is not correct. This is what Git is used for. It tracks your file version history so that you can go back in time when you would like to revert changes.
By making a commit to remove a file in the following way, you are just directing strangers on the Internet to where your secrets live.
$ git commit -m "Remove api key"
You can see how frequent this is with just a single search1. To make it more clear, during the writing of this post on January 5, 2023, there were 1M+ commits returned on GitHub for the search query “remove api key” and 735K+ commits for the query “remove password”.
For example, with ChatGPT getting popular and people trying to write Python scripts to play with it (🤍), I found countless OpenAI API keys living in random corners of GitHub (🥲).
Think about the possibilities!
When we think about removing a commit from Git history, the first thing that comes into mind is to immediately change the tip of the branch to an older commit. This safely moves us back in time to when the key was not present in the repository.
$ git reset <SHA1> $ git commit -am "message" $ git push -f <remote-name> <branch-name>
Well, if the problem you have is related to large file sizes, you can always use git filter-branch to remove past information / files from your history. In addition to this, there is a much better and simpler approach that I like using a lot.
git filter-branch
Meet BFG-Repo-Cleaner! – This is a tool written in Scala that removes large files (like the pre-trained models or large PDFs that you are not able to get rid of) or troublesome blobs (e.g. API keys, passwords, secrets) like git filter-branch does, but faster.
The official recommendation of GitHub also recommends2 using BFG-Repo-Cleaner for purging a file.
Edit: I have recently been informed that git filter-branch is deprecated. You can now use git filter-repo or directly the aforementioned BFG tool, instead.
No, it is not. Of course, you can use this tool anytime to remove large files, etc. However, you should still be careful before pushing unwanted credentials to public repositories. If you have recently exposed a secret on GitHub, you should be really fast to take it back with the aforementioned tools.
A paper named “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories”3 builds the first comprehensive, longitudinal analysis of secret leakage on GitHub. There, the researchers evaluate two different approaches for mining secrets: one is able to discover 99% of newly commited files containing secrets in real-time, whereas the other leverages a large snapshot covering 13% of all public repositories, some dating to GitHub’s creation.
GitHub should have much more strict policies or checks for commits that might possibly expose a secret. Or at least, a warning for the new registered accounts directing them to the respective documentation? I believe this is crucial especially to welcome newcomers who are just starting their programming journey. Developers (especially juniors) should be aware of how to make source code public securely and of possible consequences they might need to deal with for ignoring to do so.
git add .
git commit -a
git add filename
git diff --cached
During the writing of this post, none of the keys found public had been bombed 💣 (I mean scraped). Therefore, no owner of these repos with possibly leaked information had been harmed. I just would like to emphasize the dangers of sharing confidential data, so that people can start being more careful about what to share with random strangers and what not, both on software development platforms and on a much larger scale (such as social media platforms).
Let me know in the comments if you have other tips and tricks!
— Coding Woman
https://github.com/search?q=remove+api+key&type=commits ↩
Removing sensitive data from a repository: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository ↩
Meli, Michael, Matthew R. McNiece, and Bradley Reaves. “How bad can it git? characterizing secret leakage in public github repositories.” NDSS. 2019. ↩