Ubiquitously-revered distributed version control and software code repository system Git has been called out over an alleged flaw. The functionality shortcoming in question is said to result from Git calculating changes between different versions of the same file, which ultimately creates repository bloat through the excess storage requirements that result.
Microsoft senior engineer Jonathan Creamer has gone public on the code bloat scenario in question, which is related to a large JavaScript Git repository that his engineering team is currently working with. The repository is in fact a single monorepo i.e. a a software development repository paradigm where code for multiple projects is located in a single repository.
~20 Million Lines of Code!
“We recently crossed the 1,000 monthly active users mark, about 2,500 packages and ~20 million lines of code! The most recent clone I did of the repo clocked in at an astonishing 178GB. [But] I noticed that the versioned branch seeming to get harder and harder to clone because it kept getting so huge,” blogged Creamer. “
Git contributor Derrick Stolee (who also works as a principal software engineer at Microsoft) has explained the story here and said that when Git compares two files with a commonly used file name (which in the case of this situation was CHANGELOG.md), Git set about comparing files from different packages. This meant that Git then found a significant difference with every code commit.
“I’ve been focused recently on understanding and mitigating the growth of a few internal repositories. Some of these are growing much larger than expected for the number of contributors and there are multiple aspects to why this growth is so large,” wrote Stolee here. “The main issue plaguing these repositories is that deltas are not being computed against objects that appear at the same path. While the size of these files at [the] tip is one aspect of growth that would prevent this issue, the changes to these files are reasonable and should result in good delta compression. However, Git is not discovering the connections across different versions of the same file.”
Speaking to a number of industry specialists about this occurrence, the reactions ranged from knowing smiles to relatively unsurprised nods.
Repo Bloat
“The plot thickens, but it makes sense. Repository (repo) bloat like this is worth hunting down, especially in more distributed development organizations,” said Jon Collins, VP of engagement and field CTO at GigaOm UK. “Whether or not the team is attempting to repack a repo manually (which appears to be unspecified), any sysadmin hitting these issues (or harboring concerns about them) should just be careful when running pan-repository commands. It’s more prudent to do it at a quiet time and have up-to-date clones/backups and so on! Those less worried can wait for the system-wide fix.”
Technical engineer at GitGuardian Thomas Segura thinks that Git protocol’s design creates a special security challenge. He suggests that when secrets are committed, they persist in repository history even after deletion, becoming what his firm likes to call ‘zombie leaks.’ Unlike traditional runtime vulnerabilities that can be patched, these secrets remain a threat until explicitly revoked.
“Efficient management of repository size is crucial, especially in monorepos where every gigabyte counts. A bug like this can chip away at the advantages of maintaining best practices that prevent repositories from ballooning to enormous sizes – like not committing trash or expensive media files. This efficient management is not just about storage – it is essential to avoid longer clone times and slower builds, which drag down productivity,” said Roman Khavronenko, co-founder of monitoring tools and time-series database company VictoriaMetrics.
Regular Repo Review Rationale
Khavronenko reinforces his point and explains that issues like name-hash collisions remind us that even reliable tools like Git have edge cases that can lead to massive inefficiencies at scale. He insists that regular repo audits and reviews help catch these inefficiencies early, keeping performance on track and preventing frustrating slowdowns in the development workflow.”
“We’ve seen firsthand the challenges that come with managing large Git repositories – especially when there are many contributors and a high change rate. While the size of these repositories can consume excessive disk space and slow down operations, there’s a more pressing concern: the increased risk of corruption, merge conflicts, overwrites and force push errors,” said Subbiah Sundaram, SVP Product, Hycu, Inc. “In environments where many people are contributing simultaneously, the chances of something going wrong increase significantly. A single mistake by a contributor or an admin – like an unintended force push — can lead to serious consequences. This could mean the loss of intellectual property or even losing knowledge about how your infrastructure is configured (think AWS Lambda functions or VPC settings).”
Sundaram and team say that data loss in these environments isn’t just a theoretical problem; it happens more often than most of us would like to admit.
But, going forward, it’s not an issue exclusive to large Git repositories. He reminds us that some might suggest that cloning the repository (ex. “final_final_final_I_promise_its_final”) is a good safety net. However, with numerous contributors and inconsistent cloning practices, this isn’t foolproof.
To mitigate these risks, Hycu’s Sundaram provides us with three key best practices:
- Increase protective measures and implement Git hooks and access controls to prevent unauthorized or dangerous actions.
- Use continuous integration so that automated testing and integration pipelines can catch issues early before they become bigger problems.
- Create frequent and scheduled backups i.e. keeping consistent backups (not clones) of repositories means teams more easily recover if data loss occurs.
There is no official word from Git as to whether the platform might offer automation tools to monitor (and indeed remediate) the issues tabled here, but as computing environments get larger and code bases become increasingly more complex, we might reasonably expect realignments or augmentations to occur.