Git's simple object storage

Git's powerful abstractions allow us to use it as a version control system. When we look beyond that facade we find a surprisingly simple object storage system.

As software developers we are familiar with the idea that the systems we work with are based on abstractions and that these abstractions make our job easier. After all, the fundamental theorem of software engineering is that we can solve any problem by introducing an extra level of indirection.

But abstractions can be leaky, and that’s when the details of the underlying system are exposed and need to be dealt with. Let’s take a look at some of Git’s low-level commands for storing and retrieving data, and observe what happens in the .git directory to see how that underlying system works.

The working tree and its Git repository

When you create an empty Git repository using the git init command, a directory named .git is created in the current working directory. This .git directory is the repository, and it is associated with its parent directory, which is the root of what we will refer to as the (main) working tree. The .git directory is typically hidden, so get used to working with hidden files.

Looking inside the Git repository you’ll find the objects directory. This directory is of special interest to us, because we’re going to observe it as we execute one of Git’s low-level commands.

Git, the stupid content tracker

Unlike many databases Git doesn’t have a schema or support for primitive data types or a query language. Maybe that’s why Git describes itself as a ‘stupid content tracker’ in its own manual page.

But also unlike many databases, Git automatically deduplicates all information you store in it, and allows you to retrieve information based on its content, rather than, say, a unique ID as is customary in a relational database. Git does this using its content-addressable filesystem. Let’s see how that works!

Storing something in Git

First we need to understand how Git can deduplicate data. Git does this by computing a hash value for the data you want to store, and associating the data with its hash value. Since the same data will result in the same hash value, it’s possible to determine whether or not that data is already in the repository. And we’ll see in a minute how Git’s internal filesystem makes that feasible, and actually quite trivial.

To understand how Git computes a hash value we will use the git hash-object command, and verify its output using the shasum command.

We want to store the text San Francisco in Git’s object database. Let’s see how Git calculates the hash value for this and compare it to an alternative way that doesn’t use Git:

> echo "San Francisco" | git hash-object --stdin

> echo -en "blob 14\0San Francisco\n" | shasum --algorithm 1
9a63b5a6857cd18567033f8b54463458d67ca3b6  -

Now you can begin to understand how Git computes the hash value of an object: the SHA-1 of a prefix and the object itself, separated by NULL. The prefix consists of the object type (blob in this case) followed by a space, followed by the length in bytes (expressed in base 10) of the object.

To store the object in the object database, simply add the -w option to the command above: echo "San Francisco" | git hash-object --stdin -w.

Did you watch the .git/objects directory? You should see appear there a directory named 9a and in it a file named 63b5a6857cd18567033f8b54463458d67ca3b6. Together the names of the directory and the file make up the hash value of the object we just stored. (Git distributes objects over up to 256 directories for performance reasons.) Now try to store San Francisco again, and notice that nothing happens, because a file named after San Francisco’s hash value is already present in the object database.

Yes, it’s really that simple.

Retrieving something from Git

So let’s say we want to check if San Francisco is stored in the Git repository. When we know its hash value (see previous section for the calculation), then we know its ‘object name’ and then retrieving the object is quite simple:

> git show 9a63
San Francisco

As you can see, we don’t need to specify the full 40-byte object name to the git show command. We just need to specify a leading substring that is unique for the repository.

An alternative way to retrieve the object is using the git cat-file command:

> git cat-file -p 9a63
San Francisco

Why did we not just cat the file instead? Because Git has compressed the content (using zlib):

> cat .git/objects/9a/63b5a6857cd18567033f8b54463458d67ca3b6

Removing something from Git

One does not simply remove something from a Git repository, and that’s considered a feature. However, in this case we can remove the object we stored because it is a loose object and is currently never directly used (referenced) by any other object in the repository. Using the git prune command will completely remove the object from the repository.

>  git prune --verbose
9a63b5a6857cd18567033f8b54463458d67ca3b6 blob

In summary

We’ve learned how to store an object in Git’s object database, and how to retrieve it, and in the process learned how Git keeps track of unique content using its computed hash value. At first sight this seems to be quite simple to understand, which may explain why Git calls itself a ‘stupid content tracker.’