Git staging demystified

In The Wizard of Oz, Dorothy and her travelling companions learn that the magical wizard is just an old man hiding behind a curtain, operating a control panel. The special effects that he controlled merely made him seem magical. A simple look behind the curtains may reveal a much simpler truth.

As we saw in the previous post, Git’s object storage is quite straightforward. But we only tried to store content that we passed to Git directly via stdin. When using Git as a version control system we want to store files in its repository though. But Git makes us stage files before we can store them in its repository. Sometimes this is referred to as adding your file to the staging area. This sounds like the file is being moved to some magical place, but that’s not really what’s happening.

Where is this magical index?

There is nothing magical about the index. Don’t believe me? Let’s look behind the curtain…

We’ll start by creating a simple file and check how Git will hash the contents of this file using the knowledge of Git’s internals that we gained in the previous post.

> echo "San Francisco" > test
> cat test | git hash-object --stdin
9a63b5a6857cd18567033f8b54463458d67ca3b6
> git add test

After we add a file to the index, the file contents are stored in the repository. But the .git directory also now contains a new file named index. This is the man behind the curtain index.

Unfortunately the index is a binary file, so we cannot look at its contents directly to grok it. So we’ll use Git’s low-level ls-files command to inspect the index.

What’s in the index?

The ls-files command allows us to list files in the working tree (default) as well as the index (with the --stage option):

> git ls-files
test
> git ls-files --stage
100644 9a63b5a6857cd18567033f8b54463458d67ca3b6 0	test

The astute reader recognises the file’s mode (100644) and the hash value from before. There’s also a stage number (0), which we’re going to ignore for now. Not visible in ls-file’s output but also present in the index is the timestamp of the file’s last modification time.

So the index contains metadata for objects stored in its repository, in an optimised binary format. This allows Git to maintain a virtual tree that can be efficiently compared to the working tree. So we can change the file in the working tree and compare it to the file we staged earlier.

> echo "New York" > test
> cat test | git hash-object --stdin
825b603258f31bd8639378c5b314f33056dcb20c
> git diff
diff --git a/test b/test
index 9a63b5a..825b603 100644
--- a/test
+++ b/test
@@ -1 +1 @@
-San Francisco
+New York

The fact that the file’s content at the time of staging is stored as a blob in the repository enables us to revert our changes:

> git checkout test
> cat test
San Francisco

If instead we change the file and ‘re-stage’ it, Git will store a new blob in its repository (825b), and the reference in the index will point to this new blob. The old blob (9a63) is still present in the repository, and can be removed using the git prune command.

> echo "New York" > test
> git add test
> git ls-files --stage
100644 825b603258f31bd8639378c5b314f33056dcb20c 0	test
> git prune --verbose
9a63b5a6857cd18567033f8b54463458d67ca3b6 blob

Removing a file from the index

We can remove the reference to the object in the repository from the index using the git rm command. This does not remove the object itself from the repository, nor does it remove the file from the working tree.

> git rm --cached test
rm 'test'

Now that the blob is no longer referenced by the index, we can remove it from the repository using git prune.

Pay no attention to that man behind the curtain.

Except we did, and what we found was the familiar way that Git stores content as blobs in its repository, and a relatively simple binary file that stores references to said blobs along with some metadata. The metadata allows Git to track changes to files in the working tree.

Comments