By for LIGO. .

Please send remarks and suggestions to git-tutorial@suzanne.soy or simply fork this repository on GitHub.

This version of the site matches the tag v1.0.2 on GitHub. Permalinks to snapshots of this site are available via IPFS: v1.0.2 (this version) [computing URL…], v1.0.1 (02023-11-21), v1 (02021-06-29), Alternatively check the latest version via IPNS/IPFS or latest via HTTPS. See the Changelog section for errata, and the sitemap for a list of contents.

Credits and license

This article was written as part of my work for LIGO.

The main reference for this tutorial is the Pro Git book section on GIT internals.

This tutorial uses these libraries:

In order to encourage people to write their own implementation of a version control system, and improve upon the state of the art, the contents of this tutorial (the files index.html, git-tutorial.js, git-tutorial.css, deploy.sh and README, including the GIT implementation contained within) are dedicated to the Public Domain, using the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

The intent is to enable everyone to freely reuse and share part or all of this material under any license, including the CC0, the MIT license, other open-source licenses or proprietary licenses, without any limitations. Crediting this original article is appreciated but not required.

This tutorial comes without any warranty, in particular there are a few incompatibilities (e.g. this implementation cannot read from repositories using pack files, and it is quite possible that some issues would cause it to produce repositories that are not 100% compatible with the official implementation of GIT), a few bugs (e.g. unicode and binary text might not be stored correctly), some security vulnerabilities (user input is not sanitized when displayed).

Introduction

GIT is based on a simple model, with a lot of shorthands for common use cases. This model is sometimes hard to guess just from the everyday commands. To illustrate how GIT works, we'll implement a stripped down clone of GIT in a few lines of JavaScript. * empty lines and single closing braces excluded, a few more in total.

The Operating System's filesystem

Model of the filesystem

The Operating System's filesystem will be simulated by a very simple key-value store. In this very simple filesystem, directories are entries mapped to null and files are entries mapped to strings. The path to the current directory is stored in a separate variable.

Filesystem access functions (read, write, mkdir, exists, remove, cd)

The filesystem exposes functions to read an entire file, create or replace an entire file, create a directory, test the existence of a filesystem entry, and change the current directory.

Filesystem access functions (listdir)

It will be handy for some operations to list the contents of a directory.

Example working tree

Our imaginary user will create a proj directory, and start filling in some files.

A working tree designates the directory (and the subdirectories and files within) in which the user will normally view and edit the files. GIT has commands to save the state of the working tree (git commit), in order to be able to go back in time later on, and view older versions of the files. The command git worktree allows the user to create multiple working trees using the same local repository. This effectively allows the user to easily have two or more versions of the project side-by-side. GIT commands can be invoked in either copy. It is worth noting that the .git/ directory exists only in the original working tree; while it is safe to remove other worktrees (followed by an invocation of git worktree prune from one of the remaining working tree to let GIT detect the deletion), the removal of the original working tree will discard ths .git/ directory, and all versions of the project that have not been published elsewhere (usually via git push) will be lost.

git init (creating .git)

The first thing to do is to initialize the GIT directory. For now, only the .git folder is needed, The rest of the function implementing git init will be written later.

Click on the eval button to see the files and directories that were created so far.

git hash-object (storing a copy of a file in .git)

The most basic element of a GIT repository is an object. Objects have a type which can be blob (individual files), tree (directories), commit (pointers to a specific version of the root directory, with a description and some metadata) and tag (named pointers to a specific commit, with a description and some metadata). When a file is added to the git repostitory, a compressed copy is stored in GIT's database, in the .git/objects/ folder. This copy is a blob object.

The compressed copy is given a unique filename, which is obtained by hashing the contents of the original file. Some filesystems have poor performance when a single directory contains a large number of files, and some filesystems have a limit on the number of files that a directory may contain. To circumvent these issues, the first two characters of the hash are used as the name of an intermediate directory: if a file's hash is 0a1bd…, its compressed copy will be stored in .git/objects/0a/1bd…

This function creates a file that looks like this:

The objects stored in the GIT database are compressed with zlib (using the "deflate" compression method). The filesystem view shows the marker deflated: followed by the uncompressed data. Click on the (un)compressed data to toggle between this pretty-printed view and the raw compressed data.

When creating some blob objects, the result could be, for example:

This function reproduces faithfully the behaviour of (a subset of the options of) the git hash-object command which can be called on a real git command-line.

Adding a file to the GIT database

So far, our GIT database does not know about any of the user's files. In order to add the contents of the README file in the database, we use git hash-object -w -t blob README, where -w tells GIT to write the object in its database, and -t blob indicates that we want to create a blob object, i.e. the contents of a file.

Click on the eval button to see the file that was created by this call.

You can notice that the database does not contain the name of the original file, only its content, stored under a unique identifier which is derived by hashing that content. Let's add the second user file to the database.

zlib compression

GIT compresses objects with zlib. The deflate() function used in the script above comes from the pako 2.0.3 library. To view a zlib-compressed object in your *nix terminal, simply write this declaration in your shell.

unzlib() {
  python -c \
    "import sys,zlib; \
     sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \
    "$1"
}

You can then inspect git objects as follows, using hexdump to view the null bytes and other non-printable bytes.

unzlib .git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7 | hexdump -Cv

Storing trees (list of hashed files and subtrees)

At this point GIT knows about the contents of both of the user's files, but it would be nice to also store the filenames. This is done by creating a tree object

A tree object can contain files (by associating the blob's hash to its name), or directories (by associating the hash of other subtrees to their name). The mode (100644 for the file and 40000 for the folder) indicates the permissions, and is given in octal using the values used by *nix

In the contents of a tree, subdirectories (trees) are listed before files (blobs); within each group the entries are ordered alphabetically.

This function needs a small utility to convert hashes encoded in hexadecimal to raw bytes.

Example use of store_tree()

The following code, once uncommented, stores into the GIT database the trees for src and for the root directory of the GIT project.

The store_tree() function needs to be called for the contents of subdirectories first, and that result can be used to store the trees of upper directories. In the next section, we will write a function which takes a list of paths, constructs an internal representation of the hierarchy, and stores the corresponding trees bottom-up.

Storing a tree from a list of paths

Making trees out of the subfolders one by one is cumbersome. The following utility function takes a list of paths, and builds a tree from those.

Storing a commit in the GIT database

Now that the GIT database contains the entire tree for the current version, a commit can be created. A commit contains

The author and committer information contain

Storing an example commit

It is now possible to store a commit in the database. This saves a copy of the tree along with some metadata about this version. The first commit has no parent, which is represented by passing the empty list.

resolving references

The next few subsections will introduce symbolic references and other references like branch names, the special name HEAD or tag names.

Most GIT commands accept as an argument a commit hash or a named reference to a hash. In order to implement those, we need to be able to resolve these references first.

Symbolic references are nothing more than regular files containing a hexadecimal hash or a string of the form ref: path/to/other/symbolic/reference. The HEAD reference is stored in .git/HEAD, and can point directly to a commit hash like 0123456789abcdef0123456789abcdef01234567, or can point to another symbolic reference, in which case the .git/HEAD file will contain e.g. refs/heads/main.

Branches are simple files stored in .git/refs/heads/name-of-the-branch and usually contain a hash like 0123456789abcdef0123456789abcdef01234567.

Tags are identical to branches in terms of representation. It seems that the only difference between tags and branches is the behaviour of git checkout and similar commands. These commands, as explained in the section about git checkout below, normally write ref: refs/heads/name-of-branch in .git/HEAD when checking out a branch, but write the hash of the target commit when checking out a tag or any other non-branch reference.

We'll start with a small utility to remove the newline at the end of a string. GIT references are usually files containing a hexadecimal hash, and following *NIX tradition these files finish with a newline byte. When reading these references, we need to get rid of the newline first.

git symbolic-ref

git symbolic-ref is a low-level command which reads (and in the official GIT implementation also writes and updates) symbolic references given a path relative to .git/. For example, git symbolic-ref HEAD will read the contents of the file .git/HEAD, and if that file starts with ref: , the rest of the line will be returned.

The official implementation of GIT follows references recursively and returns the path/to/file of the last file of the form ref: path/to/file. In the example below, git symbolic-ref HEAD would

  • read the file proj/.git/HEAD which contains ref: refs/heads/main,
  • follow that indirection and read the file proj/.git/refs/heads/main which contains ref: refs/heads/other
  • follow that indirection and read the file proj/.git/refs/heads/other which contains a hash
  • return the last file path that contained a ref:, i.e. return the string refs/heads/other

git rev-parse

git rev-parse is another low-level command. It takes a symbolic reference or other reference, and returns the hash. The difference with git symbolic-ref is that symbolic-ref follows indirections to other references, and returns the last named reference in the chain of indirections, whereas rev-parse goes one step further and returns the hash pointed to by the last named reference.

git branch

A branch is a pointer to a commit, stored in a file in .git/refs/heads/name_of_the_branch. The branch can be overwritten with git branch -f. Also, as will be explained later, git commit can update the pointer of a branch.

When we call git branch main HEAD or equivalently git branch main 0123456789012345678901234567890123456789, a file containing that hash is created in .git/refs/heads/main. This file acts as a pointer to the branch, and this pointer can be read e.g. by git rev-parse.

After creating the branch, we show how the file .git/refs/heads/main can be overwritten using git branch -f

git config

The official implementation of GIT stores the settings in various files (.git/config within a repository, ~/.gitconfig in the user's home folder, and several other places).

These files use a .ini syntax with key = value lines grouped under some [section] headings. The configuration above could be stored in ~/.gitconfig or .git/config using the following syntax:

[user]
name = Ada Lovelace
email = ada@analyti.cal

The $EDITOR variable is a traditional *NIX environment variable, and could e.g. be declared with EDITOR=nano in ~/.profile or ~/.bashrc.

git commit

The git commit command stores a commit (metadata and a pointer to a tree containing the files given on the command-line), and updates the HEAD or current branch to point to the new commit.

If the HEAD points to a commit hash, then git commit updates the HEAD to point to the new commit. Otherwise, when the HEAD points to a branch, then the target branch (represented by a file named .git/refs/heads/the_branch_name) is updated.

The official implementation of git commit makes use of the index. When a file is scheduled for the next commit using git add path/to/file, it is added to the index. The index is a representation of a collection of copies of files, which can efficiently be compared to the working tree. It uses a different representation, but its role is very similar to that of a tree object along with the subtrees and blob objects of individual files. When git commit is called without specifying any files, it creates a commit containing the version of the files stored in the index.

In this simplified implementation, we only support creating commits by specifying all the files that must be present in the commit (including unchanged files). This contrasts with the official implementation which would create a tree containing the files from the current HEAD, as well as the added, modified or deleted files specified by git add or specified directly on the git commit command-line.

git tag

Tags behave like branches, but are stored in .git/refs/tags/the_tag_name and a tag is not normally modified. Once created, it's supposed to always point to the same version.

GIT does offer a git tag -f existing-tag new-hash command, but using it should be a rare occurrence.

Intuitively, tags differ from branches in the following way: when checking out a branch, and a subsequent commit is made, the branch is updated to point to the new commit's hash. As we've seen in the implementation of git commit, the difference is actually in the contents of the .git/HEAD file. If it is a symbolic reference (generally a pointer to a branch), then the target of that reference is updated every time a new commit is created. If the .git/HEAD file contains the hash of a commit, then the .git/HEAD file itself is updated every time a new commit is created.

Therefore, tags and branches differ only in their usage and in the path under which they are stored (.git/refs/heads/name-of-the-branch vs. .git/refs/tags/name-of-the-tag). The file .git/HEAD is overwritten by git commit and git checkout. It is the latter command which will behave differently for tags and branches; git checkout branch-name turns the HEAD into a symbolic reference, whereas git checkout tag-name resolves the tag name to a commit hash, and writes that hash directly into .git/HEAD.

git checkout

The git checkout commit-hash-or-reference command modifies the HEAD to point to the given commit, and modifies the working tree to match the contents of the tree object pointed to by that commit.

Checkout, branches and other references

The HEAD does not normally point to a tag. Although nothing actually prevents writing ref: refs/tags/v1.0 into .git/HEAD, the GIT commands will not automatically do this. For example, git checkout tag-or-branch-or-hash will put a symbolic ref: in .git/HEAD only if the argument is a branch.

Checking out files

In order to replace the contents of the working tree with those of the given commit, we recursively compare the subtrees, deleting from the working tree the files or directories that are not present in the tree object, and overwriting the others.

The official implementation of GIT will record the diff between the current working tree and the current commit, and will re-apply these changes on top of the freshly checked-out commit. The official git checkout command will print warnings and refuse to proceed when these changes cannot be re-applied without conflict, encouraging the user to create a commit containing this updated version or to stash the changes (effectively creating a temporary commit containing this version, pointed to by .git/refs/stash). Our simple implementation will always overwrite the changes.

Assert

The checkout_tree() function needs to read the commit, tree and blob objects from the .git/ folder. The following sections will introduce some parsers for these objects. The parsers will check that their input looks reasonably well-formed, using assert().

Reading compressed objects

The GIT objects which are stored in .git/objects are compressed with zlib, and need to be uncompressed before they can be parsed. The actual implementation of GIT also stores some objects in packs. Packs contain a large number of objects, and used a form of delta compression, which effectively stores objects as the diff with another similar object, in order to optimize the disk space usage.

Our simplified implementation only deals with zlib-compressed objects, and cannot read from pack files. The function below extracts the type and length, which form the header present in all objects, and returns those along with the contents of the object.

Parsing tree objects

We will start by parsing tree objects. As a reminder, a tree object has the following form:

After the object header, we have a mode, a filename, a null byte and a hash consisting of 20 bytes. The null byte cannot appear in the mode or filename, so we use this null + hash as a delimiter (the non-greedy match ensures the null byte terminator will not match with a 00 byte in the hash)

The parse_tree function above needs a small utility to convert hashes represented using raw bytes to a hexadecimal representation.

Parsing commit objects

The following function is fairly long, but only parses lines of the form header-name header-value (with some restrictions depending on the header), followed by a blenk line, and a free-form description.

Parsing author and committer metadata

The author and committer metadata has the form Name <email@domain.tld> timestamp +timezone, for example Ada Lovelace <ada@analyti.cal> 1617120803 +0100

Example checkout

Now that we can parse blobs objects, trees, and commits, it is now possible to checkout a given commit. The following operation will revert the working tree to the state that was copied in the initial commit.

git init

The git init command creates the .git directory and points .git/HEAD to the default branch (a file which does not exist yet, as this branch does not contain any commit at this point).

The index

When adding files with git add, GIT does not immediately create a commit object. Instead, it adds the files to the index, which uses a binary format with lots of metadata. The mock filesystem used here lacks most of these pieces of information, so the value 0 will be used for most fields. See this blog post for a more in-depth study of the index.

Playground

The implementation is now sufficiently complete to create a small repository.

By clicking on "Copy commands to recreate in *nix terminal.", it is possible to copy a series of mkdir … and printf … > … commands that, when executed, will recreate the virtual filesystem on a real system. The resulting folder is bit-compatible with the official git log, git status, git checkout etc. commands.

Conclusion

This article shows that a large part of the core of GIT can be re-implemented in a few source lines of code (* empty lines and single closing braces excluded, a few more in total).

Click here to copy all the code.

A few core commands like git diff and git apply are not described in this tutorial. They are little more than improved versions of the classical *nix commands diff and patch.

Most other commands provided by GIT are merely convenience wrappers around these commands. For example, git cherry-pick is simply a combination of git diff between the tree of a commit and the tree of its parent, followed by git apply to apply the patch and git commit to create a new commit whose diff is equivalent to the diff of the original commit. As an other example, the command git rebase performs as succession of cherry-pick operations.

By keeping in mind the internal model of GIT, it becomes easier to understand the usual commands and their quirks. By undersanding the design philosophy behind the implementation, the day-to-day usage can become, hopefully, less surprising.

Suggested exercises

The reader willing to improve their grasp of GIT's mental model, and reduce their reliance on a few learned recipies, might be interested in the following warm-up exercises:

Inspection using git cat-file

Inspect an existing repository, starting with cat .git/HEAD and using git cat-file -p some-hash to pretty-print an object given its hash.

This will help sink in the points explained in this tutorial, and give a better understanding of the internals of GIT. This knowledge is helpful for day-to-day tasks, as the GIT commands usually perform simple changes to this internal representation. Understanding the representation better can demistify the semantics of the daily GIT commands. Furthermore, equipped with a better understanding of GIT's implementation, the dreamy reader will be tempted to compare this lack of intrinsic complexity with the apparent complexity, and be entitled to expect a better, less arcane user interface for a tool with such a simple implementation.

Inspection of the files in .git/

Inspect a small existing repository, starting with cat .git/HEAD and using the zlib decompression tool from the zlib compression section. Larger repositories will make use of GIT packs, which are compressed archives containing a number of objects. GIT packs only matter as an optimization of the disk space used by large repositories, but other tools would be necessary to inspect those.

This should help understand the internal representation of GIT commits and branches, and should help having a instinctive idea of how the data store is modified by the various commands. This in turn could come in handy in case of apparent data loss (a lost stash or a checkout leaving an unreferenced commit on a detached HEAD), as this would help understand the work done by the various disaster-recovery one-liners that a quick panicked online search provides.

Creating a repository from scratch

Run git init new-directory in a terminal, and create an initial single-file commit from scratch, using only git hash-object, printf and overwriting .git/HEAD and/or .git/refs/heads/name-of-a-branch. This will involve retracing the steps in this tutorial to create a blob object for the file, a tree object to be the directory containing just that file, and a commit object.

This exercise should help sink in the feeling that the internal representation of GIT commits is not very complex, and that many commands with convoluted options have very simple semantics. For example, git reset --soft other-commit is little more than writing that other commit's hash in .git/refs/heads/name-of-the-current-branch or .git/HEAD. Furthermore, equipped with an even better understanding of GIT's implementation, the dreamy reader will be tempted to compare this lack of intrinsic complexity with the sheer complexity of the systems they are working with on a day-to-day basis, and be entitled to expect better features in a versioning tool. After all, writing those few lines of code to reimplement the core of a versioning tool shouldn't take more than a couple of afternoons, surely our community can do better?

Using only basic GIT commands

For a couple of weeks, only use the GIT commands commit, diff, checkout, merge, cherry-pick, log, clone, fetch and push remote hash-of-commit:refs/heads/name-of-the-branch. In particular, don't use rebase which is just a wrapper around a sequence of cherry-pick commands, don't use pull which is just a wrapper around fetch and merge, don't use git push as-is and instead explicitly give the name (origin) or URL of the remote, the hash of the commit to push, and the path that should be updated on the remote (git push while the main branch is checked out locally is equivalent to git push origin HEAD:refs/heads/main, where HEAD can be replaced by the actual hash of the commit).

This should help sink in the feeling that the internals of GIT are very simple (most of these commands are implemented in this tutorial, and the other ones are merely wrappers around enhanced versions of the *NIX commands diff, patch and scp), and that the rest of the GIT toolkit consists mostly of convenience wrappers to help seasoned users perform common tasks more efficiently.

Understanding commits as copies of the root directory

Try not even using git cherry-pick or git diff a few times, instead make two copies the git directoy, check out the two different commits in each copy, and use the traditional *NIX commands diff and patch.

This should help sink in the feeling that commits are not diffs, but are actual (deduplicated) copies of the entire project directory. GIT commits are quite similar to the age-old manual versioning technique of copying the entire directory under a new name at each version, except that the metadata keeps track of which version was the previous one (or which versions were merged together to obtain the new one), and the deduplication avoids excessive space usage, as would be the case with cp --reflink on a filesystem supporting Copy-On-Write (COW).

Branches as pointers: living without branches

For a couple of weeks, don't use any local branch, and stay in detached HEAD state all the time. When checking out a colleague's work, use git fetch && git checkout origin/remote-branch, and use the reflog and a text file outside of the repository to keep track of the latest commit in a current "branch" instead of relying on GIT.

This should help sink in the feeling that branches are not containers in which commits pile up, but are merely pointers to the latest commit that are automatically updated.

Changelog and errata

v1
Initial version.
v1.0.1
Internal changes to provide IPFS links.
v1.0.2
Added a sitemap for download tools.