When you look inside of a code repository you will likely first see just the
contents of a codebase at a particular revision. However, the repository also
contains a compressed version of the contents of every file at every version of
the codebase. For a Mercurial repository, these compressed data are contained in
a hidden .hg/
directory. In this blog post I’m going to try to figure out what
data are contained in this directory and how it’s structured.
To follow along you will need to have Mercurial installed. I’m going to use the
latest version, Mercurial 5.0, which you can install in a number of ways,
perhaps easiest using the pip
associated with a Python 2.7
installation. Mercurial 5.0 has beta support for Python 3.5 or newer, so you can
use that as well if you do not have a Python 2.7 installation set up. To
install, do the following:
$ pip install mercurial --user
If you use pip install --user
like I have here you will also need to ensure
that $HOME/.local/bin
is in your PATH environment variable.
Let’s take a look at the contents of the .hg
directory for a real-world
repository. For this purpose let’s use the repository for Mercurial itself -
Mercurial development is tracked using Mercurial, naturally:
$ hg clone https://mercurial-scm.org/repo/hg
real URL is https://www.mercurial-scm.org/repo/hg
destination directory: hg
requesting all changes
adding changesets
adding manifests
adding file changes
added 42325 changesets with 80197 changes to 3352 files (+1 heads)
179808 new obsolescence markers
new changesets 9117c6561b0b:2338bdea4474
updating to bookmark @
1964 files updated, 0 files merged, 0 files removed, 0 files unresolved
Depending on how fast your internet connection is, this operation might take a while to finish. Mercurial is telling us a lot of information here in its debug output that might be helpful for understanding Mercurial’s internals. First, it tries to figure out if we’ve given it a URL or some other URI it can resolve. We gave it an HTTPS URL so it just uses that to communicate with the Mercurial instance running on Mercurial-scm.org. Second, it prints “Requesting all changes” when it begins to pull the changes from the remote repository. This happens in three steps, first obtaining the changesets, then the manifests, and finally the file changes. Each of these correspond to different kinds of revlog files on-disk that we will be looking at shortly. Briefly, the revlog is the file format that Mercurial uses to store versioned data, let it be metadata, the manifest of files in a repository at any given time, and the contents of each file at each revision in history.
After these steps, Mercurial lets us know how much data it has processed. For
this repository there are more than 40,000 commits to more than 3000 files over
the history of the repository. Next it tells us that this repository contains
almost 200,000 obsolecence markers, this is a data format used by the evolve
extension, which is one of
Mercurial’s coolest features but is also beyond the scope of this post, I will
try to return to it in the future. The next couple of messages let us know which
changes were added (since we’re cloning, we’ve added all of the changes in the
repository, this message is more useful if we are only pulling a subset of the
changes), and lets us know that this repository has something called a bookmark
that is named @
defined. We will talk more about bookmarks later, but if you
are familiar with git, @
is a bit like the 'master'
branch in that new
clones will have a checkout of @
in the working directory of the
repository. Finally, it creates the working directory, which contains almost
2000 files. Note that this count is substantially less than the 3000 files that
have ever been defined in the repository, some files that were present in the
past have since been removed.
OK, now that we’ve cloned the repository, let’s take a look at what’s inside the
.hg
directory:
$ cd hg/.hg
$ ls -lh
total 136K
-rw-rw-r-- 1 goldbaum goldbaum 57 May 21 16:26 00changelog.i
-rw-rw-r-- 1 goldbaum goldbaum 43 May 21 16:30 bookmarks
-rw-rw-r-- 1 goldbaum goldbaum 1 May 21 16:30 bookmarks.current
-rw-rw-r-- 1 goldbaum goldbaum 8 May 21 16:30 branch
drwxrwxr-x 2 goldbaum goldbaum 4.0K May 21 16:30 cache
-rw-rw-r-- 1 goldbaum goldbaum 88K May 21 16:38 dirstate
-rw-rw-r-- 1 goldbaum goldbaum 501 May 21 16:30 hgrc
-rw-rw-r-- 1 goldbaum goldbaum 59 May 21 16:26 requires
drwxrwxr-x 3 goldbaum goldbaum 4.0K May 21 16:30 store
-rw-rw-r-- 1 goldbaum goldbaum 0 May 21 16:27 undo.bookmarks
-rw-rw-r-- 1 goldbaum goldbaum 7 May 21 16:27 undo.branch
-rw-rw-r-- 1 goldbaum goldbaum 41 May 21 16:27 undo.desc
-rw-rw-r-- 1 goldbaum goldbaum 40 May 21 16:27 undo.dirstate
drwxrwxr-x 2 goldbaum goldbaum 4.0K May 21 16:35 wcache
Hmm, this is a lot of stuff. Let’s make this a little simpler by starting with a new repository with a single file and only a couple of commits:
$ cd ../../
$ mkdir test-repository
$ cd test-repository
$ hg init
$ echo "some data" > a_file
$ hg add a_file
$ hg commit -m "adding a_file"
$ echo "some more data >> a_file
$ hg commit -m "adding some more text to a_file"
This creates a repository containing a single file with two revisions:
$ hg log --graph
@ changeset: 1:0e80b49a8edc
| tag: tip
| user: Nathan Goldbaum <nathan12343@gmail.com>
| date: Wed May 22 09:29:18 2019 -0400
| summary: adding more text to a_file
|
o changeset: 0:6f3346b94a1f
user: Nathan Goldbaum <nathan12343@gmail.com>
date: Wed May 22 09:28:44 2019 -0400
summary: adding a_file
Let’s take a look at the contents of the .hg
directory in this new more
trivial repository:
☿ ls -lh .hg
total 44K
-rw-rw-r-- 1 goldbaum goldbaum 57 May 22 09:27 00changelog.i
drwxrwxr-x 2 goldbaum goldbaum 4.0K May 22 09:29 cache
-rw-rw-r-- 1 goldbaum goldbaum 63 May 22 09:28 dirstate
-rw-rw-r-- 1 goldbaum goldbaum 26 May 22 09:29 last-message.txt
-rw-rw-r-- 1 goldbaum goldbaum 59 May 22 09:27 requires
drwxrwxr-x 3 goldbaum goldbaum 4.0K May 22 09:29 store
-rw-rw-r-- 2 goldbaum goldbaum 63 May 22 09:28 undo.backup.dirstate
-rw-rw-r-- 1 goldbaum goldbaum 0 May 22 09:29 undo.bookmarks
-rw-rw-r-- 1 goldbaum goldbaum 7 May 22 09:29 undo.branch
-rw-rw-r-- 1 goldbaum goldbaum 9 May 22 09:29 undo.desc
-rw-rw-r-- 2 goldbaum goldbaum 63 May 22 09:28 undo.dirstate
drwxrwxr-x 2 goldbaum goldbaum 4.0K May 22 09:29 wcache
Still a decent number of files but definitely less complex. There is a very helpful page on the Mercurial wiki that describes Mercurial’s custom file formats, so we can look there to decide which of these files is important.
The first, 00changelog.i
is there to inform older versions of Mercurial that
this repository was created with a newer version and is incompatible with the
old version. Mercurial development proceeds with strict backward
compatibility
guarantees so repositories created by older versions of Mercurial should
continue to work with newer versions forever, however there’s guarantee that an
old Mercurial client should be able to read a repository created by a new
one. Since Mercurial is a distributed system it is important for it to be able
to talk to various versions of itself over the network or when operating on
repositories on disk.
The cache
and wcache
directories contain caches of various kinds used by
Mercurial and some extensions:
$ ls -lh .hg/cache
total 1.4M
-rw-rw-r-- 1 goldbaum goldbaum 148 May 21 16:30 branch2-base
-rw-rw-r-- 1 goldbaum goldbaum 42K May 21 16:30 evoext-obscache-00
-rw-rw-r-- 1 goldbaum goldbaum 992K May 21 16:30 hgtagsfnodes1
-rw-rw-r-- 1 goldbaum goldbaum 14 May 21 16:30 rbc-names-v1
-rw-rw-r-- 1 goldbaum goldbaum 331K May 21 16:30 rbc-revs-v1
These aren’t documented on the wiki (last updated in 2013) and appear to contain opaque binary data. I’m going to ignore these for now.
The dirstate
file contains information about the state of the working
directory (e.g. everything in the repository except for the .hg
directory). Quote the Mercurial wiki:
This file contains information on the current state of the working directory in a binary format. It begins with two 20-byte hashes, for first and second parent, followed by an entry for each file. Each file entry is of the following form:
<1-byte state><4-byte mode><4-byte size><4-byte mtime><4-byte name length>
If the name contains a null character, it is split into two strings, with the second being the copy source for move and copy operations.
In addition there is a wiki page devoted just to this file that contains more information.
Let’s take a look at the contents of the dirstate
file for our repository:
$ xxd .hg/dirstate
00000000: 0e80 b49a 8edc 08c2 d9ff cdcd 7fd7 1b55 ...............U
00000010: de9a 7f7f 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 6e00 0081 b400 0000 ........n.......
00000030: 195c e54e 9600 0000 0661 5f66 696c 65 .\.N.....a_file
If you’re unfamiliar with hexadecimal output, I’m using the xxd
tool to
quickly preview the binary content of the dirstate
file. The first column
tells you how many bytes into the file we are. Each set of 4 hex characters
corresponds to two bytes in the file. If we look above to where we examined the
output of hg log
for this repository, you can see that the first 20 bytes of
this file is the SHA1 nodeid
associated with the most recent change (hg log
only shows the first 12 bytes
of the nodeid for brevity). The nodeid
for a changeset is also sometimes
called a changeset hash. It is a cryptographically unique identifier for a
commit generated by hashing the commit contents along with some metadata for the
commit. The next 20 bytes is filled with zeros. This is a special nodeid called
the nullid that represents a nonexistent commit. These two commits are the
parents of the working directory, these are usually referred to as p1
and
p2
. In this case p1 is the most recent commit, and since the last commit was
not a merge, p2 is set to the nullid. In addition to being p2
for non-merge
commits, an empty repository with no commits will have both p1
and p2
set to
the nullid. An interesting consequence of this choice is that completely
unrelated repositories can be merged with no issues, since ultimately all
repositories histories descend from the “commit” associated with the nullid.
Following the nodeid entries for the parents of the commit is the state entry
for the only file in this repository, a_file
. This consists of a set of binary
encoded metadata for the file, first a one-byte “state”, which for this file is
“n”, corresponding to a “normal” state. Other options include “a” for added, “r”
for removed, and “m” for merged. Following this is 4 bytes containing the “mode”
of the file. This corresponds to the bytes 000081b4
. In this case the first
two bytes are null and the UNIX file permissions are encoded in the last two
bytes. In this case it corresponds to the octal permission code 664
$ stat -c "%a %n" a_file
664 a_file
How this is calculated based on the contents of the dirstate
file is a little
confusing to me, I’d like to come back to this later. Internally Mercurial is
doing something like this python code:
>>> import os
>>> mode = '%3o' % (0x000081b4 & 0o777 & ~os.umask(0))
>>> mode
'664'
The first operation makes some sense, masking with 0o777
ignores the first two
and half bytes. The 8 may indicate that the next 12 bits correspond to three
octal characters, and then the next three characters are the file mask. I’m not
sure why we additionally need to mask with ~os.umask(0)
. Digging into the
history of Mercurial, it looks like this extra masking step was
added to fix issues on
windows and wasn’t in the original implementation, so let’s just ignore it for
now.
The next 4 bytes contain the size of the file in bytes (in this case the entry
is 0x19
, or 25 bytes). As an aside, this makes me wonder what happens if you
add a file bigger than 0xFFFFFFFF
bytes! After this come 4 more bytes for the
modification time, in this case stored as the UNIX timestamp 0x5ce54e96
, about
9:30 AM EST on May 22 2019 when this blog post was being written. This will also
be not-great in 2038 when the UNIX epoch overflows a 32 bit integer. Next we
have 4 bytes for the length of the name of the file, in this case ‘0x6’, or
plain old 6 to you and me, the number of characters in the filename. Finally the
filename itself, which is encoded in UTF-8, but in this case we can get away
with just reading off the ASCII in the hex dump.
Ok, that covers the dirstate
file. There’s still a few more files left, so
let’s quickly go over those.
last-message.txt
This file contains the content of the last commit message, presumably for caching purposes or so people can set up prompts that don’t need to actually start up the Mercurial executable.
requires
A record of repository requirements. This tells Mercurial clients what features must be supported in order to work with the repository. Old clients that do not have support for newer features will refuse to load a repository that lists requirements from newer Mercurial versions.
undo.*
filesFiles used by the deprecated “hg rollback” command to undo the last transaction. I will ignore these since they are only useful for a deprecated feature in Mercurial.
Finally there is one last directory, the store
:
$ ls .hg/store
00changelog.i data phaseroots undo.backupfiles
00manifest.i fncache undo undo.phaseroots
The primary purpose of this directory is to store the bulk of the repository data, in the form of revlog files. This is a special data structure that was invented by Mercurial’s original developer to store versioned data in a compressed manner. We will come back to revlogs and the contents of this directory in the next blog post.