Using UUID in an Atom feed

This a again blend of my previous post announcing a feed and a post before discussing an uuid commad. The fact these two came one after each other is not coincidence, even though they seem to have very little in common. So what is the connection between the Atom feed and the Universally Unique IDentifiers (UUID)?

Atom ID specification

In the RFC4287, an IETF standard describing the Atom format specification, the atom:id element is defined as:

The "atom:id" element conveys a permanent, universally unique identifier for an entry or feed. Its content MUST be an IRI, as defined by RFC3987.

What does IRI stands for? Well, there is a lot of reading to find out, but for all the practical reasons, it boils down to the fact that the IRI is a supreset of Uniform Resource Identifier (URI) , meaning that every URI is also an IRI, but only some IRI is also an URI. The difference lies in the character encoding set, IRI uses an extended one.

Now we now that an ID used in an Atom feed can be anything that we would call an URI. An URI, for the use in an Atom feed can be further broken down to two groups, an Uniform Resource Locator (URL) and a less known, Uniform Resource Name (URN). URI is a superset for both URL and URN, meaning that every URL is also an URI and similarly, every URN is also an URI. URN and URL are are disjoint however, meaning no URL is an URN at the same time and vice-versa. Or, put in other words, an Atom ID can either be an URL or an URN.

URL vs URN: which one to choose?

While choosing an URL for an Atom ID complies with the standard, it has some disadvantages. Specifically, sometimes the URL changes, and it happens far more often than you may think, although there appears to be no conclusive number yet.

As cited above, the Atom ID element has to be permanent and for that matter, an URN is far better suited. If translated into practice, using URN would mean that the RSS or Atom feed client would keeps the post in a read and in favorites, even when the blog moves to a different domain, changing it's URL in the process. In reality, the number of clients ranges between a plethora and superfluity and every one of them does something a little bit differently, but this is what the theory says.

Does URN need to be registered?

For a majority of URNs, for instance when referring to an International Standard Book Number (ISBN) in the form of urn:isbn:, the URN has to be registered by a central authority. But there is at least one such URN group, that does not need a registration and that is the UUID group, under the handy urn:uuid: designation. This is where the URN and UUID overlap and form a useful partnership for identifying a resources long after their original location is gone.

For a moment, imagine a scenario where some future archaeologist finds an USB stick with the feed file containing the whole Atom feed of your blog. The technology by then could for instance use some biologic mechanism to store data instead of using silicon, but they might still want to examine the data. If the posts were identified by the URL's, it could be hard to link them to the data in some global database, as domain names expire every year. Matching could be done by comparing the contents to everything in the database. Using some defined identifier such as UUID on the other hand could be easily matched against the database records and even UUID collisions, while improbable today, but very possible in the future, could be sorted out easily by comparing the contents of entries against all matched UUIDs. In such scenario, using URNs would reduce the search space drastically by the very least, from comparing the whole database down to just comparing UUID collisions.

Which UUID version to use?

There are two places where an Atom ID is specified in the feed. One is for identifying the feed itself while other is for identifying individual entries. Both IDs benefit from using UUID URN for their value. We have already learned in the cheatsheet that there are 5 specified UUID versions, of which 3 are available / recommended in new designs. So which one to choose and where? This is actually where it gets pretty hairy.

UUID version 4 for the entries?

The specification does only mention UUID in one example and it is a UUID version 1 for the feed itself and UUID version 4 for the entries. All the other sources I could find are very vague on this topic, but generally, using UUID version 4, which has a property of being completely random is very common for the entries. This approach implies, that such is UUID generated when the entry is first stored in the database and stored along the entry itself and not changed afterwards.

An identical approach is used elsewhere, for instance for a persistent block device naming, which means that when you turn on the computer, an operating system starts from the same disk every single time. Generating UUIDs for the block devices in your system once and referring them by this ID later prevents a so called race condition error, which would in this example happen when some device got loaded sooner than usual obtaining a wrong identifier, resulting in occasional failures during system startup.

UUID version 1 for the feed?

I could not understand why UUID version 1 was used for the feed in the specification. Even worse, UUID version 1 has specific use cases and it seems to me that it's usage is discouraged as a safety concern unless it's precise property of predictability and to a less extend, a sequentiality is required. Wrongly identifying a feed has no security concerns I could think of, so an UUID version 1 can work perfectly here. In the end, even the blog URL can be put there. I have decided for a different approach however.

Approaches for feeds UUID generation

There are multiple ways I could think of to generate the UUID for the feed that I was considering:

Use UUID version 1 and store it
Use UUID version 4 and store it
Use UUID version 5 with the DNS or URL namespace prefix and the blog's domain as a name
Use UUID version 5 with the NIL UUID as namespace prefix and the blog's URL as a name

Since the UUID version 5 has the property of reproducibility, approaches 3 and 4 would serve, if I were to generate the UUID multiple times. With the same input, UUID version 5 provides the same output. This would however not work if the domain (the input data) changed, completely defying the purpose of permanent identification. This means approaches 1 and 2 when the UUID is generated once, stored and used afterwards is preferable. Ruling out approach 1 already, I was left with using UUID version 4, storing it in a code and using it as a feed ID.

UUID version 5 namespaces

The reproducibility property of the version 5 UUID is however very useful for the statically generated blogs, especially the ones that are completely git powered, meaning without the database. I for instance retrieve publication and modification dates from the git commit history and even following file renames. I have also found that Hugo, another static site generator, popular for blogging can be configured to do the same, so this approach is probably not too far-fetched.

Since I have no way of storing the generated version 4 UUID in a database as there is none, I could only store it in the post markdown file itself, most conveniently in the Front Matter section. I am a lazy person as I do not store dates in the Front Matter manually either, as pointed above. Automating everything is a challenge, but it's paying off with the increasing frequency of automated event happening (search also for geeks and repetitive tasks).

UUID version 5 for entries

With the above in mind, I got to generate a reproducible UUID version 5 for the entry ID. As we have already learned, version 5 UUID requires two pieces of input data - the UUID prefix and a name. For the prefix I have chosen the UUID version 4 for the feed itself and for the name I have chosen the hash of the commit the post was introduced with.

Atom entry UUID version 5 = Atom feed constant UUID as a prefix + git commit hash as a name

This way, the entry ID is guaranteed to be generated the same every time, unless I change the feed UUID, which I have no reason for doing so and it is also stored in the version history to prevent loss or unless I rewrite the git commit history, which should generally always be avoided at all costs.

That's it. As a side note, I was considering using a posts slug (which in my setup is a post's filename without the .md extension, another think that I do not store in the Front Matter), but slugs do change very rarely for some SEO modifications. In my setup, as already pointed out, the renames are followed, so the dates would not get disrupted, but the hash of the commit that introduced the file, even before the renaming does not change, as long as the rename gets recognized by git itself.

This is a 58th post of #100daystooffload.