Efficient QR codes (26 Aug 2021)
QR codes seem to have won the battle for 2D barcodes, but they're not just a bag of bits inside. Their payload is a series of segments, each of which can have a different encoding. Segments are bitstrings, concatenated without any byte-alignment, and terminated with an empty segment of type zero. If you want to squeeze the maximum amount of data into a QR code without it turning into a gray square, understanding segmentation helps.
The most basic segment type is byte mode, which is a series of 8-bit bytes. If you control the QR decoder then this is perfectly efficient for encoding binary data. But you probably need to work with a variety of decoders. In that case, beware: the first edition of the QR standard, ISO/IEC 18004, said that byte mode should be interpreted as JIS X 0201. The 2005 edition changed that to be ISO/IEC 8859-1 (i.e. Latin-1). In practice, some QR decoders will attempt to content-sniff the encoding because, while UTF-8 contents should be indicated with an ECI header, they often aren't and UTF-8 is really common.
So if you put binary data into a QR code, you are probably going to hit these edge cases. The contents are likely going to be passed to a general operating system API for handling URLs — do you think the full pipeline will handle NUL bytes in the string, and UTF-8 non-characters and invalid surrogate pairs when interpreted as UTF-8? Also, you probably want your QR contents to be a printable string: bits of it might be shown in the scanner's UI; users might need to copy and paste them.
So let's assume that you want something printable and consider an obvious answer: base64url. It's very common, printable, and doesn't contain any characters that are special in URLs for maximum compatibility. It'll be encoded in a byte-mode segment: each base64url character contains 6 bits of information and takes 8 bits in the QR code for an efficiency of 75%. That's our baseline.
The next segment type to consider is digit mode. This only encodes the digits, 0–9, by packing triples of digits into 10 bits. If there are two digits left over at the end then they take 7 bits, and a singleton takes 4 bits. Ignoring the potential digits at the end, this lets you store 3×log2(10) = 3×3.322 = 9.966 bits of information in 10 bits of space. That's 99.66% efficient! So you can clearly do better than base64url.
The last segment type for these purposes is alphanumeric mode. These segments can encode A–Z, 0–9, and nine special characters: $, %, *, +, -, ., /, :, and space. Pairs of these characters are encoded in 11 bits. (If there's an odd number then the last takes 6 bits.) If you consider this to be “base-45” encoding then it stores 2×log2(45) = 10.98 bits in 11 bits of space, for 99.85% efficiency. Even better than digit mode, although only just.
So maybe you should base-45 encode your data using that alphabet. But, of the special characters supported in alphanumeric mode, only two (minus and period) are unreserved (i.e. safe) in URLs. So you might be reduced to base-38, which cuts the efficiency to 95.42%. But having textually smaller QR contents might be valuable and worth a few percent efficiency in your context.
If you've picked base-10 (digits), base-38, or even base-45 for your data then you need to get it into that form. Base-64 is easy because that's exactly 6 bits per character; you work on 3 bytes of input at a time and produce exactly 4 characters of output. But 10, 38, and 45 aren't powers of two. You've got three options here. The obvious conversion would be to treat the input as a bigint and repeatedly divmod by 10 (or 38, etc) to generate the encoding. If you have a bigint library to hand then it almost certainly has the functions for that, but it's a rather obnoxious (and quadratic) amount of computation and a significant dependency. So you might be willing to waste a few bits to make things easier.
Next option is an encoding noted by djb that achieves similar efficiency but with less computation and no long-division. I updated this post to include it, so it's covered in a section below.
The third option is to chunk the input and convert each chunk independently. Ideal input chunks would be 8 bytes or fewer, because nearly all environments will support a uint64 type and nearly all hardware can do a divmod on them. If you're using base-10 then there's going to be a function that can “print” a uint64 to digits for you, so let's take digits as an example. With a chunk size of two bytes you would get five digits. Each digit takes 3⅓ bits of space, so 16 input bits takes 16⅔ bits: 96% efficient. Less than the 99.66% we can get with digits for sure. But if you consider all chunk sizes from one to eight bytes, turning 7-byte chunks into 17 digits is 98.82% efficient. That's pretty good for the complexity savings.
For base-38, 7-byte chunks are the best again, producing 11 characters for 92.56% efficiency. For base-45, two-byte chunks produce 3 characters for 96.97% efficiency. (Four- and eight-byte chunks do the same if you want fewer loop iterations.)
(And you should use little-endian chunks because little-endian has won, even if the IETF hasn't caught up to that yet.)
Now you've got your payload encoding sorted … probably. A wrinkle is that it's difficult to know how your QR encoder will segment what you give it. You might have crafted a wonderful base-38 input and it might stuff it all into a byte-mode segment! (68.65% efficient, worse than base64url.) I'm sadly not aware of a good QR “debugger” that shows all the details of a QR code. ZXing's online service will give a hex-dump of the raw contents, but that assumes that you can decode the segment headers. (QR-Logo promises better debugging output but doesn't work for me.) My best advice is to use ZXing on a sample QR code, ignore the 0xec, 0x11 padding pattern at the end, and calculate whether the number of bytes used roughly makes sense.
You probably want to put a URL-like prefix at the front to make your QR codes distinguishable. One thing to consider is that “https://www.example.com/” is a byte-mode segment that takes 204 bits, but “HTTPS://WWW.EXAMPLE.COM/” is alphanumeric and needs only 145 bits. (That's assuming QR versions 1 through 9, the lengths are slightly different otherwise.) DNS names are case insensitive and “an implementation should accept uppercase letters” for the scheme says RFC 3986. Maybe it just looks odd and that's not worth the bits, though?
We'll finish up with a quick look at an example, which is the thing that started me on this path in the first place: SMART Health Cards. (Thank you to the person who pointed me at them, who likely wants to remain anonymous.)
SHC's are trying to squeeze a lot of data into a QR code: they minify their JSON structure and compress it but, even then, they sometimes span multiple QR codes and the user has to scan them all. As such their contents are a) a binary segment containing “shc:/” (and maybe sequence numbers if using multiple QR codes), and then b) a digits segment containing the payload. So they didn't use “SHC:/” to save bits, but the difference is small.
One thing to note is that the QR spec (ISO/IEC 18004:2005) has a whole section on “structured append” mode, where multiple QR codes can be combined into one. But trying that with iOS and Android shows that neither support it, so probably it can be considered dead and that's why SHC is replicating the same feature at a higher level.
Another thing to note is that SHC is using digits for better efficiency, which is great, but the way that they do it is not. They're using JWT, which is bad but not today's topic, so they have three base64-encoded strings. They then take each base64 character, subtract 45, and turn that into two base-10 digits! All that work minifying JSON and compressing it, and then they throw away 10% of their bits on such a trivial thing!
So SHC did pretty well, but missed an easy win. Having read this, you can do better.
The NTRU Prime encoding
Above, I referenced an encoding that gets nearly all the space efficiency of the bigint-and-divmod method, but without the computational costs. This section is about that. It's taken from page 18 of the NTRU Prime NIST submission.
Our motivating issue is thus: if you have a whole byte then taking it mod 10 to generate a digit works fairly well. The digits 0–5 have probability 26/256 while 6–9 have probability 25/256. That's not uniform therefore it doesn't encode the maximum amount of entropy, but it's 99.992% efficient, which is pretty good. But when you have a smaller range of input values the non-uniformity becomes significant and so does the reduction in information transmitted.
The encoding in NTRU Prime takes input values (which can be larger than bytes) and combines pairs of them. It produces some output values from each pair but, once the non-uniformity gets unconfortable, it pairs up the leftovers to increase the range. This repeats in a binary-tree.
As a concrete example we'll use the Python code from page 18 and set M = [256…] (because our inputs are bytes), change the 256 and 255 to 10 and 9 (to extract digits, not bytes), and set limit to 1024. Our example input will be 419dd0ed371f44b7.
The input bytes are written (in base 10) along the top. Pairs of them are combined. Take the top-right box: its value is 157×256 + 65 = 40257. That can be considered to be 40257 mod 65536 and, since there's a reasonable number of bits in there, two digits are extracted. Obviously 40257 mod 10 = 7, and 4025 mod 10 = 5. So the two digits are 7 and 5. That leaves 402 mod 656, and 656 is below the limit of 1024 that we set, so it's passed down to be combined into the box below. This continues down the tree: each time there's too little data left to extract another digit, the leftovers are passed down to be combinined. At the bottom there's nothing else to combine with so the final leftover value, 132 mod 186, is converted into the digits 2, 3, and 1. The ultimate output digits are concatenated from left-to-right, top-to-bottom.
This achieves good encoding efficiency without repeated long-division, and can be parallelised.