Fixing Facebook's Borked File Encoding

While working on the Kryanite Facebook archive reader I ran into a frequent problem of garbled text. Ninety-nine percent of the time everything would be fine but then I’d see a string of weirdly out of place characters. I had assumed that perhaps the default encoding picked by the Dart file reader was to blame. Perhaps it should have been UTF-16 instead of UTF-8, or something along those lines. Experimentation didn’t help the problem. It turns out the problem is one of Facebook not properly encoding their files. With the help of another blogger who discovered the problem and fixed it I was able to create Dart code which fixes it for my uses as well. Here is the snippet. Read below the fold for more details.

As luck would have it, <sarcasm>or maybe my superior DDG search skills</sarcasm>, I found another developer who ran into the exact same problem. Paweł Krawczyk had attempted to process his Facebook archive file which had lots of Polish diacritical characters. All of them were coming up wrong. In this blog post he documents exactly where Facebook engineers went wrong:

The story here seems to be that Facebook programmers mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. What Facebook outputs is binary representation of UTF-8 encoded Unicode character U+105 (LATIN SMALL LETTER A WITH OGONEK) but confusingly prefixed with \u.

The prefix is confusing, because it implies it’s an escape sequence referring to a single Unicode character U+C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) followed by another single Unicode character U+85 (NEXT LINE).

This kind of “Unicode characters pretending to be bytes” is possible for characters U+0000 up to U+00FF, or simply the basic ASCII range, because their UTF-8 encoding is identical to their ASCII counterparts. Nonetheless, it’s used against its purpose here and clearly confusing for both humans and JSON decoders.

Algorithm

Along with diagnosing the problem he also posted Python code which read the Facebook JSON files as raw bytes, properly encoded the mis-encoded Unicode elements, and returned the proper file. I needed this in Dart for my Flutter application. I therefore decided to create a Dart implementation.

The premise of the algorithm he presented is pretty straight forward:

  • Load the text file up as raw bytes not as Unicode characters
  • Read in the data byte for byte looking for a character sequence \u
    • If there is no \u sequence simply write the character to the buffer
    • Else read the next four characters which have encoded the Unicode character position index and store in a byte buffer
    • Continue reading and parsing the next six characters until you no longer have a \u sequence, dumping each result into the byte buffer
    • When you’ve reached the end of the sequence then you convert the byte buffer into Unicode character(s) and write that to the string buffer.

Practically speaking what sorts of codes are we looking at? Well Krawczyk’s example from his post was the Polish diacritical “ą” which was written in the file by Facebook as \u00c4\u0085. In my own archive I ran into a heart Unicode character “❤” written as \u00e2\u009d\u00a4\u00ef\u00b8\u008f. You can see why it is important to not assume that there are just two \u sequences defining the character, it’s just not the case. A string of emojis are similarly better interpreted into Unicode by just reading the sequence in entirely. Therefore five star emojis one right after the other “★★★★★” was written by Facebook as \u00e2\u0098\u0085\u00e2\u0098\u0085\u00e2\u0098\u0085\u00e2\u0098\u0085\u00e2\u0098\u0085.

Dart Implementation

My first implementation used the String.fromCharCode to read each of the numbers after the \u escape sequence. Once those were built up into the index string I could then use the int.parse method with the radix: 16 option to convert that into a byte value for that element which could then be added to the byte array that I was building up. That felt like a lot of byte-to-string-to-byte conversions to be doing. It worked but it felt like it could be a lot faster if I just stayed in byte space. The final implementation directly converts their ASCII/Unicode string index into the proper byte index. For example ASCII/Unicode “character 0” is index 48. Therefore if I get a byte value of 48 I want to return 0 by subtracting 48. It needs some additional logic to encode the A-F hex values but the premise is the same. With all four hex values for each element of the index we can shift them into the proper place to build up the real byte value of the Unicode index that is captured by the \u sequence. That is then stored in a byte array we keep building until the end of the \u sequences which are then properly encoded by the standard UTF decoder. Below shows the conversion process for each \u sequence (where data is where we are in the original data byte array):

final chars = data
  .sublist(i + 2, i + 6)
  .map((e) => e < 97 ? e - 48 : e - 87)
  .toList(growable: false);
final byte =
  (chars[0] << 12) + (chars[1] << 8) + (chars[2] << 4) + (chars[3]);
byteBuffer.add(byte);

…and this is what the whole loop over the data looks like:

while (i < data.length - 1) {
  if (data[i] == leadingSlash && data[i + 1] == leadingU) {
	final byteBuffer = <int>[];
	while(i < data.length -1 && data[i] == leadingSlash && data[i + 1] == leadingU) {
	  final chars = data
		  .sublist(i + 2, i + 6)
		  .map((e) => e < 97 ? e - 48 : e - 87)
		  .toList(growable: false);
	  final byte =
		  (chars[0] << 12) + (chars[1] << 8) + (chars[2] << 4) + (chars[3]);
	  byteBuffer.add(byte);
	  i += 6;
	}
	final unicodeChar = utf8.decode(byteBuffer);
	buffer.write(unicodeChar);
  } else {
	buffer.writeCharCode(data[i]);
	i++;
  }
}

Performance

All of this reading does create some performance hit but it is not egregious. On smaller files, less than 100KB, the difference is literally in the noise but on larger files it can be up to 10% slower than a straight file read. The original version with the string building/converting of each byte segment was 2-3 times slower than the present byte-only version though. Because reading even large JSON files is still far less than a second it’s technically not that big of a deal but I still appreciate the extra efficiency enough to keep this version.

As posted above the full code with an example driver main program can be found in this GitLab snippet. In that incarnation reading a Facebook archive file with proper encoding is as simple as:

final data = await File(path).readFacebookEncodedFileAsString();