Facebook Data Export Hidden Pitfalls

2018-12-27 in SOFTWARE ENGINEERING

fediverse open source

5 min read

As I get more and more fed up with Facebook while also getting more and more embedded into the Fediverse I’ve been considering the whole #deletefacebook campaign again. I turned off Facebook earlier but never deleted it. As the new year approaches the thought of shutting it down appealed to me but then I went a step further thinking I should just blow away all of my data as well. There are lots of posts with lots of data and lots of associations that I want to keep though. Thankfully Facebook provides a mechanism for extracting your data. Unfortunately if you assume all of your data is there you’ll probably be wrong.

I’m not going to ascribe malicious intent to Facebook but I will ascribe incompetence in that regard. When I decided to try this data extraction exercise a few months ago I was sadly disappointed that in all of my posts the links had been ripped out. One of the few things I was actually trying to preserve was all of the links I shared. That is probably the majority of my posts: here’s a link and what I thought of it. Without that data the post lacks context and usefulness. Lo and behold though that if one exports with JSON instead of HTML that the link data is preserved after all. JSON is probably a better format so I figure that’s a good thing. After a small couple day extraction test I then went ahead and told it to dump the entire history of my time on Facebook, which spans 5 years. I clicked on the archive creation button with all options selected and then waited the requisite time for it to e-mail me it was done.

With e-mail in hand, although Facebook is happy to give you an in-browser notification as well, I had my archive. A full 1GB+ of data packaged up in a neat little bundle and ready for use. I downloaded and extracted and then began to pore over this voluminous collection of posts and comments, thousands of them. It’s raw JSON with time stamps, my original post comments and the links! Thank god the links were finally there! The times of the post are in seconds since Unix Epoch (1 January 1970 00:00:00 UTC). That may sound like a bizarre thing to many people but that’s actually a pretty standard way of tracking time for developers. Wanting to make sure these thousands of posts covered the time I did a little spreadsheet math and confirmed that the posts began in July 2013 and ended on the day I did the archive. That was the plan anyway. Instead what I got was that the archive began some time in 2015 and ended the day before the day I was archiving. Okay…I mean I guess it’s possible that the posts are out of order in the middle but it’s more likely that I’m missing data. A little experimentation later and it seems that the request essentially times out and gives you whatever it gets or simple stops at a certain post count or time threshold. Also it seems that the date selector really gets interpreted as “posts on or after this date up to before this date”. Thanks Facebook for telling us that!

There was other confusion as well. I experimented with doing extractions from back in 2013 for one month, 2013 for one year, the previous two days et cetera, to get to my above conclusions. The history of who you friended, unfriended, didn’t accept, seems pretty legit. The message comments history does too. The photos however seem to be nearly arbitrary. The 2013 archive had pictures from November of this year. Likewise the archive for the previous 48 hours, which only returns data for the previous day not the present day, included pictures from 2014. I guess data de-duplication is better than data loss, which is what I’d be dealing with with these posts.

Beyond data loss there is data usability issues. The archive includes every post you ever interacted with. You get all of your likes, comments, as well as posts. The problem is that there is nothing linking your interactions to the other post. It has a timestamp, what you said or reaction was, and who the owner of the post was. That’s it. There are no GUIDs or other mechanism for actually relating that information to anything. That’s true even for posts you created! So the comments/interactions are essentially a giant cloud of unconnected data. Boy that’s useful! Since the primary thing I wanted to save were my posts I was willing to live with that, but what is the point of that data except maybe to get around GPDR in an almost passive aggressive way. “Oh okay, yes here is all the data but it’s not in any useful format for you.”

I’m not surprised but I am disappointed in how well this archive functionality works. Before I’d go through the process of deleting my entire account I’m going to have to spend time manually archiving in chunks of times, validating begin/end dates of posts, etc. I’m also going to then want to de-duplicate the corresponding data. For those that want to do the same, or have done so thinking they got all of their data, they should really tread very lightly and be dilligent about checking that you got everything you think you did.