The Zombie Stack Exchanges That Just Won't Die
I wonder whether there is a data format for archiving messages from twitter and similar microblogging sites, including metadata. The Library of Congress gets all messages from Twitter but I found no information how they store them and how they would make them available for researches. There are several tools to backup your Tweets, in most cases you don't get a download including all data. One can also access Twitter API but then everyone stores his archive in different formats so there is no easy way to exchange and combine archived messages.
On the other hand there is interest in analyzing microblogging and tweets are analyzed by companies and researchers. But are there open and standard tools to collect, archive, and analyze tweets? And which collections of archived microblogs exists in addition to the LoC? I don't want to propagate "MARC for Twitter" but at least something more precise than "just use CSV/JSON". Relying on the custom format of one particular provider (Twitter) at one particular point in time does not look like a reliable solution for long term archiving.
P.S: Ed Summers gave a brief overview of the format used by LoC for
archiving Twitter. There are some open questions how do use the format
for other services (e.g.
Google+ and for
particular selections of messages (somehow documented in
bag-info.txt
). I'd like to see tools to create your own archives in
the same format - for instance all tweets by some user accounts or all
tweets with a given hashtag like #VenusTransit in a given time - and
tools to read and analyze these archive files. There are several closed
web applications to do so but they don't provide import/export in such a
defined standard format, do they?
Jakob
I would use RDF and serialize this in turtle or, better yet, json-LD.
Why? Tweets often provide links, hashtags and mentions, every tweet has a URI, e.g. https://twitter.com/jindrichmynarz/status/176326368701853696 , and all in all it's all about graphs: "social" graphs. Retweeting is " :_personX twitter:retweets :_tweetY " . Liking is linking, see https://headtoweb.posterous.com/liking-is-linking, and so on.
I think some public information about how the Twitter data is being stored at LC could possibly be helpful. One of the reasons for the lack thereof is probably because it's quite uninteresting (at the moment). In the interests of transparency I can provide some rudimentary information about how the data is archived, as I have been involved in the little bit of software development LC has done around the Twitter data to date. These remarks are not meant to be an official statement from the Library of Congress, but are merely the reflections of a software developer who has worked on the project.
LC currently receives the Twitter data from a third party data provider Gnip. Gnip packages up each hour's tweet and delete activity using BagIt, which is tarred up, and made available on Amazon S3. The structure of an example bag looks like:
2012050105
|-- bag-info.txt
|-- bagit.txt
|-- data
| |-- 2012050105_deletes.gz
| `-- 2012050105_tweets.gz
|-- manifest-md5.txt
`-- manifest-sha1.txt
A simple custom Python/Django application at the Library of Congress periodically polls S3 for new bags to download. When it finds a new one it downloads the tar file, untars it, counts the number of tweets and deletes, verifies the bag, and uses an internal data transfer application to inventory and copy the bag to archival storage, after which the bag is deleted from the local filesystem. This process runs 24/7 in order to keep up.
The tweet and delete data look like the JSON that the Twitter API itself emits. We have made no effort (to date) to normalize the data using another format for several reasons:
I hope this brief on-the-ground description of the Twitter archiving activity provides some guidance, and doesn't do a disservice to any of the folks at LC that have been involved in the effort.
Although your question is mostly on formats, the title suggests you're interested in tools as well. If so, maybe T, which is a command-line interface to the Twitter API, could be of some interest here:
Not sure how useful this is for professional archiving, and I must say that I don't have any hands-on experience with this tool myself. However I've spoken to some people who appear to be impressed with it (although they were mainly using it for personal backups). Possibly worth a look.