If you’ve tweeted anything in the last seven years or so there’s a pretty good chance your 140-character musing has been enshrined in the Library of Congress. In a rather ambitious project, that came to an end last month, the Library has collected every public tweet since 2006. Every ICYMI. Every TBT. Every hashtag.
When the Library announced its Twitter Collection in 2010, that was about 55 million tweets per day. By the end of it, the Library was adding about 500 million tweets to the collection daily — a challenging volume to manage — even in its quip-length packaging. According to Information Science Assistant Professor Alex Poole, PhD, whose research in the College of Computing & Informatics focuses on archives, records and digital curation, collecting the Tweets might have been the easy part. Now the Library faces the onerous task of sorting them in a way that will allow for public access.
Poole recently shed a light on how and why libraries decide to start collections and how they go about managing them — and what the Library of Congress is going to do with 11 years of tweets.
Up until Dec. 31, 2017, the Library of Congress had been collecting tweets since 2010 (and has archived all tweets since 2006), considering the scope of what the Library typically archives from the internet, why was this deemed a viable and useful undertaking at the time?
The Library first dipped into digital preservation at scale in 2000, when it started selectively capturing blogs and websites. It seems likely that its decision to archive Twitter flowed from this initiative. What was more, Twitter offered a ready-made opportunity to preserve a new medium of communication at the beginning of its lifecycle—a tantalizing opportunity for archivists. Twitter constituted a large corpus of real-time data produced by ordinary folks at an unprecedented scale.
Traditionally, by contrast, repositories have acquired materials only selectively or even accidentally, often after the materials’ active phase and those materials have usually been biased toward elite white males. Therefore, advocates of the new endeavor suggested that the corpus would prove a boon to future historical research. Social media could serve as a supplement to traditional documents such as correspondence and other materials long collected by research libraries. The Library suggested that the Twitter corpus could inform not only future scholarship, but also legislation and education.
What sort of considerations go into the Library’s — or any library’s — decision to archive certain information?
Repositories’ collection development channels its mission statement. Yet repositories often have broad mission statements and broad collecting policies: the Library of Congress, for example, seeks to “acquire, preserve, and provide access to a universal collection of knowledge and the record of America’s creativity,” which gives it much latitude, for better or worse.
It bears noting that the Library, like most repositories, has rarely if ever committed to collecting any materials comprehensively. Most collecting policies, moreover, have been written retroactively, i.e. once a body of materials has already been assembled. This helps account for their generally expansive nature.
As important as a repository’s mission statement and concomitant collecting policy is its ability to commit resources to a collecting project. Public institutions are particularly affected by resource shortages — even those as august as the Library of Congress. Complicating matters still further, repositories invariably grapple with tremendous backlogs of work and must prioritize those ongoing initiatives with any new ventures. On top of that, the Library chose not to hire an information officer until 2014 and the Librarian of Congress from 1987 to 2015 refused even to adopt email.
Undoubtedly these two concerns—existing commitments and concerns and inertia vis-à-vis technology—played a key role in the institution’s decision to cease archiving all tweets.
What factors played into the Library’s decision to cease archiving tweets?
When the Library decided to start collecting tweets in 2010, it assumed responsibility for a relatively small number of tweets that were text-based and were limited to 140 characters. The first batch of tweets accessioned in 2011 (covering the years 2006-2010) comprised only 2.3 terabytes of data. The number of tweets to be collected, moreover, skyrocketed: from roughly 55 million per day in 2010 to 140 million per day in 2011 to 500 million per day in 2012. More important, perhaps, was the challenge of developing use policies and of indexing.
Collecting tweets implicated vexing privacy concerns: as a result, the Library planned to embargo tweets for six months. Similarly, the Library agreed to redact any messages by users who made their account “protected”; it also elected not to collect deleted tweets. Further, it did not collect any metadata about the tweets such as images or retweets. Indexing to facilitate use was likely the most daunting problem: the goal of giving researchers access to the original 2006-2010 materials has not been achieved even in 2018.
The Library justified its decision to stop collecting all tweets by claiming that it had fulfilled its earlier, albeit vague, mission: to document the origins and formative period of a social medium. More specifically, it offered three reasons, all debatable, for its evolved policy: the sheer volume of tweets produced daily, the fact that the size of each tweet itself had expanded to 240 characters, and that it had only collected text, not images and other metadata. Additionally, its public statements mentioned resource allocation as a challenge — it is worth noting that Twitter has not offered to help sponsor the continuation of the project.
The Library’s Twitter archiving policy going forward will concentrate on covering themes, such as public policy, or events, such as elections; this stance is in accord with its overall policy on collecting. As noted above, it still has no timetable for when access will be provided to the existing collection.
How have the Library’s policies changed to accommodate the growing rate of information creation in the last 10-15 years?
It is unclear that they have! Apropos of their website/blog preservation efforts, for instance, they hedge, “Thousands of sites have been preserved in a variety [sic] event and thematic Web archives, selected by subject specialists.”
How will the Library determine which tweets it will archive on a “very selective basis” going forward?
This is a big problem and points back to the intertwined issues of collecting policy and mission. Because both are so broad, they can be tailored to rationalize effectively any course of action the library chooses in this regard. It will be quite fascinating to see how they handle this issue, especially because it is politically fraught: in other words, the very notion of collecting tweets was to document ordinary people and it is unclear how this can best be effected on a “very selective” basis. Again, it appears that neither Twitter nor any other private entity has offered financially to support the Library’s efforts past or present in this respect.
For media inquiries contact Britt Faulstick, assistant director, Media Relations, bef29@drexel.edu