There is a building in northern San Francisco that looks like a cousin to the Acropolis in Greece. It used to be a Christian Science church.
Now, however, it houses 26 petabytes of digital information in a forest of blinking, heat-generating servers. Welcome to the Internet Archive headquarters.
“The Internet Archive is part of the vision to build the Library of Alexandria, version 2,” says digital librarian Brewster Kahle. “We hit the record button on the World Wide Web in 1996. We take a snapshot of every web site and every web page on every website.”
The archive doesn’t just collect digital information. They also archive old video games, film and hardware. And they get a lot of traffic. Between two and three million users upload or download something from the archive every day.
“What we want is the wackiness and the wildness of all the people participating in the big conversation that is the Internet,” Kahle says.
The Internet Archive collects web pages at about one billion pages per week. Currently the collection consists of some 450 billion web objects.
“Which is just freaking huge,” Kahle says, “The Library of Congress's number of books is 28 million. We collect that in about, oh I don't know, six hours.”
The Internet Archive is unable to access all of the web, however. Information on Facebook and Twitter is closed to the public, and difficult for the archive to get.
“You have to understand these are privately owned information assets,” says historian Abby Smith Rumsey. “People don't think of things that they put on Facebook or Twitter as belonging to somebody else because they come from us. But in fact they don't have ultimate control over this. ... We need more organizations that control information like Twitter, Apple, Google, to actually develop partnerships with public institutions, like the Internet Archive or the Library of Congress, so that they can be archived and made available to the public for the long term.”
The Internet Archive’s work involves not only collecting information, but figuring out new ways to store it, and constantly updating it to keep it available in relevant formats.
“We have to be out there all the time not only gathering the stuff, but keeping it in formats and keeping it relevant,” Kahle says.