Saturday, September 20, 2008

The Wayback Machine

We often think that whatever we publish on the web stays there forever. This is not the case. Not only is the average lifespan of a page 40-75 days, forums often go defunct, web designers are forever tinkering with the layout, and the need for being "current" is constantly driving the turn-over of web pages in the corporate world. All that data, lost? Not so. Not if you have access to The Wayback Machine . The Wayback machine is a browser that gives you access to an archive of the internet since 1997. The brainchild of Brewster Kahle, inventor, librarian and one of the director of the EFF, the machine is named after the fictional machine from a segment of the cartoon The Rocky and Bullwinkle Show used to transport Mr. Peabody and Sherman back in time. Type in a url of the site in question, click on the archival date and your web page will be retrieved as is, with all the links in working order. It is free, it is legal and it works.

Kahle's premise for The Wayback Machine was "to build a library of everything, and the opportunity is to build a great library that offers universal access to all of human knowledge." As he often says in his promotional lectures "We really need to put the best we can offer in reach of our children. If we don't, we are going to get the generation we deserve." This project was designed to take a snapshot of the web and preserve it for future generations. there are 85 Billion pages archived from 1996 to a few months ago. It is a a computer system with close to 400 parallel processors, 100 terabytes of disk space, hundreds of gigs of RAM, all for under a half-million dollars. The Archive has turned clusters of PCs into a single parallel computer running the biggest database in existence and with its own operating system, P2, which allows programmers with no expertise in parallel systems to program the system. The crawlers record pages into 100MB files in a standard archive file format, and then stores it on one of the storage machines, which are nothing more than normal PCs with IDE drives. Then they're indexed onto another set of machines and that is kept up to date on an hourly basis. It uses the link structure of the Net and the usage trails from the Alexa users to be able to compute this data. Alexa Internet, Inc. is a California-based subsidiary company of Amazon.com that is best known for operating a website that provides information on web traffic to other websites. Some of the notable archives are Web Pioneers , sites ranging from Trekky user-groups to Amazon, that made a global impact in the early days, and the Asian Tsunami Web Archive which is a collection of over 1500 sites relating to the December 2004 Tsunami disaster in Asia. A snapshot of these sites has been taken once a week starting from the first week of January 2005. All in all, you can source pretty much any site you want, however these sites are restricted to those who gave copyright permissions.

The Wayback Machine is not limited to just web pages, it includes a move to archive all books, music, film, television and software. Kahle, ever the librarian, "whenever I try to read a book on my laptop it feels like work". "I like the physical book. And I think we can go and use our technology to digitize things put the on the net, and then download, print them and bind them and end up with books again." The archive is scanning 15,000 books a month, and has 250,000 books online in 8 collections. A book is about a megabyte, so 26 Terabytes for the entire library of congress which it fits on a 3x3 ft machine on spinning linux drives with the total cost around $60,000. It only archives that which is free publicly. Kahle started a program called the bookmobile for children, a van with its own satellite, printer and binder. It costs about $3 to download, print and bind a book. There is even an Espresso Book machine, for personalized books, printed on an assembly line. The New Library of Alexandria in Egypt has 30,000 national books scanned, the Chinese 1,000,000 and the Indians have about 300,000. The push now is to provide In-Library machines, and so far there are 8 scanning centres in N. America.

When it comes to film, you will be surprised to know that all theatrical releases since inception have been estimated at 100,000, and about half of these are from India. About a 1000 out of copyright and these are legally available. If you factor in independent flicks and archival films they are in a couple of 100,000s. That for the user is unlimited storage, unlimited bandwidth forever, and not only is it free, you can do whatever you want with it. One of the points that Kahle raised when dealing with Television is how can we have critical thinking without being able to quote and being able to compare what happened in the past. There is just so much rich information, without having inclusive viewpoints and a historical perspective, context is lost.
The archive is recording 20 channels 24 hours a day of global TV and is about a Petabyte. There are 50,000 videos on archive now. Noteworthy is the archive of a full week of News on September 11, 2001, some of which you can find here. Music is quantitatively tiny in comparison. There're only 2-3 million records that have been produced over the last century and the archive has stored about a 100,000. As its obvious and in keeping with copyright infringement values of the Music Industry, this is taking the longest to consolidate. Software titles are only about 50,000 and again there are legal issues. Kahle is trying to build a business model where "free for all" and "loan for all" meet somewhere in the middle.

Of course for every Utopian dream that comes out of Silicon Valley, it is followed keenly by sharks, and the great white thats in the thick of it today is none other than Google Books. according to Wikipedia, Google is scanning more than 3,000 books per day, a rate that translates into more than 1 million annually. Google has encountered choppy waters itself as
The Authors Guild of America and Association of American Publishers have separately sued Google, citing "massive copyright infringement. Apparently it has been surreptitiously scanning books and uploading them until the owner becomes aware and yanks it off. The books are also scanned in English, leaving 3 Billion people out of their "historical or literary" say. Google's aim is qualitatively contrary to the adage "what is in the public domain should remain in the public domain."

The Library of Alexandria, repository of all written knowledge in its time was burnt by Julius Ceasar in his conquest of Egypt in 48 BC. where he set fire to his own ships, burnt the docks and accidentally destroyed the great library. Today a copy of The Wayback Machine resides within the library of Alexandria, free, for all.

No comments: