Web Sites Forgotten

I’m not at a loss for diversity of approaches in my current archiving work on Prom Week. Teams of developers in modern academy (or even the real world) need ways to share and evaluate information about their work and projects. As a result of the never-ending new services and means of disseminating digital information, there are generally quite a few places you need to look nowadays to find everything. For instance, say you are working on a project and it needs to be announced and shared. Well then you’ll start a blog for it, and maybe get mentioned in some press. All good, but those are now two more potential research objects potentially worthy of appraisal and ingestion into a digital archive. If the press builds up enough, there will tens to hundreds of individual stories, some derivative, some not, about your work. Maybe even a famous person likes your stuff and mentions it on Twitter or Tumblr, you’d definitely want to have that saved for posterity, right?

In light of all the new modes of dissemination on the Internet, it has become imperative to find ways of aggregating and saving large amounts of online documents. This isn’t a reference to cloud documents, I’m mainly referring to anything involved or incumbent upon the development process that gets posted to some addressable folder on a server. I split this type of documentation into two types: dissemination documents and development documents. Dissemination documentation is everything I mentioned in the first paragraph. These are documents that are either released to the online public by the developers, in the form of press releases, development blogs, and other affiliated blog posts. These documents, in turn, can produce community reactions in the form of press coverage, reviews, and other online references to the game.  Development documents are online references to the game intended for the development team only. These can range from shared folders containing demonstration applications to group chat logs and meeting organization tools. Both of these document types site on the accessible web and might be informative to later researchers, so how to organize and save it all?

Internet Archive?

One thing to do would be to rely on someone else to pick up the slack. In this instance I’m mainly referring to the Internet Archive and their Wayback Machine. Internet Archive’s tool regularly scrapes the Internet for pages, and stores and logs changes to a large number of websites. You can even browse through older versions of websites and see what they looked like before they were acquired. The major problem is that the crawler will respect robots.txt (a file located at the root of a web directory explicitly blocking crawls) and apparently remove websites upon request. Another issue is that the crawler is not omniscient, it ‘crawls’ and needs a medium through which to exercise that activity, namely hyperlinks. If you put up a file on your domain, don’t tell anyone the link, and don’t post it anywhere, there is no way for the crawler to find or know about it. As such, a large portion of the Internet does not function in a way conducive to crawling. Sites like Facebook block crawlers and as a result the Wayback Machine does not have a copy of the original Prom Week page. So, the short answer is no, if you want to save all dissemination documents, you’ll need to do it yourself.

Certain applications, including most browsers, will let you save a webpage in its entirety although the scope of “entirety” is up to the application. Two services I use for this activity are either Zotero (for individual pages) or Archive-It. Zotero is a fantastic citation and reference management system and it will function as a browser plug-in. This allows you to save a website and have it attached to relevant metadata in a folder of your choosing. Archive-It is a commercial web application through which the Internet Archive provides paid access to their web crawler. Conveniently unless a user of the service explicitly prevents it, anything crawled by a target user will also show up somewhere on the archive.org website. Archive-It is a better solution if you want to save a larger class of website. As an example I used the tool to save the Prom Week development blog. The crawler managed to grab quite a bit of information and all the blog posts at the domain. Surprisingly, just a handful of posts turned up over 243 URLs and 45 megabytes of data. So there’s generally a lot more documents and data than one would assume. Granted some of these are links to other services like Twitter and Facebook or to individual hosted images so the document count is a little inflated. Still, even with this seemingly deep dive into a website it is still possible to miss something significant.

Results from undirected Archive-It Crawl

Results from undirected Archive-It Crawl

One of the project’s developers stored some demo applications and explanatory web sites for conferences on his personal UCSC web space. After crawling his site I came up with 156 URLs, but none of those linked to the Prom Week documents. Since he hadn’t linked them to any file in his main directory the crawler didn’t find them. I had to feed the crawler individual URLs for each demo file and directory. Needless to say this would become untenable for even a moderately complex grouping of online documents.

Ephemeral Sites

Along with development demos and files, another task that teams frequently engage with is scheduling. In looking through the email list archives for the project, I came across references to third party scheduling apps. These apps were used for organizing project meeting times and are still active on the web. This is a new type of documentation that I haven’t really seen discussed in the archival frame yet. Many people use simple online services to schedule meetings, share notes and even code. I figure most people don’t expect these texts to hang around, but a large number of them do. Being at the whims of whatever services is hosting them, it’s important for archival entities to be aware of the problems inherent in finding, appraising and saving such ephemeral documentation. I have quite a few still-active links to meetings the development team scheduled over two years ago. Although this information is not something I’d deem to be very important, it does further substantiate a number of things, including who was working on the project and when they were meeting. This type of information might be irrelevant for my current work, but it would be nifty to have the date for the first UI design meeting of Google or something similar. Awareness of the range of documentation can help prevent potentially valuable information from slipping through. The continuing problem being to know, in advance, what the relevant documents are.