Don’t Look Back: Cloud APIs and Revisions

Cloud APIs represent an interesting problem for the archival security of documents over extended periods. Storing files with a secondary source, in this case I’ll be discussing Google Docs and Dropbox, is usually a most convenient option. Google Docs provides a means to create, edit, share and browse documents without leaving the confines of a browser. Since I’m usually doing research or testing applications within that environment, having some of my documents there is definitely a useful addition to my workflow. Similarly, Dropbox’s main appeal is that it shadows the cloud storage of my files by being present as a locally accessible folder. Anything I drop in there then gets whisked up to the mystical cloud, floating into the immense sea of offsite data storage.

The obvious problem with cloud storage and a definite reason for skepticism is the reliance on secondary corporate structures to persist. Granted, the stability of the backups on Google Docs or Dropbox is probably greater than on my local hard drive but there is still something a little unnerving about giving up that control. From an archival perspective, which is very…very long, there is a general fear about the lasting permanence of cloud institutions. Google and Dropbox seem totally stable right now, but will I be able to leave something on there for 20 years and still be able to retrieve it? Technological turnover is astoundingly fast and many users have been burned by storage companies blinking out of existence. A good example was the shutdown of MegaUpload, in which 25 petabytes of user data vanished after a government raid.

The goal is to find the most stable possible place for long term digital storage. And although they are sometimes derided as slow and conservative, government and educational institutions generally have the best track record for long term retention. So the question, in light of all the online documents produced by game development, is what is the best solution to the probable collapse of a cloud service? Do you just migrate the stored data to a more secure digital repository? What data do you take, or more specifically, what data is available? This last question is where access to cloud service APIs becomes important because they are the main conduit for detailed documentary information and also the limiting factor for getting data off of a service.

Permission and Access

Both Google and Dropbox require user authentication to access cloud files. This is necessary for security and is at the heart of the trust networks that allow these services to function. Through a weird inversion, this security framework is also a liability for archives. Since getting access to files requires membership to either service, if a group wants to share data with a repository through the service, that repository must then become a member.  The problem here is that as an archivist I should not, in any way, be inserting myself into the material. Yes, we you prepare a collection you make tons of decisions about potential organization and delineation betwixt documents, but I’m not going to write my name on every one. When a Dropbox or Google documents folder is shared, the metadata associated with that transaction becomes permanent, so I effectively become a member of the project team from the viewpoint of the service.

Now, as far as future recommendations go, I don’t know if this type of thing is avoidable. For instance, in Google Docs you can assign privileges for a file and thus make it impossible for the archival entity to modify the document. Sadly, this option is not available on Dropbox, so if I accidentally do something to the files on my machine I could permanently overwrite some actual historical document. And yes, I could always make sure to make a full, non-synced copy of the folder somewhere else and then recursively modify those permissions, but my main point is that this is a ton more to think about and manage on a technical level than with a standard digital document.

Access to the various APIs for these services is also based on membership. If I want to use Google’s or Dropbox’s Javascript API, I need to register with them as a developer and then register my potential application of the API to get a token allowing access to the service. Now, if you don’t follow that immediately, you see where some problems start to set in. I’ve developed web applications so all this is pretty straight forward to me. To libraries or other traditional repository structures it means the need for technical expertise and cost. The APIs for Dropbox and Google provide a lot of information about users’ files. There are heaps of metadata associated with cloud documents that might interest a repository and are only available through sending programmed requests to the APIs.

Metadata and Contextual Information

When you create a document on Google Docs or add something to a Dropbox folder, a significant amount of computational work takes place, virtually all of it hidden from the user. Native formats for Google Docs are anyone’s guess, and odds are they are probably drawn from multiple services spread across multiple servers and then assembled through complex Javascript and AJAX queries. In order to remove these items from the service, a user can choose to download them as a selection of different files. Google Documents (text documents) are downloadable as PDF, Microsoft Word (.docx), OpenOffice (.odt), Rich Text (.rtf), Plain Text(.txt) or HTML files. These files don’t exist until downloaded and do not include most of the metadata generated for the file. The downloaded version is just one manifestation of the data Google has associated with the document. To get the rest you have to use calls to the API.

The downloaded documents also shed information not expressible in the format. Comments added to documents by a user are saved in Microsoft Word downloads, but removed for PDF downloads. Additionally, while Google is tracking revisions to files, that information is not present in the downloaded copies.

Revisions

According to the Google Drive SDK API, Google is tracking a significant amount of metadata for each file on Drive. Revisions also have a detailed metadata trail that is only accessible through code. Another issue is that revisions to documents are deprecated over time, so there isn’t really a concept of versioning to the files. File can be “pinned” to specific revisions by setting a specific flag through the API, but that option is not available to anyone without programming knowledge. After a long enough period of time Google services will prune revisions from less active documents. Therefore, though some of the shared documents do still have histories, I don’t know if they are inclusive of most changes.

Dropbox presents a similar issue in that most of the metadata associated with a file is also available only through the API. Thankfully since Dropbox is a small company, the API is a lot simpler and less intense than Google’s offering. Additionally, Dropbox’s online application does allow one to view the previous versions of a file, but also deletes version history after 30 days unless you pay extra for their Packrat add-on. All the files in Prom Week’s Dropbox now have only one version, including drafts of academic publications. All that data is now gone.

Since Google at least keeps some of the revisions around for a decent period of time, it would be feasible to write some add-on or application to back up the revisions (even to Google Drive). A problem with this is deciding which revisions are meaningful and which are just mild edits. Each revisions would they be its own full document, instead just the fragmentary diff common to source control programs like Git.

More on the APIs later, if I have time to work out some scripts and such.