Don’t Look Back: Cloud APIs and Revisions

Cloud APIs represent an interesting problem for the archival security of documents over extended periods. Storing files with a secondary source, in this case I’ll be discussing Google Docs and Dropbox, is usually a most convenient option. Google Docs provides a means to create, edit, share and browse documents without leaving the confines of a browser. Since I’m usually doing research or testing applications within that environment, having some of my documents there is definitely a useful addition to my workflow. Similarly, Dropbox’s main appeal is that it shadows the cloud storage of my files by being present as a locally accessible folder. Anything I drop in there then gets whisked up to the mystical cloud, floating into the immense sea of offsite data storage.

The obvious problem with cloud storage and a definite reason for skepticism is the reliance on secondary corporate structures to persist. Granted, the stability of the backups on Google Docs or Dropbox is probably greater than on my local hard drive but there is still something a little unnerving about giving up that control. From an archival perspective, which is very…very long, there is a general fear about the lasting permanence of cloud institutions. Google and Dropbox seem totally stable right now, but will I be able to leave something on there for 20 years and still be able to retrieve it? Technological turnover is astoundingly fast and many users have been burned by storage companies blinking out of existence. A good example was the shutdown of MegaUpload, in which 25 petabytes of user data vanished after a government raid.

The goal is to find the most stable possible place for long term digital storage. And although they are sometimes derided as slow and conservative, government and educational institutions generally have the best track record for long term retention. So the question, in light of all the online documents produced by game development, is what is the best solution to the probable collapse of a cloud service? Do you just migrate the stored data to a more secure digital repository? What data do you take, or more specifically, what data is available? This last question is where access to cloud service APIs becomes important because they are the main conduit for detailed documentary information and also the limiting factor for getting data off of a service.

Permission and Access

Both Google and Dropbox require user authentication to access cloud files. This is necessary for security and is at the heart of the trust networks that allow these services to function. Through a weird inversion, this security framework is also a liability for archives. Since getting access to files requires membership to either service, if a group wants to share data with a repository through the service, that repository must then become a member.  The problem here is that as an archivist I should not, in any way, be inserting myself into the material. Yes, we you prepare a collection you make tons of decisions about potential organization and delineation betwixt documents, but I’m not going to write my name on every one. When a Dropbox or Google documents folder is shared, the metadata associated with that transaction becomes permanent, so I effectively become a member of the project team from the viewpoint of the service.

Now, as far as future recommendations go, I don’t know if this type of thing is avoidable. For instance, in Google Docs you can assign privileges for a file and thus make it impossible for the archival entity to modify the document. Sadly, this option is not available on Dropbox, so if I accidentally do something to the files on my machine I could permanently overwrite some actual historical document. And yes, I could always make sure to make a full, non-synced copy of the folder somewhere else and then recursively modify those permissions, but my main point is that this is a ton more to think about and manage on a technical level than with a standard digital document.

Access to the various APIs for these services is also based on membership. If I want to use Google’s or Dropbox’s Javascript API, I need to register with them as a developer and then register my potential application of the API to get a token allowing access to the service. Now, if you don’t follow that immediately, you see where some problems start to set in. I’ve developed web applications so all this is pretty straight forward to me. To libraries or other traditional repository structures it means the need for technical expertise and cost. The APIs for Dropbox and Google provide a lot of information about users’ files. There are heaps of metadata associated with cloud documents that might interest a repository and are only available through sending programmed requests to the APIs.

Metadata and Contextual Information

When you create a document on Google Docs or add something to a Dropbox folder, a significant amount of computational work takes place, virtually all of it hidden from the user. Native formats for Google Docs are anyone’s guess, and odds are they are probably drawn from multiple services spread across multiple servers and then assembled through complex Javascript and AJAX queries. In order to remove these items from the service, a user can choose to download them as a selection of different files. Google Documents (text documents) are downloadable as PDF, Microsoft Word (.docx), OpenOffice (.odt), Rich Text (.rtf), Plain Text(.txt) or HTML files. These files don’t exist until downloaded and do not include most of the metadata generated for the file. The downloaded version is just one manifestation of the data Google has associated with the document. To get the rest you have to use calls to the API.

The downloaded documents also shed information not expressible in the format. Comments added to documents by a user are saved in Microsoft Word downloads, but removed for PDF downloads. Additionally, while Google is tracking revisions to files, that information is not present in the downloaded copies.

Revisions

According to the Google Drive SDK API, Google is tracking a significant amount of metadata for each file on Drive. Revisions also have a detailed metadata trail that is only accessible through code. Another issue is that revisions to documents are deprecated over time, so there isn’t really a concept of versioning to the files. File can be “pinned” to specific revisions by setting a specific flag through the API, but that option is not available to anyone without programming knowledge. After a long enough period of time Google services will prune revisions from less active documents. Therefore, though some of the shared documents do still have histories, I don’t know if they are inclusive of most changes.

Dropbox presents a similar issue in that most of the metadata associated with a file is also available only through the API. Thankfully since Dropbox is a small company, the API is a lot simpler and less intense than Google’s offering. Additionally, Dropbox’s online application does allow one to view the previous versions of a file, but also deletes version history after 30 days unless you pay extra for their Packrat add-on. All the files in Prom Week’s Dropbox now have only one version, including drafts of academic publications. All that data is now gone.

Since Google at least keeps some of the revisions around for a decent period of time, it would be feasible to write some add-on or application to back up the revisions (even to Google Drive). A problem with this is deciding which revisions are meaningful and which are just mild edits. Each revisions would they be its own full document, instead just the fragmentary diff common to source control programs like Git.

More on the APIs later, if I have time to work out some scripts and such.

Lost in the Cloud II: The Thickening

My previous post on identifying file formats introduced my three approaches to uncovering evasive file definitions. Examine the context, search the internet and mine the header. In this post I’m returning again to those methods and the files from Prom Week’s Dropbox folder to highlight more file format related issues. This time I’ll focus on versioning issues and software obscurity.

First, since I didn’t elaborate on this in the last post, the purpose for this type of investigation is to ascertain what a future researcher would need, in terms of executable resources (operating systems, programs, hardware configurations, etc.) to properly run and examine the work produced by a development team. In the case of just a subset of the shared cloud files for Prom Week the answer is surprisingly extensive and diverse. Looking through over 2000 files in the folder, I identified 60 different file types requiring the interpretation of at least 29 programs.

Plopping the list here for effect, this is not a final assessment of the required programs, as I’ll explain in a sec. The file extensions are on the left, notes on the right.

  1. .txt – text files
  2. .docx – Microsoft Office Document 2007 or later, archive / ooxml filetype
  3. .svg – Scalable Vector Graphics
  4. .vue – Visual Understanding Environment (VUE) file (Tufts University)
  5. .png – Portable Network Graphics
  6. .zip – ZIP File Archive
  7. .pptx – Microsoft Powerpoint Files 2007 or later
  8. .ppt – Microsoft Powerpoint Files
  9. .pdf – Portable Document Format
  10. .xlsx – Microsoft Excel Files 2007 or later
  11. .as3proj – FlashDevelop ActionScript 3.0 Project File
  12. .xml – Extensible Markup Language
  13. .bat – Windows Batch File
  14. .swf – Shockwave / Flash Files
  15. .old – “Old” backup file
  16. .mxml – Flex meta XML file format
  17. .as – Actionscript 3.0 file
  18. .jpg – JPEG Joint Photographic Expert Group image file
  19. .java – Java Source File
  20. .fla – Adobe Flash File
  21. .pbm – portable bitmap file (Netpbm)
  22. .ai – Adobe Illustrator File
  23. .html – Hypertext Markup Language
  24. .css – Cascading Style Sheet
  25. .js – Javascript File
  26. .doc – Microsoft Office Document
  27. .psd – Adobe Photoshop File
  28. .gif – Graphics Interchange Format image file
  29. .avi – Audio Video Interleave
  30. .camproj – Camtasia Studio project file (Camtasia Studio 7.0)
  31. .rtf – Rich Text Format
  32. .swc – Compiled Shockwave Flash File
  33. .potx – Microsoft PowerPoint Template File (XML Format)
  34. .xcf – GIMP (GNU Image Manipulation Program) file format
  35. .vpk – VUE package file (for sharing VUE maps)
  36. .camrec – Camtasia Studio recording file
  37. .mp3 – MPEG-1 or MPEG-2 Audio Layer III digital audio file
  38. .jpeg – Alternate extension for jpg file
  39. .mov – Quicktime video file
  40. .pages – Pages document file
  41. directory file – Folder file
  42. .dropbox – Dropbox configuration file
  43. .wav – Waveform audio file format
  44. .au – Audacity block file (NOT Sun .au audio file)
  45. .aup – Audacity project file
  46. .aup.bak – Audacity project file backup
  47. .bak – Backup File
  48. .lel – from “robert-portrait Logon.lel” filename clue, Windows 7 Logon editor file (http://www.tweakscene.com/viewtopic.php?f=149&t=4614)
  49. .dll – Dynamic-link Library Microsoft Windows shared library
  50. .pyd – same as a .dll though written in Python (http://docs.python.org/2/faq/windows.html#is-a-pyd-file-the-same-as-a-dll)
  51. .sql – Structured Query Language File
  52. .php – PHP language file
  53. .odp – OpenDocument presentation file (OpenOffice Impress)
  54. .mp4 – MPEG-4 Part 14 multimedia format
  55. .7z – 7Zip Archive File
  56. .odt – OpenDocument text document (OpenOffice Writer)
  57. .fxp – Adobe Flash Builder Flex Project File
  58. .csv – Comma-Separated Value file
  59. .m4v – Apple video format
  60. .xmpses – Adobe Premiere Elements DVD Marker file

Aside from being a rather imposing listing, at least to my archival mind, this crash of files highlights the difficultly in finding relevant programs to read and run individual entries.

Versions

The first problem is the need to find the correct version of a program for a specific file.  For example, 7 and 8 on the list above are .pptx and .ppt files for Microsoft Powerpoint. The former is a XML based document scheme used for Microsoft Office documents created in 2007 or later. The latter is a pre-Office 2007 Powerpoint document, or a document saved as a .ppt in a post-Office 2007 Powerpoint. I don’t have a way to tell without doing some immediate digging.

Another version issue arises with Adobe Creative Suite programs. Adobe Illustrator (22) and Photoshop (27) files have maintained the same file extension for a large number of versions. Some features in older files are not reproducible in newer versions of the software and vice-versa. This creates an issue when you load an older file in a newer program; it attempts to change it to the newer format, thus making it unreadable to the program that created the file and potentially damaging the contents of the file itself. It’s effectively destroying the provenance of file, which is a huge no-no.  In order to ensure that everything remained in the exact condition that the developers left it, an archivist or researcher would need to identify the file type and then find the contingent version of the software that created the file.

A good example that I worked through was for the .camproject file type (30) above. This file is produced by Camtasia Studio, a screen capture program for Mac and Windows popular with screen casters, and also used here to record demo gameplay videos for Prom Week. After finding the file type after a simple search, I mined the header anyway just to see if there was anything interesting in there. Turns out it was a good example of a potential solution to the version issue, sometimes the information is just explicitly present in the file:

Nice color scheme

.camproject file header with program version

That seems pretty straight forward, but since that is just a project data version and not a program version, I ventured a little farther and found the path to the program’s executable.

.camproject directory illustrating 7.0 application folder

.camproject directory illustrating 7.0 application folder

So there, sometimes its not too difficult to find the version of a specific file. Since .camproject is also just a text file, it would be rather trivial to right a script or parser to analyze and provide the correct file version without snooping. However, the approach here would not generally scale, especially if the information in the file is in an arbitrary, non-textual form and could then only be identified through structural analysis. One would have to write a program to do statistically reasoning over a corpus of filetypes, and who would be that nuts?…Oh Harvard? Harvard and JSTOR…really…huh. So yeah, there is a program called JHOVE that does do that and I’ll be looking at it in a future post. By JHOVE!

Obscurity 

Even after you’ve found a file type or rather, think you’ve found a file type, there are still other potential barriers to getting it running. Two file types in the list, the .vue file (4) and the .lel file (48) illustrated the difficulties better than most. Starting with the .vue file, which is a Visual Understanding Environment file for a mind-mapping software made developed at Tufts University, I hit some identification road blocks when searching the internet. VUE files are commonly used three-dimensional geometry files (something that would conceivably be found in a game) and are also the file format for Microsoft FoxPro, a data schema development application (I hadn’t heard of it either.) Since both of these uses apparently aren’t uncommon, I figured one of them might be correct. Upon further analysis, however, it became obvious that there was no use for 3D geometry in a strictly 2D flash game, so that file type was out. Additionally, I couldn’t find any mention of FoxPro in the group email or discussion and it didn’t seem to make much sense, especially since the filenames for .vue files didn’t line up with data scheme type uses.

Eventually, after I had given up on the .vue file (since it’s header didn’t have anything helpful), I was examining another file, a .vpk (35) through a text editor and low and behold, there was some additional information:

.vpk file header

.vpk file header

After seeing the Tufts reference in there, I went back to the Google and found the Visual Understanding Environment project page. This allowed me to identify the .vpk file and and the complementary .vue file! As should be evident, there will even be confusion when you think you’ve found a game related file type but it doesn’t fit the development context. Another issue is that the .vue and .vpk files are proprietary formats used for a little known application. In the future it will probably become even more difficult to locate a correct binary.

Even more extreme is the .lel file (48). Searching the internet for a decent amount of time finally lead me to a forum posting mentioning the format. The post discusses a Windows 7 log on screen modification program, which I figured was correct since the file I was trying to identify was called “robert-portrait Logon.lel”. The link to the  program from that post had expired, but given that I now knew it was associated with a specific program, I searched for and found a newer version. Sadly, both the application link there and its mirror are now both gone and I can’t find them elsewhere, meaning I have no way to open the .lel file nor verify it. The log on screen application is probably the most obscure program of the thirty or so I found. Its existence is was the most in danger, given that it’s a small, non-professional application made for a subset of the DeviantArt community.

Obscurity of software is a double-edge sword for preservation. If something is really obscure and made for a small user base it usually isn’t that large or complex (yes I know some people write their own flight simulators). So it’s generally easy to just save a copy with the data you want it to read. The log on editor program file is probably very small, and I could have just wrapped it up with all the other data in the project without much fuss. Therefore after sussing out file types, one should definitely see if any of their dependent programs are in immediate danger of disappearing or they might not be recoverable. The .lel file was created less than three years ago and is now unreadable, things go away fast on the Internet. A main point of this work is to highlight the transience and instability of data online, another site mentioning the .lel file literally told me to just Google it, assuming that it would just be available somewhere. Sigh.

So there, a bit more information on file types, I’ll return to them at some point in the future but I’m going to break up the posts a bit since there are so many things to talk about.