File Format Identification Resources and Apps

In two earlier posts I detailed the simple methods I use for file type identification and the issues that arise from them.  Here I’m going to look at a few more resources for file type identification, including some applications and websites that host useful file type information. The first part will just discuss places to look, the second will be a mild analysis of some software provided by the National Archives of the United States. I’m using file type and format interchangeably from now on to make it less boring to write.

File Formats and Internets

Aside from searching for a file format on the Google or Wikipedia, both of which meet with sometimes limited success, there are some good places to check. One is Jason Scott’s effort at the Just Solve the Problem Project, in which he got Archive Team members and anyone on the Internet to rabidly organize a file formats wiki. The site has a rather large number of file formats, though your mileage may vary with more obscure proprietary types. At least if you find something new you can add it to the listings. I like this site since it tries to include many formats that might not be floating around the Internet, like internal Playstation development files and other internal formats. There’s also a great collection of links on the project proposal wiki with tons of other file format resources.

Another listing of file formats, which is much less complete but still potentially useful is the UK National Archives file format database PRONOM. The UK National Archives actually uses this database for file identification in the applications I’m going to mention below. It’s a little hard to parse at first and the UI needs some work, but there is some good information in there.


I’m going to look at three applications in brief, just to give a flavor of what’s available for free and immediately. There are tons of expensive computational forensics tools that you could use for this analysis. In fact, when I was working on the Stanford collections I had access to some, they were quite nice and around $5000 a license. FTK Tool Kit is an example of this type of forensic software and its purpose is mostly for document discovery in court cases. As such it is not interested in weird file format identification. It can, however, search a hard drive for any data that might be an image of a human. Not going any farther with that comment.

Back to the free, non-spook world, I’m going to look at JHOVE, JHOVE2 and DROID; three programs designed for digital repositories and libraries. Starting with JHOVE, or the JSTOR/Harvard Object Validation Environment, I’m going to outline an attempt to use the software for identification of file types. As is apparent from the name, JHOVE was an effort to provide document validation for university digital repositories. Validating a file format is very important since it ensures that it follow an accepted standard specification and thus will probably be readable in the future. Sadly, for my current purposes, this software is not very useful. JHOVE only validates and identifies files related to digitization of print material, which is the basis for most digital collections and certainly not related in the slightest to arbitrary software archives. JHOVE will tell you is a certain PDF is version 1 or 1.7, but it treats most non-text and non-image formats as black-boxed bitstreams. It is also not really supported at this point, so while it’s probably possible to modify its internals, I really don’t have the time.


JHOVE interpreting an unknown file type

JHOVE interpreting an unknown file type

JHOVE2 is the result of a two-year grant to try and improve the scope and functionality of the original JHOVE software, however it seems to be only partly finished and labor intensive. JHOVE2 provides for a larger range of file type identification but still only deals with the major print and image based formats of the original JHOVE. Additionally, JHOVE2 does not have an graphical user interface, so all analysis must be done either through the command line or by writing your own rules for the system and designing custom output. I could see the latter being great if you had a full-time staff member managing it, but the 40 pages installation and usage manual is too intense for my current needs and time constraints. JHOVE2 also appears to have just run out of development time. Running the default configuration on a single text file gives me a 300+ line XML document that I can barely parse. Luckily, JHOVE2 uses access to the DROID application as a backend file format search. A fact that lead me and will lead this blog post to brighter pastures.

Partial JHOVE2 Output for the Apache License

Partial JHOVE2 Output for the Apache License

DROID or, Digital Record Object IDentification, is a Java program developed by the UK National Archives for file format identification. It accesses the online PRONOM database mentioned above and will try to identify the file types in a given folder. Since the PRONOM database does have a decent amount of common files, DROID was able to identify some common Actionscript and Flash development files. Although some of the more confusing files through it off, specifically all the difficult ones I mentioned in this post, it also managed to find file extensions that I had missed the first time around. So that’s a total win! DROID has a nice reporting feature that will enumerate the different file extensions it found and give their number. While I would have liked full path information, at least this can give someone a basic overview of the proprietary or potentially confusing file formats immediately. Thereby allowing one to narrow later searches to just those files for identification.



DROID Generated Report

DROID Generated Report

Okay, so that’s a could of the programs I’ve messed with to identify files. Next time join me for some basic email analysis thanks to the good folks at the Stanford MUSE Project.

Web Sites Forgotten

I’m not at a loss for diversity of approaches in my current archiving work on Prom Week. Teams of developers in modern academy (or even the real world) need ways to share and evaluate information about their work and projects. As a result of the never-ending new services and means of disseminating digital information, there are generally quite a few places you need to look nowadays to find everything. For instance, say you are working on a project and it needs to be announced and shared. Well then you’ll start a blog for it, and maybe get mentioned in some press. All good, but those are now two more potential research objects potentially worthy of appraisal and ingestion into a digital archive. If the press builds up enough, there will tens to hundreds of individual stories, some derivative, some not, about your work. Maybe even a famous person likes your stuff and mentions it on Twitter or Tumblr, you’d definitely want to have that saved for posterity, right?

In light of all the new modes of dissemination on the Internet, it has become imperative to find ways of aggregating and saving large amounts of online documents. This isn’t a reference to cloud documents, I’m mainly referring to anything involved or incumbent upon the development process that gets posted to some addressable folder on a server. I split this type of documentation into two types: dissemination documents and development documents. Dissemination documentation is everything I mentioned in the first paragraph. These are documents that are either released to the online public by the developers, in the form of press releases, development blogs, and other affiliated blog posts. These documents, in turn, can produce community reactions in the form of press coverage, reviews, and other online references to the game.  Development documents are online references to the game intended for the development team only. These can range from shared folders containing demonstration applications to group chat logs and meeting organization tools. Both of these document types site on the accessible web and might be informative to later researchers, so how to organize and save it all?

Internet Archive?

One thing to do would be to rely on someone else to pick up the slack. In this instance I’m mainly referring to the Internet Archive and their Wayback Machine. Internet Archive’s tool regularly scrapes the Internet for pages, and stores and logs changes to a large number of websites. You can even browse through older versions of websites and see what they looked like before they were acquired. The major problem is that the crawler will respect robots.txt (a file located at the root of a web directory explicitly blocking crawls) and apparently remove websites upon request. Another issue is that the crawler is not omniscient, it ‘crawls’ and needs a medium through which to exercise that activity, namely hyperlinks. If you put up a file on your domain, don’t tell anyone the link, and don’t post it anywhere, there is no way for the crawler to find or know about it. As such, a large portion of the Internet does not function in a way conducive to crawling. Sites like Facebook block crawlers and as a result the Wayback Machine does not have a copy of the original Prom Week page. So, the short answer is no, if you want to save all dissemination documents, you’ll need to do it yourself.

Certain applications, including most browsers, will let you save a webpage in its entirety although the scope of “entirety” is up to the application. Two services I use for this activity are either Zotero (for individual pages) or Archive-It. Zotero is a fantastic citation and reference management system and it will function as a browser plug-in. This allows you to save a website and have it attached to relevant metadata in a folder of your choosing. Archive-It is a commercial web application through which the Internet Archive provides paid access to their web crawler. Conveniently unless a user of the service explicitly prevents it, anything crawled by a target user will also show up somewhere on the website. Archive-It is a better solution if you want to save a larger class of website. As an example I used the tool to save the Prom Week development blog. The crawler managed to grab quite a bit of information and all the blog posts at the domain. Surprisingly, just a handful of posts turned up over 243 URLs and 45 megabytes of data. So there’s generally a lot more documents and data than one would assume. Granted some of these are links to other services like Twitter and Facebook or to individual hosted images so the document count is a little inflated. Still, even with this seemingly deep dive into a website it is still possible to miss something significant.

Results from undirected Archive-It Crawl

Results from undirected Archive-It Crawl

One of the project’s developers stored some demo applications and explanatory web sites for conferences on his personal UCSC web space. After crawling his site I came up with 156 URLs, but none of those linked to the Prom Week documents. Since he hadn’t linked them to any file in his main directory the crawler didn’t find them. I had to feed the crawler individual URLs for each demo file and directory. Needless to say this would become untenable for even a moderately complex grouping of online documents.

Ephemeral Sites

Along with development demos and files, another task that teams frequently engage with is scheduling. In looking through the email list archives for the project, I came across references to third party scheduling apps. These apps were used for organizing project meeting times and are still active on the web. This is a new type of documentation that I haven’t really seen discussed in the archival frame yet. Many people use simple online services to schedule meetings, share notes and even code. I figure most people don’t expect these texts to hang around, but a large number of them do. Being at the whims of whatever services is hosting them, it’s important for archival entities to be aware of the problems inherent in finding, appraising and saving such ephemeral documentation. I have quite a few still-active links to meetings the development team scheduled over two years ago. Although this information is not something I’d deem to be very important, it does further substantiate a number of things, including who was working on the project and when they were meeting. This type of information might be irrelevant for my current work, but it would be nifty to have the date for the first UI design meeting of Google or something similar. Awareness of the range of documentation can help prevent potentially valuable information from slipping through. The continuing problem being to know, in advance, what the relevant documents are.

Don’t Look Back: Cloud APIs and Revisions

Cloud APIs represent an interesting problem for the archival security of documents over extended periods. Storing files with a secondary source, in this case I’ll be discussing Google Docs and Dropbox, is usually a most convenient option. Google Docs provides a means to create, edit, share and browse documents without leaving the confines of a browser. Since I’m usually doing research or testing applications within that environment, having some of my documents there is definitely a useful addition to my workflow. Similarly, Dropbox’s main appeal is that it shadows the cloud storage of my files by being present as a locally accessible folder. Anything I drop in there then gets whisked up to the mystical cloud, floating into the immense sea of offsite data storage.

The obvious problem with cloud storage and a definite reason for skepticism is the reliance on secondary corporate structures to persist. Granted, the stability of the backups on Google Docs or Dropbox is probably greater than on my local hard drive but there is still something a little unnerving about giving up that control. From an archival perspective, which is very…very long, there is a general fear about the lasting permanence of cloud institutions. Google and Dropbox seem totally stable right now, but will I be able to leave something on there for 20 years and still be able to retrieve it? Technological turnover is astoundingly fast and many users have been burned by storage companies blinking out of existence. A good example was the shutdown of MegaUpload, in which 25 petabytes of user data vanished after a government raid.

The goal is to find the most stable possible place for long term digital storage. And although they are sometimes derided as slow and conservative, government and educational institutions generally have the best track record for long term retention. So the question, in light of all the online documents produced by game development, is what is the best solution to the probable collapse of a cloud service? Do you just migrate the stored data to a more secure digital repository? What data do you take, or more specifically, what data is available? This last question is where access to cloud service APIs becomes important because they are the main conduit for detailed documentary information and also the limiting factor for getting data off of a service.

Permission and Access

Both Google and Dropbox require user authentication to access cloud files. This is necessary for security and is at the heart of the trust networks that allow these services to function. Through a weird inversion, this security framework is also a liability for archives. Since getting access to files requires membership to either service, if a group wants to share data with a repository through the service, that repository must then become a member.  The problem here is that as an archivist I should not, in any way, be inserting myself into the material. Yes, we you prepare a collection you make tons of decisions about potential organization and delineation betwixt documents, but I’m not going to write my name on every one. When a Dropbox or Google documents folder is shared, the metadata associated with that transaction becomes permanent, so I effectively become a member of the project team from the viewpoint of the service.

Now, as far as future recommendations go, I don’t know if this type of thing is avoidable. For instance, in Google Docs you can assign privileges for a file and thus make it impossible for the archival entity to modify the document. Sadly, this option is not available on Dropbox, so if I accidentally do something to the files on my machine I could permanently overwrite some actual historical document. And yes, I could always make sure to make a full, non-synced copy of the folder somewhere else and then recursively modify those permissions, but my main point is that this is a ton more to think about and manage on a technical level than with a standard digital document.

Access to the various APIs for these services is also based on membership. If I want to use Google’s or Dropbox’s Javascript API, I need to register with them as a developer and then register my potential application of the API to get a token allowing access to the service. Now, if you don’t follow that immediately, you see where some problems start to set in. I’ve developed web applications so all this is pretty straight forward to me. To libraries or other traditional repository structures it means the need for technical expertise and cost. The APIs for Dropbox and Google provide a lot of information about users’ files. There are heaps of metadata associated with cloud documents that might interest a repository and are only available through sending programmed requests to the APIs.

Metadata and Contextual Information

When you create a document on Google Docs or add something to a Dropbox folder, a significant amount of computational work takes place, virtually all of it hidden from the user. Native formats for Google Docs are anyone’s guess, and odds are they are probably drawn from multiple services spread across multiple servers and then assembled through complex Javascript and AJAX queries. In order to remove these items from the service, a user can choose to download them as a selection of different files. Google Documents (text documents) are downloadable as PDF, Microsoft Word (.docx), OpenOffice (.odt), Rich Text (.rtf), Plain Text(.txt) or HTML files. These files don’t exist until downloaded and do not include most of the metadata generated for the file. The downloaded version is just one manifestation of the data Google has associated with the document. To get the rest you have to use calls to the API.

The downloaded documents also shed information not expressible in the format. Comments added to documents by a user are saved in Microsoft Word downloads, but removed for PDF downloads. Additionally, while Google is tracking revisions to files, that information is not present in the downloaded copies.


According to the Google Drive SDK API, Google is tracking a significant amount of metadata for each file on Drive. Revisions also have a detailed metadata trail that is only accessible through code. Another issue is that revisions to documents are deprecated over time, so there isn’t really a concept of versioning to the files. File can be “pinned” to specific revisions by setting a specific flag through the API, but that option is not available to anyone without programming knowledge. After a long enough period of time Google services will prune revisions from less active documents. Therefore, though some of the shared documents do still have histories, I don’t know if they are inclusive of most changes.

Dropbox presents a similar issue in that most of the metadata associated with a file is also available only through the API. Thankfully since Dropbox is a small company, the API is a lot simpler and less intense than Google’s offering. Additionally, Dropbox’s online application does allow one to view the previous versions of a file, but also deletes version history after 30 days unless you pay extra for their Packrat add-on. All the files in Prom Week’s Dropbox now have only one version, including drafts of academic publications. All that data is now gone.

Since Google at least keeps some of the revisions around for a decent period of time, it would be feasible to write some add-on or application to back up the revisions (even to Google Drive). A problem with this is deciding which revisions are meaningful and which are just mild edits. Each revisions would they be its own full document, instead just the fragmentary diff common to source control programs like Git.

More on the APIs later, if I have time to work out some scripts and such.

Lost in the Cloud II: The Thickening

My previous post on identifying file formats introduced my three approaches to uncovering evasive file definitions. Examine the context, search the internet and mine the header. In this post I’m returning again to those methods and the files from Prom Week’s Dropbox folder to highlight more file format related issues. This time I’ll focus on versioning issues and software obscurity.

First, since I didn’t elaborate on this in the last post, the purpose for this type of investigation is to ascertain what a future researcher would need, in terms of executable resources (operating systems, programs, hardware configurations, etc.) to properly run and examine the work produced by a development team. In the case of just a subset of the shared cloud files for Prom Week the answer is surprisingly extensive and diverse. Looking through over 2000 files in the folder, I identified 60 different file types requiring the interpretation of at least 29 programs.

Plopping the list here for effect, this is not a final assessment of the required programs, as I’ll explain in a sec. The file extensions are on the left, notes on the right.

  1. .txt – text files
  2. .docx – Microsoft Office Document 2007 or later, archive / ooxml filetype
  3. .svg – Scalable Vector Graphics
  4. .vue – Visual Understanding Environment (VUE) file (Tufts University)
  5. .png – Portable Network Graphics
  6. .zip – ZIP File Archive
  7. .pptx – Microsoft Powerpoint Files 2007 or later
  8. .ppt – Microsoft Powerpoint Files
  9. .pdf – Portable Document Format
  10. .xlsx – Microsoft Excel Files 2007 or later
  11. .as3proj – FlashDevelop ActionScript 3.0 Project File
  12. .xml – Extensible Markup Language
  13. .bat – Windows Batch File
  14. .swf – Shockwave / Flash Files
  15. .old – “Old” backup file
  16. .mxml – Flex meta XML file format
  17. .as – Actionscript 3.0 file
  18. .jpg – JPEG Joint Photographic Expert Group image file
  19. .java – Java Source File
  20. .fla – Adobe Flash File
  21. .pbm – portable bitmap file (Netpbm)
  22. .ai – Adobe Illustrator File
  23. .html – Hypertext Markup Language
  24. .css – Cascading Style Sheet
  25. .js – Javascript File
  26. .doc – Microsoft Office Document
  27. .psd – Adobe Photoshop File
  28. .gif – Graphics Interchange Format image file
  29. .avi – Audio Video Interleave
  30. .camproj – Camtasia Studio project file (Camtasia Studio 7.0)
  31. .rtf – Rich Text Format
  32. .swc – Compiled Shockwave Flash File
  33. .potx – Microsoft PowerPoint Template File (XML Format)
  34. .xcf – GIMP (GNU Image Manipulation Program) file format
  35. .vpk – VUE package file (for sharing VUE maps)
  36. .camrec – Camtasia Studio recording file
  37. .mp3 – MPEG-1 or MPEG-2 Audio Layer III digital audio file
  38. .jpeg – Alternate extension for jpg file
  39. .mov – Quicktime video file
  40. .pages – Pages document file
  41. directory file – Folder file
  42. .dropbox – Dropbox configuration file
  43. .wav – Waveform audio file format
  44. .au – Audacity block file (NOT Sun .au audio file)
  45. .aup – Audacity project file
  46. .aup.bak – Audacity project file backup
  47. .bak – Backup File
  48. .lel – from “robert-portrait Logon.lel” filename clue, Windows 7 Logon editor file (
  49. .dll – Dynamic-link Library Microsoft Windows shared library
  50. .pyd – same as a .dll though written in Python (
  51. .sql – Structured Query Language File
  52. .php – PHP language file
  53. .odp – OpenDocument presentation file (OpenOffice Impress)
  54. .mp4 – MPEG-4 Part 14 multimedia format
  55. .7z – 7Zip Archive File
  56. .odt – OpenDocument text document (OpenOffice Writer)
  57. .fxp – Adobe Flash Builder Flex Project File
  58. .csv – Comma-Separated Value file
  59. .m4v – Apple video format
  60. .xmpses – Adobe Premiere Elements DVD Marker file

Aside from being a rather imposing listing, at least to my archival mind, this crash of files highlights the difficultly in finding relevant programs to read and run individual entries.


The first problem is the need to find the correct version of a program for a specific file.  For example, 7 and 8 on the list above are .pptx and .ppt files for Microsoft Powerpoint. The former is a XML based document scheme used for Microsoft Office documents created in 2007 or later. The latter is a pre-Office 2007 Powerpoint document, or a document saved as a .ppt in a post-Office 2007 Powerpoint. I don’t have a way to tell without doing some immediate digging.

Another version issue arises with Adobe Creative Suite programs. Adobe Illustrator (22) and Photoshop (27) files have maintained the same file extension for a large number of versions. Some features in older files are not reproducible in newer versions of the software and vice-versa. This creates an issue when you load an older file in a newer program; it attempts to change it to the newer format, thus making it unreadable to the program that created the file and potentially damaging the contents of the file itself. It’s effectively destroying the provenance of file, which is a huge no-no.  In order to ensure that everything remained in the exact condition that the developers left it, an archivist or researcher would need to identify the file type and then find the contingent version of the software that created the file.

A good example that I worked through was for the .camproject file type (30) above. This file is produced by Camtasia Studio, a screen capture program for Mac and Windows popular with screen casters, and also used here to record demo gameplay videos for Prom Week. After finding the file type after a simple search, I mined the header anyway just to see if there was anything interesting in there. Turns out it was a good example of a potential solution to the version issue, sometimes the information is just explicitly present in the file:

Nice color scheme

.camproject file header with program version

That seems pretty straight forward, but since that is just a project data version and not a program version, I ventured a little farther and found the path to the program’s executable.

.camproject directory illustrating 7.0 application folder

.camproject directory illustrating 7.0 application folder

So there, sometimes its not too difficult to find the version of a specific file. Since .camproject is also just a text file, it would be rather trivial to right a script or parser to analyze and provide the correct file version without snooping. However, the approach here would not generally scale, especially if the information in the file is in an arbitrary, non-textual form and could then only be identified through structural analysis. One would have to write a program to do statistically reasoning over a corpus of filetypes, and who would be that nuts?…Oh Harvard? Harvard and JSTOR…really…huh. So yeah, there is a program called JHOVE that does do that and I’ll be looking at it in a future post. By JHOVE!


Even after you’ve found a file type or rather, think you’ve found a file type, there are still other potential barriers to getting it running. Two file types in the list, the .vue file (4) and the .lel file (48) illustrated the difficulties better than most. Starting with the .vue file, which is a Visual Understanding Environment file for a mind-mapping software made developed at Tufts University, I hit some identification road blocks when searching the internet. VUE files are commonly used three-dimensional geometry files (something that would conceivably be found in a game) and are also the file format for Microsoft FoxPro, a data schema development application (I hadn’t heard of it either.) Since both of these uses apparently aren’t uncommon, I figured one of them might be correct. Upon further analysis, however, it became obvious that there was no use for 3D geometry in a strictly 2D flash game, so that file type was out. Additionally, I couldn’t find any mention of FoxPro in the group email or discussion and it didn’t seem to make much sense, especially since the filenames for .vue files didn’t line up with data scheme type uses.

Eventually, after I had given up on the .vue file (since it’s header didn’t have anything helpful), I was examining another file, a .vpk (35) through a text editor and low and behold, there was some additional information:

.vpk file header

.vpk file header

After seeing the Tufts reference in there, I went back to the Google and found the Visual Understanding Environment project page. This allowed me to identify the .vpk file and and the complementary .vue file! As should be evident, there will even be confusion when you think you’ve found a game related file type but it doesn’t fit the development context. Another issue is that the .vue and .vpk files are proprietary formats used for a little known application. In the future it will probably become even more difficult to locate a correct binary.

Even more extreme is the .lel file (48). Searching the internet for a decent amount of time finally lead me to a forum posting mentioning the format. The post discusses a Windows 7 log on screen modification program, which I figured was correct since the file I was trying to identify was called “robert-portrait Logon.lel”. The link to the  program from that post had expired, but given that I now knew it was associated with a specific program, I searched for and found a newer version. Sadly, both the application link there and its mirror are now both gone and I can’t find them elsewhere, meaning I have no way to open the .lel file nor verify it. The log on screen application is probably the most obscure program of the thirty or so I found. Its existence is was the most in danger, given that it’s a small, non-professional application made for a subset of the DeviantArt community.

Obscurity of software is a double-edge sword for preservation. If something is really obscure and made for a small user base it usually isn’t that large or complex (yes I know some people write their own flight simulators). So it’s generally easy to just save a copy with the data you want it to read. The log on editor program file is probably very small, and I could have just wrapped it up with all the other data in the project without much fuss. Therefore after sussing out file types, one should definitely see if any of their dependent programs are in immediate danger of disappearing or they might not be recoverable. The .lel file was created less than three years ago and is now unreadable, things go away fast on the Internet. A main point of this work is to highlight the transience and instability of data online, another site mentioning the .lel file literally told me to just Google it, assuming that it would just be available somewhere. Sigh.

So there, a bit more information on file types, I’ll return to them at some point in the future but I’m going to break up the posts a bit since there are so many things to talk about.

Lost in the Cloud

Any software development process involves a fair amount of extraneous creation. Code is revised, documents created and destroyed, prototypes and demos constructed, all in the pursuit of a final, stable digital object. Digital games add even more to this crush of documentation with an unending multitude of art assets, proprietary file types, and a lack of internal documentation.  Since most development today relies on cloud storage and backup, code repositories and all forms of digital spatio-temporal communication, just finding out where everything is stored necessitates significant technical effort and time.

The team for Prom Week, the object at the heart of my current research for the NEH (info here), made use of numerous cloud services throughout the duration of the project. Fortunately, most of the documents are stored on only two services, Dropbox and Google Docs. Unfortunately, the organization is about as structured as I would expect from a rotating development team with intense time pressures and significant distractions. The Dropbox repository proved particularly onerous in analysis. Each team member had their own individual directory, which usually duplicated some files from another major folder. Aside from duplicates, there is no real structure to the folder names or documentation. This is usually not a problem, however, as Dropbox is searchable and I’m assuming when this folder was active each person responsible for a file knew where and what it was. As an outsider to the Prom Week development process, I can usually ascertain what a document relates to, but that is definitely due to the last few months I’ve spent researching the project.

I’m going to continue the focus on Dropbox for two reasons: first, the Google Documents for the project, while interesting and post-worthy, are only 24 in number and 5 in type, and second, the points I want to make about file extensions and confusion in the cloud are easier to argue when I’m dealing with the 1.8 gigabytes of haphazardly organized Dropbox data. Those nearly two gigabytes of information breakdown into 2,051 individual files spanning 4 years of creation and modification by 8 people. Now while this appears to be a rather small set of data, making sense of it and potentially using it turns out to be more difficult than I had even assumed. And I’m generally rather cynical about such things.  The following post is mainly about file extensions, and is the first in a series on file formats and the cloud.

The major issue for archiving such a collection of documents isn’t necessarily about trying to figure out what they represent at the level of content. Although that does get quite hairy, the major issue I had with the documents is at a much lower level that I’ll discuss in a second. Ascertaining what a document might be about is usually available from context, like the name of the document and its related folders, and from personal development experience. I’ve worked on many software projects with multiple collaborators and so I’m generally keen to the types of documents created. To give a sense of Prom Week’s documentary complexity here is a rough outline of the types of documents I found in just the Dropbox folder:

  • Research Papers
    • outlines
    • shared notes
    • major versions
    • revisions,
    • abstracts
    • submissions
    • figures
    • templates
  • Conferences
    • posters
    • poster templates
    • presentations
    • ephemera documents and records
  • Assignments for undergraduate researchers
  • Data analysis maps
  • Demo and Test Programs for different game elements
  • Screenshots
  • Demo Videos
  • Game Files
    • background images
    • character art
    • sound files
    • video files
    • application project files for a specific Integrated Development Environment (IDE) (Actionscript project files)
    • application files
    • structured data files
    • operating system scripting files
    • system configuration files
    • IDE configuration files
    • source code files
      • game processes
      • data processing
      • asset management
      • software objects
      • data structures
    • user interface files
      • mock ups
      • test applications
      • database files
  • Developer Specific Folders
  • Secondary Creative Software Files
  • Web Site Resources
    • web embedding files
    • website icons
  • Backup Files

This list is assembled from document names and my personal understanding of Flash game development files. However, not all the files were easy to identify, which leads to what I consider the most pressing issue: file types.

There are many files in the folder that are obviously named but not easy to open. Essentially, you can know the context of a file (what type of file it should be) and still have no idea what program created it or how the data is organized. This leads to the three major ways I figure out how to read a file:

  1. Examine the Context
  2. Search the Internet
  3. Mine the Headers

I’m going to illustrate these approaches through examples of problematic files from the Prom Week Dropbox folder, illuminating the pitfalls of each approach.

The first example is the aptly named Now you probably know exactly what type of file this is (aren’t you smart!) but I had no flippin’ clue. So the first thing I did was examine the context and the file’s full path gave me a pretty big clue:

…/altprom/Jacob’s SFX/VOCAL/utterances/mohawk/mohawk_data/e00/d00/

Evidently this is a part of an audio file for the vocal sound effects for Prom Week. The ‘mohawk’ refers to (I think) an earlier version of the be-mohawked character in the earlier demos of the game.

I still don’t know what program is associated with the file extension .au, so I use the second approach, I search the Internet. The first hit is a wikipedia page describing an audio format created by Sun Microsystems and popular on NeXT Workstations and early Web sites. This seems totally off, since I’m positive that no one has used a NeXT machine at UCSC for at least 15 years and possibly never (cue angry UCSC NeXT users). Now NeXT is the progenitor of Apple’s OS X, the former being purchased by Apple in 1996, and is a totally interesting topic not for this blog post. In fact, my iTunes application detected the .au as an audio file but could not run it, which is good sign it’s not a valid Sun .au file.

Looking at the Internet results again, I noticed that the second result is  a FAQ answer for the open-source application Audacity. The site asks, “Why does Audacity create a folder full of .au files when I save a project?” Looking at the file in question, our friend, I see that she’s in a folder with a bunch of other .au files, there’s a big (.au)dacious party up in there. The FAQ page also mentions that there should be an .aup (Audacity Project File) associated with the .au Audacity Block Files and sure enough, there’s a .aup file in a parent directory.

Now I’ve figured out what type of file .au is referring to, and it makes sense that a student researcher on the project would use a free, open-source audio editor for an academic project. However, I still haven’t mentioned the third identification method, mine the headers, because I actually didn’t need to do that for this particular file. If I had I would have seen this:

So many colors...

Audacity Block File in vi editor

The file clearly states in the header information that it’s associated with Audacity, so I could have examined that first and probably saved a bit of work. Regardless, I’ll explain the process for doing dirt-cheap header analysis on UNIX-based systems. I don’t generally use a Windows PC for anything but gaming, so most of the methodology on this blog will be from the technical context of Mac OS X available tools. All OS X’s flashy graphical flourishes are underwritten by the BSD-derived Darwin operating system and it is UNIX compliant, therefore I’m using mostly common UNIX tools for my surface analysis.

A good deal of file formats, though not all, have some text-based header information at the ‘head’ of the file. You know, at the top. So if you open those files in a format-agnostic text editor and if they have encoded text, you can see what type of file it is. To obtain the screenshot of the Audacity file above I used a terminal application, basically a program that lets a user interact with the command line interface to the Darwin OS running my Mac. Every Mac has the Terminal application installed, so you can follow along if you open it. If you’re on Linux I’m assuming you are already aware of how to access the terminal. When I’m in a command line interface, I just use the command: vi path-to-file to open the file in the vi editor. vi will open pretty much anything, though if it’s not encoded as text it will be gobble-de-gook like the Audacity file above. I keep active files in a convenient place if I just want to snoop so I copy them to my desktop temporarily. Therefore, the command to look at the audio file was:

vi ~/Desktop/

Okay, so I’ve covered the types of files and common methods I use to find out about file types. In the next few blog posts I’ll elaborate on how these methods can lead to some confusion with particularly knotty files, and discuss some other issues related to file formats, like versioning and dependent applications. I’ll also try to make them less than 1,537 words in length. Bye for now.

National Endowment for the Humanities

Last April I, along with Noah Wardrip-Fruin and Christy Caldwell at University of California, Santa Cruz (UCSC) and Henry Lowood at Stanford, received a National Endowment for the Humanities Digital Humanities Start Up Grant. Our project proposal (online here) is a first pass on an archival and appraisal strategy for academically produced computer games. The focus of the research is on the process of creating computer games in an academic context. We are looking deeply at a game produced by UCSC graduate students and divining the trajectory of development and all the types of artifacts produced by such an effort. There is a general lack of knowledge about how game production functions in an academic research project and our goal is to shed some light on it, both through a technical dive into the development process and an archival narrative of object production.


The game we chose is Prom Week, a social simulation game produced by my lab (the Expressive Intelligence Studio) at UCSC. Although its selection is slightly self-serving we needed full access to a development process and its resulting game. If we want to understand how the development process worked and aggregate all the different outputs it produced then we had to choose something close to home. Private software development is notoriously insular and shielded, an effort to protect IP issues and development talent. Therefore we figured an academic game would provide more open tools and access, especially one in which I could just ask the developers questions if I ran into them at lab meetings, in the hallway or at the food truck.

Given that this type of work is new, we needed to find helpful examples to provide some initial guidance. The two major sources of inspiration are the 1983 Joint Committee on the Archives of Science and Technology (JCAST) report on scientific process, and the Preserving Virtual Worlds Report on game preservation issues. The JCAST report is essentially a detailed description of the problems inherent in the records management and archiving of the scientific process. There is less interest in the official publications, since those are generally clean and organized documents representing the output of a messy research process. JCAST is concerned with archiving the mess, specifically how research institutions should handle and evaluate the myriad artifacts incumbent to scientific research. This type of investigation seemed applicable to the archival treatment of video game development processes and has provided nice guidance so far. The other source, Preserving Virtual Worlds, was the first major government research into the issue of game preservation specifically, and while not concerned with game development process, it still highlights numerous types of documents and extensive technical warnings about the issues inherent in reproducing digital documents.

The process of my current work on Prom Week is informed by both reports and seeks a middle path to explain how the development process works in an academic context, what types of documents are produced and what technical issues one would face if they were crazy enough to actually archive it all.

A final note here. The academically produced games that we are concerned with are those pieces of digital entertainment software produced with a specifically teleological bent. They are designed to research some processes or validate some novel system of play, design, or pedagogy with hopes of publishable academic results. This context is then slightly different than the corporate or independent development process, but it is hoped that many considerations will map accordingly.

References (yes in a blog post):