File Format Identification Resources and Apps

In two earlier posts I detailed the simple methods I use for file type identification and the issues that arise from them.  Here I’m going to look at a few more resources for file type identification, including some applications and websites that host useful file type information. The first part will just discuss places to look, the second will be a mild analysis of some software provided by the National Archives of the United States. I’m using file type and format interchangeably from now on to make it less boring to write.

File Formats and Internets

Aside from searching for a file format on the Google or Wikipedia, both of which meet with sometimes limited success, there are some good places to check. One is Jason Scott’s effort at the Just Solve the Problem Project, in which he got Archive Team members and anyone on the Internet to rabidly organize a file formats wiki. The site has a rather large number of file formats, though your mileage may vary with more obscure proprietary types. At least if you find something new you can add it to the listings. I like this site since it tries to include many formats that might not be floating around the Internet, like internal Playstation development files and other internal formats. There’s also a great collection of links on the project proposal wiki with tons of other file format resources.

Another listing of file formats, which is much less complete but still potentially useful is the UK National Archives file format database PRONOM. The UK National Archives actually uses this database for file identification in the applications I’m going to mention below. It’s a little hard to parse at first and the UI needs some work, but there is some good information in there.

Applications

I’m going to look at three applications in brief, just to give a flavor of what’s available for free and immediately. There are tons of expensive computational forensics tools that you could use for this analysis. In fact, when I was working on the Stanford collections I had access to some, they were quite nice and around $5000 a license. FTK Tool Kit is an example of this type of forensic software and its purpose is mostly for document discovery in court cases. As such it is not interested in weird file format identification. It can, however, search a hard drive for any data that might be an image of a human. Not going any farther with that comment.

Back to the free, non-spook world, I’m going to look at JHOVE, JHOVE2 and DROID; three programs designed for digital repositories and libraries. Starting with JHOVE, or the JSTOR/Harvard Object Validation Environment, I’m going to outline an attempt to use the software for identification of file types. As is apparent from the name, JHOVE was an effort to provide document validation for university digital repositories. Validating a file format is very important since it ensures that it follow an accepted standard specification and thus will probably be readable in the future. Sadly, for my current purposes, this software is not very useful. JHOVE only validates and identifies files related to digitization of print material, which is the basis for most digital collections and certainly not related in the slightest to arbitrary software archives. JHOVE will tell you is a certain PDF is version 1 or 1.7, but it treats most non-text and non-image formats as black-boxed bitstreams. It is also not really supported at this point, so while it’s probably possible to modify its internals, I really don’t have the time.

JHOVE Logo

JHOVE interpreting an unknown file type

JHOVE interpreting an unknown file type

JHOVE2 is the result of a two-year grant to try and improve the scope and functionality of the original JHOVE software, however it seems to be only partly finished and labor intensive. JHOVE2 provides for a larger range of file type identification but still only deals with the major print and image based formats of the original JHOVE. Additionally, JHOVE2 does not have an graphical user interface, so all analysis must be done either through the command line or by writing your own rules for the system and designing custom output. I could see the latter being great if you had a full-time staff member managing it, but the 40 pages installation and usage manual is too intense for my current needs and time constraints. JHOVE2 also appears to have just run out of development time. Running the default configuration on a single text file gives me a 300+ line XML document that I can barely parse. Luckily, JHOVE2 uses access to the DROID application as a backend file format search. A fact that lead me and will lead this blog post to brighter pastures.

Partial JHOVE2 Output for the Apache License

Partial JHOVE2 Output for the Apache License

DROID or, Digital Record Object IDentification, is a Java program developed by the UK National Archives for file format identification. It accesses the online PRONOM database mentioned above and will try to identify the file types in a given folder. Since the PRONOM database does have a decent amount of common files, DROID was able to identify some common Actionscript and Flash development files. Although some of the more confusing files through it off, specifically all the difficult ones I mentioned in this post, it also managed to find file extensions that I had missed the first time around. So that’s a total win! DROID has a nice reporting feature that will enumerate the different file extensions it found and give their number. While I would have liked full path information, at least this can give someone a basic overview of the proprietary or potentially confusing file formats immediately. Thereby allowing one to narrow later searches to just those files for identification.

DROID UI

DROID UI

DROID Generated Report

DROID Generated Report

Okay, so that’s a could of the programs I’ve messed with to identify files. Next time join me for some basic email analysis thanks to the good folks at the Stanford MUSE Project.

Web Sites Forgotten

I’m not at a loss for diversity of approaches in my current archiving work on Prom Week. Teams of developers in modern academy (or even the real world) need ways to share and evaluate information about their work and projects. As a result of the never-ending new services and means of disseminating digital information, there are generally quite a few places you need to look nowadays to find everything. For instance, say you are working on a project and it needs to be announced and shared. Well then you’ll start a blog for it, and maybe get mentioned in some press. All good, but those are now two more potential research objects potentially worthy of appraisal and ingestion into a digital archive. If the press builds up enough, there will tens to hundreds of individual stories, some derivative, some not, about your work. Maybe even a famous person likes your stuff and mentions it on Twitter or Tumblr, you’d definitely want to have that saved for posterity, right?

In light of all the new modes of dissemination on the Internet, it has become imperative to find ways of aggregating and saving large amounts of online documents. This isn’t a reference to cloud documents, I’m mainly referring to anything involved or incumbent upon the development process that gets posted to some addressable folder on a server. I split this type of documentation into two types: dissemination documents and development documents. Dissemination documentation is everything I mentioned in the first paragraph. These are documents that are either released to the online public by the developers, in the form of press releases, development blogs, and other affiliated blog posts. These documents, in turn, can produce community reactions in the form of press coverage, reviews, and other online references to the game.  Development documents are online references to the game intended for the development team only. These can range from shared folders containing demonstration applications to group chat logs and meeting organization tools. Both of these document types site on the accessible web and might be informative to later researchers, so how to organize and save it all?

Internet Archive?

One thing to do would be to rely on someone else to pick up the slack. In this instance I’m mainly referring to the Internet Archive and their Wayback Machine. Internet Archive’s tool regularly scrapes the Internet for pages, and stores and logs changes to a large number of websites. You can even browse through older versions of websites and see what they looked like before they were acquired. The major problem is that the crawler will respect robots.txt (a file located at the root of a web directory explicitly blocking crawls) and apparently remove websites upon request. Another issue is that the crawler is not omniscient, it ‘crawls’ and needs a medium through which to exercise that activity, namely hyperlinks. If you put up a file on your domain, don’t tell anyone the link, and don’t post it anywhere, there is no way for the crawler to find or know about it. As such, a large portion of the Internet does not function in a way conducive to crawling. Sites like Facebook block crawlers and as a result the Wayback Machine does not have a copy of the original Prom Week page. So, the short answer is no, if you want to save all dissemination documents, you’ll need to do it yourself.

Certain applications, including most browsers, will let you save a webpage in its entirety although the scope of “entirety” is up to the application. Two services I use for this activity are either Zotero (for individual pages) or Archive-It. Zotero is a fantastic citation and reference management system and it will function as a browser plug-in. This allows you to save a website and have it attached to relevant metadata in a folder of your choosing. Archive-It is a commercial web application through which the Internet Archive provides paid access to their web crawler. Conveniently unless a user of the service explicitly prevents it, anything crawled by a target user will also show up somewhere on the archive.org website. Archive-It is a better solution if you want to save a larger class of website. As an example I used the tool to save the Prom Week development blog. The crawler managed to grab quite a bit of information and all the blog posts at the domain. Surprisingly, just a handful of posts turned up over 243 URLs and 45 megabytes of data. So there’s generally a lot more documents and data than one would assume. Granted some of these are links to other services like Twitter and Facebook or to individual hosted images so the document count is a little inflated. Still, even with this seemingly deep dive into a website it is still possible to miss something significant.

Results from undirected Archive-It Crawl

Results from undirected Archive-It Crawl

One of the project’s developers stored some demo applications and explanatory web sites for conferences on his personal UCSC web space. After crawling his site I came up with 156 URLs, but none of those linked to the Prom Week documents. Since he hadn’t linked them to any file in his main directory the crawler didn’t find them. I had to feed the crawler individual URLs for each demo file and directory. Needless to say this would become untenable for even a moderately complex grouping of online documents.

Ephemeral Sites

Along with development demos and files, another task that teams frequently engage with is scheduling. In looking through the email list archives for the project, I came across references to third party scheduling apps. These apps were used for organizing project meeting times and are still active on the web. This is a new type of documentation that I haven’t really seen discussed in the archival frame yet. Many people use simple online services to schedule meetings, share notes and even code. I figure most people don’t expect these texts to hang around, but a large number of them do. Being at the whims of whatever services is hosting them, it’s important for archival entities to be aware of the problems inherent in finding, appraising and saving such ephemeral documentation. I have quite a few still-active links to meetings the development team scheduled over two years ago. Although this information is not something I’d deem to be very important, it does further substantiate a number of things, including who was working on the project and when they were meeting. This type of information might be irrelevant for my current work, but it would be nifty to have the date for the first UI design meeting of Google or something similar. Awareness of the range of documentation can help prevent potentially valuable information from slipping through. The continuing problem being to know, in advance, what the relevant documents are.

Don’t Look Back: Cloud APIs and Revisions

Cloud APIs represent an interesting problem for the archival security of documents over extended periods. Storing files with a secondary source, in this case I’ll be discussing Google Docs and Dropbox, is usually a most convenient option. Google Docs provides a means to create, edit, share and browse documents without leaving the confines of a browser. Since I’m usually doing research or testing applications within that environment, having some of my documents there is definitely a useful addition to my workflow. Similarly, Dropbox’s main appeal is that it shadows the cloud storage of my files by being present as a locally accessible folder. Anything I drop in there then gets whisked up to the mystical cloud, floating into the immense sea of offsite data storage.

The obvious problem with cloud storage and a definite reason for skepticism is the reliance on secondary corporate structures to persist. Granted, the stability of the backups on Google Docs or Dropbox is probably greater than on my local hard drive but there is still something a little unnerving about giving up that control. From an archival perspective, which is very…very long, there is a general fear about the lasting permanence of cloud institutions. Google and Dropbox seem totally stable right now, but will I be able to leave something on there for 20 years and still be able to retrieve it? Technological turnover is astoundingly fast and many users have been burned by storage companies blinking out of existence. A good example was the shutdown of MegaUpload, in which 25 petabytes of user data vanished after a government raid.

The goal is to find the most stable possible place for long term digital storage. And although they are sometimes derided as slow and conservative, government and educational institutions generally have the best track record for long term retention. So the question, in light of all the online documents produced by game development, is what is the best solution to the probable collapse of a cloud service? Do you just migrate the stored data to a more secure digital repository? What data do you take, or more specifically, what data is available? This last question is where access to cloud service APIs becomes important because they are the main conduit for detailed documentary information and also the limiting factor for getting data off of a service.

Permission and Access

Both Google and Dropbox require user authentication to access cloud files. This is necessary for security and is at the heart of the trust networks that allow these services to function. Through a weird inversion, this security framework is also a liability for archives. Since getting access to files requires membership to either service, if a group wants to share data with a repository through the service, that repository must then become a member.  The problem here is that as an archivist I should not, in any way, be inserting myself into the material. Yes, we you prepare a collection you make tons of decisions about potential organization and delineation betwixt documents, but I’m not going to write my name on every one. When a Dropbox or Google documents folder is shared, the metadata associated with that transaction becomes permanent, so I effectively become a member of the project team from the viewpoint of the service.

Now, as far as future recommendations go, I don’t know if this type of thing is avoidable. For instance, in Google Docs you can assign privileges for a file and thus make it impossible for the archival entity to modify the document. Sadly, this option is not available on Dropbox, so if I accidentally do something to the files on my machine I could permanently overwrite some actual historical document. And yes, I could always make sure to make a full, non-synced copy of the folder somewhere else and then recursively modify those permissions, but my main point is that this is a ton more to think about and manage on a technical level than with a standard digital document.

Access to the various APIs for these services is also based on membership. If I want to use Google’s or Dropbox’s Javascript API, I need to register with them as a developer and then register my potential application of the API to get a token allowing access to the service. Now, if you don’t follow that immediately, you see where some problems start to set in. I’ve developed web applications so all this is pretty straight forward to me. To libraries or other traditional repository structures it means the need for technical expertise and cost. The APIs for Dropbox and Google provide a lot of information about users’ files. There are heaps of metadata associated with cloud documents that might interest a repository and are only available through sending programmed requests to the APIs.

Metadata and Contextual Information

When you create a document on Google Docs or add something to a Dropbox folder, a significant amount of computational work takes place, virtually all of it hidden from the user. Native formats for Google Docs are anyone’s guess, and odds are they are probably drawn from multiple services spread across multiple servers and then assembled through complex Javascript and AJAX queries. In order to remove these items from the service, a user can choose to download them as a selection of different files. Google Documents (text documents) are downloadable as PDF, Microsoft Word (.docx), OpenOffice (.odt), Rich Text (.rtf), Plain Text(.txt) or HTML files. These files don’t exist until downloaded and do not include most of the metadata generated for the file. The downloaded version is just one manifestation of the data Google has associated with the document. To get the rest you have to use calls to the API.

The downloaded documents also shed information not expressible in the format. Comments added to documents by a user are saved in Microsoft Word downloads, but removed for PDF downloads. Additionally, while Google is tracking revisions to files, that information is not present in the downloaded copies.

Revisions

According to the Google Drive SDK API, Google is tracking a significant amount of metadata for each file on Drive. Revisions also have a detailed metadata trail that is only accessible through code. Another issue is that revisions to documents are deprecated over time, so there isn’t really a concept of versioning to the files. File can be “pinned” to specific revisions by setting a specific flag through the API, but that option is not available to anyone without programming knowledge. After a long enough period of time Google services will prune revisions from less active documents. Therefore, though some of the shared documents do still have histories, I don’t know if they are inclusive of most changes.

Dropbox presents a similar issue in that most of the metadata associated with a file is also available only through the API. Thankfully since Dropbox is a small company, the API is a lot simpler and less intense than Google’s offering. Additionally, Dropbox’s online application does allow one to view the previous versions of a file, but also deletes version history after 30 days unless you pay extra for their Packrat add-on. All the files in Prom Week’s Dropbox now have only one version, including drafts of academic publications. All that data is now gone.

Since Google at least keeps some of the revisions around for a decent period of time, it would be feasible to write some add-on or application to back up the revisions (even to Google Drive). A problem with this is deciding which revisions are meaningful and which are just mild edits. Each revisions would they be its own full document, instead just the fragmentary diff common to source control programs like Git.

More on the APIs later, if I have time to work out some scripts and such.

Lost in the Cloud II: The Thickening

My previous post on identifying file formats introduced my three approaches to uncovering evasive file definitions. Examine the context, search the internet and mine the header. In this post I’m returning again to those methods and the files from Prom Week’s Dropbox folder to highlight more file format related issues. This time I’ll focus on versioning issues and software obscurity.

First, since I didn’t elaborate on this in the last post, the purpose for this type of investigation is to ascertain what a future researcher would need, in terms of executable resources (operating systems, programs, hardware configurations, etc.) to properly run and examine the work produced by a development team. In the case of just a subset of the shared cloud files for Prom Week the answer is surprisingly extensive and diverse. Looking through over 2000 files in the folder, I identified 60 different file types requiring the interpretation of at least 29 programs.

Plopping the list here for effect, this is not a final assessment of the required programs, as I’ll explain in a sec. The file extensions are on the left, notes on the right.

  1. .txt – text files
  2. .docx – Microsoft Office Document 2007 or later, archive / ooxml filetype
  3. .svg – Scalable Vector Graphics
  4. .vue – Visual Understanding Environment (VUE) file (Tufts University)
  5. .png – Portable Network Graphics
  6. .zip – ZIP File Archive
  7. .pptx – Microsoft Powerpoint Files 2007 or later
  8. .ppt – Microsoft Powerpoint Files
  9. .pdf – Portable Document Format
  10. .xlsx – Microsoft Excel Files 2007 or later
  11. .as3proj – FlashDevelop ActionScript 3.0 Project File
  12. .xml – Extensible Markup Language
  13. .bat – Windows Batch File
  14. .swf – Shockwave / Flash Files
  15. .old – “Old” backup file
  16. .mxml – Flex meta XML file format
  17. .as – Actionscript 3.0 file
  18. .jpg – JPEG Joint Photographic Expert Group image file
  19. .java – Java Source File
  20. .fla – Adobe Flash File
  21. .pbm – portable bitmap file (Netpbm)
  22. .ai – Adobe Illustrator File
  23. .html – Hypertext Markup Language
  24. .css – Cascading Style Sheet
  25. .js – Javascript File
  26. .doc – Microsoft Office Document
  27. .psd – Adobe Photoshop File
  28. .gif – Graphics Interchange Format image file
  29. .avi – Audio Video Interleave
  30. .camproj – Camtasia Studio project file (Camtasia Studio 7.0)
  31. .rtf – Rich Text Format
  32. .swc – Compiled Shockwave Flash File
  33. .potx – Microsoft PowerPoint Template File (XML Format)
  34. .xcf – GIMP (GNU Image Manipulation Program) file format
  35. .vpk – VUE package file (for sharing VUE maps)
  36. .camrec – Camtasia Studio recording file
  37. .mp3 – MPEG-1 or MPEG-2 Audio Layer III digital audio file
  38. .jpeg – Alternate extension for jpg file
  39. .mov – Quicktime video file
  40. .pages – Pages document file
  41. directory file – Folder file
  42. .dropbox – Dropbox configuration file
  43. .wav – Waveform audio file format
  44. .au – Audacity block file (NOT Sun .au audio file)
  45. .aup – Audacity project file
  46. .aup.bak – Audacity project file backup
  47. .bak – Backup File
  48. .lel – from “robert-portrait Logon.lel” filename clue, Windows 7 Logon editor file (http://www.tweakscene.com/viewtopic.php?f=149&t=4614)
  49. .dll – Dynamic-link Library Microsoft Windows shared library
  50. .pyd – same as a .dll though written in Python (http://docs.python.org/2/faq/windows.html#is-a-pyd-file-the-same-as-a-dll)
  51. .sql – Structured Query Language File
  52. .php – PHP language file
  53. .odp – OpenDocument presentation file (OpenOffice Impress)
  54. .mp4 – MPEG-4 Part 14 multimedia format
  55. .7z – 7Zip Archive File
  56. .odt – OpenDocument text document (OpenOffice Writer)
  57. .fxp – Adobe Flash Builder Flex Project File
  58. .csv – Comma-Separated Value file
  59. .m4v – Apple video format
  60. .xmpses – Adobe Premiere Elements DVD Marker file

Aside from being a rather imposing listing, at least to my archival mind, this crash of files highlights the difficultly in finding relevant programs to read and run individual entries.

Versions

The first problem is the need to find the correct version of a program for a specific file.  For example, 7 and 8 on the list above are .pptx and .ppt files for Microsoft Powerpoint. The former is a XML based document scheme used for Microsoft Office documents created in 2007 or later. The latter is a pre-Office 2007 Powerpoint document, or a document saved as a .ppt in a post-Office 2007 Powerpoint. I don’t have a way to tell without doing some immediate digging.

Another version issue arises with Adobe Creative Suite programs. Adobe Illustrator (22) and Photoshop (27) files have maintained the same file extension for a large number of versions. Some features in older files are not reproducible in newer versions of the software and vice-versa. This creates an issue when you load an older file in a newer program; it attempts to change it to the newer format, thus making it unreadable to the program that created the file and potentially damaging the contents of the file itself. It’s effectively destroying the provenance of file, which is a huge no-no.  In order to ensure that everything remained in the exact condition that the developers left it, an archivist or researcher would need to identify the file type and then find the contingent version of the software that created the file.

A good example that I worked through was for the .camproject file type (30) above. This file is produced by Camtasia Studio, a screen capture program for Mac and Windows popular with screen casters, and also used here to record demo gameplay videos for Prom Week. After finding the file type after a simple search, I mined the header anyway just to see if there was anything interesting in there. Turns out it was a good example of a potential solution to the version issue, sometimes the information is just explicitly present in the file:

Nice color scheme

.camproject file header with program version

That seems pretty straight forward, but since that is just a project data version and not a program version, I ventured a little farther and found the path to the program’s executable.

.camproject directory illustrating 7.0 application folder

.camproject directory illustrating 7.0 application folder

So there, sometimes its not too difficult to find the version of a specific file. Since .camproject is also just a text file, it would be rather trivial to right a script or parser to analyze and provide the correct file version without snooping. However, the approach here would not generally scale, especially if the information in the file is in an arbitrary, non-textual form and could then only be identified through structural analysis. One would have to write a program to do statistically reasoning over a corpus of filetypes, and who would be that nuts?…Oh Harvard? Harvard and JSTOR…really…huh. So yeah, there is a program called JHOVE that does do that and I’ll be looking at it in a future post. By JHOVE!

Obscurity 

Even after you’ve found a file type or rather, think you’ve found a file type, there are still other potential barriers to getting it running. Two file types in the list, the .vue file (4) and the .lel file (48) illustrated the difficulties better than most. Starting with the .vue file, which is a Visual Understanding Environment file for a mind-mapping software made developed at Tufts University, I hit some identification road blocks when searching the internet. VUE files are commonly used three-dimensional geometry files (something that would conceivably be found in a game) and are also the file format for Microsoft FoxPro, a data schema development application (I hadn’t heard of it either.) Since both of these uses apparently aren’t uncommon, I figured one of them might be correct. Upon further analysis, however, it became obvious that there was no use for 3D geometry in a strictly 2D flash game, so that file type was out. Additionally, I couldn’t find any mention of FoxPro in the group email or discussion and it didn’t seem to make much sense, especially since the filenames for .vue files didn’t line up with data scheme type uses.

Eventually, after I had given up on the .vue file (since it’s header didn’t have anything helpful), I was examining another file, a .vpk (35) through a text editor and low and behold, there was some additional information:

.vpk file header

.vpk file header

After seeing the Tufts reference in there, I went back to the Google and found the Visual Understanding Environment project page. This allowed me to identify the .vpk file and and the complementary .vue file! As should be evident, there will even be confusion when you think you’ve found a game related file type but it doesn’t fit the development context. Another issue is that the .vue and .vpk files are proprietary formats used for a little known application. In the future it will probably become even more difficult to locate a correct binary.

Even more extreme is the .lel file (48). Searching the internet for a decent amount of time finally lead me to a forum posting mentioning the format. The post discusses a Windows 7 log on screen modification program, which I figured was correct since the file I was trying to identify was called “robert-portrait Logon.lel”. The link to the  program from that post had expired, but given that I now knew it was associated with a specific program, I searched for and found a newer version. Sadly, both the application link there and its mirror are now both gone and I can’t find them elsewhere, meaning I have no way to open the .lel file nor verify it. The log on screen application is probably the most obscure program of the thirty or so I found. Its existence is was the most in danger, given that it’s a small, non-professional application made for a subset of the DeviantArt community.

Obscurity of software is a double-edge sword for preservation. If something is really obscure and made for a small user base it usually isn’t that large or complex (yes I know some people write their own flight simulators). So it’s generally easy to just save a copy with the data you want it to read. The log on editor program file is probably very small, and I could have just wrapped it up with all the other data in the project without much fuss. Therefore after sussing out file types, one should definitely see if any of their dependent programs are in immediate danger of disappearing or they might not be recoverable. The .lel file was created less than three years ago and is now unreadable, things go away fast on the Internet. A main point of this work is to highlight the transience and instability of data online, another site mentioning the .lel file literally told me to just Google it, assuming that it would just be available somewhere. Sigh.

So there, a bit more information on file types, I’ll return to them at some point in the future but I’m going to break up the posts a bit since there are so many things to talk about.