Lost in the Cloud II: The Thickening

My previous post on identifying file formats introduced my three approaches to uncovering evasive file definitions. Examine the context, search the internet and mine the header. In this post I’m returning again to those methods and the files from Prom Week’s Dropbox folder to highlight more file format related issues. This time I’ll focus on versioning issues and software obscurity.

First, since I didn’t elaborate on this in the last post, the purpose for this type of investigation is to ascertain what a future researcher would need, in terms of executable resources (operating systems, programs, hardware configurations, etc.) to properly run and examine the work produced by a development team. In the case of just a subset of the shared cloud files for Prom Week the answer is surprisingly extensive and diverse. Looking through over 2000 files in the folder, I identified 60 different file types requiring the interpretation of at least 29 programs.

Plopping the list here for effect, this is not a final assessment of the required programs, as I’ll explain in a sec. The file extensions are on the left, notes on the right.

  1. .txt – text files
  2. .docx – Microsoft Office Document 2007 or later, archive / ooxml filetype
  3. .svg – Scalable Vector Graphics
  4. .vue – Visual Understanding Environment (VUE) file (Tufts University)
  5. .png – Portable Network Graphics
  6. .zip – ZIP File Archive
  7. .pptx – Microsoft Powerpoint Files 2007 or later
  8. .ppt – Microsoft Powerpoint Files
  9. .pdf – Portable Document Format
  10. .xlsx – Microsoft Excel Files 2007 or later
  11. .as3proj – FlashDevelop ActionScript 3.0 Project File
  12. .xml – Extensible Markup Language
  13. .bat – Windows Batch File
  14. .swf – Shockwave / Flash Files
  15. .old – “Old” backup file
  16. .mxml – Flex meta XML file format
  17. .as – Actionscript 3.0 file
  18. .jpg – JPEG Joint Photographic Expert Group image file
  19. .java – Java Source File
  20. .fla – Adobe Flash File
  21. .pbm – portable bitmap file (Netpbm)
  22. .ai – Adobe Illustrator File
  23. .html – Hypertext Markup Language
  24. .css – Cascading Style Sheet
  25. .js – Javascript File
  26. .doc – Microsoft Office Document
  27. .psd – Adobe Photoshop File
  28. .gif – Graphics Interchange Format image file
  29. .avi – Audio Video Interleave
  30. .camproj – Camtasia Studio project file (Camtasia Studio 7.0)
  31. .rtf – Rich Text Format
  32. .swc – Compiled Shockwave Flash File
  33. .potx – Microsoft PowerPoint Template File (XML Format)
  34. .xcf – GIMP (GNU Image Manipulation Program) file format
  35. .vpk – VUE package file (for sharing VUE maps)
  36. .camrec – Camtasia Studio recording file
  37. .mp3 – MPEG-1 or MPEG-2 Audio Layer III digital audio file
  38. .jpeg – Alternate extension for jpg file
  39. .mov – Quicktime video file
  40. .pages – Pages document file
  41. directory file – Folder file
  42. .dropbox – Dropbox configuration file
  43. .wav – Waveform audio file format
  44. .au – Audacity block file (NOT Sun .au audio file)
  45. .aup – Audacity project file
  46. .aup.bak – Audacity project file backup
  47. .bak – Backup File
  48. .lel – from “robert-portrait Logon.lel” filename clue, Windows 7 Logon editor file (http://www.tweakscene.com/viewtopic.php?f=149&t=4614)
  49. .dll – Dynamic-link Library Microsoft Windows shared library
  50. .pyd – same as a .dll though written in Python (http://docs.python.org/2/faq/windows.html#is-a-pyd-file-the-same-as-a-dll)
  51. .sql – Structured Query Language File
  52. .php – PHP language file
  53. .odp – OpenDocument presentation file (OpenOffice Impress)
  54. .mp4 – MPEG-4 Part 14 multimedia format
  55. .7z – 7Zip Archive File
  56. .odt – OpenDocument text document (OpenOffice Writer)
  57. .fxp – Adobe Flash Builder Flex Project File
  58. .csv – Comma-Separated Value file
  59. .m4v – Apple video format
  60. .xmpses – Adobe Premiere Elements DVD Marker file

Aside from being a rather imposing listing, at least to my archival mind, this crash of files highlights the difficultly in finding relevant programs to read and run individual entries.

Versions

The first problem is the need to find the correct version of a program for a specific file.  For example, 7 and 8 on the list above are .pptx and .ppt files for Microsoft Powerpoint. The former is a XML based document scheme used for Microsoft Office documents created in 2007 or later. The latter is a pre-Office 2007 Powerpoint document, or a document saved as a .ppt in a post-Office 2007 Powerpoint. I don’t have a way to tell without doing some immediate digging.

Another version issue arises with Adobe Creative Suite programs. Adobe Illustrator (22) and Photoshop (27) files have maintained the same file extension for a large number of versions. Some features in older files are not reproducible in newer versions of the software and vice-versa. This creates an issue when you load an older file in a newer program; it attempts to change it to the newer format, thus making it unreadable to the program that created the file and potentially damaging the contents of the file itself. It’s effectively destroying the provenance of file, which is a huge no-no.  In order to ensure that everything remained in the exact condition that the developers left it, an archivist or researcher would need to identify the file type and then find the contingent version of the software that created the file.

A good example that I worked through was for the .camproject file type (30) above. This file is produced by Camtasia Studio, a screen capture program for Mac and Windows popular with screen casters, and also used here to record demo gameplay videos for Prom Week. After finding the file type after a simple search, I mined the header anyway just to see if there was anything interesting in there. Turns out it was a good example of a potential solution to the version issue, sometimes the information is just explicitly present in the file:

Nice color scheme

.camproject file header with program version

That seems pretty straight forward, but since that is just a project data version and not a program version, I ventured a little farther and found the path to the program’s executable.

.camproject directory illustrating 7.0 application folder

.camproject directory illustrating 7.0 application folder

So there, sometimes its not too difficult to find the version of a specific file. Since .camproject is also just a text file, it would be rather trivial to right a script or parser to analyze and provide the correct file version without snooping. However, the approach here would not generally scale, especially if the information in the file is in an arbitrary, non-textual form and could then only be identified through structural analysis. One would have to write a program to do statistically reasoning over a corpus of filetypes, and who would be that nuts?…Oh Harvard? Harvard and JSTOR…really…huh. So yeah, there is a program called JHOVE that does do that and I’ll be looking at it in a future post. By JHOVE!

Obscurity 

Even after you’ve found a file type or rather, think you’ve found a file type, there are still other potential barriers to getting it running. Two file types in the list, the .vue file (4) and the .lel file (48) illustrated the difficulties better than most. Starting with the .vue file, which is a Visual Understanding Environment file for a mind-mapping software made developed at Tufts University, I hit some identification road blocks when searching the internet. VUE files are commonly used three-dimensional geometry files (something that would conceivably be found in a game) and are also the file format for Microsoft FoxPro, a data schema development application (I hadn’t heard of it either.) Since both of these uses apparently aren’t uncommon, I figured one of them might be correct. Upon further analysis, however, it became obvious that there was no use for 3D geometry in a strictly 2D flash game, so that file type was out. Additionally, I couldn’t find any mention of FoxPro in the group email or discussion and it didn’t seem to make much sense, especially since the filenames for .vue files didn’t line up with data scheme type uses.

Eventually, after I had given up on the .vue file (since it’s header didn’t have anything helpful), I was examining another file, a .vpk (35) through a text editor and low and behold, there was some additional information:

.vpk file header

.vpk file header

After seeing the Tufts reference in there, I went back to the Google and found the Visual Understanding Environment project page. This allowed me to identify the .vpk file and and the complementary .vue file! As should be evident, there will even be confusion when you think you’ve found a game related file type but it doesn’t fit the development context. Another issue is that the .vue and .vpk files are proprietary formats used for a little known application. In the future it will probably become even more difficult to locate a correct binary.

Even more extreme is the .lel file (48). Searching the internet for a decent amount of time finally lead me to a forum posting mentioning the format. The post discusses a Windows 7 log on screen modification program, which I figured was correct since the file I was trying to identify was called “robert-portrait Logon.lel”. The link to the  program from that post had expired, but given that I now knew it was associated with a specific program, I searched for and found a newer version. Sadly, both the application link there and its mirror are now both gone and I can’t find them elsewhere, meaning I have no way to open the .lel file nor verify it. The log on screen application is probably the most obscure program of the thirty or so I found. Its existence is was the most in danger, given that it’s a small, non-professional application made for a subset of the DeviantArt community.

Obscurity of software is a double-edge sword for preservation. If something is really obscure and made for a small user base it usually isn’t that large or complex (yes I know some people write their own flight simulators). So it’s generally easy to just save a copy with the data you want it to read. The log on editor program file is probably very small, and I could have just wrapped it up with all the other data in the project without much fuss. Therefore after sussing out file types, one should definitely see if any of their dependent programs are in immediate danger of disappearing or they might not be recoverable. The .lel file was created less than three years ago and is now unreadable, things go away fast on the Internet. A main point of this work is to highlight the transience and instability of data online, another site mentioning the .lel file literally told me to just Google it, assuming that it would just be available somewhere. Sigh.

So there, a bit more information on file types, I’ll return to them at some point in the future but I’m going to break up the posts a bit since there are so many things to talk about.