Multi-Version Documents: May 2013

Most digital scholarly editions (DSEs) are stored on web-servers as a collection of files (images and text), database fields and tables in binary form. The structure of the data is tied closely to the software that expects resources to be in a precise format and location. So moving a digital scholarly edition to another web-server, which probably runs a different content management system, different scripting languages and a different database, is usually considered either impossible or too expensive. And sooner or later, due to changes in the technical environment caused by automatic updates of the server, languages and tools on which it depends, the DSE will break. That usually takes about one to two years. That's fine if there is still money to fix it, but more than likely the technicians who created it have moved on, the money has run out, and pretty soon that particular DSE will die. This vicious lifecycle of the typical DSE has already played itself out on the Web countless times and everybody knows it.

But XML will save us, won't it?

I hear people saying: 'but XML will save us. It is a permanent storage format that transcends the fragility of the software that gives it life.' Firstly, XML just puts those special data structures that the software needs to work into a form that people (I mean programmers) can read. It doesn't change anything of the above. Yes, XML itself as a metalanguage is interoperable with all the XML tools, but the languages that are defined using it, if not rigidly standardised, are as numberless as the grains of sand on the beach, and are just as tied to the software that animates them as earlier binary formats.

Escaping the vicious cycle

Secondly, a digital scholarly edition is much more than a collection of transcriptions and images. There is simply no way to make the technological part of a DSE interoperable between disparate systems. But what we can do is make the content of a DSE portable between systems that run different databases and to some extent different content management systems. Also, this gives the DSE a life outside its original software environment and leaves behind something for future researchers. What we have recently done in the AustESE project is to create a portable digital edition format (PDEF) which encapsulates the 'digital scholarly edition' in a single ZIP file containing:

The source documents of each version in TEXT and MVD (multi-version document) formats. The MVD format makes it easy to move documents between installations of the AustESE system, and the TEXT files record exactly the same data in a more or less timeless form.
The markup associated with the text. This is in the form of JSON (javascript object notation), which is now supplanting the use of XML in the sending and storage of data in many web applications. Several layers of markup are possible for each text version, and these can be combined to produce Web pages on demand.
The formats of the marked-up text. These define, using CSS, different renditions of the combined text+markup.
The images that may be referred to by the markup
(In future) the annotations about the text which are common to all the versions to which they apply.

This allows what I don't think anyone else can do yet, namely, to download a DSE, to send it to another installation and to have them upload it so that it works 'out of the box'. We need to think in those terms if we are to get beyond the experimental stage in which many digital scholarly editions currently seem stuck. Otherwise we run the risk of becoming an irrelevance in the face of massive and simplistic attempts to digitise our textual cultural heritage by Gutenberg and Google. We need much more than what these services are offering. We need a space in which we can play out on the Web the timeless activities of editing, annotation and research into texts – what we call 'scholarship'. The only way to do that is to have 'a thing' that we call a 'digital scholarly edition'.

I have just completed the move from one type of NoSQL database (couchdb) to another (mongodb). The speed increase is exhilarating, though I don't know by how much. The comparisons are not cached, they are computed on the fly: and the speed is the fruit of good design and careful implementation. As I said all along, conventional approaches based on XML are just too inefficient and inadequate. This design, based on multi-version documents, and hosted on a Jetty/Java service with a C core, is faster than any other such service I know of, and it is more flexible. There is also the new C-version of nmerge yet to add, which should increase capacity and speed further. Try out the test interface for yourself.

Multi-Version Documents

Tuesday, May 28, 2013

The vicious life-cycle of the digital scholarly edition

But XML will save us, won't it?

Escaping the vicious cycle

Saturday, May 18, 2013

Faster than a speeding bullet