Wednesday, March 5, 2008

What's a Multi-Version Document?

What's it for?

A multi-version document is for recording texts that exist in multiple versions. This musn't be confused with recording multiple drafts of a single text you might be working on. Instead, a multi-version document is for recording the non-linear structure of a text, like the Greek New Testament, which exists in thousands of versions, or equally the text of a modern literary or philosophical work, which might have been published in several versions or may exist only in the form of manuscripts heavily edited by their author. Another major use is in recording multiple marked-up versions of digital documents, for example, texts in linguistics that express multiple perspectives of the same text, or which contain overlapping markup.

Why we need them

Multi-version documents are the ideal form for recording our textual cultural heritage in an increasingly digital age. Year on year we are reading fewer paper books. As people play more games, watch more DVDs, browse more information on the Internet, our rich cultural heritage comes under threat. Ultimately, those written texts that give our language depth and history, and our culture an identity, the collection of works written on physical media that go back thousands of years, will have to be transferred to the digital medium if they are to survive. The problem is that existing forms of digital text can't accurately record these documents. Subtle structural differences between the two media mean that important information will be lost, or it will simply prove too difficult to transfer old knowledge into the new form. If we fail to develop new means for representing our textual cultural heritage now, we may soon lose it forever.

How does it work?

A multi-version document represents a text as a set of merged versions in a single digital entity, which can be efficiently edited, and its versions listed, compared and searched. Versions can overlap freely, and this overcomes the limitation of markup languages, which are based on the formal generative grammars invented by linguists in the 1950s, which are the basis of all markup systems today. A multi-version document is represented as a list of text fragments ti, each of which is assigned a set of versions vi:

{v1, t1}, {v2, t2}, ... {vn, tn}

This extremely simple form is all you need to record texts that contain thousands of versions. It is a form of digital text that trades complexity for mere size, and size is something modern computers are very good at handling. The structure of the document is implied by the intersection of the versions of each fragment. For example, to read a single version all you need to do is read through the list, picking out the fragments that belong to it. Searching all versions or comparing any two versions to find out what is different is just as easy.

The mathematical basis for this model

This form is equivalent to a 'graph', a set of intermingling paths that start at one point, branch, rejoin and split again, until they all join back together at the end. It can be proven, and has been, in a paper which has recently been accepted by the International Journal of Human-Computer Studies, that these two forms of multi-version text, namely:

  1. the intuitive graph representation, and
  2. the list of pairs described above

are equivalent, that is, we can transform one into the other and back again with no loss of information. This kind of solid mathematical basis is in contrast to previous attempts at representing versions and overlapping structures in digital text, all of which were based on markup, which can only efficiently represent hierarchical structures. As many humanists and linguists have discovered, natural texts in their disciplines are much more frequently overlapping in structure than they are hierarchical.

The Variant Graph is not a Replacement for Markup: it Complements it

A Multi-Version Document cleanly separates content from variation. The content of a document is expressed by the textual fragments in the list, or by the textual labels to the arcs in the graph. The structure of the document, its variation, on the other hand, is expressed by the order of the pairs and by their sets of versions.

This means that any technology can be used to represent the content, even ordinary markup. Since all the overlapping structures have been removed and placed in the Variant Graph structure the markup can be simple enough to handle in a wiki. Of course, we are not tied to markup. If in future markup becomes obsolete we can still use Multi-Version Documents to record the content using some other technology, in binary form for example.

A Multi-Version Document doesn't, and need not, represent any of the complexities of text relating to content. It just represents versions and its complexity ends right there.

What we have so far

The first publication of the idea of 'network text' (submitted 2004) was in:

Schmidt, D. 2006 'A Graphical Editor for Manuscripts' Literary and Linguistic Computing 21: 341-351.

The first publication of the variant graph idea was in:

Schmidt, D., and Wyeld, T., 2005. 'A novel user interface for online literary documents.' ACM International Conference Proceeding Series 122, 1--4.

Subsequent conference papers appeared in:

Schmidt, D., and Fiormonte, D., 2006. 'A Fresh Computational Approach to Textual Variation', in: The First International Conference of the Alliance of Digital Humanities Organisations (ADHO) 5-9 July Paris-Sorbonne, Conference Abstracts, 193--196.

Schmidt, D. and Fiormonte, D. 2007. 'Documenti Multiversione: una soluzione per gli artefatti testuali del patrimonio culturale / Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts'. In Bordoni, L. (ed.) Proceedings of the AI*IA Workshop for Cultural Heritage. 10th Congress of Italian Association for Artificial Intelligence, Università di Roma Tor Vergata, Villa Mondragone, 10 settembre 2007, 9-16.

This was subsequently accepted for Intelligenza Artificiale, and is awaiting publication.

Schmidt, D., Brocca, N., Fiormonte, D. 'A Multi-Version Wiki', accepted for Digital Humanities 2008, Oulu, Finland, June 2008

Schmidt, D. and Colomb, R., 2009. 'A Data Structure for Representing Multi-version Texts Online', International Journal of Human Computer Studies 67.6.

Schmidt, D., 2009. Merging Multi-Version Texts: a General Solution to the Overlap Problem, in The Markup Conference 2009 Preliminary Proceedings, Montreal, August.

2 comments:

Dominguero said...

May be I did no understand what you're saying, but I was wondering if it is necessary to exclude from the realm of Multi-Version documents and its representation "multiple drafts of a single text". In the case of Sanvitale, for example (see Digital Variants), we have one short story in eight writing passages. I don't see why these should not be represented by the MVD model.
Domenico

desmond said...

Domenico, the model covers all forms of texts that exist in multiple versions, including "multiple drafts of a single text". If you examine the images of the Sibylline Gospel closely you will see that some of the versions are actually revisions of single manuscripts, just as when the author revises his own text. For example, "A2 (Angers, Bibl. Publ. 26 (22), saec. XIII post correctionem)". There are not so many changes as in genetic texts, but nevertheless there are revisions. When you first commit a text to an MVD you choose the version that is to be its base. In the case of a revised text this is always the previous version. In the case of a separate manuscript it is up to the editor to choose, perhaps the reference text of that recension.

We really need to demonstrate this at Oulu, that the model covers all types of textual variation, both those traditionally covered by the classical and bibliographical models or by the genetic model, and also the kinds of overlapping hierarchies you find in linguistics texts.