Wednesday, March 5, 2008

What's a Multi-Version Document?

What's it for?

A multi-version document is for recording texts that exist in multiple versions. This musn't be confused with recording multiple drafts of a single text you might be working on. Instead, a multi-version document is for recording the non-linear structure of a text, like the Greek New Testament, which exists in thousands of versions, or equally the text of a modern literary or philosophical work, which might have been published in several versions or may exist only in the form of manuscripts heavily edited by their author.

How does it work?

A multi-version document represents a text as a set of merged versions in a single digital entity, which can be efficiently edited, and its versions listed, compared and searched. Versions can overlap freely, and this overcomes the limitation of markup languages, which are based on the formal generative grammars invented by linguists in the 1950s, which are the basis of all markup systems today. A multi-version document is represented as a list of text fragments ti, each of which is assigned a set of versions vi:

{v1, t1}, {v2, t2}, ... {vn, tn}

This extremely simple form is all you need to record texts that contain thousands of versions. It is a form of digital text that trades complexity for mere size, and size is something modern computers are very good at handling. The structure of the document is implied by the intersection of the versions of each fragment. For example, to read a single version all you need to do is read through the list, picking out the fragments that belong to it. Other common operations, such as comparing two texts to find the differences, or printing the variants of a particular range within a given version are just as easy.

The mathematical basis for this model

This form is equivalent to a 'graph', a set of intermingling paths that start at one point, branch, rejoin and split again, until they all join back together at the end. It has been proven in the paper cited below in the International Journal of Human-Computer Studies, that these two forms of multi-version text, namely:

  1. the intuitive graph representation, and
  2. the list of pairs described above

are equivalent, that is, we can transform one into the other and back again with no loss of information. This kind of solid mathematical basis is in contrast to previous attempts at representing versions and overlapping structures in digital text, all of which were based on markup, which can only efficiently represent hierarchical structures. As many humanists and linguists have discovered, natural texts in their disciplines are much more frequently overlapping in structure than they are hierarchical.

The Variant Graph is not a Replacement for Markup: it Complements it

A Multi-Version Document cleanly separates content from variation. The content of a document is expressed by the textual fragments in the list, or by the textual labels to the arcs in the graph. The structure of the document, its variation, on the other hand, is expressed by the order of the pairs and by their sets of versions.

This means that any technology can be used to represent the content, even ordinary markup. Since all the overlapping structures have been removed and placed in the Variant Graph structure the markup can be simple enough to handle in a wiki. Of course, we are not tied to markup. If in future markup becomes obsolete we can still use Multi-Version Documents to record the content using some other technology, in binary form for example.

A Multi-Version Document doesn't, and need not, represent any of the complexities of text relating to content. It just represents versions and its complexity ends right there.

What we have so far

The first publication of the idea of 'network text' (submitted 2004) was in:

Schmidt, D. 2006 'A Graphical Editor for Manuscripts' Literary and Linguistic Computing 21: 341-351.

The first publication of the variant graph idea was in:

Schmidt, D., and Wyeld, T., 2005. 'A novel user interface for online literary documents.' ACM International Conference Proceeding Series 122, 1--4.

Subsequent conference papers appeared in:

Schmidt, D., and Fiormonte, D., 2006. 'A Fresh Computational Approach to Textual Variation', in: The First International Conference of the Alliance of Digital Humanities Organisations (ADHO) 5-9 July Paris-Sorbonne, Conference Abstracts, 193--196.

Schmidt, D. and Fiormonte, D. 2007. 'Documenti Multiversione: una soluzione per gli artefatti testuali del patrimonio culturale / Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts'. In Bordoni, L. (ed.) Proceedings of the AI*IA Workshop for Cultural Heritage. 10th Congress of Italian Association for Artificial Intelligence, Università di Roma Tor Vergata, Villa Mondragone, 10 settembre 2007, 9-16.

This was subsequently accepted for Intelligenza Artificiale (see below)

Schmidt, D., Brocca, N., Fiormonte, D. 'A Multi-Version Wiki', Proceedings of Digital Humanities 2008, Oulu, Finland, June 2008

Schmidt, D. and Colomb, R., 2009. 'A Data Structure for Representing Multi-version Texts Online', International Journal of Human Computer Studies 67.6, pp. 497-514.

Schmidt, D., 2009. Merging Multi-Version Texts: a General Solution to the Overlap Problem, in The Markup Conference 2009 Proceedings, Montreal, August.

Schmidt, D., 2010. The Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing, 25.3, 337-356.

Schmidt, D., Fiormonte, D., 2010. Documenti multiversione: una soluzione per gli artefatti testuali del patrimonio culturale/Multi-version documents: a digitsation solution for textual cultural heritage artefacts. Intelligenza artificiale, IV.1 (Dec) 56-61.

'The Role of Markup in the Digital Humanities', Historical and Social Research / Historische Sozialforschung 37.3 (2012), 125-146.

Schmidt, D., 2013. "Collation on the Web", Digital Humanities 2013Abstracts.

Schmidt, D., 2014. Towards an Interoperable Digital Scholarly Edition, Journal of the Text Encoding Initiative. 7 (forthcoming)


Dominguero said...

May be I did no understand what you're saying, but I was wondering if it is necessary to exclude from the realm of Multi-Version documents and its representation "multiple drafts of a single text". In the case of Sanvitale, for example (see Digital Variants), we have one short story in eight writing passages. I don't see why these should not be represented by the MVD model.

desmond said...

Domenico, the model covers all forms of texts that exist in multiple versions, including "multiple drafts of a single text". If you examine the images of the Sibylline Gospel closely you will see that some of the versions are actually revisions of single manuscripts, just as when the author revises his own text. For example, "A2 (Angers, Bibl. Publ. 26 (22), saec. XIII post correctionem)". There are not so many changes as in genetic texts, but nevertheless there are revisions. When you first commit a text to an MVD you choose the version that is to be its base. In the case of a revised text this is always the previous version. In the case of a separate manuscript it is up to the editor to choose, perhaps the reference text of that recension.

We really need to demonstrate this at Oulu, that the model covers all types of textual variation, both those traditionally covered by the classical and bibliographical models or by the genetic model, and also the kinds of overlapping hierarchies you find in linguistics texts.