Sunday, March 16, 2008

Merging N versions into an MVD

Tonight I finally managed to automatically merge 3 versions of a short story in Spanish into a valid MVD, after literally several years of trying. Previously I had to create MVDs manually for all my demos. Now I have an automatic tool for doing it accurately and fairly quickly. The steps are quite simple, and will be the same steps you would use in the multi-version wiki to build up a multi-version document.

  1. Starting with an empty document containing NO text of version 1, update it with the full text of version 1. This produces an MVD with only one version.
  2. Now update the MVD with the text of the second version. The program does three things:
    1. It calculates the differences between version 1 and version 2. These identify parts of version 2 that are the same, or are inserted, deleted, or alternatives to the corresponding parts of version 1. The differences are normally aligned by word or, optionally, by character.
    2. It 'stitches in' the differences using version 1 as a base text to produce a correct MVD of two versions:
    3. It optimises the MVD so that any two arcs going from and to the same pair of nodes and containing the same text (in different versions of course) get merged into one arc. This happens all the time if, for example, you have a text containing the letter 'A' and then change it to 'B'. Then, you change it back to 'A' again. This produces three separate arcs for the three versions when you actually want two arcs, or two pairs, one for versions '1,3' containing the text 'A' and another for version '2' containing text 'B'.
  3. Finally, update the MVD using version 2 as the base of version 3. If you have more versions, then you just repeat this step as often as needed. You would normally choose a different base text each time, preferably one quite similar to the new one you are committing.

The result is an optimal alignment between the new version and the base version, rather than for all possible pairs of versions. For genetic texts at least, a writer conceptually changes an existing base version each time he/she makes an alteration to the text. Insertions, deletions and variants happen in pairwise fashion, not globally. Admittedly this is not true of texts that have evolved over time in many different, physically separate versions, as in the case of the complex manuscript tradition of an ancient text, but I still wonder how useful an alignment optimised over N versions would really be even in that case. Traditionally, at least, variants have always been aligned to a particular base text, whether a lost, shared original, or the editor's version or a copy text etc. The same thing happens in bioinformatics: a multiple sequence alignment program figures out which pairs of sequences are most similar, and then aligns them pairwise. All I am doing is taking away the automatic selection of a base version and replacing it with the user's choice.

The same process as the one described above works for updates of an existing version as well as for adding a new version. If you only change a few words, as you would typically during an editing session, the response time will typically be only a few milliseconds. In my test, adding all of version 2 to version 1 took only 0.7 of a second. Adding version 3 to version 2, however, took over 12 seconds. It all depends on the number of differences.

Unresolved issues

I haven't yet worked out how to include transpositions in this process, although it ought to be possible. The MVD format supports them fully, but we obviously can't calculate them or it would take forever. I think the user should specify them because they are always visible in the original texts, rather than let a machine 'calculate' it somehow, probably badly.

There is also the question of 'transclusion', to misuse Nelson's term. What I mean by transclusion is altering the text of one version and having that change also applied to any other versions that share the same text. This ought to be optional, but it would make the wiki really useful. Imagine an aircraft manual made up of a set of systems shared across different aircraft. Updating one system ought to be propagated automatically to all the other manuals. Again, this is possible, but I couldn't fully work out the details in this version of the multi-version document platform, which I call Nmerge.

Structure of the multi-version Wiki

Here's a drawing of the overall structure of the wiki. The wiki module itself is called Phaidros, after the dialogue in Plato where Socrates criticises the medium of writing:

The next stage is to build the Phaidros web application. I have an old version that just listed one version of an MVD. I need to add at least the facility of editing the source in XML and committing the changes back to the MVD to turn it into a wiki. Then we will have proof of concept, but the hard work is now done.

Wednesday, March 5, 2008

What's a Multi-Version Document?

What's it for?

A multi-version document is for recording texts that exist in multiple versions. This musn't be confused with recording multiple drafts of a single text you might be working on. Instead, a multi-version document is for recording the non-linear structure of a text, like the Greek New Testament, which exists in thousands of versions, or equally the text of a modern literary or philosophical work, which might have been published in several versions or may exist only in the form of manuscripts heavily edited by their author.

How does it work?

A multi-version document represents a text as a set of merged versions in a single digital entity, which can be efficiently edited, and its versions listed, compared and searched. Versions can overlap freely, and this overcomes the limitation of markup languages, which are based on the formal generative grammars invented by linguists in the 1950s, which are the basis of all markup systems today. A multi-version document is represented as a list of text fragments ti, each of which is assigned a set of versions vi:

{v1, t1}, {v2, t2}, ... {vn, tn}

This extremely simple form is all you need to record texts that contain thousands of versions. It is a form of digital text that trades complexity for mere size, and size is something modern computers are very good at handling. The structure of the document is implied by the intersection of the versions of each fragment. For example, to read a single version all you need to do is read through the list, picking out the fragments that belong to it. Other common operations, such as comparing two texts to find the differences, or printing the variants of a particular range within a given version are just as easy.

The mathematical basis for this model

This form is equivalent to a 'graph', a set of intermingling paths that start at one point, branch, rejoin and split again, until they all join back together at the end. It has been proven in the paper cited below in the International Journal of Human-Computer Studies, that these two forms of multi-version text, namely:

  1. the intuitive graph representation, and
  2. the list of pairs described above

are equivalent, that is, we can transform one into the other and back again with no loss of information. This kind of solid mathematical basis is in contrast to previous attempts at representing versions and overlapping structures in digital text, all of which were based on markup, which can only efficiently represent hierarchical structures. As many humanists and linguists have discovered, natural texts in their disciplines are much more frequently overlapping in structure than they are hierarchical.

The Variant Graph is not a Replacement for Markup: it Complements it

A Multi-Version Document cleanly separates content from variation. The content of a document is expressed by the textual fragments in the list, or by the textual labels to the arcs in the graph. The structure of the document, its variation, on the other hand, is expressed by the order of the pairs and by their sets of versions.

This means that any technology can be used to represent the content, even ordinary markup. Since all the overlapping structures have been removed and placed in the Variant Graph structure the markup can be simple enough to handle in a wiki. Of course, we are not tied to markup. If in future markup becomes obsolete we can still use Multi-Version Documents to record the content using some other technology, in binary form for example.

A Multi-Version Document doesn't, and need not, represent any of the complexities of text relating to content. It just represents versions and its complexity ends right there.

What we have so far

The first publication of the idea of 'network text' (submitted 2004) was in:

Schmidt, D. 2006 'A Graphical Editor for Manuscripts' Literary and Linguistic Computing 21: 341-351.

The first publication of the variant graph idea was in:

Schmidt, D., and Wyeld, T., 2005. 'A novel user interface for online literary documents.' ACM International Conference Proceeding Series 122, 1--4.

Subsequent conference papers appeared in:

Schmidt, D., and Fiormonte, D., 2006. 'A Fresh Computational Approach to Textual Variation', in: The First International Conference of the Alliance of Digital Humanities Organisations (ADHO) 5-9 July Paris-Sorbonne, Conference Abstracts, 193--196.

Schmidt, D. and Fiormonte, D. 2007. 'Documenti Multiversione: una soluzione per gli artefatti testuali del patrimonio culturale / Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts'. In Bordoni, L. (ed.) Proceedings of the AI*IA Workshop for Cultural Heritage. 10th Congress of Italian Association for Artificial Intelligence, Università di Roma Tor Vergata, Villa Mondragone, 10 settembre 2007, 9-16.

This was subsequently accepted for Intelligenza Artificiale (see below)

Schmidt, D., Brocca, N., Fiormonte, D. 'A Multi-Version Wiki', Proceedings of Digital Humanities 2008, Oulu, Finland, June 2008

Schmidt, D. and Colomb, R., 2009. 'A Data Structure for Representing Multi-version Texts Online', International Journal of Human Computer Studies 67.6, pp. 497-514.

Schmidt, D., 2009. Merging Multi-Version Texts: a General Solution to the Overlap Problem, in The Markup Conference 2009 Proceedings, Montreal, August.

Schmidt, D., 2010. The Inadequacy of Embedded Markup for Cultural Heritage Texts. Literary and Linguistic Computing, 25.3, 337-356.

Schmidt, D., Fiormonte, D., 2010. Documenti multiversione: una soluzione per gli artefatti testuali del patrimonio culturale/Multi-version documents: a digitsation solution for textual cultural heritage artefacts. Intelligenza artificiale, IV.1 (Dec) 56-61.

'The Role of Markup in the Digital Humanities', Historical and Social Research / Historische Sozialforschung 37.3 (2012), 125-146.

Schmidt, D., 2013. "Collation on the Web", Digital Humanities 2013Abstracts.

Schmidt, D., 2014. Towards an Interoperable Digital Scholarly Edition, Journal of the Text Encoding Initiative. 7 (forthcoming)