Sunday, March 16, 2008

Merging N versions into an MVD

Tonight I finally managed to automatically merge 3 versions of a short story in Spanish into a valid MVD, after literally several years of trying. Previously I had to create MVDs manually for all my demos. Now I have an automatic tool for doing it accurately and fairly quickly. The steps are quite simple, and will be the same steps you would use in the multi-version wiki to build up a multi-version document.

  1. Starting with an empty document containing NO text of version 1, update it with the full text of version 1. This produces an MVD with only one version.
  2. Now update the MVD with the text of the second version. The program does three things:
    1. It calculates the differences between version 1 and version 2. These identify parts of version 2 that are the same, or are inserted, deleted, or alternatives to the corresponding parts of version 1. The differences are normally aligned by word or, optionally, by character.
    2. It 'stitches in' the differences using version 1 as a base text to produce a correct MVD of two versions:
    3. It optimises the MVD so that any two arcs going from and to the same pair of nodes and containing the same text (in different versions of course) get merged into one arc. This happens all the time if, for example, you have a text containing the letter 'A' and then change it to 'B'. Then, you change it back to 'A' again. This produces three separate arcs for the three versions when you actually want two arcs, or two pairs, one for versions '1,3' containing the text 'A' and another for version '2' containing text 'B'.
  3. Finally, update the MVD using version 2 as the base of version 3. If you have more versions, then you just repeat this step as often as needed. You would normally choose a different base text each time, preferably one quite similar to the new one you are committing.

The result is an optimal alignment between the new version and the base version, rather than for all possible pairs of versions. For genetic texts at least, a writer conceptually changes an existing base version each time he/she makes an alteration to the text. Insertions, deletions and variants happen in pairwise fashion, not globally. Admittedly this is not true of texts that have evolved over time in many different, physically separate versions, as in the case of the complex manuscript tradition of an ancient text, but I still wonder how useful an alignment optimised over N versions would really be even in that case. Traditionally, at least, variants have always been aligned to a particular base text, whether a lost, shared original, or the editor's version or a copy text etc. The same thing happens in bioinformatics: a multiple sequence alignment program figures out which pairs of sequences are most similar, and then aligns them pairwise. All I am doing is taking away the automatic selection of a base version and replacing it with the user's choice.

The same process as the one described above works for updates of an existing version as well as for adding a new version. If you only change a few words, as you would typically during an editing session, the response time will typically be only a few milliseconds. In my test, adding all of version 2 to version 1 took only 0.7 of a second. Adding version 3 to version 2, however, took over 12 seconds. It all depends on the number of differences.

Unresolved issues

I haven't yet worked out how to include transpositions in this process, although it ought to be possible. The MVD format supports them fully, but we obviously can't calculate them or it would take forever. I think the user should specify them because they are always visible in the original texts, rather than let a machine 'calculate' it somehow, probably badly.

There is also the question of 'transclusion', to misuse Nelson's term. What I mean by transclusion is altering the text of one version and having that change also applied to any other versions that share the same text. This ought to be optional, but it would make the wiki really useful. Imagine an aircraft manual made up of a set of systems shared across different aircraft. Updating one system ought to be propagated automatically to all the other manuals. Again, this is possible, but I couldn't fully work out the details in this version of the multi-version document platform, which I call Nmerge.

Structure of the multi-version Wiki

Here's a drawing of the overall structure of the wiki. The wiki module itself is called Phaidros, after the dialogue in Plato where Socrates criticises the medium of writing:

The next stage is to build the Phaidros web application. I have an old version that just listed one version of an MVD. I need to add at least the facility of editing the source in XML and committing the changes back to the MVD to turn it into a wiki. Then we will have proof of concept, but the hard work is now done.

No comments: