Multi-Version Documents: July 2010

Tuesday, July 27, 2010

Greek MVDs

Having come from a background in classics it came as a shock to get a recent query about ancient Greek texts. And of course behind the problem was a bug. What nmerge actually does is merge a set of versions on their byte not character boundaries, and one consequence of this is that in some encodings characters can get split. When you read an entire version this isn't a problem, but what if you want to compute the variants of a text? Then the bits that vary might be only half a character. This can play havoc with encodings like UTF-8 when used to encode anything more complex than English. So Greek was an acid test, and although it initially wasn't good enough, I fixed the problem by migrating the half-characters after the alignment to the correct 'side' of the arc. So no more splits.

Here's the output of issuing an nmerge -c variants command on two versions of Athenaeus' Deipnosophists, with version 'A' as the base:

[B:συντετάσθαι]
[B:ὁ]
[B:ἔφη, ὥρα]
[B:κἂν]
[B:διαστησῶμεθ’]
[B:τραγῳδίαν·]
[B:αἰνίγμασιν· ἱκανῶς]

I've begun to realise though that what this needs is a reference system so the user can relate the apparatus to the text. Which is more work, of course.

Saturday, July 17, 2010

Improvements to Apparatus

One advantage of Multi-Version-Documents is that generating an apparatus is so easy. There is just a simple command in nmerge: you specify the range (offset and length), the desired base text, and it then computes the traditional 'apparatus' type display of all the variants aligned on word-boundaries. Maybe it's old-fashioned and print-related but it does show you many versions of the text in a very compact way. So I think it's still useful.

The problem I had been struggling with for the past couple of weeks was how to ensure that this range in the MVD could be specified precisely via a selection in the GUI. Of course what the user sees is not the contents of an MVD. It is extracted and transformed via XSLT (at the moment) and the user selection in HTML bears no clear relation to the corresponding selection in the underlying data. The problem boils down to aligning the XML and HTML versions of the text fast enough for the user not to notice. There are plenty of techniques for doing this, but they all take waaaay too long. I wanted it in fractions of a second. After perhaps the sixth try my new method finds the correct answer in around 28 milliseconds for the King Lear example in slow old PHP. What is perhaps most annoying is that the method I used was incredibly simple. It's just 58 lines of code. Strange that you can never see the simple things that are right under your nose. :-) And when you finally have the answer you can't explain why it didn't occur to you earlier.

If anyone really wants to know how I did it they can download the MVD_GUI code to find out. I'm not going to bore you all with technical details here. You might have to wait until I update the Google code site.

Friday, July 9, 2010

Tree View

Tree View is finally working. What this does is compute the genealogical tree of a set of versions. Although this is normally of use mostly for manuscript traditions, I believe that it is also useful for printed works. It can show at a glance the relationships between texts that make up a work. Previous attempts to do this (by others) were based on collation output and didn't take account of invariants, only variants. I think this casts doubt on the accuracy of the result. Also, rather than being offline and manual this method is online and automatic. There's a basic zoom facility which is useful for the larger trees. Changing any of the options recomputes the tree. Check it out at Harpur.

Here's a small sample from the DV website. Relationships between 9 texts of Vicenzo Cerami's the Serpent Woman. This was published in a newspaper (so it's kind of a print tradition) and the author made available the pre-texts in the form of edited drafts. The length of branches is significant (it indicates the distance between versions), but in case this gets confusing you can make all the lengths the same.

If you are wondering how I produce this online the process is basically:

Query the MVD to produce a difference matrix (edit distance of each version from each other version)
Pipe the result into the Fitch-Margoliash tree-building program from Phylip.
Pipe the result into drawtree from Phylip. This outputs a postscript version of the diagram.
Pipe the result of that into Ghostscript to produce a temporary JPG file, which you can view. All this is done by executing a succession of binaries using exec() in PHP. I had to adapt fitch and drawtree extensively to get this to work with pipes. Fitch chokes a bit on the biggest tree (Sibylline Gospel), but that's to be expected. It does work, though.