Multi-Version Documents: Alpha Prototype Ready

I am renaming the multi-version wiki Alpha, simply because it's easier to say than Phaidros. It's a bit of a joke, really, because 'Alpha' was just the description of the product I developed for DH2008. It was the 'alpha' release of that.

The old Alpha didn't do transpositions, and to remedy this deficiency I have been labouring hard for the past year. NMerge was revised to support transpositions, but I hadn't integrated it into the multi-version wiki. But when I finally saw the result of the new nmerge in the web browser, it was suddenly clear that there were still some bugs in the transposition algorithm. Finding out exactly what was going wrong, though, took me about a week of solid debugging. But it is done now and I am finally satisfied. And now I have something to take to Montréal to show the audience. And I can say: 'Hey folks, you said this conference was all about theory, but here's something that actually works.' I think that is a pretty good argument.

In this screendump of part of the TwinView of Galiano's 'El mapa de las aguas' you can see the transposition of 'otras de un hachazo' from after 'de un bocado rabioso' (in version B, left) to before (in version C, right). To consistently detect cases like this manually would be near to impossible.

Red text is deleted in the left-hand version with respect to the version on the right. Blue text is inserted, and transpositions are shown in grey. Black text is merged and, like transpositions, clicking on it aligns the text on each side. This use of these simple features of HTML results in a surprisingly effective UI.

Character-Level vs Word-Level Alignment

The use of character-level alignment by default is new to this version. For example, the expression 'el molino chico' became 'el molino' through the deletion of the character sequence 'o chic'. This goes to show that what humans would expect – the deletion of ' chico' – and what the computer detects, don't always correspond. I don't think that is a bad thing. The alternative would be to fail to see changes of spelling such as 'desaparecido' for 'desparecido' or the capitalisation of 'Ojos' for 'ojos'. A word-level granularity would puzzle the reader while he/she tried to work out the difference. It is clearer to see small changes like these highlighted, so I agree with the MEDITE people that character-level alignment is more powerful. After all, you can always reduce character-level granularity to word-level but if you only have word-level alignment you are stuck with it.

'Collation' programs based on XML use word-level granularity because a finer resolution would make the markup impossibly complex (you'd have to mark up each letter separately). That doesn't have to be a restriction once we abandon the print-oriented concept of 'apparatus.' For the digital medium, at least, a new digital presentation of variation is needed. Let it evolve.

Multi-Version Documents

Sunday, July 26, 2009

Alpha Prototype Ready

Character-Level vs Word-Level Alignment

No comments: