Tuesday, July 27, 2010

Greek MVDs

Having come from a background in classics it came as a shock to get a recent query about ancient Greek texts. And of course behind the problem was a bug. What nmerge actually does is merge a set of versions on their byte not character boundaries, and one consequence of this is that in some encodings characters can get split. When you read an entire version this isn't a problem, but what if you want to compute the variants of a text? Then the bits that vary might be only half a character. This can play havoc with encodings like UTF-8 when used to encode anything more complex than English. So Greek was an acid test, and although it initially wasn't good enough, I fixed the problem by migrating the half-characters after the alignment to the correct 'side' of the arc. So no more splits.

Here's the output of issuing an nmerge -c variants command on two versions of Athenaeus' Deipnosophists, with version 'A' as the base:

[B:συντετάσθαι]
[B:ὁ]
[B:ἔφη, ὥρα]
[B:κἂν]
[B:διαστησῶμεθ’]
[B:τραγῳδίαν·]
[B:αἰνίγμασιν· ἱκανῶς]
I've begun to realise though that what this needs is a reference system so the user can relate the apparatus to the text. Which is more work, of course.

1 comment:

Anonymous said...

I totally agree about the need for a reference system to relate the apparatus to the text. It would be great to see it implemented in the Joomla GUI at some point!