Having come from a background in classics it came as a shock to get a recent query about ancient Greek texts. And of course behind the problem was a bug. What nmerge actually does is merge a set of versions on their byte not character boundaries, and one consequence of this is that in some encodings characters can get split. When you read an entire version this isn't a problem, but what if you want to compute the variants of a text? Then the bits that vary might be only half a character. This can play havoc with encodings like UTF-8 when used to encode anything more complex than English. So Greek was an acid test, and although it initially wasn't good enough, I fixed the problem by migrating the half-characters after the alignment to the correct 'side' of the arc. So no more splits.
Here's the output of issuing an
nmerge -c variants command on two versions of Athenaeus' Deipnosophists, with version 'A' as the base:
I've begun to realise though that what this needs is a reference system so the user can relate the apparatus to the text. Which is more work, of course.
[B:συντετάσθαι] [B:ὁ] [B:ἔφη, ὥρα] [B:κἂν] [B:διαστησῶμεθ’] [B:τραγῳδίαν·] [B:αἰνίγμασιν· ἱκανῶς]