Thursday, May 14, 2009

HyperNietzsche vs MVD

I decided after all to make some general remarks about the recently proposed 'Encoding Model for Genetic Editions' being promoted by the HyperNietzsche people and the TEI. Since this is being put forward as a rival solution for a small subset of multi-version texts covered by my solution, I thought that readers of this blog might like to know the main reasons why I think that the MVD technology is much the better of the two.

One Work = One Text

Because it is difficult to record many versions in one file using markup, the proposal recommends a document-centric approach. In this method each physical document is encoded separately, even when they are just drafts of the one text. As a result there is a great deal of redundancy in their representation. They interconnect the variants between documents by means of links which are weighted with a probability, and they see in this their main advantage over MVD. But this is based purely on a misunderstanding of the MVD model. The weights can of course be encoded in the version information of the MVD as user-constructed paths. We can have an 80% probable version and a 20% probable version just as well as physical versions.

Actually I think it is wrong to encode one transcriber's opinion about the probability that a certain combination of variants is 'correct'. A transcription should just record the text and any interpretations should be kept separate. How else can it be shared? The display of alternative paths is a task for the software, mediated by the user's preferences.

The main disadvantage in having multiple copies of the same text is that every subsequent operation on the text has to reestablish or maintain the connections between bits that are supposed to be the same. You thus have much more work to do than in an MVD. I believe that text that is the same across versions should literally be the same text. This simplifies the whole approach to multi-version texts. I also don't believe that humanists want to maintain complex markup that essentially records interconnections between versions, when this same information can be recorded automatically as simple identity.

OHCO Thesis Redux

The section on 'grouping changes' implies that manuscript texts have a structure that can be broken down into a hierarchy of changes that can be conveniently grouped and nested arbitrarily. Similarly in section 4.1 a strict hierarchy is imposed consisting of document->writing surface->zone->line. Since Barnard's paper in 1988 where he pointed out the inherent failure of markup to adequately represent a simple case of nested speeches and lines in Shakespeare - sometimes a line was spread over two speeches - the problem of overlap has become the dominant issue in the digital encoding of historical texts. This representation, which seeks to reassert the OHCO thesis, which has been withdrawn by its own authors, will fail to adequately represent these genetic texts until it is recognised that they are fundamentally non-hierarchical. The last 20 years of research cannot simply be ignored. It is no longer possible to propose something for the future that does not address the overlap problem. And MVD neatly disposes of that.

Collation of XML Texts

I am also curious as to how they propose to 'collate' XML documents arranged in this structure, especially when the variants are distributed via two mechanisms: as markup in individual files and also as links between documentary versions. Collation programs work by comparing basically plain text files, containing only light markup for references in COCOA or empty XML elements (as in the case of Juxta). The virtual absence of collation programs able to process arbitrary XML renders this proposal at least very difficult to achieve. It would be better if a purely digital representation of the text were the objective, since in this case, an apparatus would not be needed.

Transpositions

The mechanism for transposition as described also sounds infeasible. It is unclear what is meant by the proposed standoff mechanism. However, if this allows chunks of transposed text to be moved around this will fail if the chunks contain non-well-formed markup or if the destination location does not permit that markup in the schema at that point. Also if transpositions between physical versions are allowed - and this actually comprises the majority of cases - how is such a mechanism to work, especially when transposed chunks may well overlap?

Simplicity = Limited Scope

Much is made in the supporting documentation of the HyperNietzsche Markup Language (HNML) and 'GML' (Genetic Markup Language) of the greater simplicity of the proposed encoding schemes. Clearly, the more general an encoding scheme the less succinct it is going to be. Since the proposal is to encorporate the encoding model for genetic editions into TEI then this advantage will surely be lost. In any case there seems very little in the proposal that cannot already be encoded as well (or as poorly, depending on your point of view) in the TEI Guidelines as they now stand.

No comments: