Multi-Version Documents: May 2009

Thursday, May 28, 2009

The Light at the End of the Tunnel

Well I finally got 'compare' to work properly. The delay was caused by having to redesign the 'chunking' mechanism that delivers the text back to the browser as a series of blocks with all the same characteristics. So all the deleted text can be made red, and the inserted blue and the merged black. And the user can click on the black text and be taken to the corresponding part of the compared text. Very important, but also very tricky to get absolutely right. And in this version I had to allow for transpositions, and they are even more complicated. But now at last it works. I will post the project on Google Code in the morning, because I am too tired now.

usage

create

help

add

del

desc

arch

unarch

export

import

update

read

list

comp

find

vars

Several Days Later ...

Almost done testing the code. Just a few minor problems with find (again) and variants. The latter could be quite a useful feature in the GUI. For example, selecting a piece of text could conceivably show its variants dynamically in a sub-window at the bottom. I favour an in-line solution using popup text, but that will have to wait. This feature should demonstrate that we don't need to 'collate' separate physical versions any longer to get this information.

Thursday, May 14, 2009

HyperNietzsche vs MVD

I decided after all to make some general remarks about the recently proposed 'Encoding Model for Genetic Editions' being promoted by the HyperNietzsche people and the TEI. Since this is being put forward as a rival solution for a small subset of multi-version texts covered by my solution, I thought that readers of this blog might like to know the main reasons why I think that the MVD technology is much the better of the two.

One Work = One Text

Because it is difficult to record many versions in one file using markup, the proposal recommends a document-centric approach. In this method each physical document is encoded separately, even when they are just drafts of the one text. As a result there is a great deal of redundancy in their representation. They interconnect the variants between documents by means of links which are weighted with a probability, and they see in this their main advantage over MVD. But this is based purely on a misunderstanding of the MVD model. The weights can of course be encoded in the version information of the MVD as user-constructed paths. We can have an 80% probable version and a 20% probable version just as well as physical versions.

Actually I think it is wrong to encode one transcriber's opinion about the probability that a certain combination of variants is 'correct'. A transcription should just record the text and any interpretations should be kept separate. How else can it be shared? The display of alternative paths is a task for the software, mediated by the user's preferences.

The main disadvantage in having multiple copies of the same text is that every subsequent operation on the text has to reestablish or maintain the connections between bits that are supposed to be the same. You thus have much more work to do than in an MVD. I believe that text that is the same across versions should literally be the same text. This simplifies the whole approach to multi-version texts. I also don't believe that humanists want to maintain complex markup that essentially records interconnections between versions, when this same information can be recorded automatically as simple identity.

OHCO Thesis Redux

The section on 'grouping changes' implies that manuscript texts have a structure that can be broken down into a hierarchy of changes that can be conveniently grouped and nested arbitrarily. Similarly in section 4.1 a strict hierarchy is imposed consisting of document->writing surface->zone->line. Since Barnard's paper in 1988 where he pointed out the inherent failure of markup to adequately represent a simple case of nested speeches and lines in Shakespeare - sometimes a line was spread over two speeches - the problem of overlap has become the dominant issue in the digital encoding of historical texts. This representation, which seeks to reassert the OHCO thesis, which has been withdrawn by its own authors, will fail to adequately represent these genetic texts until it is recognised that they are fundamentally non-hierarchical. The last 20 years of research cannot simply be ignored. It is no longer possible to propose something for the future that does not address the overlap problem. And MVD neatly disposes of that.

Collation of XML Texts

I am also curious as to how they propose to 'collate' XML documents arranged in this structure, especially when the variants are distributed via two mechanisms: as markup in individual files and also as links between documentary versions. Collation programs work by comparing basically plain text files, containing only light markup for references in COCOA or empty XML elements (as in the case of Juxta). The virtual absence of collation programs able to process arbitrary XML renders this proposal at least very difficult to achieve. It would be better if a purely digital representation of the text were the objective, since in this case, an apparatus would not be needed.

Transpositions

The mechanism for transposition as described also sounds infeasible. It is unclear what is meant by the proposed standoff mechanism. However, if this allows chunks of transposed text to be moved around this will fail if the chunks contain non-well-formed markup or if the destination location does not permit that markup in the schema at that point. Also if transpositions between physical versions are allowed - and this actually comprises the majority of cases - how is such a mechanism to work, especially when transposed chunks may well overlap?

Simplicity = Limited Scope

Much is made in the supporting documentation of the HyperNietzsche Markup Language (HNML) and 'GML' (Genetic Markup Language) of the greater simplicity of the proposed encoding schemes. Clearly, the more general an encoding scheme the less succinct it is going to be. Since the proposal is to encorporate the encoding model for genetic editions into TEI then this advantage will surely be lost. In any case there seems very little in the proposal that cannot already be encoded as well (or as poorly, depending on your point of view) in the TEI Guidelines as they now stand.

Friday, May 8, 2009

A Slight Delay in a Good Cause

OK, I'm not finished yet, when I said I would, but software is like that. Sorry. I decided that in order to really test the program properly I should have a complete test suite that I can run after making any changes to make sure that everything in the release is OK. Well when I say 'make sure' a test can only tell you if a bug is present, not tell you that there are none. But that's a lot better than letting the user find them. If I release something that is incomplete or not fully tested then I know the sceptics will attack the flaws. They will say 'See, it doesn't work, I told you so!' I can't afford that, so I have to be careful. So far I have tests for fourteen out of 16 commands.

I also added an unarchive command to go with the archive command. With 'archive' users can save an MVD as a set of versions in a folder, plus a small XML file instructing nmerge how to reassemble them into an MVD. This contains all the version and group information etc. So if you don't believe the MVD format will last, it doesn't matter. You always have the archive and that is in whatever format the original files were in. A user could even construct such an archive manually. The 'unarchive' command takes this archive and builds an MVD from it in one step.

Here's a progress bar for the tests. Green means there is a test routine and it passes. Yellow means there is a test routine but it doesn't pass yet. Red means there is no test routine and I don't know for sure if it works, but it might. There was an intermittent problem with update, but this is now fixed.