Multi-Version Documents: March 2014

It is important that there is an open discussion about the forms that digital representations of historical artefacts may take. With this end in view I have prepared an article for the Journal of the Text Encoding Initiative entitled "Towards an Interoperable Digital Scholarly Edition". What most people are using as a shared data format at the moment are the TEI Guidelines. The problem with this is that:

it is tied to a piece of technology, namely XML, rather than being an abstract specification, and
it is based on subjective human judgements and is most definitely not interoperable.

That's a big problem, because it means we can't share our work. For example, currently I can't efficiently edit an edition of anything with a number of colleagues located in various parts of the world. I would have to spend a great deal of time trying to homogenise the codes used to describe features and to force everyone to use them consistently. I also can't cite or annotate that digital work and share that work with others. And I can't reuse someone else's digital scholarly edition, by extending or repurposing it. I think that's a waste of human effort and until we fix the problem this part of the digital humanities is going to suffer. So my paper is about how that problem can be fixed.

The reaction so far has been to adduce two arguments:

We don't really need interoperability, just the ability to record subjective judgements about the text. That's what humanists do, after all.
The interoperability problem can be solved by reducing the tag set so that there is only one way to encode everything

In response to 1. it is clear that the digital humanists working on digital editions have for years been calling for interoperability. Just a glance at the Interedition project makes it clear that this objective is essential rather than optional. It is hard to believe that a consensus will be reached that says that collaboration isn't necessary, but that's what this objection amounts to.

In response to 2. this has already been tried multiple times, e.g. TEI Lite, TEI Tite, Textgrid Baseline encoding, DTA Basisformat, TEI Nudge. For example the TEI Tite specification says it "is meant to prescribe exactly one way of encoding a particular feature of a document in as many cases as possible, ensuring that any two encoders would produce the same XML document for a source document". But if I see italics in a text and try to encode it with TEI-Tite I can still use <i>, <abbr>, <foreign>, <hi> or <seg> with various rend attributes, <label>, <ornament>, <stage>, <title>. If every tag I encode differs from every tag you encode in the same document then it is clear that the problem is quickly magnified the longer the text goes on. Like a human fingerprint no two transcriptions can ever be the same.

The fact is that marking up a text that never had any markup when it was created is a very different proposition to writing one with tags in it from the start. Since only digital humanists do this much, it is easy to overlook this distinction. If I say that something is italics when I write it, that's what it is. If I print it and then someone else marks it up they may use a code that indicates that my "italics" is really a foreign word, title, emphatic statement or stage direction, etc. That's interpretation and on a tag-by-tag basis the alternatives (including whether to record the feature at all) are manifold. So tag-reduction doesn't work and never will for this reason.

If you're curious about these arguments have a read about it now by following the link above, or wait until volume 7 of the Journal of the TEI comes out in a month or so.

Multi-Version Documents

Friday, March 14, 2014

The slow tide turns...