Multi-Version Documents: Standoff Properties explained

I've been asked for a more detailed explanation of how CorCode works as a set of standoff properties. I'll try but it won't be all that brief.

Embedded markup

Since at least the 1980s humanities texts have been described using embedded markup codes, but this leads to several problems:

The structure imposed on the text is a tree and maybe the structure we want to describe is not.
The embedded codes need to be standardised because otherwise we can't share texts or create shared software. But there are so many codes we need to define that the standard soon becomes unwieldy.
Embedded markup lacks flexibility. We can't easily exchange one set of markup for another, or merge two sets.
Users who edit the texts have to read it through the smoke-screeen of the tags and their attributes. And they have to learn a complex system that is becoming ever more complex.

Standoff markup is a partial solution to these problems. Removing tags from the text clarifies it for the reader, and allows the exchange of one set of tags for another. But with standoff markup we still can't combine two tag sets or define non-tree structures. And because the standoff codes depend on the inviolability of the text, we can't edit it.

What I was trying to explain in the previous post is that we can in fact overcome all of these problems by defining markup as a simple set of overlapping named properties. I'm not the first to suggest this by any means: in fact it resembles to varying degrees George's valency idea, LMNL, Thaller's extended string model, eComma, LORE and other annotation systems, and even TexMECS to some extent. But I'd like to describe my implementation because I think it offers some advantages over previous attempts.

Properties

Properties have a name, an offset and a length, which describe a range in the text. That's one string and two numbers. An example of a property is 'italics' at offset 23 with length 5. Let's just consider the offsets first.

Absolute versus relative offsets

With absolute offsets (as used in JITM and every other standoff system I know) the offsets increase for properties as we move through the text. So if we had properties at offsets 2, 10, 23, 45, 106, 230, 1022, 1100, 1495, 1567 and we added 121 characters at the start of the text, the first offset would have to change to 123, AND we would have to add 121 to all subsquent ones: 123, 131, 144, 166, 227, 351, 1143, 1221, 1616, 1688.

With relative offsets we only record the differences between an offset and the previous one. So the same sequence would read: 2, 8, 13, 12, 61, 124, 792, 78, 395, 72. (That's obtained by subtracting 2 from 10, then 10 from 23, then 23 from 45 etc.) Now when we add 121 characters at the start, the sequence changes to 123, 8, 13, 12, 61, 124, 792, 78, 395, 72. Only the first one needs to change because the relative distances between the other properties haven't altered.

Property lengths

If, instead of just inserting text outside of a property we altered the length of the property itself, say by extending a paragraph labelled with a 'p' (if we use TEI), then with relative offsets only the length of that property and the offset of the following one would change. Let's say that the length of the text covered by property 3 was 5 characters and we extended it to 12, then we'd change the length of property 3 from 5 to 12 and the offset of property 4 from 12 to 19 by adding 7 (i.e. 12-5). The other properties preceding property 3 AND following property 4 would not change.

Property names

Now let's consider the names of the properties. We can make them multi-lingual and go beyond what TEI can do. Europeans see TEI as based on English texts. (e.g. look at Domenico's objections in Scrittura e filologica nell'era digitale p. 170). Why should we not call 'italics' 'kursiv' if we are Germans? The entire standard encoding scheme is based on English words and concepts. Why do we have to standardise them when we can just let the users choose what they want to call them? Or they can provide translations for their property names so others can read their markup. So instead of explicit names I propose that we have a table of properties at the start of the list:

1 italics
2 paragraph
3 stage
etc.

Then when we want to use the italics property we just say #1 23 5 – which means 'property 1 (italics) at relative offset 23 of length 5'. Of course the computer handles all this. We never see these values directly, only through their representation on the screen via formatted text, not even when we edit them via the GUI.

Having got the properties into this form we can write a table that provides translations of all the properties in the file into any other languages we choose. And texts marked with '#1' will show up as 'kursiv' for Germans and 'corsivo' for Italians (or المائل for Arabs). TEI can't do this because the English names are burned into the standard.

Editing the text in this form

Each time we edit the underlying text we have to adjust the standoff properties so that they still correspond. But thanks to the use of relative offsets this is easy. After editing the base text the user commits it to the server. The server computes the differences between the old version and the new one. From this we obtain a set of insertions and deletions.

Insertions

If we insert text outside of an property or within a property we just follow the rules described above in adjusting the relative offsets and property lengths of any lists of properties that describe that text.

Deletions

If we delete a bit of text that completely contains a property we delete that property and adjust the relative offset of the next one in the list. If the property's range is only partly deleted (at the start or at the end) we simply adjust its length and also the offset of the next property in the list.

So, in both cases we can edit the text and its underlying properties quite cleanly.

Publishing digital editions

If we publish version 1 of a text and someone writes a property list for it, and then we change the text and issue edition 2, then their properties can easily be adjusted using these procedures. So, on requesting a copy of King Lear, the server informs the user that his edition of King Lear is out of date and would he/she like to update it. The updates are performed automatically and the old properties now refer to edition 2.

Merging property lists into CorCode

Yet another advantage of relative offsets is the ability to merge lists of properties belonging to different versions. Let's say we have 5 versions of Shakespeare's King Lear. We could define properties like stage, speaker, speech, paragraph, line, italics etc for ranges within each version, but like the text these properties would mostly be the same. Tedious. If we had used absolute offsets the lists of properties would all be different because they would contain different offsets throughout. Just one extra character would change all the absolute offsets from then on, and it would fail to merge. But with relative offsets most of the properties, like the text they describe, will be exactly the same. So we can merge all the property lists into one CorCode to correspond to the one CorTex. And when we apply a new property to one version it will automatically be adopted by all other versions - should we so desire - without having to redefine it for each version separately.

Turning overlapping properties into HTML

To make all these advantages practical we will have to convert a text marked up in this way into HTML for the browser. But how to do it? There is no hierarachical structure left, it's not XML, we can't use XSLT and the target language IS a hierarchy. In fact all this has already been done in eComma. How eComma works I don't know so I'll explain how I would do it.

We can scan the text and its property lists and deduce the hierarchical structure. If a property of type line is always inside a property called speech we can deduce that we might render that in HTML as lines inside speeches, say as inside , i.e. as .... But often in Shakespeare a line is divided between two speakers. Then we can simply break the line up into two lines, because it is most often contained by the speech property. (If it had been the other way around we would have had to split the ... instead). So we can resolve all cases of overlap and discover hierachical structure from a simple analysis of the properties.

Translating the properties into HTML tags is also easy. Each HTML file these days comes with a CSS file that tells the browser how to format elements. So if we want to format speech properties specially we provide a css rule:

p.speech { text-indent: 5px }

This tells the browser to indent paragraphs of class 'speech' by five pixels. The neat thing about CorCode is that we can reuse this definition to convert speech properties into paragraphs. We just follow the recipe contained in the CSS rule: speeches are p's of class 'speech'. So we don't need XSLT.

Simplication of XML to properties

When we convert legacy XML files to CorCode format we have to simplify XML elements with their attributes into property names. So <hi rend="italics">word</hi> becomes #1 23 5 (remember we defined '#1' to be italics). So the CSS rules can all be simple and don't have to take account of complex XML attributes. Not all XML properties are as simple as the italics example, but if we provide a list of recipes of how to convert them I think it can be done. For example the TEI coding for a page number looks like this: <pb n="42" ed="1"/>. The "42" is really text and should be represented in the main text with the property of 'page-number'. The 'ed' attribute is really a version specification and should be expressed in the CorText and its versions. So there's nothing left. TEI markup is complex because it is a mixture of every kind of information we want to include that involves text. But we really need to separate out things like anchors to external images into a separate CorCode file that is handled specially in a GUI. When we focus on actual properties that the text really has, as opposed to programming information, there is not much left to represent.

Multi-Version Documents

Sunday, February 13, 2011

Standoff Properties explained