Thursday, February 24, 2011

Multi-lingual MVDs

There are plenty of cases where the concept of 'work' spans more than one basic version in one language. Just think of the multi-lingual laws of the EU, the Romulo of Virgilio Malvezzi translated into several languages, each having its own textual history, or the Chronicles of Eusebius, in Latin, Greek and Armenian. The question is, how can you align the same text written in a different language? Can one align Latin and Greek, or French and German? In my opinion, no, or at least not automatically. Quite apart from the language dissimilarity, translations often have quite different structures, making alignment particularly difficult. But a tiny change to the definition of an MVD makes it possible to align such texts manually and to use the MVD format as a storage facility.

Tweaking the groups

MVDs have always had a simple grouping mechanism. You can group versions by type. For example, versions of a particular recension, or internal versions (corrections or revisions of a single manuscript) can be grouped together to keep them separate from versions in other physically different documents. Now if we assign one of these groups a simple attribute, called 'merge' and set it to 'true' or 'false', then we can control how an MVD is built up. For example, imagine we have French, German and Italian translations of some work, each in several versions. We could group all the Italian versions together, and similarly for the German and French ones. And we could set each group's attribute 'merge' to 'true'. But each such group would belong to a higher group, whose 'merge' attribute would be 'false'. So the merging program would know, on being given version 23 (French) to add to the MVD, not to merge it with version 16 (German) because their shared parent group is not merged. Here's how it would look schematically inside the resulting MVD:

This might also be a good strategy whenever the same 'work' is substantially rewritten, like the Morte d'Arthur and other medieval tales. Versions of each rewrite would get their own group and we wouldn't attempt to align them automatically because it just gets too messy.

Linking the translations

Now we can extend the standoff markup mechanism described in the previous post to link the texts of the different languages manually. We add a view that displays two versions of an MVD side by side:

Selecting some text on the right or left highlights it independently (you can do this in Javascript). Now select something in the opposite version and press the 'link' button. This creates an annotated property that specifies a link between the two selected ranges and records it via standoff markup. The view could then give the user graphical feedback by formatting the two selected blocks rigidly side-by-side:

They could also scroll together in sync, as they currently do in compare view. If blocks are transposed between languages (as often happens) the text might jump around a bit as you scroll, but so long as we align on the most central block it should work OK. Also, the alignment would hold for all the aligned versions on either side, not merely for the ones currently selected. If you had 12 German versions and 16 French ones, they would all be aligned at the same point of their shared text. You could even display an apparatus at the bottom of each side so the user could see the variants of the versions in each language.

How much work is that?

Although a special view would have to be designed, there is not much else needed to make it work. It might even be a good idea to add such a view to the MVD-GUI suite and see what people can do with it – but only once the standoff mechanism is up and running, because this solution depends on it.

Sunday, February 13, 2011

Standoff Properties explained

I've been asked for a more detailed explanation of how CorCode works as a set of standoff properties. I'll try but it won't be all that brief.

Embedded markup

Since at least the 1980s humanities texts have been described using embedded markup codes, but this leads to several problems:

  1. The structure imposed on the text is a tree and maybe the structure we want to describe is not.
  2. The embedded codes need to be standardised because otherwise we can't share texts or create shared software. But there are so many codes we need to define that the standard soon becomes unwieldy.
  3. Embedded markup lacks flexibility. We can't easily exchange one set of markup for another, or merge two sets.
  4. Users who edit the texts have to read it through the smoke-screeen of the tags and their attributes. And they have to learn a complex system that is becoming ever more complex.

Standoff markup is a partial solution to these problems. Removing tags from the text clarifies it for the reader, and allows the exchange of one set of tags for another. But with standoff markup we still can't combine two tag sets or define non-tree structures. And because the standoff codes depend on the inviolability of the text, we can't edit it.

What I was trying to explain in the previous post is that we can in fact overcome all of these problems by defining markup as a simple set of overlapping named properties. I'm not the first to suggest this by any means: in fact it resembles to varying degrees George's valency idea, LMNL, Thaller's extended string model, eComma, LORE and other annotation systems, and even TexMECS to some extent. But I'd like to describe my implementation because I think it offers some advantages over previous attempts.

Properties

Properties have a name, an offset and a length, which describe a range in the text. That's one string and two numbers. An example of a property is 'italics' at offset 23 with length 5. Let's just consider the offsets first.

Absolute versus relative offsets

With absolute offsets (as used in JITM and every other standoff system I know) the offsets increase for properties as we move through the text. So if we had properties at offsets 2, 10, 23, 45, 106, 230, 1022, 1100, 1495, 1567 and we added 121 characters at the start of the text, the first offset would have to change to 123, AND we would have to add 121 to all subsquent ones: 123, 131, 144, 166, 227, 351, 1143, 1221, 1616, 1688.

With relative offsets we only record the differences between an offset and the previous one. So the same sequence would read: 2, 8, 13, 12, 61, 124, 792, 78, 395, 72. (That's obtained by subtracting 2 from 10, then 10 from 23, then 23 from 45 etc.) Now when we add 121 characters at the start, the sequence changes to 123, 8, 13, 12, 61, 124, 792, 78, 395, 72. Only the first one needs to change because the relative distances between the other properties haven't altered.

Property lengths

If, instead of just inserting text outside of a property we altered the length of the property itself, say by extending a paragraph labelled with a 'p' (if we use TEI), then with relative offsets only the length of that property and the offset of the following one would change. Let's say that the length of the text covered by property 3 was 5 characters and we extended it to 12, then we'd change the length of property 3 from 5 to 12 and the offset of property 4 from 12 to 19 by adding 7 (i.e. 12-5). The other properties preceding property 3 AND following property 4 would not change.

Property names

Now let's consider the names of the properties. We can make them multi-lingual and go beyond what TEI can do. Europeans see TEI as based on English texts. (e.g. look at Domenico's objections in Scrittura e filologica nell'era digitale p. 170). Why should we not call 'italics' 'kursiv' if we are Germans? The entire standard encoding scheme is based on English words and concepts. Why do we have to standardise them when we can just let the users choose what they want to call them? Or they can provide translations for their property names so others can read their markup. So instead of explicit names I propose that we have a table of properties at the start of the list:

1 italics
2 paragraph
3 stage
etc.

Then when we want to use the italics property we just say #1 23 5 – which means 'property 1 (italics) at relative offset 23 of length 5'. Of course the computer handles all this. We never see these values directly, only through their representation on the screen via formatted text, not even when we edit them via the GUI.

Having got the properties into this form we can write a table that provides translations of all the properties in the file into any other languages we choose. And texts marked with '#1' will show up as 'kursiv' for Germans and 'corsivo' for Italians (or المائل for Arabs). TEI can't do this because the English names are burned into the standard.

Editing the text in this form

Each time we edit the underlying text we have to adjust the standoff properties so that they still correspond. But thanks to the use of relative offsets this is easy. After editing the base text the user commits it to the server. The server computes the differences between the old version and the new one. From this we obtain a set of insertions and deletions.

Insertions

If we insert text outside of an property or within a property we just follow the rules described above in adjusting the relative offsets and property lengths of any lists of properties that describe that text.

Deletions

If we delete a bit of text that completely contains a property we delete that property and adjust the relative offset of the next one in the list. If the property's range is only partly deleted (at the start or at the end) we simply adjust its length and also the offset of the next property in the list.

So, in both cases we can edit the text and its underlying properties quite cleanly.

Publishing digital editions

If we publish version 1 of a text and someone writes a property list for it, and then we change the text and issue edition 2, then their properties can easily be adjusted using these procedures. So, on requesting a copy of King Lear, the server informs the user that his edition of King Lear is out of date and would he/she like to update it. The updates are performed automatically and the old properties now refer to edition 2.

Merging property lists into CorCode

Yet another advantage of relative offsets is the ability to merge lists of properties belonging to different versions. Let's say we have 5 versions of Shakespeare's King Lear. We could define properties like stage, speaker, speech, paragraph, line, italics etc for ranges within each version, but like the text these properties would mostly be the same. Tedious. If we had used absolute offsets the lists of properties would all be different because they would contain different offsets throughout. Just one extra character would change all the absolute offsets from then on, and it would fail to merge. But with relative offsets most of the properties, like the text they describe, will be exactly the same. So we can merge all the property lists into one CorCode to correspond to the one CorTex. And when we apply a new property to one version it will automatically be adopted by all other versions - should we so desire - without having to redefine it for each version separately.

Turning overlapping properties into HTML

To make all these advantages practical we will have to convert a text marked up in this way into HTML for the browser. But how to do it? There is no hierarachical structure left, it's not XML, we can't use XSLT and the target language IS a hierarchy. In fact all this has already been done in eComma. How eComma works I don't know so I'll explain how I would do it.

We can scan the text and its property lists and deduce the hierarchical structure. If a property of type line is always inside a property called speech we can deduce that we might render that in HTML as lines inside speeches, say as <span> inside <p>, i.e. as <p><span>...</span></p>. But often in Shakespeare a line is divided between two speakers. Then we can simply break the line up into two lines, because it is most often contained by the speech property. (If it had been the other way around we would have had to split the <p>...</p> instead). So we can resolve all cases of overlap and discover hierachical structure from a simple analysis of the properties.

Translating the properties into HTML tags is also easy. Each HTML file these days comes with a CSS file that tells the browser how to format elements. So if we want to format speech properties specially we provide a css rule:

p.speech { text-indent: 5px }

This tells the browser to indent paragraphs of class 'speech' by five pixels. The neat thing about CorCode is that we can reuse this definition to convert speech properties into paragraphs. We just follow the recipe contained in the CSS rule: speeches are p's of class 'speech'. So we don't need XSLT.

Simplication of XML to properties

When we convert legacy XML files to CorCode format we have to simplify XML elements with their attributes into property names. So <hi rend="italics">word</hi> becomes #1 23 5 (remember we defined '#1' to be italics). So the CSS rules can all be simple and don't have to take account of complex XML attributes. Not all XML properties are as simple as the italics example, but if we provide a list of recipes of how to convert them I think it can be done. For example the TEI coding for a page number looks like this: <pb n="42" ed="1"/>. The "42" is really text and should be represented in the main text with the property of 'page-number'. The 'ed' attribute is really a version specification and should be expressed in the CorText and its versions. So there's nothing left. TEI markup is complex because it is a mixture of every kind of information we want to include that involves text. But we really need to separate out things like anchors to external images into a separate CorCode file that is handled specially in a GUI. When we focus on actual properties that the text really has, as opposed to programming information, there is not much left to represent.

Friday, February 11, 2011

The death of the angle-bracket

I was pleasantly surprised to learn that the eComma project uses overlapping properties rather than embedded markup to encode humanities texts. This emboldens me to take a similar approach with my rewrite of the MVD-GUI. For a relatively small effort I can transform ugly bits of XML such as <hi rend="italic">word</hi> into the standoff property called italics that applies to a specific range in the text. So to kill off angle brackets for good all I have to do is the following:

  1. Take a TEI text and use my splitter program to split the markup from the text for all the versions of a work. This yields as many versions of plain text as versions of markup.
  2. Simplify the markup: remove attributes by merging them with element names and swapping them for something shorter. And we can have multi-lingual property-names – no need to always use English.
  3. Merge the text of all the versions into a CorTex (MVD) and all the markup into a CorCode. The CorCode is just a list of properties and their ranges in the text, one for each version.
  4. Design 3 Joomla components:
    1. A formatted view of any chosen version, with expanding/collapsing apparatus.
    2. Edit the CorCode. A formatted view of the CorTex+CorCode for the currently chosen version: oft-used markup tags on the right as buttons, the rest as a dropdown list. Either just pressing a button or selecting an item from the dropdown and pressing 'apply' would apply that format to the current selection.
    3. Edit the CorTex. This view is just a text editing box, with possibly an expanding/collapsing apparatus.

That's not too much work, and when it is done users won't have to struggle with complex syntax ever again. In its place a set of simple overlapping properties that automatically format themselves into HTML in the browser. And all steps will be reversible: so we can go back to the XML representation at any stage, with no loss of information (hopefully).

Here are some mock-ups of how the user interface would look:

The Combined view

This is partly implemented in the new version, (all browsers) and more fully implemented in the old version (markup still embedded, Firefox only).

The CorCode view

We only have to show the properties present in this text. Note the language dropdown menu – this will translate the property names into whatever we provided in the property list. 'clear all' clears all properties from the current selection.

The CorTex view

This is just a plain edit text box, although I have enhanced it with a collapsable apparatus showing textual (not formatting) variants. The user simply edits then clicks 'save'. Carriage returns are not passed on to the display so they can be added as desired to lay out the text so it is more readable.

Friday, February 4, 2011

Intelligenza artificiale

We had another publication this time in Intelligenza artificiale, a journal of the IA*AI (University of Bologna). This is a published version of a conference paper presented by my colleague in Italy several years ago when we were just starting up this multi-version-document thing. So it's kind of interesting historically. I'm content that what we said then is more or less what we say now. In other words the idea appears to be stable.