Saturday, October 16, 2010

Electronic Editions without Embedded Markup

I am going to do a demo for Chicago, which I visit on the 28th for a talk at Loyola University. I want to demonstrate a working method for converting legacy XML files with multiple embedded versions into separate HTML files. A bit like the Versioning Machine. However, between the XML and the HTML the interim files will be uniquely plain text and potentially overlapping markup extracted from the XML. These two files could both be combined into separate Multi-Version-Documents so that electronic editions of multi-version texts can be subject to markup which can be freely combined or removed at any time without affecting the text.

This isn't just standoff markup. I'm using a custom technique called HRIT format, which is based on xLMNL. This just records a set of ranges with attributes (garnered from the XML) so potentially any property can overlap with any other property. And these ranges, tied by fixed offsets to the original text, have no hierarchical structure.

Cool. But how does this get turned into hierarchical HTML? That's the tricky part. I will define a simple CSS file, which is interpreted as a recipe for constructing the HTML file. For example, we might have a style definition "p.stage". This would mean that we should generate a paragraph (<p class="stage">...</p>) for all ranges called "stage", and apply the formats of the p.stage style definition. The beauty of this is that the same CSS file can be used both for formatting the HTML and for generating it. Now that's cool. Here's an outline of the demo (I'll tick them off as I do them):

  1. 1. encode Act 1, scene 2 of King Lear all versions as ONE TEI-XML file, using parallel segmentation. ✓ Done, at least first three folio versions. But I should really include the quartos too.
  2. Write a simple C-program to separate out the versions called splitter. This produces copies of each version as separate TEI-XML files. ✓ Done.
  3. Strip out the markup from these files with stripper - another C-program. This produces 2 files for each XML file:
    1. the original text, stripped of all markup.
    2. The markup expressed in HRIT standoff format with coordinates for where it is now in the plain text. ✓ Done.
  4. Write a simple CSS stylesheet and another C-program formatter that takes the standoff markup from step 3b and recombines it with 3a using the stylesheet definitions into HTML. This is the most complex program: it needs to parse CSS in a superficial way only and use definitions of the type element.class to construct the HTML. The class will be the name of an XML element and the element will be the HTML element name. Then the program need only dumbly create elements for the given ranges. Since it was originally nested the result will also be nested. (This was in fact a requirement for XSLT to do the same work.) What we will eventually need is more flexiblity in the program later that can handle more intelligently the nesting property. ✓ Done, but needs further extension.
  5. Display the result of one version in the browser. ✓ Done
  6. Write a simple interactive web program consisting of a web page and some Javascript. Divide the page into two parts. On the right a few of the most common properties as buttons: paragraph, speaker, speech, etc. Less common properties can be selected from a dropdown menu and a button to apply the property. On the left the raw text of King Lear. Now select a bit of the text and press a button. This sends the selection to the server, which adds a format to that range, calls the formatter program to change the HTML, then refreshes the page, so the text formats interactively. For the server just use apache+PHP, and call the commandline tools via exec. In progress

Once this is incorporated into the MVD-GUI (in place of the XSLT step that currently transforms the XML of the versions into HTML) we will have an electronic edition that is truly free of embedded markup!