Sunday, March 18, 2018

The end of XML

XML is now 20 years old. We might expect the first author of the XML 1.0 specification, Tim Bray, to be enthusiastic about XML's achievements and excited about its prospects for the future. Not a bit of it. In a limp endorsement on xml.com Tim tries diplomatically to think up some nice things to say about XML. But by the end of the article he lets out his true feelings:

People did a lot of that with XML just because there was no other alternative and, well… while it worked, you could do better, and in fact we have done better, for weak values of “better”. I wonder if we’ll ever do better still? As the editor of the IETF JSON RFCs, I’m a pessimist.

It’s been OK

Seriously; XML made a lot of things possible that previously weren’t. It has extended the lifetime of big chunks of our intellectual and legal heritage. It’s created a lot of interesting jobs and led to the production of a lot of interesting software. We could have done better, but you always could have done better.

Happy birthday!

I don't think there are any candles on the cake.

HTML is the new XML

More evidence of the disappearance of XML can be found in the new HTML Imports standard published by the W3C. Remember XInclude?

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset.

Now we have HTML Imports:

HTML Imports are a way to include and reuse HTML documents in other HTML documents.

Sounds familiar? Of course they are not the same because XML and HTML are not the same but the same basic need that existed when they were busy defining XML standards has now seen yet another instance where HTML is taking on the capabilities of XML. The others are RDFa being redefined for HTML, HTML5 being independent of SGML, CSS 3 Paged Media Module replacing XSL formatting objects. Does the list go on beyond what I know? Probably. One thing is clear: HTML is being made more and more into a replacement for XML in all things. In a couple of more years people will even be asking: 'what is XML?' And the Museum Guide will point to a funny page of complex markup with a stick and everyone will go 'Ooooh!'

Saturday, December 23, 2017

Ecdosis and the Charles Harpur Critical Archive

Now that we have are close to finishing our first historical digital edition, the Charles Harpur Critical Archive, it was time to articulate the technical design that led to its realisation. It is also worth reflecting on what we achieved. The extant papers of Charles Harpur (1813-1868) consist of 5,225 manuscript pages ranging in difficulty from easy to diabolically complex, 674 published newspaper poems, 140 letters on 403 manuscript pages, and 250 published pages in book form. To give you some idea of how large that is just to print the last version of each poem took Elizabeth Perkins 1000 pages. We have included all the versions, which is three times that much plus all the notes to the poems and the letters, which is double that again. So think 6,000 pages of printed matter. And we did it, including an elaborate user interface, in just 3 years. We recorded every last deleted full-stop. Here's a sample, in case you thought it was easy:

The technical design that made this possible is now described in outline on the CHCA website and also the revised website that will eventually supplant it. It is a general system that can be reused to create a wide range of other editions.

Tuesday, November 17, 2015

The importance of hyphenation

When transcribing original source documents hyphens are used to break words at the ends of lines. It is vital that we record them (and hence also the lineation) because otherwise we lose a vital piece of information: we can no longer reconstruct the document as it was, and we will lose a referencing system for synchronised scrolling, as a fine-grain control over where we are in the document. Similarly for side-by-side display with a transcription and a page-facsimile we need to be able to view the text with original line-breaks to compare line for line. Unfortunately, in English and other languages like French (and to a lesser extent Italian) it is not always clear whether the hyphen should be removed when the word is reconstructed. And we reconstruct words all the time: for example, when we index a text for searching, or when displaying a text with line-breaks removed. We need to know that the part-words "guard-" and "ing" are not two words but one word "guarding". But we are less sure about "thunder-" and "head". Is that "thunderhead" or "thunder-head"? Both are possible.

Fortunately, there is a simple algorithm that tells us whether to remove the hyphen or not in the vast majority of cases:

  1. If the hyphenated form is already in the dynamic dictionary we are building then don't remove the hyphen. Since hyphenated words occur more often in the middle of lines than at the end they will usually occur there before we see them split over line-end. So we simply go with the author's preference.
  2. If at least one part of the hyphenated word is not in the static big dictionary for that language then we remove the hyphen.
  3. If the two halves are both in the static dictionary but the compound word (sans hyphen) is not, then we retain the hyphen.
  4. If the hyphenated form is in our exceptions table then we retain the hyphen, otherwise we remove it.

This works so well that that the list of exceptions is scarcely needed.