Saturday, December 23, 2017

Ecdosis and the Charles Harpur Critical Archive

Now that we have are close to finishing our first historical digital edition, the Charles Harpur Critical Archive, it was time to articulate the technical design that led to its realisation. It is also worth reflecting on what we achieved. The extant papers of Charles Harpur (1813-1868) consist of 5,225 manuscript pages ranging in difficulty from easy to diabolically complex, 674 published newspaper poems, 140 letters on 403 manuscript pages, and 250 published pages in book form. To give you some idea of how large that is just to print the last version of each poem took Elizabeth Perkins 1000 pages. We have included all the versions, which is three times that much plus all the notes to the poems and the letters, which is double that again. So think 6,000 pages of printed matter. And we did it, including an elaborate user interface, in just 3 years. We recorded every last deleted full-stop. Here's a sample, in case you thought it was easy:

The technical design that made this possible is now described in outline on the CHCA website and also the revised website that will eventually supplant it. It is a general system that can be reused to create a wide range of other editions.

Tuesday, November 17, 2015

The importance of hyphenation

When transcribing original source documents hyphens are used to break words at the ends of lines. It is vital that we record them (and hence also the lineation) because otherwise we lose a vital piece of information: we can no longer reconstruct the document as it was, and we will lose a referencing system for synchronised scrolling, as a fine-grain control over where we are in the document. Similarly for side-by-side display with a transcription and a page-facsimile we need to be able to view the text with original line-breaks to compare line for line. Unfortunately, in English and other languages like French (and to a lesser extent Italian) it is not always clear whether the hyphen should be removed when the word is reconstructed. And we reconstruct words all the time: for example, when we index a text for searching, or when displaying a text with line-breaks removed. We need to know that the part-words "guard-" and "ing" are not two words but one word "guarding". But we are less sure about "thunder-" and "head". Is that "thunderhead" or "thunder-head"? Both are possible.

Fortunately, there is a simple algorithm that tells us whether to remove the hyphen or not in the vast majority of cases:

  1. If the hyphenated form is already in the dynamic dictionary we are building then don't remove the hyphen. Since hyphenated words occur more often in the middle of lines than at the end they will usually occur there before we see them split over line-end. So we simply go with the author's preference.
  2. If at least one part of the hyphenated word is not in the static big dictionary for that language then we remove the hyphen.
  3. If the two halves are both in the static dictionary but the compound word (sans hyphen) is not, then we retain the hyphen.
  4. If the hyphenated form is in our exceptions table then we retain the hyphen, otherwise we remove it.

This works so well that that the list of exceptions is scarcely needed.

Tuesday, September 8, 2015


There was an interesting workshop yesterday at Verona EKDOSIS ON THE NET How do/can old frames endure/resist the new displaying. I presented a description of how far the Ecdosis editing tools have progressed, via Skype. Immediately before me another speaker presented a theoretical approach based on standard TEI-XML encoding, but my presentation was not polemical: it only sought to explain the tools and the reason for our different approach. The audience reaction, however, was surprising. The first question was Why do we have to learn TEI?, and other comments during the discussion that followed seemed to indicate that many agreed with me that what we need are graphical interface tools for the digital edition, not complex markup solutions.