Tuesday, November 17, 2015

The importance of hyphenation

When transcribing original source documents hyphens are used to break words at the ends of lines. It is vital that we record them (and hence also the lineation) because otherwise we lose a vital piece of information: we can no longer reconstruct the document as it was, and we will lose a referencing system for synchronised scrolling, as a fine-grain control over where we are in the document. Similarly for side-by-side display with a transcription and a page-facsimile we need to be able to view the text with original line-breaks to compare line for line. Unfortunately, in English and other languages like French (and to a lesser extent Italian) it is not always clear whether the hyphen should be removed when the word is reconstructed. And we reconstruct words all the time: for example, when we index a text for searching, or when displaying a text with line-breaks removed. We need to know that the part-words "guard-" and "ing" are not two words but one word "guarding". But we are less sure about "thunder-" and "head". Is that "thunderhead" or "thunder-head"? Both are possible.

Fortunately, there is a simple algorithm that tells us whether to remove the hyphen or not in the vast majority of cases:

  1. If the hyphenated form is already in the dynamic dictionary we are building then don't remove the hyphen. Since hyphenated words occur more often in the middle of lines than at the end they will usually occur there before we see them split over line-end. So we simply go with the author's preference.
  2. If at least one part of the hyphenated word is not in the static big dictionary for that language then we remove the hyphen.
  3. If the two halves are both in the static dictionary but the compound word (sans hyphen) is not, then we retain the hyphen.
  4. If the hyphenated form is in our exceptions table then we retain the hyphen, otherwise we remove it.

This works so well that that the list of exceptions is scarcely needed.

Tuesday, September 8, 2015


There was an interesting workshop yesterday at Verona EKDOSIS ON THE NET How do/can old frames endure/resist the new displaying. I presented a description of how far the Ecdosis editing tools have progressed, via Skype. Immediately before me another speaker presented a theoretical approach based on standard TEI-XML encoding, but my presentation was not polemical: it only sought to explain the tools and the reason for our different approach. The audience reaction, however, was surprising. The first question was Why do we have to learn TEI?, and other comments during the discussion that followed seemed to indicate that many agreed with me that what we need are graphical interface tools for the digital edition, not complex markup solutions.