Tuesday, November 17, 2015

The importance of hyphenation

When transcribing original source documents hyphens are used to break words at the ends of lines. It is vital that we record them (and hence also the lineation) because otherwise we lose a vital piece of information: we can no longer reconstruct the document as it was, and we will lose a referencing system for synchronised scrolling, as a fine-grain control over where we are in the document. Similarly for side-by-side display with a transcription and a page-facsimile we need to be able to view the text with original line-breaks to compare line for line. Unfortunately, in English and other languages like French (and to a lesser extent Italian) it is not always clear whether the hyphen should be removed when the word is reconstructed. And we reconstruct words all the time: for example, when we index a text for searching, or when displaying a text with line-breaks removed. We need to know that the part-words "guard-" and "ing" are not two words but one word "guarding". But we are less sure about "thunder-" and "head". Is that "thunderhead" or "thunder-head"? Both are possible.

Fortunately, there is a simple algorithm that tells us whether to remove the hyphen or not in the vast majority of cases:

  1. If the hyphenated form is already in the dynamic dictionary we are building then don't remove the hyphen. Since hyphenated words occur more often in the middle of lines than at the end they will usually occur there before we see them split over line-end. So we simply go with the author's preference.
  2. If at least one part of the hyphenated word is not in the static big dictionary for that language then we remove the hyphen.
  3. If the two halves are both in the static dictionary but the compound word (sans hyphen) is not, then we retain the hyphen.
  4. If the hyphenated form is in our exceptions table then we retain the hyphen, otherwise we remove it.

This works so well that that the list of exceptions is scarcely needed.