Saturday, August 3, 2013

Those pesky hyphens

One of the first problems I ever had to deal with in our Wittgenstein edition was 'what to do with the line-endings'. It seems such a simple problem: author writes or types a text, and hits the return key or starts a new line whenever he/she runs out of space. But this means that the author may hyphenate words over line-breaks. The preservation of these particular line endings, and not those automatically put in afterwards by software, is what the scholarly editor seeks to preserve.

In print or on screen lines are usually reflowed to make up the full line length of the edition. New hyphens not written by the author may thus be introduced. But remember, the sources were, if prepared correctly, recorded using the author's hyphens. If a word was hyphenated naturally, say the word 'sudden-ly', it is straightforward for software to restore the original word, 'suddenly', and even to indicate that in the original a hyphen was stored there. (Perhaps to be revealed later via a stylesheet that breaks lines as in the original). But what about a line-break at a hyphen? This happens quite frequently, perhaps 10% of the time: e.g. 'dog-flesh', 'ram-paddock'. A hyphen is such a convenient place to break the line that authors frequently avail themselves of this opportunity.

Hard vs soft hyphens

Ah, but now you see the dilemma. The correct restoration of 'ram-[new-line]paddock' is not 'rampaddock' but 'ram-paddock'. Humans recognise the difference at once, and hence don't even bother to record the difference between such 'hard' hyphens and the 'soft' hyphen in 'sudden-ly'. But computers are a bit stupid. My guess is that most digital scholarly editions don't even consider this problem, by choosing either to ignore hyphens at line-end or not. Admittedly hard hyphens are mostly an Anglo-French phenomenon, (e.g. grand-mère) but they are also common in some Italian words, e.g. gonna-pantalone, Milano-Roma etc. but are quite rare in Germanic languages.

Strategy 1

Recording the hyphens means that you have to display the transcript exactly as it was typed or written -- limiting the ways that the text can be redisplayed on, say, small screens. Alternatively you can have a stylesheet hide all the hyphens at line-end. But that will get about 10% of the cases wrong.

Strategy 2

Deleting the hyphens and joining up the words means a lot of hard work if they have already been recorded, and in any case distorts the text. But the advantage is that the text can be reflowed to fit the window on whatever device it is displayed on and the hyphens are always right, since only the hard hyphens are left. But then we can't get the original soft hyphens or line-breaks back if we need them.


There has to be some way to encode the hyphens and yet display them or hide them on request, without manually entering the distinction between hard and soft hyphens. If we have a dictionary and look up the words 'ram' and 'paddock' (both present) but the word 'rampaddock' is not, then the correct hyphenation must be 'ram-paddock'. On the other hand 'sudden' is a word, but 'ly' probably isn't. At least 'suddenly' is in the dictionary, so we can work out that the correct hyphenation must be 'suddenly'. Making this work for all imports of new files recorded in either XML or plain text has taken me some time, but the addition of 'intelligent hyphens' to our software gives us the edge over our rivals.

I used GNU's aspell library, which has numerous dictionaries. So we can dehyphenate almost any living language. The only problem is that variant spellings will cause problems. For example, 'ram-piddick' is a hyphenated word in Joseph Furphy, but 'piddick' is not in the English dictionary, so the hyphen will be mis-recognised as soft. Similar problems will occur with authors from the 16th century like Shakespeare. My solution in such cases is to just allow the editor to change it back to hard manually. Alternatively, if that is too much work, a dictionary could be compiled and added to aspell. However, the success rate even with the Furphy texts is close to 100% just using ordinary dictionary lookup, so I don't think there is too much to be worried about and a lot to be satisfied with.


I had the idea that a list of exceptions might overcome residual problems. So that the default behaviour would work fine in most cases, but if you wanted to preserve hyphens at line-end even when it might seem that you don't need them -- for example if an author consistently wrote 'scape-goat' instead of 'scapegoat' -- then you could do that.