Multi-Version Documents

The end of XML

2018-03-18T12:16:00.002-07:00

XML is now 20 years old. We might expect the first author of the XML 1.0 specification, Tim Bray, to be enthusiastic about XML's achievements and excited about its prospects for the future. Not a bit of it. In a limp endorsement on xml.com Tim tries diplomatically to think up some nice things to say about XML. But by the end of the article he lets out his true feelings:

People did a lot of that with XML just because there was no other alternative and, well… while it worked, you could do better, and in fact we have done better, for weak values of “better”. I wonder if we’ll ever do better still? As the editor of the IETF JSON RFCs, I’m a pessimist.

It’s been OK

Seriously; XML made a lot of things possible that previously weren’t. It has extended the lifetime of big chunks of our intellectual and legal heritage. It’s created a lot of interesting jobs and led to the production of a lot of interesting software. We could have done better, but you always could have done better.

Happy birthday!

I don't think there are any candles on the cake.

HTML is the new XML

More evidence of the disappearance of XML can be found in the new HTML Imports standard published by the W3C. Remember XInclude?

This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset.

Now we have HTML Imports:

HTML Imports are a way to include and reuse HTML documents in other HTML documents.

Sounds familiar? Of course they are not the same because XML and HTML are not the same but the same basic need that existed when they were busy defining XML standards has now seen yet another instance where HTML is taking on the capabilities of XML. The others are RDFa being redefined for HTML, HTML5 being independent of SGML, CSS 3 Paged Media Module replacing XSL formatting objects. Does the list go on beyond what I know? Probably. One thing is clear: HTML is being made more and more into a replacement for XML in all things. In a couple of more years people will even be asking: 'what is XML?' And the Museum Guide will point to a funny page of complex markup with a stick and everyone will go 'Ooooh!'

Ecdosis and the Charles Harpur Critical Archive

2017-12-23T16:26:00.001-08:00

Now that we have are close to finishing our first historical digital edition, the Charles Harpur Critical Archive, it was time to articulate the technical design that led to its realisation. It is also worth reflecting on what we achieved. The extant papers of Charles Harpur (1813-1868) consist of 5,225 manuscript pages ranging in difficulty from easy to diabolically complex, 674 published newspaper poems, 140 letters on 403 manuscript pages, and 250 published pages in book form. To give you some idea of how large that is just to print the last version of each poem took Elizabeth Perkins 1000 pages. We have included all the versions, which is three times that much plus all the notes to the poems and the letters, which is double that again. So think 6,000 pages of printed matter. And we did it, including an elaborate user interface, in just 3 years. We recorded every last deleted full-stop. Here's a sample, in case you thought it was easy:

The technical design that made this possible is now described in outline on the CHCA website and also the revised website that will eventually supplant it. It is a general system that can be reused to create a wide range of other editions.

The importance of hyphenation

2015-11-17T03:23:00.001-08:00

When transcribing original source documents hyphens are used to break words at the ends of lines. It is vital that we record them (and hence also the lineation) because otherwise we lose a vital piece of information: we can no longer reconstruct the document as it was, and we will lose a referencing system for synchronised scrolling, as a fine-grain control over where we are in the document. Similarly for side-by-side display with a transcription and a page-facsimile we need to be able to view the text with original line-breaks to compare line for line. Unfortunately, in English and other languages like French (and to a lesser extent Italian) it is not always clear whether the hyphen should be removed when the word is reconstructed. And we reconstruct words all the time: for example, when we index a text for searching, or when displaying a text with line-breaks removed. We need to know that the part-words "guard-" and "ing" are not two words but one word "guarding". But we are less sure about "thunder-" and "head". Is that "thunderhead" or "thunder-head"? Both are possible.

Fortunately, there is a simple algorithm that tells us whether to remove the hyphen or not in the vast majority of cases:

If the hyphenated form is already in the dynamic dictionary we are building then don't remove the hyphen. Since hyphenated words occur more often in the middle of lines than at the end they will usually occur there before we see them split over line-end. So we simply go with the author's preference.
If at least one part of the hyphenated word is not in the static big dictionary for that language then we remove the hyphen.
If the two halves are both in the static dictionary but the compound word (sans hyphen) is not, then we retain the hyphen.
If the hyphenated form is in our exceptions table then we retain the hyphen, otherwise we remove it.

This works so well that that the list of exceptions is scarcely needed.

Ecdosis

2015-09-08T16:10:00.004-07:00

There was an interesting workshop yesterday at Verona EKDOSIS ON THE NET How do/can old frames endure/resist the new displaying. I presented a description of how far the Ecdosis editing tools have progressed, via Skype. Immediately before me another speaker presented a theoretical approach based on standard TEI-XML encoding, but my presentation was not polemical: it only sought to explain the tools and the reason for our different approach. The audience reaction, however, was surprising. The first question was Why do we have to learn TEI?, and other comments during the discussion that followed seemed to indicate that many agreed with me that what we need are graphical interface tools for the digital edition, not complex markup solutions.

TILT makes good progress

2014-06-27T14:22:00.001-07:00

Check out the TILT blog. You can follow us on Twitter @bltilt

NMergec and Tagore's Golden Boat

2014-05-18T22:07:00.002-07:00

Nmergec, the rewritten form of nmerge in the C language has reached a milestone. It managed to merge 11 versions of the Tagore poem সোনার তরী (Golden Boat). Since there are several transpositions and much variation between the versions there is good reason to think that nmergec now 'works', and that it can handle much longer texts. What it does prove is that the revised program can handle Asian scripts like Bengali with ease, and it promises to deliver other benefits: such as the ability to merge (and hence compare) any number of versions of texts of any size. So, for example, it is designed to handle entire novels with rearranged chapters, or texts that contain thousands of versions (such as the Greek New Testament). For my next test though I'll be trying to merge an entire novel, perhaps Joseph Furphy's Such is Life, or Conrad's Under Western Eyes to get an idea of the practical size limits. There is a description of the nmergec program and its advantages on Digital Variants.

The merging time for the 11 versions was just 0.5 seconds, and that is with a ton of debugging code. Once it is stripped down and optimised it should be much faster. Here's one version of the poem:

সোনার তরী

গগনে গরজে মেঘ, ঘন বরষা।
কূলে একা বসে’ আছি, নাহি ভরসা।
রাশি রাশি ভারা ভারা
ধান কাটা হ’ল সারা,
ভরা নদী ক্ষুরধারা
খর-পরশা।
কাটিতে কাটিতে ধান এল বরষা।

একখানি ছোট ক্ষেত আমি একেলা,
চারিদিকে বাঁকা জল করিছে খেলা।
পরপারে দেখি আঁকা
তরুছায়ামসীমাখা
গ্রামখানি মেঘে ঢাকা
প্রভাত বেলা।
এ পারেতে ছোট ক্ষেত আমি একেলা।

গান গেয়ে তরী বেয়ে কে আসে পারে!
দেখে’ যেন মনে হয় চিনি উহারে।
ভরা-পালে চলে’ যায়,
কোনো দিকে নাহি চায়,
ঢেউগুলি নিরুপায়
ভাঙে দু’ধারে,
দেখে’ যেন মনে হয় চিনি উহারে।

ওগো তুমি কোথা যাও কোন বিদেশে!
বারেক ভিড়াও তরী কূলেতে এসে।
যেয়ো যেথা যেতে চাও,
যারে খুসি তা’রে দাও,
শুধু তুমি নিয়ে যাও
ক্ষণিক হেসে
আমার সোনার ধান কূলেতে এসে।

যত চাও তত লও তরণী পরে।
আর আছে?—আর নাই, দিয়েছি ভরে’।
এতকাল নদীকূলে
যাহা লয়ে’ ছিনু ভুলে’
সকলি দিলাম তুলে’
থরে বিথরে
এখন আমারে লহ করুণা করে’।

ঠাঁই নাই, ঠাঁই নাই! ছোট সে তরী
আমারি সোনার ধানে গিয়েছে ভরি’।
শ্রাবণ গগন ঘিরে
ঘন মেঘ ঘুরে ফিরে
শূন্য নদীর তীরে
রহিনু পড়ি’,
যাহা ছিল নিয়ে গেল সোনার তরী।
ফাল্গুন, ১২৯৮।

The slow tide turns...

2014-03-14T14:26:00.001-07:00

It is important that there is an open discussion about the forms that digital representations of historical artefacts may take. With this end in view I have prepared an article for the Journal of the Text Encoding Initiative entitled "Towards an Interoperable Digital Scholarly Edition". What most people are using as a shared data format at the moment are the TEI Guidelines. The problem with this is that:

it is tied to a piece of technology, namely XML, rather than being an abstract specification, and
it is based on subjective human judgements and is most definitely not interoperable.

That's a big problem, because it means we can't share our work. For example, currently I can't efficiently edit an edition of anything with a number of colleagues located in various parts of the world. I would have to spend a great deal of time trying to homogenise the codes used to describe features and to force everyone to use them consistently. I also can't cite or annotate that digital work and share that work with others. And I can't reuse someone else's digital scholarly edition, by extending or repurposing it. I think that's a waste of human effort and until we fix the problem this part of the digital humanities is going to suffer. So my paper is about how that problem can be fixed.

The reaction so far has been to adduce two arguments:

We don't really need interoperability, just the ability to record subjective judgements about the text. That's what humanists do, after all.
The interoperability problem can be solved by reducing the tag set so that there is only one way to encode everything

In response to 1. it is clear that the digital humanists working on digital editions have for years been calling for interoperability. Just a glance at the Interedition project makes it clear that this objective is essential rather than optional. It is hard to believe that a consensus will be reached that says that collaboration isn't necessary, but that's what this objection amounts to.

In response to 2. this has already been tried multiple times, e.g. TEI Lite, TEI Tite, Textgrid Baseline encoding, DTA Basisformat, TEI Nudge. For example the TEI Tite specification says it "is meant to prescribe exactly one way of encoding a particular feature of a document in as many cases as possible, ensuring that any two encoders would produce the same XML document for a source document". But if I see italics in a text and try to encode it with TEI-Tite I can still use <i>, <abbr>, <foreign>, <hi> or <seg> with various rend attributes, <label>, <ornament>, <stage>, <title>. If every tag I encode differs from every tag you encode in the same document then it is clear that the problem is quickly magnified the longer the text goes on. Like a human fingerprint no two transcriptions can ever be the same.

The fact is that marking up a text that never had any markup when it was created is a very different proposition to writing one with tags in it from the start. Since only digital humanists do this much, it is easy to overlook this distinction. If I say that something is italics when I write it, that's what it is. If I print it and then someone else marks it up they may use a code that indicates that my "italics" is really a foreign word, title, emphatic statement or stage direction, etc. That's interpretation and on a tag-by-tag basis the alternatives (including whether to record the feature at all) are manifold. So tag-reduction doesn't work and never will for this reason.

If you're curious about these arguments have a read about it now by following the link above, or wait until volume 7 of the Journal of the TEI comes out in a month or so.

An all-in-one installer for AustESE

2013-12-13T21:41:00.002-08:00

There is now an installer for the AustESE publishing system. I am painfully aware of the missing bits: as listed in the readme, but I prefer to focus on the positives. You can run the installer for Ubuntu and you get something that can manage the creation of a digital scholarly edition. And what it addresses are the perennial problems of interoperability and overlap. This is a general system that does everything that users asked for, not what I think it should be. It's what we have all been working on for 18 months, and we will keep improving it until it is finished and perfected. Check it out and see if you can get it going, and tell me what you think. I'm listening.

Those pesky hyphens

2013-08-03T23:00:00.000-07:00

One of the first problems I ever had to deal with in our Wittgenstein edition was 'what to do with the line-endings'. It seems such a simple problem: author writes or types a text, and hits the return key or starts a new line whenever he/she runs out of space. But this means that the author may hyphenate words over line-breaks. The preservation of these particular line endings, and not those automatically put in afterwards by software, is what the scholarly editor seeks to preserve.

In print or on screen lines are usually reflowed to make up the full line length of the edition. New hyphens not written by the author may thus be introduced. But remember, the sources were, if prepared correctly, recorded using the author's hyphens. If a word was hyphenated naturally, say the word 'sudden-ly', it is straightforward for software to restore the original word, 'suddenly', and even to indicate that in the original a hyphen was stored there. (Perhaps to be revealed later via a stylesheet that breaks lines as in the original). But what about a line-break at a hyphen? This happens quite frequently, perhaps 10% of the time: e.g. 'dog-flesh', 'ram-paddock'. A hyphen is such a convenient place to break the line that authors frequently avail themselves of this opportunity.

Hard vs soft hyphens

Ah, but now you see the dilemma. The correct restoration of 'ram-[new-line]paddock' is not 'rampaddock' but 'ram-paddock'. Humans recognise the difference at once, and hence don't even bother to record the difference between such 'hard' hyphens and the 'soft' hyphen in 'sudden-ly'. But computers are a bit stupid. My guess is that most digital scholarly editions don't even consider this problem, by choosing either to ignore hyphens at line-end or not. Admittedly hard hyphens are mostly an Anglo-French phenomenon, (e.g. grand-mère) but they are also common in some Italian words, e.g. gonna-pantalone, Milano-Roma etc. but are quite rare in Germanic languages.

Strategy 1

Recording the hyphens means that you have to display the transcript exactly as it was typed or written -- limiting the ways that the text can be redisplayed on, say, small screens. Alternatively you can have a stylesheet hide all the hyphens at line-end. But that will get about 10% of the cases wrong.

Strategy 2

Deleting the hyphens and joining up the words means a lot of hard work if they have already been recorded, and in any case distorts the text. But the advantage is that the text can be reflowed to fit the window on whatever device it is displayed on and the hyphens are always right, since only the hard hyphens are left. But then we can't get the original soft hyphens or line-breaks back if we need them.

Solution

There has to be some way to encode the hyphens and yet display them or hide them on request, without manually entering the distinction between hard and soft hyphens. If we have a dictionary and look up the words 'ram' and 'paddock' (both present) but the word 'rampaddock' is not, then the correct hyphenation must be 'ram-paddock'. On the other hand 'sudden' is a word, but 'ly' probably isn't. At least 'suddenly' is in the dictionary, so we can work out that the correct hyphenation must be 'suddenly'. Making this work for all imports of new files recorded in either XML or plain text has taken me some time, but the addition of 'intelligent hyphens' to our software gives us the edge over our rivals.

I used GNU's aspell library, which has numerous dictionaries. So we can dehyphenate almost any living language. The only problem is that variant spellings will cause problems. For example, 'ram-piddick' is a hyphenated word in Joseph Furphy, but 'piddick' is not in the English dictionary, so the hyphen will be mis-recognised as soft. Similar problems will occur with authors from the 16th century like Shakespeare. My solution in such cases is to just allow the editor to change it back to hard manually. Alternatively, if that is too much work, a dictionary could be compiled and added to aspell. However, the success rate even with the Furphy texts is close to 100% just using ordinary dictionary lookup, so I don't think there is too much to be worried about and a lot to be satisfied with.

Addendum

I had the idea that a list of exceptions might overcome residual problems. So that the default behaviour would work fine in most cases, but if you wanted to preserve hyphens at line-end even when it might seem that you don't need them -- for example if an author consistently wrote 'scape-goat' instead of 'scapegoat' -- then you could do that.

The vicious life-cycle of the digital scholarly edition

2013-05-28T15:18:00.000-07:00

Most digital scholarly editions (DSEs) are stored on web-servers as a collection of files (images and text), database fields and tables in binary form. The structure of the data is tied closely to the software that expects resources to be in a precise format and location. So moving a digital scholarly edition to another web-server, which probably runs a different content management system, different scripting languages and a different database, is usually considered either impossible or too expensive. And sooner or later, due to changes in the technical environment caused by automatic updates of the server, languages and tools on which it depends, the DSE will break. That usually takes about one to two years. That's fine if there is still money to fix it, but more than likely the technicians who created it have moved on, the money has run out, and pretty soon that particular DSE will die. This vicious lifecycle of the typical DSE has already played itself out on the Web countless times and everybody knows it.

But XML will save us, won't it?

I hear people saying: 'but XML will save us. It is a permanent storage format that transcends the fragility of the software that gives it life.' Firstly, XML just puts those special data structures that the software needs to work into a form that people (I mean programmers) can read. It doesn't change anything of the above. Yes, XML itself as a metalanguage is interoperable with all the XML tools, but the languages that are defined using it, if not rigidly standardised, are as numberless as the grains of sand on the beach, and are just as tied to the software that animates them as earlier binary formats.

Escaping the vicious cycle

Secondly, a digital scholarly edition is much more than a collection of transcriptions and images. There is simply no way to make the technological part of a DSE interoperable between disparate systems. But what we can do is make the content of a DSE portable between systems that run different databases and to some extent different content management systems. Also, this gives the DSE a life outside its original software environment and leaves behind something for future researchers. What we have recently done in the AustESE project is to create a portable digital edition format (PDEF) which encapsulates the 'digital scholarly edition' in a single ZIP file containing:

The source documents of each version in TEXT and MVD (multi-version document) formats. The MVD format makes it easy to move documents between installations of the AustESE system, and the TEXT files record exactly the same data in a more or less timeless form.
The markup associated with the text. This is in the form of JSON (javascript object notation), which is now supplanting the use of XML in the sending and storage of data in many web applications. Several layers of markup are possible for each text version, and these can be combined to produce Web pages on demand.
The formats of the marked-up text. These define, using CSS, different renditions of the combined text+markup.
The images that may be referred to by the markup
(In future) the annotations about the text which are common to all the versions to which they apply.

This allows what I don't think anyone else can do yet, namely, to download a DSE, to send it to another installation and to have them upload it so that it works 'out of the box'. We need to think in those terms if we are to get beyond the experimental stage in which many digital scholarly editions currently seem stuck. Otherwise we run the risk of becoming an irrelevance in the face of massive and simplistic attempts to digitise our textual cultural heritage by Gutenberg and Google. We need much more than what these services are offering. We need a space in which we can play out on the Web the timeless activities of editing, annotation and research into texts – what we call 'scholarship'. The only way to do that is to have 'a thing' that we call a 'digital scholarly edition'.

Faster than a speeding bullet

2013-05-18T18:08:00.002-07:00

I have just completed the move from one type of NoSQL database (couchdb) to another (mongodb). The speed increase is exhilarating, though I don't know by how much. The comparisons are not cached, they are computed on the fly: and the speed is the fruit of good design and careful implementation. As I said all along, conventional approaches based on XML are just too inefficient and inadequate. This design, based on multi-version documents, and hosted on a Jetty/Java service with a C core, is faster than any other such service I know of, and it is more flexible. There is also the new C-version of nmerge yet to add, which should increase capacity and speed further. Try out the test interface for yourself.

Multi-encoding text-merging

2013-02-24T20:22:00.004-08:00

MVDs now support any text encoding. I've used ICU, the IBM-supplied conversion library for to/from UniCode conversions. It's very good. So texts may now be added to an MVD from any textual source. All you do is specify the encoding. Internally, merging is done on 16-bit UTF-16, not 8-bit UTF-8 any more. I don't believe any rival text-comparison programs can do this. This is a big improvement over the previous version of nmerge.

Comparing Chinese and Bengali texts

2013-02-05T12:20:00.000-08:00

Extending multi-version documents (MVDs) to properly support languages like Chinese and Bengali, which use 16-bit characters, turns out to be easier than I thought. Currently the nmerge tool, which produces MVDs, works only with 8-bit bytes internally, so that individual characters may be split over several bytes, as in UTF-8 encoding. Things get complicated whenever differences are detected between parts of characters. Making everything 16-bit will facilitate the comparison of texts in any living language and avoid such complications (unless you want to compare dead languages like Phoenician or Lydian, and even then UTF-16 can encode them). I don't have any Chinese examples, but my friends in India have provided me with some interesting Bengali texts, which I'll be using for testing.

Hritserver 0.2.0 released

2013-01-08T20:41:00.000-08:00

I've made an early release of hritserver 0.2.0. The version number reflects my rough feeling that this is about 20% finished. That doesn't sound like much but most of the unfinished part is in the services it will eventually perform. The basic infrastructure of hritserver: the merging, formatting and importing facilities are closer to 90% complete.

There's a mixed import dialog in this release that should be able to import any TEI-Lite document. The import process can be configured in several ways, or you can just follow the defaults. The result is always supposed to "work". The stages for XML files are:

XSLT transform of XML sources. The default transform fixes some anomalies in the TEI data model that make it hard to convert it into HTML. Or you can substitute your own stylesheet to do anything you like. In TEI-Lite this step also splits the input into the main text and any embedded <note>s and <interp>s.
The versions within each XML file (add/del, sic/corr, abbrev/expan, app/rdg etc.) are all split into separate files. This is done safely by first splitting the document into a variant graph wherever a valid splittable tag is found and then the graph is written out as N separate files.
The individual files are stripped into their remaining markup and the plain text. The markup may be in several files, such as a separate one for page divisions.
The markup files and the plain text files are merged into CorCode and CorText multi-version documents.
The CorCodes and the CorTexs are then stored in the database.

Imported files should then appear in the Home tab of the Test interface.

Installation

You can try downloading version 0.2.0 if you use a Mac. I'll get around to supporting other platforms presently. To download version 0.2.0 you should use git. If you don't have it you can install it easily via homebrew

brew install git

Then download the latest hritserver code:

git clone https://github.com/HRIT-Infrastructure/hritserver.git

That creates a folder "hritserver" in that directory. Then you should run the installer:

cd hritserver sudo ./install-macosx.sh

And it should work. This version will be tested and gradually improved. The advantage of using git is that you can easily update to the latest version by typing

git pull

in the hritserver directory.

A critical edition view for the Web

2012-12-01T12:48:00.000-08:00

Critical editions, with their explanatory notes at the foot of the page, 'the apparatus', which described the variants of the version printed as the 'main' text above it were very popular in print editions of classical authors and scholarly editions of early printed books, from the late 18th century on. But in the digital medium this view, so dependent on print for its effectiveness, has struggled to make an impact, other than to evoke the response that this is a fake or copy of a print edition on screen. The apparatus showed a cross-section of versions that differ from the printed base text, but the connection between the two was via line-numbers. Unfortunately this does not work well on screen. Since the text is not divided into pages the line-numbers would get too large. In any case the idiom of the screen is based on clicks and highlighting rather than numbering. In prose texts, unless the reader were satisfied with the line-breaks of a particular print edition (and its hyphenation), numbering the lines would not even be possible. Another problem is that an auto-generated (versus hand-edited) apparatus is either too detailed to be readable or is filtered, so that important small variants may be lost. A more natural way of displaying variation on screen has to be found.

Side-by-side

On screen, this is the most popular view for displaying multiple versions, e.g. the Online Froissart and Juxta Commons. Usually just two versions are shown, and the user can choose which two to compare. Some versions of this view allow the user to choose multiple versions for display, but it gets a bit confusing after two because only the differences between each pane and one designated base can be shown. But fundamentally what is missing in this view is an ability to show all (or a subset of) versions in one go.

Table view

Another way to display variants is in a table. This aligns all the variants one above the other wherever they correspond. Editors love this view, because it resembles the apparatus of the critical edition, but there are only a few examples on the Web, usually as a window that floats above the text and so obscures it.

The 'killer app'

It seems obvious to combine table view with a view containing one chosen version at a time, to act as the base. It is also fairly obvious that the table should scroll horizontally in sync with the main text. Our project supervisor called this a bit hyperbolically 'a killer app'. It may be observed, however, that:

Using table view in this way has never been done before.
Table view is more adapted to the digital medium and the screen than the apparatus.
It is easy to use because the horizontal scrolling of the table matches the vertical scrolling of the main text, and the highlighting shows what's in sync with what.
The table can be collapsed out of sight or replaced by a set of options that change the appearance and size of the table (and the number of versions shown)
The blend of the two views seems so natural that I would venture to suggest that it could be a digital replacement for the print critical edition.

Check out the first working prototype. There are still some things to be done, such as:

making the sigla stick to the left of the scrolling table instead of scrolling away out of view
Integrating this prototype into the main test panel so that the options work.
Adding a popup menu to select a new base version

But as it stands you can still 'get the picture'. It seems to have potential.

Automating word and line selection

2012-10-23T16:25:00.003-07:00

TILT has now been repackaged as an applet. It can auto-recognise words and lines. The user selects the word or line tool and then clicks on a likely candidate to select it. Also shown are the black and white image (which TILT actually uses to do word and line-recognition), and the view showing the detected baselines.

One thing this does show is that an applet is an excellent form for this application. It is based on a stable and powerful software platform (Java), is delivered over the Web, and facilitates both rapid development and user feedback. Although still far from finished, it is surprising what has been possible in just two weeks.

Word-detection

This works by assessing a small square around the click-point. If this is darker than average then it is added to the selection, otherwise it is subdivided into four smaller squares and reassessed. If either the big or one of the small squares are accepted, then the four big squares to the north, south, east and west are assessed similarly. Word boundaries are thus rapidly identified.

Line detection

This works by counting up the concentration of black pixels in each row of the image. Anything below the average intensity of the page is ignored. The peaks are mostly baselines.

Although this simple algorithm won't work for slanted or curved lines, I will later subdivide the image into narrow columns and join up the detected baselines. This should overcome the limitation.

There are still a lot of things it doesn't do yet, such as create actual links, or store the results. But, hey, one step at a time.

TILT (Text to Image Linking Tool)

2012-10-13T01:16:00.003-07:00

Visualising links between image and text on the Web is one thing, but how do you create them in the first place? Unless you have a lot of spare time on your hands and don't get bored easily you may find the available tools aren't up to much. For example, the Harpur poems come in the form of manuscript anthologies, 70-100 pages in length. Even if you restricted yourself to just one link per line it would take quite a while to define image maps using a tool like the Gimp image-map plugin. One ordinary-looking manuscript page from the Harpur collection took me more than one hour, and that was without defining any links to the text. The British Library edition of the Codex Sinaticus looks good, but cost millions of dollars to make (reportedly). There has to be a better way.

What I needed was a simple tool that would truly facilitate an essentially tedious task, and leave me with data in a form I could use. One clear requirement was for the definition and editing of polygonal areas, not just rectangles. Polygons are a must, if anything that is not strictly rectilinear happens in a text, for example a correction, or even if the paper was slightly curved when it was photographed. Since the available manual methods were all much too slow, I decided to write my own tool.

Try defining rectangles on a per-line basis here

There are two current tools that are freely available. One is the Text-Image alignment tool in TextGrid. Although it works quite well it is entirely manual, and it doesn't offer automation. It's also tied in with TextGrid, Eclipse and XML. I didn't want to use any of these. TILE on the other hand is written in Javascript, which would allow it to be usable over the Web. But it appears to be unfinished. Also there is doubt that a program of this complexity can be built quickly enough, and be sufficiently stable and robust using only web-languages. So I decided to use good old Java. This gave me plenty of power, stability, and a single language to develop in. After a few days coding I have nearly exceeded the capabilities of TILE. It can also be turned into an applet, so it can be used on the Web.

So far TILT can define rectangles and polygons on top of a resizable image. Either type of shape can be selected, deleted, moved, scaled and edited by dragging control points. The background image can be also reduced to black and white to facilitate automatic word and line selection. This should be easy to achieve with the image-handling facilities of Java.

I also have to add linking with the text, saving in HTML and JSON formats, and uploading to the server. But all of these are easy to do.

Text and Facsimile Side-by-Side

2012-10-06T11:27:00.001-07:00

One of the most common requests you hear about in digital humanities is to display a facsimile next to its transcription. The TILE project is one example, another was developed some years back, perhaps for the first time, by some guys at the University of Kentucky, for their electronic Beowulf edition. As soon as you think about putting a facsimile next to its transcription you see the need to link areas of the image to spans of text in the transcription. Sometimes the facsimile is hard to read, and it only becomes useful if someone has already gone over it to link those areas and spans to help the reader to work out what corresponds to what. This gives rise to two main problems:

How to provide an environment to define those links and save them
How to display this information and update it as the user scrolls through the text and zooms in on the image

Both solutions must work over the Web, because that is the medium everyone wants to work in. TILE dealt with problem 1. The Beowulf edition was mostly about 2, I think (you have to buy the DVD to find out). We are tackling both problems as part of our AustESE project - to provide an editing environment for electronic scholarly editions. Since someone else is tackling problem 1 I'll be talking about my solution to problem 2. I'd be curious to know if anyone else has already developed something similar.

How it works

This is an screen dump of the live demo on AustESE. The user has already zoomed into an area of interest and has just moved the mouse over a polygonal area. The correct span of text on the right has been highlighted. This seems disarmingly simple but is quite hard to achieve in the medium of the Web. Years ago, when HTML was first developed, imagemaps were invented that allowed rectangular, circular and polygonal AREAs to be defined for an image. They can be defined for example in a tool like Gimp. Mouse movements and mouse clicks could be detected through the maps to fire events that might, for example, be used to highlight parts of the transcription. What was not provided, however, was any way to style or graphically affect the image map. So to make the map more interactive we need to employ some fancy new Web tricks.

The first thing I found that highlights areas on an image map is a thing called maphighlight. Unfortunately it doesn't do zooming, scaling or scrolling. Scaling is required to fit the image and its map onto the user's screen size. Zooming is used to drill down to details on an image that is unlikely to be readable on many devices otherwise. And scrolling is needed to move around the zoomed image. But the maphighlight designers did hit upon a cool workaround for the lame functionality of MAP, AREA and IMG in HTML:

Their display is broken into three panes layered on top of each other. The top layer is the original image and its image map. The image is set to be invisible, so only the defined regions on the image are "visible" to the mouse and are activated as it moves over them. This is important because if the imagemap was behind the canvas the mouse events wouldn't fire and you'd get no interactivity*. Looking through the image and its map the user can see a transparent canvas, which gets drawn to in response to the mouse movements. Behind it, in the background is the scaled and panned image itself, this time visible. What this needs to work is support for HTML5. Fortunately all modern browsers have the required features. To get the zooming and scrolling to work I had to modify and augment the existing maphighlight jQuery plugin, which I renamed maphilite. There are some more features I want to add, but it basically works. I've tested it on the latest versions of Chrome, Firefox, Safari and Opera. I'm pretty sure it will work also in IE9.

Why not support older browsers? Firstly, it's just too hard. Secondly, anyone can install a free browser on their computer even if they currently use IE8 or less. Thirdly, in a few more months or years everyone will be using HTML5, so why bother?

I'd like to stress that this is just a demo at this stage. The next stage will be to create a bunch of image maps and define them in the HRIT system so they can be applied to images and transcriptions on the fly as the user scrolls through the text.

* I imagine some techies are asking 'Why not just use CSS z-indicies to stack the three panes?' Because MAPs are bound to their images and are not separately stackable, that's why. Either you have the canvas in front of the image and you get no detection of mouse movements or you can't see the canvas behind the image. Pointer-events might get around that but they don't work on IE or Opera. As a CSS-defined background the image can be scrolled by use of background-position and scaled with background-size. Otherwise I don't see how you can get it to work.

The Role of Markup in the Digital Humanities

2012-07-15T06:07:00.002-07:00

My paper from the Cologne colloquium has been published in:

Historical and Social Research / Historische Sozialforschung 37.3 (2012), 125-146.

It contains a fairly detailed description of my alternative to embedded XML markup as an overall system for representing texts in the digital humanities, and how interoperable software can be built upon that foundation. Since it is a subscription-only journal I can't publish the text here, but a free copy of that volume is reportedly being given to all attendees to the current Digital Humanities conference at Hamburg this year. So it should be read by a wide audience.

Restricting Versions in Table View

2012-07-11T22:02:00.000-07:00

One refinement suggested by my initial version of Table View was to restrict the number of versions for comparison to some subset of those available. This has the advantage of further improving the signal to noise ratio, and does so in a purely digital way.

To help the user select a subset of versions intuitively I used a dropdown menu with the selected versions marked by a × sign. (A tick mark can't be used because browsers already mark the currently selected item with a tick). This method is very compact. It always occupies the same space however many versions there are – something not achieved by the usual technique of a set of check boxes:

Table View

2012-07-02T07:39:00.003-07:00

As we struggle to find ways to effectively represent textual variation on screen one of the persistent requests from various quarters has been the need for a table view: a hierarchical representation of the variants of a range within the chosen base text. This kind of view, for example, is used in the Cervantes hypertextual edition, or CollateX. Unlike the apparatus, which is a compact series of footnotes about variations in a text, table view shows variants in a strict rectilinear grid. Although variation is naturally overlapping in structure, not rectilinear or recursive, we can use such a format to clarify for the reader what is a variant of what across a number of versions – something side-by-side view cannot achieve.

Full text

One way to make table view work would be to show the text of all versions covering a particular range in the base text. Although this duplicates text between versions it is quite clear:

The rectilinear grid is implemented as a simple table, which can be seen by turning on the table cell borders:

This ensures that variants are vertically aligned, but since much of the text is the same, we might want to collapse the grid wherever the text is the same, and show only variants of the chosen base text above the line as highlighted alternatives:

This reduces clutter, but introduces another problem: the context of part-word variants is now removed and they may be regarded as unreadable. Extending them to the nearest word-boundary overcomes this:

What this view highlights is another need: many versions are almost the same. For example, in the Shakespeare example, Q1 and Q2 are practically the same, just like F1-F4. The differences are only minor punctuation changes. Collapsing these further introduces nested tables of variants that are best hidden from the reader:

The underlined text-ranges can be expanded by clicking on them, and the same action collapses them again. In the expanded form the sigla are displayed as a guide to the reader:

How this table view differs from the others

This table view differs from other attempts in two key ways:

It is generated directly from a merged multi-version text, not from a collation of many separate texts
It has three combinable options: 1) expansion to word-boundaries, 2) hiding merged text and 3) collapsing minor variants into sub-tables. These may be combined where desired to produce different effects.

The tables are generated as simple HTML. The cross-browser Javascript and CSS required to animate and format them may also be generated optionally.

Anthologies

2012-05-15T15:43:00.002-07:00

One of the curious problems we face is anthologies, which represent revisions and collections of works whose order is often rearranged.

For example, Charles Harpur produced several collections of poetry, and often the same works appear in multiple anthologies in altered form, with further corrections. The complexity of this natural arrangement of data is not appreciated or even recognised by most technicians who have undertaken to put such material online. Usually they just treat each anthology as a separate document, transcribe only the final version, catalogue it with metadata, and leave the user to find his or her way around the vast collection. The key problem of interrelating such information is left as a conundrum to be solved manually by the user.

The technique I used to overcome this is to split up each anthology into separate poems and then write out an anthology file that contains links to the individual works. In this way merging differences between works is easy, and the user can still get a feel for how each anthology looked. A case in point is the poem 'Eden Lost', which appears in MS A88 and also MS C376. Apart from inevitable differences in wording and punctuation there is also an extra stanza in the C376 version. The reader needs to know what those similarities and differences are, not by clumsily comparing anthology with anthology on screen, that is, separate document with separate document, but by visualising differences between versions of the same poetic work interactively. Since I am receiving transcriptions of entire anthologies I have written a splitting program that stores them in such a way that subsequent anthologies will place versions of the same poem into the same folder. Although this is an edition-specific program, it is worth considering as a general approach for handling any collection of anthologies or similar material:

The 'versions' file stores all successive versions, that is, the names and descriptions of each anthology. The Poem folders, which are very numerous, each contain the sources of the poem's versions, to facilitate importing. The 'anthologies' folder contains files with links to the poems. These are documents which have the same status as the poems themselves, and also appear in the catalog, although they only have one version. The next step is to write a script that uses the import facility developed in the previous-but-one post to automate importation. This speeds up the process of building a website of archival material considerably.

In fact, it is beginning to look as if I will need to create a separate program on the desktop to manage online archives: to backup, upload, import, export and test if they are still up. That's not such a trivial task, and is only clumsily served by my rapidly growing collection of scripts.

Useless elements in TEI

2012-05-06T15:32:00.002-07:00

Writing a program to separate out versions in TEI (text encoding initiative) encoded documents reveals some surprising facts about the latest tags added to that now vast scheme. Real world texts such as manuscripts written by their authors (holographs) contain lots of corrections and may exist in the form of separate physical drafts. In recording variation in such texts what functions do <choice>, <app>, <subst> and <mod> actually perform? By design they are supposed to group various kinds of alternatives but functionally speaking they serve no purpose. You could leave them out and the encoded text would record exactly the same information.

Admittedly there is a human factor here. Humans want things spelled out clearly and tags like <subst> make it clearer, or do they? Since <choice>, <subst> and <mod> are new tags in version 5 not present in version 4 one wonders how confused people were back in the good old days. Now perhaps they are confused even more by the addition of extra tags that obscure the text and serve no functional purpose whatsoever. You might think that <app> (apparatus entry) groups together a set of readings in parallel, but since each successive <rdg> (reading) contains a "wit" attribute that spells out which versions it contains, <app> is left with nothing to do.

One might argue that these tags are comments on variation, and their information should be preserved somehow. On the other hand <app> is clearly end-result related – it refers to the creation of a printed apparatus. Even <mod>, which might be used to record "open" variants, where the first version is not cancelled, only makes explicit what should already be implied by the contents. The fact that there is no element to describe an uncancelled first variant (<undel>?) is a problem with TEI not a justification for <mod>.

Importing into HRIT

2012-03-30T16:30:00.018-07:00

Having a good import tool is one of the most important things in building a digital text archive. When you have 2,000+ files to load into the system doing anything by hand with them is prohibitively expensive. I've tried to identify the steps that must be implemented to represent texts as multi-version works described by multi-version standoff properties. I intend to tick off these steps as I do them. Most of the stuff is already available in some form. It's really just a matter of pulling it all together.

A GUI to interact with the user: ✓
- Select files for upload.
- Gather information about where the merged files are to go.
- Also specify a filter for plain text files if any
- Verify that the inputted information makes sense
Compare each submitted file with the first file. If less than say 10% similar or it is too large reject it, and tell the user.✓
Split the remaining files into two groups: a) plain text and b) XML ✓
Filter the plain text files and so produce one set of markup for each.✓
Split the XML files into further versions, so multiplying them. This happens if there are any add/del, sic/corr, app/rdg etc variations.✓
Strip each XML file into markup and text.✓
Merge the sets of markup and text into a corcode and a cortex, and install it in the specified location.✓

That completes the basic import process.

A Better Way to do Transpositions

2012-03-11T12:53:00.024-07:00

One of the weaknesses of nmerge is its handling of block-size transpositions. It does the small ones OK, but large transposed blocks pose a problem because they are rarely contiguous. They tend to break up into short strings of literal similarity, punctuated by small differences in spelling. For example, if you transposed a paragraph in Shakespeare between two editions, differences in spelling would make it hard for the program to see that all the small similarities add up to an entire transposed block. Every idea I had to get around this limitation threatened to make nmerge much slower. Until now.

At the moment, when you add a version to the variant graph (as shown above) it aligns it with the graph directly opposite. It does this recursively, by merging sections of identical text and then gradually making the leftover sub-graphs and sub-sections of the new version smaller and smaller until all the new text is merged into the main graph. So in the drawing above the left-over sub-graphs are "The quick red/brown" and "lazy dog." and the new version fragments are "The lazy grey" and "quick dog." Transpositions are looked for to the left and right of the opposite graph-section and replace the direct alignment if a longer match is found. This can only find short contiguous sections of transposed text, and if they are far enough away nmerge simply ignores them. The longest match between "The lazy grey" and its opposite sub-graph is "The", but there is a longer match "lazy" with the other sub-graph. NMerge might miss this because it is too far away, relatively speaking.

The new algorithm is much simpler. Rather than align a section of new text with its directly opposite sub-graph and then look to either side for transpositions, it aligns it with all the remaining subgraphs equally. So if there is transposition of an entire block – and we found this quite often in the Tagore poems – nmerge will simply choose the best subgraph to align with for each new section of the text. The problem now is to stop it making trivial transpositions like "the" between the start and end of the work. Some kind of weighting based on previous alignments between blocks as well as distance and length might be the way to go.