Tuesday, May 15, 2012


One of the curious problems we face is anthologies, which represent revisions and collections of works whose order is often rearranged.

For example, Charles Harpur produced several collections of poetry, and often the same works appear in multiple anthologies in altered form, with further corrections. The complexity of this natural arrangement of data is not appreciated or even recognised by most technicians who have undertaken to put such material online. Usually they just treat each anthology as a separate document, transcribe only the final version, catalogue it with metadata, and leave the user to find his or her way around the vast collection. The key problem of interrelating such information is left as a conundrum to be solved manually by the user.

The technique I used to overcome this is to split up each anthology into separate poems and then write out an anthology file that contains links to the individual works. In this way merging differences between works is easy, and the user can still get a feel for how each anthology looked. A case in point is the poem 'Eden Lost', which appears in MS A88 and also MS C376. Apart from inevitable differences in wording and punctuation there is also an extra stanza in the C376 version. The reader needs to know what those similarities and differences are, not by clumsily comparing anthology with anthology on screen, that is, separate document with separate document, but by visualising differences between versions of the same poetic work interactively. Since I am receiving transcriptions of entire anthologies I have written a splitting program that stores them in such a way that subsequent anthologies will place versions of the same poem into the same folder. Although this is an edition-specific program, it is worth considering as a general approach for handling any collection of anthologies or similar material:

The 'versions' file stores all successive versions, that is, the names and descriptions of each anthology. The Poem folders, which are very numerous, each contain the sources of the poem's versions, to facilitate importing. The 'anthologies' folder contains files with links to the poems. These are documents which have the same status as the poems themselves, and also appear in the catalog, although they only have one version. The next step is to write a script that uses the import facility developed in the previous-but-one post to automate importation. This speeds up the process of building a website of archival material considerably.

In fact, it is beginning to look as if I will need to create a separate program on the desktop to manage online archives: to backup, upload, import, export and test if they are still up. That's not such a trivial task, and is only clumsily served by my rapidly growing collection of scripts.

Sunday, May 6, 2012

Useless elements in TEI

Writing a program to separate out versions in TEI (text encoding initiative) encoded documents reveals some surprising facts about the latest tags added to that now vast scheme. Real world texts such as manuscripts written by their authors (holographs) contain lots of corrections and may exist in the form of separate physical drafts. In recording variation in such texts what functions do <choice>, <app>, <subst> and <mod> actually perform? By design they are supposed to group various kinds of alternatives but functionally speaking they serve no purpose. You could leave them out and the encoded text would record exactly the same information.

Admittedly there is a human factor here. Humans want things spelled out clearly and tags like <subst> make it clearer, or do they? Since <choice>, <subst> and <mod> are new tags in version 5 not present in version 4 one wonders how confused people were back in the good old days. Now perhaps they are confused even more by the addition of extra tags that obscure the text and serve no functional purpose whatsoever. You might think that <app> (apparatus entry) groups together a set of readings in parallel, but since each successive <rdg> (reading) contains a "wit" attribute that spells out which versions it contains, <app> is left with nothing to do.

One might argue that these tags are comments on variation, and their information should be preserved somehow. On the other hand <app> is clearly end-result related – it refers to the creation of a printed apparatus. Even <mod>, which might be used to record "open" variants, where the first version is not cancelled, only makes explicit what should already be implied by the contents. The fact that there is no element to describe an uncancelled first variant (<undel>?) is a problem with TEI not a justification for <mod>.