Multi-Version Documents: 2012

Saturday, December 1, 2012

A critical edition view for the Web

Critical editions, with their explanatory notes at the foot of the page, 'the apparatus', which described the variants of the version printed as the 'main' text above it were very popular in print editions of classical authors and scholarly editions of early printed books, from the late 18th century on. But in the digital medium this view, so dependent on print for its effectiveness, has struggled to make an impact, other than to evoke the response that this is a fake or copy of a print edition on screen. The apparatus showed a cross-section of versions that differ from the printed base text, but the connection between the two was via line-numbers. Unfortunately this does not work well on screen. Since the text is not divided into pages the line-numbers would get too large. In any case the idiom of the screen is based on clicks and highlighting rather than numbering. In prose texts, unless the reader were satisfied with the line-breaks of a particular print edition (and its hyphenation), numbering the lines would not even be possible. Another problem is that an auto-generated (versus hand-edited) apparatus is either too detailed to be readable or is filtered, so that important small variants may be lost. A more natural way of displaying variation on screen has to be found.

Side-by-side

On screen, this is the most popular view for displaying multiple versions, e.g. the Online Froissart and Juxta Commons. Usually just two versions are shown, and the user can choose which two to compare. Some versions of this view allow the user to choose multiple versions for display, but it gets a bit confusing after two because only the differences between each pane and one designated base can be shown. But fundamentally what is missing in this view is an ability to show all (or a subset of) versions in one go.

Table view

Another way to display variants is in a table. This aligns all the variants one above the other wherever they correspond. Editors love this view, because it resembles the apparatus of the critical edition, but there are only a few examples on the Web, usually as a window that floats above the text and so obscures it.

The 'killer app'

It seems obvious to combine table view with a view containing one chosen version at a time, to act as the base. It is also fairly obvious that the table should scroll horizontally in sync with the main text. Our project supervisor called this a bit hyperbolically 'a killer app'. It may be observed, however, that:

Using table view in this way has never been done before.
Table view is more adapted to the digital medium and the screen than the apparatus.
It is easy to use because the horizontal scrolling of the table matches the vertical scrolling of the main text, and the highlighting shows what's in sync with what.
The table can be collapsed out of sight or replaced by a set of options that change the appearance and size of the table (and the number of versions shown)
The blend of the two views seems so natural that I would venture to suggest that it could be a digital replacement for the print critical edition.

Check out the first working prototype. There are still some things to be done, such as:

making the sigla stick to the left of the scrolling table instead of scrolling away out of view
Integrating this prototype into the main test panel so that the options work.
Adding a popup menu to select a new base version

But as it stands you can still 'get the picture'. It seems to have potential.

Tuesday, October 23, 2012

Automating word and line selection

TILT has now been repackaged as an applet. It can auto-recognise words and lines. The user selects the word or line tool and then clicks on a likely candidate to select it. Also shown are the black and white image (which TILT actually uses to do word and line-recognition), and the view showing the detected baselines.

One thing this does show is that an applet is an excellent form for this application. It is based on a stable and powerful software platform (Java), is delivered over the Web, and facilitates both rapid development and user feedback. Although still far from finished, it is surprising what has been possible in just two weeks.

Word-detection

This works by assessing a small square around the click-point. If this is darker than average then it is added to the selection, otherwise it is subdivided into four smaller squares and reassessed. If either the big or one of the small squares are accepted, then the four big squares to the north, south, east and west are assessed similarly. Word boundaries are thus rapidly identified.

Line detection

This works by counting up the concentration of black pixels in each row of the image. Anything below the average intensity of the page is ignored. The peaks are mostly baselines.

Although this simple algorithm won't work for slanted or curved lines, I will later subdivide the image into narrow columns and join up the detected baselines. This should overcome the limitation.

There are still a lot of things it doesn't do yet, such as create actual links, or store the results. But, hey, one step at a time.

Saturday, October 13, 2012

TILT (Text to Image Linking Tool)

Visualising links between image and text on the Web is one thing, but how do you create them in the first place? Unless you have a lot of spare time on your hands and don't get bored easily you may find the available tools aren't up to much. For example, the Harpur poems come in the form of manuscript anthologies, 70-100 pages in length. Even if you restricted yourself to just one link per line it would take quite a while to define image maps using a tool like the Gimp image-map plugin. One ordinary-looking manuscript page from the Harpur collection took me more than one hour, and that was without defining any links to the text. The British Library edition of the Codex Sinaticus looks good, but cost millions of dollars to make (reportedly). There has to be a better way.

What I needed was a simple tool that would truly facilitate an essentially tedious task, and leave me with data in a form I could use. One clear requirement was for the definition and editing of polygonal areas, not just rectangles. Polygons are a must, if anything that is not strictly rectilinear happens in a text, for example a correction, or even if the paper was slightly curved when it was photographed. Since the available manual methods were all much too slow, I decided to write my own tool.

Try defining rectangles on a per-line basis here

There are two current tools that are freely available. One is the Text-Image alignment tool in TextGrid. Although it works quite well it is entirely manual, and it doesn't offer automation. It's also tied in with TextGrid, Eclipse and XML. I didn't want to use any of these. TILE on the other hand is written in Javascript, which would allow it to be usable over the Web. But it appears to be unfinished. Also there is doubt that a program of this complexity can be built quickly enough, and be sufficiently stable and robust using only web-languages. So I decided to use good old Java. This gave me plenty of power, stability, and a single language to develop in. After a few days coding I have nearly exceeded the capabilities of TILE. It can also be turned into an applet, so it can be used on the Web.

So far TILT can define rectangles and polygons on top of a resizable image. Either type of shape can be selected, deleted, moved, scaled and edited by dragging control points. The background image can be also reduced to black and white to facilitate automatic word and line selection. This should be easy to achieve with the image-handling facilities of Java.

I also have to add linking with the text, saving in HTML and JSON formats, and uploading to the server. But all of these are easy to do.

Saturday, October 6, 2012

Text and Facsimile Side-by-Side

One of the most common requests you hear about in digital humanities is to display a facsimile next to its transcription. The TILE project is one example, another was developed some years back, perhaps for the first time, by some guys at the University of Kentucky, for their electronic Beowulf edition. As soon as you think about putting a facsimile next to its transcription you see the need to link areas of the image to spans of text in the transcription. Sometimes the facsimile is hard to read, and it only becomes useful if someone has already gone over it to link those areas and spans to help the reader to work out what corresponds to what. This gives rise to two main problems:

How to provide an environment to define those links and save them
How to display this information and update it as the user scrolls through the text and zooms in on the image

Both solutions must work over the Web, because that is the medium everyone wants to work in. TILE dealt with problem 1. The Beowulf edition was mostly about 2, I think (you have to buy the DVD to find out). We are tackling both problems as part of our AustESE project - to provide an editing environment for electronic scholarly editions. Since someone else is tackling problem 1 I'll be talking about my solution to problem 2. I'd be curious to know if anyone else has already developed something similar.

How it works

This is an screen dump of the live demo on AustESE. The user has already zoomed into an area of interest and has just moved the mouse over a polygonal area. The correct span of text on the right has been highlighted. This seems disarmingly simple but is quite hard to achieve in the medium of the Web. Years ago, when HTML was first developed, imagemaps were invented that allowed rectangular, circular and polygonal AREAs to be defined for an image. They can be defined for example in a tool like Gimp. Mouse movements and mouse clicks could be detected through the maps to fire events that might, for example, be used to highlight parts of the transcription. What was not provided, however, was any way to style or graphically affect the image map. So to make the map more interactive we need to employ some fancy new Web tricks.

The first thing I found that highlights areas on an image map is a thing called maphighlight. Unfortunately it doesn't do zooming, scaling or scrolling. Scaling is required to fit the image and its map onto the user's screen size. Zooming is used to drill down to details on an image that is unlikely to be readable on many devices otherwise. And scrolling is needed to move around the zoomed image. But the maphighlight designers did hit upon a cool workaround for the lame functionality of MAP, AREA and IMG in HTML:

Their display is broken into three panes layered on top of each other. The top layer is the original image and its image map. The image is set to be invisible, so only the defined regions on the image are "visible" to the mouse and are activated as it moves over them. This is important because if the imagemap was behind the canvas the mouse events wouldn't fire and you'd get no interactivity*. Looking through the image and its map the user can see a transparent canvas, which gets drawn to in response to the mouse movements. Behind it, in the background is the scaled and panned image itself, this time visible. What this needs to work is support for HTML5. Fortunately all modern browsers have the required features. To get the zooming and scrolling to work I had to modify and augment the existing maphighlight jQuery plugin, which I renamed maphilite. There are some more features I want to add, but it basically works. I've tested it on the latest versions of Chrome, Firefox, Safari and Opera. I'm pretty sure it will work also in IE9.

Why not support older browsers? Firstly, it's just too hard. Secondly, anyone can install a free browser on their computer even if they currently use IE8 or less. Thirdly, in a few more months or years everyone will be using HTML5, so why bother?

I'd like to stress that this is just a demo at this stage. The next stage will be to create a bunch of image maps and define them in the HRIT system so they can be applied to images and transcriptions on the fly as the user scrolls through the text.

* I imagine some techies are asking 'Why not just use CSS z-indicies to stack the three panes?' Because MAPs are bound to their images and are not separately stackable, that's why. Either you have the canvas in front of the image and you get no detection of mouse movements or you can't see the canvas behind the image. Pointer-events might get around that but they don't work on IE or Opera. As a CSS-defined background the image can be scrolled by use of background-position and scaled with background-size. Otherwise I don't see how you can get it to work.

Sunday, July 15, 2012

The Role of Markup in the Digital Humanities

My paper from the Cologne colloquium has been published in:

Historical and Social Research / Historische Sozialforschung 37.3 (2012), 125-146.

It contains a fairly detailed description of my alternative to embedded XML markup as an overall system for representing texts in the digital humanities, and how interoperable software can be built upon that foundation. Since it is a subscription-only journal I can't publish the text here, but a free copy of that volume is reportedly being given to all attendees to the current Digital Humanities conference at Hamburg this year. So it should be read by a wide audience.

Wednesday, July 11, 2012

Restricting Versions in Table View

One refinement suggested by my initial version of Table View was to restrict the number of versions for comparison to some subset of those available. This has the advantage of further improving the signal to noise ratio, and does so in a purely digital way.

To help the user select a subset of versions intuitively I used a dropdown menu with the selected versions marked by a × sign. (A tick mark can't be used because browsers already mark the currently selected item with a tick). This method is very compact. It always occupies the same space however many versions there are – something not achieved by the usual technique of a set of check boxes:

Monday, July 2, 2012

Table View

As we struggle to find ways to effectively represent textual variation on screen one of the persistent requests from various quarters has been the need for a table view: a hierarchical representation of the variants of a range within the chosen base text. This kind of view, for example, is used in the Cervantes hypertextual edition, or CollateX. Unlike the apparatus, which is a compact series of footnotes about variations in a text, table view shows variants in a strict rectilinear grid. Although variation is naturally overlapping in structure, not rectilinear or recursive, we can use such a format to clarify for the reader what is a variant of what across a number of versions – something side-by-side view cannot achieve.

Full text

One way to make table view work would be to show the text of all versions covering a particular range in the base text. Although this duplicates text between versions it is quite clear:

The rectilinear grid is implemented as a simple table, which can be seen by turning on the table cell borders:

This ensures that variants are vertically aligned, but since much of the text is the same, we might want to collapse the grid wherever the text is the same, and show only variants of the chosen base text above the line as highlighted alternatives:

This reduces clutter, but introduces another problem: the context of part-word variants is now removed and they may be regarded as unreadable. Extending them to the nearest word-boundary overcomes this:

What this view highlights is another need: many versions are almost the same. For example, in the Shakespeare example, Q1 and Q2 are practically the same, just like F1-F4. The differences are only minor punctuation changes. Collapsing these further introduces nested tables of variants that are best hidden from the reader:

The underlined text-ranges can be expanded by clicking on them, and the same action collapses them again. In the expanded form the sigla are displayed as a guide to the reader:

How this table view differs from the others

This table view differs from other attempts in two key ways:

It is generated directly from a merged multi-version text, not from a collation of many separate texts
It has three combinable options: 1) expansion to word-boundaries, 2) hiding merged text and 3) collapsing minor variants into sub-tables. These may be combined where desired to produce different effects.

The tables are generated as simple HTML. The cross-browser Javascript and CSS required to animate and format them may also be generated optionally.

Tuesday, May 15, 2012

Anthologies

One of the curious problems we face is anthologies, which represent revisions and collections of works whose order is often rearranged.

For example, Charles Harpur produced several collections of poetry, and often the same works appear in multiple anthologies in altered form, with further corrections. The complexity of this natural arrangement of data is not appreciated or even recognised by most technicians who have undertaken to put such material online. Usually they just treat each anthology as a separate document, transcribe only the final version, catalogue it with metadata, and leave the user to find his or her way around the vast collection. The key problem of interrelating such information is left as a conundrum to be solved manually by the user.

The technique I used to overcome this is to split up each anthology into separate poems and then write out an anthology file that contains links to the individual works. In this way merging differences between works is easy, and the user can still get a feel for how each anthology looked. A case in point is the poem 'Eden Lost', which appears in MS A88 and also MS C376. Apart from inevitable differences in wording and punctuation there is also an extra stanza in the C376 version. The reader needs to know what those similarities and differences are, not by clumsily comparing anthology with anthology on screen, that is, separate document with separate document, but by visualising differences between versions of the same poetic work interactively. Since I am receiving transcriptions of entire anthologies I have written a splitting program that stores them in such a way that subsequent anthologies will place versions of the same poem into the same folder. Although this is an edition-specific program, it is worth considering as a general approach for handling any collection of anthologies or similar material:

The 'versions' file stores all successive versions, that is, the names and descriptions of each anthology. The Poem folders, which are very numerous, each contain the sources of the poem's versions, to facilitate importing. The 'anthologies' folder contains files with links to the poems. These are documents which have the same status as the poems themselves, and also appear in the catalog, although they only have one version. The next step is to write a script that uses the import facility developed in the previous-but-one post to automate importation. This speeds up the process of building a website of archival material considerably.

In fact, it is beginning to look as if I will need to create a separate program on the desktop to manage online archives: to backup, upload, import, export and test if they are still up. That's not such a trivial task, and is only clumsily served by my rapidly growing collection of scripts.

Sunday, May 6, 2012

Useless elements in TEI

Writing a program to separate out versions in TEI (text encoding initiative) encoded documents reveals some surprising facts about the latest tags added to that now vast scheme. Real world texts such as manuscripts written by their authors (holographs) contain lots of corrections and may exist in the form of separate physical drafts. In recording variation in such texts what functions do <choice>, <app>, <subst> and <mod> actually perform? By design they are supposed to group various kinds of alternatives but functionally speaking they serve no purpose. You could leave them out and the encoded text would record exactly the same information.

Admittedly there is a human factor here. Humans want things spelled out clearly and tags like <subst> make it clearer, or do they? Since <choice>, <subst> and <mod> are new tags in version 5 not present in version 4 one wonders how confused people were back in the good old days. Now perhaps they are confused even more by the addition of extra tags that obscure the text and serve no functional purpose whatsoever. You might think that <app> (apparatus entry) groups together a set of readings in parallel, but since each successive <rdg> (reading) contains a "wit" attribute that spells out which versions it contains, <app> is left with nothing to do.

One might argue that these tags are comments on variation, and their information should be preserved somehow. On the other hand <app> is clearly end-result related – it refers to the creation of a printed apparatus. Even <mod>, which might be used to record "open" variants, where the first version is not cancelled, only makes explicit what should already be implied by the contents. The fact that there is no element to describe an uncancelled first variant (<undel>?) is a problem with TEI not a justification for <mod>.

Friday, March 30, 2012

Importing into HRIT

Having a good import tool is one of the most important things in building a digital text archive. When you have 2,000+ files to load into the system doing anything by hand with them is prohibitively expensive. I've tried to identify the steps that must be implemented to represent texts as multi-version works described by multi-version standoff properties. I intend to tick off these steps as I do them. Most of the stuff is already available in some form. It's really just a matter of pulling it all together.

A GUI to interact with the user: ✓
- Select files for upload.
- Gather information about where the merged files are to go.
- Also specify a filter for plain text files if any
- Verify that the inputted information makes sense
Compare each submitted file with the first file. If less than say 10% similar or it is too large reject it, and tell the user.✓
Split the remaining files into two groups: a) plain text and b) XML ✓
Filter the plain text files and so produce one set of markup for each.✓
Split the XML files into further versions, so multiplying them. This happens if there are any add/del, sic/corr, app/rdg etc variations.✓
Strip each XML file into markup and text.✓
Merge the sets of markup and text into a corcode and a cortex, and install it in the specified location.✓

That completes the basic import process.

Sunday, March 11, 2012

A Better Way to do Transpositions

One of the weaknesses of nmerge is its handling of block-size transpositions. It does the small ones OK, but large transposed blocks pose a problem because they are rarely contiguous. They tend to break up into short strings of literal similarity, punctuated by small differences in spelling. For example, if you transposed a paragraph in Shakespeare between two editions, differences in spelling would make it hard for the program to see that all the small similarities add up to an entire transposed block. Every idea I had to get around this limitation threatened to make nmerge much slower. Until now.

At the moment, when you add a version to the variant graph (as shown above) it aligns it with the graph directly opposite. It does this recursively, by merging sections of identical text and then gradually making the leftover sub-graphs and sub-sections of the new version smaller and smaller until all the new text is merged into the main graph. So in the drawing above the left-over sub-graphs are "The quick red/brown" and "lazy dog." and the new version fragments are "The lazy grey" and "quick dog." Transpositions are looked for to the left and right of the opposite graph-section and replace the direct alignment if a longer match is found. This can only find short contiguous sections of transposed text, and if they are far enough away nmerge simply ignores them. The longest match between "The lazy grey" and its opposite sub-graph is "The", but there is a longer match "lazy" with the other sub-graph. NMerge might miss this because it is too far away, relatively speaking.

The new algorithm is much simpler. Rather than align a section of new text with its directly opposite sub-graph and then look to either side for transpositions, it aligns it with all the remaining subgraphs equally. So if there is transposition of an entire block – and we found this quite often in the Tagore poems – nmerge will simply choose the best subgraph to align with for each new section of the text. The problem now is to stop it making trivial transpositions like "the" between the start and end of the work. Some kind of weighting based on previous alignments between blocks as well as distance and length might be the way to go.

Tuesday, March 6, 2012

HritServer Progress

Hritserver has progressed to version 0.1.2. It can generate HTML for compare view, with parallel texts that will be able to syncro-scroll, plain HTML versions, a dropdown menu to select versions, and a few services like stripping XML into markup and plain text. The back-end is also taking shape:

The idea here is to build up a web page in any web-development system whatsoever from a series of configurable components. Each component is just a fragment of HTML whose formatting and markup can be specified by the user. When I say markup I don't mean XML, I mean standoff properties. So you can have multiple markup sets and stylesheets for one text that combine sensibly. Or you can just ask for that component and let the default formatting do all the hard work for you.

The back-end shown above in a very early version is supposed to be a 'reference implementation' that fully exercises the HritServer application. So you can see what it is capable of and how to do it at the same time.