Multi-Version Documents: November 2009

Friday, November 27, 2009

Interedition Handout

I've had some positive feedback from the recent meeting of the Interedition initiative in Brussels. One of my colleagues distributed a handout that was favourably received, and to which I have already received one offer of collaboration. Since it expresses the essence of MVD in a non-technical form and has a stunning graphic of the comparison of two versions of Charles Harpur's 1845 versus 1888 editions of the Creek of the Four Graves, which have only around 40% similarity, I thought I'd share it with you:

Multi-Version Documents and the Harpur Archive

The Multi-Version Document or MVD system is designed to automate as far as possible the work of editing our textual cultural heritage. Existing markup-based approaches pose serious problems for the modern digital scholarly editor, including:

Failure to adequately and accurately represent ordinary textual phenomena
Obscuring the text and confusing the editor with excessive density of technical markup
Requiring manual tasks that could be performed much better and automatically by computer
Embedding subjective and potentially obsolescent technical information into texts that are supposed to be archived for the long term

These problems can mostly be overcome by separating the versions from their content. In this way editing a text becomes relatively simple, because all the complexities of versions (insertions, deletions, variants and transpositions) are handled automatically. Instead the editor works on a simplified text marked up only with the textual structure of each version.

An MVD represents 'the work' as an interrelated set of versions that can be searched, compared, edited and archived as a single, compact digital entity. An MVD also has a zero footprint. You can always get out the texts in exactly the same form as you put them in.

What we have now:

The following tools are available for download from the Googlecode site:

The nmerge commandline tool. This can be used to create, edit and manipulate MVDs.
The Alpha wiki prototype. This can be used to visualise and edit MVDs. For copyright reasons it only has one example text: all major versions of Act 1 Scene 1 from Shakespeare’s King Lear.

Future Developments

We are currently developing a plugin for Joomla! that will incorporate all the current technology, with further enhancements, to enable a humanities type web archive to be easily built and deployed on ordinary web hosts, requiring only a low level of technical expertise. This will be used as the basis of the new Digital Variants website and also the Harpur Text Archive. Progress reports will be posted on the MVD blog.

References

Schmidt, D. (2009a). Merging Multi-Version Texts: a Generic Solution to the Overlap Problem. In: Usdin, B.T. (ed) Proceedings of Balisage: The Markup Conference 2009. doi:10.4242/BalisageVol3.Schmidt01.

Schmidt, D. and Colomb, R. (2009). A data structure for representing multi-version texts online. International Journal of Human-Computer Studies, 67.6: 497-514.

Schmidt, D., Brocca, N. and Fiormonte, D. (2008). A Multi-Version Wiki. In: L.L. Opas-Hänninen, M. Jokelainen, I. Juuso, T. Seppänen (eds), Proceedings of Digital Humanities 2008, Oulu, Finland, June, 2008, pp. 187-188.

Multi-Version Documents. http://multiversiondocs.blogspot.com.

Merge and edit N versions in one document. http://code.google.com/p/multiversiondocs/.

Tuesday, November 24, 2009

Minor updates to nmerge, Alpha

I have added a README to Alpha to help install it and get it working. It didn't have one, which was an oversight. Also I noticed that the nmerge installer didn't work properly. This is due to my inexperience with automake. In fact it installed correctly, it just complained about the java source code directory which wasn't listed in the makefile properly. I'll try to be more careful in future.

Sunday, November 1, 2009

C++ Version of nmerge

One problem with the current design of nmerge is that it is written in Java. The commandline tool is a thin C wrapper around that, and if you want to process larger files you can't pass in arguments to increase available memory. So it just fails to work on large files. Also if you want to run it on servers that don't have, or won't allow, Java (true of many commercial hosting sites) you're also out of luck. Since the Digital Variants people and probably a large number of humanities projects will have these problems also, I have decided to convert it into pure C++. This should be relatively easy, and the benefits are:

memory usage will be limited only by what is available on the machine, not to that allocated to the Java Virtual machine (JVM)
nmerge-c++ will be callable from PHP or another scripting language without requiring installation of a JVM.
nmerge can optionally write to a database instead of directly to disk. This is usually the only way you can save changes on a commercial hosting site.
The C++ version will use far less memory than the Java version and should be a bit faster.

Overall, these changes will facilitate the building of a practical web application or plugin, which can be added to existing sites. Initially, my intention is to produce a Joomla! plugin that other people can use.

Some changes that will be possible in this revision include:

Grouped transpositions. By assessing individual transposition candidates as a group it will be possible to detect larger transpositions that contain small corrections.
Proper multi-tasking of the merging process in C++ will hopefully speed up the algorithm considerably.

That's the plan. I thought I'd let you know where I'm taking this, and it is to turn it into a generally usable tool.

There is at least one drawback, of course. C++ is cumbersome to write code in, compared to the relative heaven of Java. It's like painting a room with a brush instead of a roller.