Tuesday, July 29, 2008

Improvements to Alignment

In preparation to publicly releasing the NMerge code I have been revising the alignment algorithm, as there were significant weaknesses in the naïve approach I had previously adopted. In cases like the Sibylline Gospel, there is a great deal of variation in a small space, and simply requiring the user to choose a base text for alignment doesn’t work very well. There seem to be a few reasons for this:

  1. The user doesn't actually know to which existing version a new version should be aligned. It should be the task of the software to find this out.
  2. There can't be a distinction, as I originally supposed, between updating an existing version and adding a new one. The newly revised version might resemble another version more than the one it came from. For example, we might change ‘the quick brown dog’ back to ‘the quick brown fox’. In the case of manuscript traditions like the Sibylline Gospel this happens all the time, because the versions aren’t a succession of edits, but a set of parallel alternatives. To avoid this problem I now automatically calculate the most similar version already in the MVD and then align with that.
  3. When you add a new arc to the graph you need to look for identical paths not merely identical arcs that are already in that position. Now when the program finds an existing path with the same text, spanning the same two end-points, the new arc can be discarded and its version simply added to the path instead: In this simplified real-world example, the new D-version was aligned with version C, overall its most similar version. However, in this location the D-variant ‘milia hominum’ already exists in version A. When the program tried to add the D-Arc it realised that there was already an A-path with the same text and instead merely added the D-version to that path.

This is only the first stage of a series of improvements. Still to come:

  • Calling the alignment algorithm recursively to handle contamination, i.e. aligning to more than one base text. After aligning to the most similar version, any significant portions of the new version that didn't align can be realigned to the most similar version between the two endpoints of the unaligned section.
  • Calculating what is a variant of what, one use of which might be to generate a kind of apparatus criticus
  • Calculating transpositions. These can be done after the other alignments are complete: any leftover insertion/deletion pairs meeting certain criteria can be tested for equality, and the transposition carried out.

XML Awareness

While NMerge remains ignorant of XML, as it should, the public interface class now breaks up ‘words’ based on angle-brackets as well as white space. In addition I am contemplating moving this code and the class that XML-izes the differences between versions into a separate package. The latter functionality is needed by the wiki since a difference might occur in the middle of a tag, and the marker for this needs to be moved to the start of the next piece of real text, so it can be displayed.

Saturday, July 12, 2008

Question and Answer

As Ted Nelson said in his book Literary Machines, it is difficult to explain a simple but radical idea. What came out of Oulu were a number of misunderstandings about what an MVD is and what it can record. So I thought I would publish a FAQ that I have accumulated not just from Oulu, but also from all previous presentations.

MVD-related

What’s an ‘MVD’?
What’s it good for?
Why don't you just use a Version Control system like CVS?
Why can't MVDs record all I need to say about versions?
Can the list form of an MVD record individual variants?
Isn't an MVD or Variant Graph just another kind of GODDAG?
If I adopt MVD won't I'll be stuck with a technology that might disappear later?

Markup-related

How can your format compete with XML?
How can your solution represent overlapping hierarchies?
Why can't I convert your MVD format into TEI-XML?
Why should I stop using markup to record variation?

Wiki-related

Can I gain access to the underlying wiki technology?
How do I get my texts into the wiki?
What are ‘NMerge’ and ‘Alpha’?

Q. What’s an ‘MVD’?

A. An MVD is a Multi-Version Document. It’s primarily a binary document format that allows you to store multiple versions, perhaps thousands, in a single integrated digital entity, that can be searched and edited, and its versions extracted, compared and examined. It is based on a directed graph structure stored in a simplified list form. There is a basic description of it at the start of this blog, which also provides links to published papers.

Q. What’s it good for?

A. MVDs were originally designed for recording multiple versions of cultural heritage texts: mostly older written or printed documents such as literature, correspondence, legal documents, journals, parallel translations etc. However, since they are designed to represent overlapping structures in text they can also be used to record ‘overlapping hierarchies’, or different marked up versions of the same underlying text. In principle they could also be applied to genetic sequencing and version control systems.

Q. Why don't you just use a Version Control system like CVS?

A. Version Control or SCM (Software Configuration Management) Systems are unsuited to storing texts that exist in multiple versions. Specifically:

  1. SCM systems only store the differences between the version you are committing and one designated version already in the repository. Their purpose is to allow the user to reconstruct any version from a series of updates. By contrast, the purpose of an MVD is to maintain a large number of simultaneous versions and to record the global differences between all of them.
  2. The level of granularity in SCM systems is usually the line. The required granularity for cultural heritage texts is usually the word or the character.
  3. SCM systems usually only calculate the insertions and deletions between two texts. For humanities texts it is also necessary to calculate variants and transpositions.
  4. In SCM systems versions are necessarily immutable: the idea is to be able to recover any older version on demand. On the other hand versions in an MVD must always be mutable: the editor of a cultural heritage text needs to be able to update any given version and save the changes back as a replacement text, without changing the overall number of versions.

Q. Why can't MVDs record all I need to say about versions?

A. An MVD records variation. It’s a general solution to textual variation in all its forms. It doesn’t need to record details about the content and it doesn’t impose any restrictions on whatever detail you may want to add to the content of a particular version. Whereas in pure markup systems you record at the same time both the version status of a piece of text and its physical appearance (e.g. it’s in red ink) or placement (it’s above the line) or version information (it’s from manuscript such and such) in an MVD you only put the version-related information and all the rest goes into the content of each version. For example, you might define a version called the ‘red ink version’. That frees you from having to record that fact each time there is some text in red ink, and instead you merely specify that this piece of text belongs to the ‘red ink version’ and record in the content any unique details about that particular instance of red ink text, for example, that it is crossed out, or written in the margin etc.

Q. Can the list form of an MVD record individual variants?

A. Yes. You can use the nmerge tool to list the variants of a given span of text in any version. It finds the innermost variants of all versions other than the specified base version.

Q. How can your format compete with XML?

A. An MVD does not compete with markup. It is perfectly acceptable for a version to use markup for its content. Indeed, the Alpha multi-version wiki described on this blog is an example of that approach. An MVD doesn’t care about whether you use markup or not because it is one level above it, one step ahead of it. Because of that it can leverage any existing technology to represent the content. If you want to use markup for that purpose, you can. Markup languages are very useful in a wide variety of modern applications but they also have the serious limitation of not being able to represent overlapping structures in text very well. An MVD is designed to overcome that.

Q. How can your solution represent overlapping hierarchies?

A. Every known form of overlapping hierarchies is a form of textual variation. If you read the extensive literature on the topic you will see that there are essentially three kinds of overlapping hierarchies:

  1. Completely separate hierarchies of tags that share the same text, e.g. CONCUR in SGML. This the OHCO2 model of Renear, Mylonas and Durand
  2. Hierarchies that describe the same text but which partly overlap with each other. This is the OHCO3 model described by Renear, Mylonas and Durand in the same paper.
  3. Hierarchies that may or may not overlap, but which share only part of the same text. (e.g. Durand, Mylonas and DeRose, 1996)
All three types of overlapping hierarchies can be written out as versions. Each individual hierarchy, whether or not it overlaps with another hierarchy, whether or not it describes part or all of the text, can be written out as a separate version simply by traversing its DOM or parse tree. What varies in this case is not the text but the markup, but that’s still a version. If we can solve the problem of how to represent multiple versions we have also solved the problem of how to represent overlapping hierarchies.

Q. Why can't I convert your MVD format into TEI-XML?

A. Each pair in the MVD’s sequence of pairs contains some text, whose content is not necessarily well-formed. For example, you might have a deleted paragraph end as a fragment: ‘</p><p>’. An MVD treats markup as plain text or CDATA – it doesn’t parse it. You can’t have variation of the metadata in XML, but you can in an MVD. That’s one thing that makes it a powerful technique, but it is also something that prevents it being represented usefully in XML. The XML export format in the nmerge library is just to show you what’s in an MVD. It's actually XML inside XML and is not intended for editing. Even a small change could break the MVD and render it invalid. All that you might want to do with an MVD is already implemented in the nmerge library anyway. There can be no possible use for an XML version of an MVD, except to see what is inside it. A binary MVD is only one fifth the size of the same file rendered in XML.

Q. Can I gain access to the underlying wiki technology?

A. The multi-version wiki is one example of what you can do with MVDs. The underlying technology is contained in a JAVA library called nmerge, available here. Since this library does not implement any user interface, but is wholly concerned with loading, saving, updating, searching and extracting versions etc. from MVDs, you can build a whole application on top of it. Alternatively, if that appears too daunting, the multi-version wiki can be customised, for example, by changing the stylesheets or some of the servlets of which it is composed. Both programs are available under the GPLv3 license.

Q. How do I get my texts into the wiki?

A. If you have a number of separate versions you can use nmerge to build an MVD. What you do is supply each version as a separate file and add them one by one to the MVD that you specify. You have to write a small guide file in xml so that nmerge knows what you want to call your versions and what their grouping is, and where the files are. Alternatively, you will soon be able to add new versions directly in the wiki. If you have a text marked up with, say TEI <app> structures, you will either have to write a conversion script to strip out each version separately, or you can use the Versioning Machine.

Q. What are ‘NMerge’ and ‘Alpha’?

A. NMerge is the JAVA library that powers the multi-version wiki, which is called Alpha. NMerge saves, merges, compares, queries and reads versions in MVDs. It knows nothing about the structure of the data it represents. Alpha, on the other hand, is a JAVA web application and is entirely concerned with formatting and content. Its purpose is to provide at least one means for people to update and read MVDs.

Q. Isn't an MVD or Variant Graph just another kind of GODDAG?

A. The only similarity shared by a Variant Graph and a GODDAG, as described by Sperberg-McQueen and Huitfeldt 1999, is that both are types of directed acyclic graph. However, a GODDAG is really a subsumption hierarchy, a set of partly merged trees describing a single linear version of text. A Variant Graph on the other hand, is a flow or PERT graph with an entirely different structure and labeling, used to describe an entire set of versions. A GODDAG can't be serialised, but an MVD can. In other words they have very little in common and one did not inspire the other.

Q. Why should I stop using markup to record variation?

A. The biggest drawback of encoding variation using markup, apart from the inaccuracy, is the excessive cost in human labour. Editors have to be trained to use complex markup. Someone needs to oversee the work to ensure consistency. The markup has to be checked multiple times against the source texts. Someone has to write specialised software to produce an electronic or printed edition from that information. This is all very costly. Using an automatic alignment method, as provided by nmerge, takes away the need to encode interconnections between versions altogether. Now they can be computed, rather than specified by hand. And the MVD format is simply more accurate at recording variation than markup.

Q. If I adopt MVD won't I'll be stuck with a technology that might disappear later?

A. No. You can get your texts out at any time in the same form as you put them in, by using the archive facility of nmerge. MVD has a zero footprint - it doesn't add anything at all to the files it manipulates and merges together. If you want to abandon the MVD format in the future that choice is always open. Can you say the same about XML?

T.H. Nelson, Literary Machines, Mindful Press, Sausalito, 1992