As Ted Nelson said in his book
Literary Machines, it is difficult to explain a simple but radical idea. What came out of Oulu were a number of misunderstandings about what an MVD is and what it can record. So I thought I would publish a FAQ that I have accumulated not just from Oulu, but also from all previous presentations.
MVD-related
What’s an ‘MVD’?
What’s it good for?
Why don't you just use a Version Control system like CVS?
Why can't MVDs record all I need to say about versions?
Can the list form of an MVD record individual variants?
Isn't an MVD or Variant Graph just another kind of GODDAG?
If I adopt MVD won't I'll be stuck with a technology that might disappear later?
Markup-related
How can your format compete with XML?
How can your solution represent overlapping hierarchies?
Why can't I convert your MVD format into TEI-XML?
Why should I stop using markup to record variation?
Wiki-related
Can I gain access to the underlying wiki technology?
How do I get my texts into the wiki?
What are ‘NMerge’ and ‘Alpha’?
Q. What’s an ‘MVD’?
A. An MVD is a Multi-Version Document. It’s primarily a binary document format that allows you to store multiple versions, perhaps thousands, in a single integrated digital entity, that can be searched and edited, and its versions extracted, compared and examined. It is based on a directed graph structure stored in a simplified list form. There is a basic description of it at the start of this blog, which also provides links to published papers.
Q. What’s it good for?
A. MVDs were originally designed for recording multiple versions of cultural heritage texts: mostly older written or printed documents such as literature, correspondence, legal documents, journals, parallel translations etc. However, since they are designed to represent overlapping structures in text they can also be used to record ‘overlapping hierarchies’, or different marked up versions of the same underlying text. In principle they could also be applied to genetic sequencing and version control systems.
Q. Why don't you just use a Version Control system like CVS?
A. Version Control or SCM (Software Configuration Management) Systems are unsuited to storing texts that exist in multiple versions. Specifically:
- SCM systems only store the differences between the version you are committing and one designated version already in the repository. Their purpose is to allow the user to reconstruct any version from a series of updates. By contrast, the purpose of an MVD is to maintain a large number of simultaneous versions and to record the global differences between all of them.
- The level of granularity in SCM systems is usually the line. The required granularity for cultural heritage texts is usually the word or the character.
- SCM systems usually only calculate the insertions and deletions between two texts. For humanities texts it is also necessary to calculate variants and transpositions.
- In SCM systems versions are necessarily immutable: the idea is to be able to recover any older version on demand. On the other hand versions in an MVD must always be mutable: the editor of a cultural heritage text needs to be able to update any given version and save the changes back as a replacement text, without changing the overall number of versions.
Q. Why can't MVDs record all I need to say about versions?
A. An MVD records variation. It’s a general solution to textual variation in all its forms. It doesn’t need to record details about the content and it doesn’t impose any restrictions on whatever detail you may want to add to the content of a particular version. Whereas in pure markup systems you record at the same time both the version status of a piece of text and its physical appearance (e.g. it’s in red ink) or placement (it’s above the line) or version information (it’s from manuscript such and such) in an MVD you only put the version-related information and all the rest goes into the content of each version. For example, you might define a version called the ‘red ink version’. That frees you from having to record that fact each time there is some text in red ink, and instead you merely specify that this piece of text belongs to the ‘red ink version’ and record in the content any unique details about that particular instance of red ink text, for example, that it is crossed out, or written in the margin etc.
Q. Can the list form of an MVD record individual variants?
A. Yes. You can use the nmerge tool to list the variants of a given span of text in any version. It finds the innermost variants of all versions other than the specified base version.
Q. How can your format compete with XML?
A. An MVD does not compete with markup. It is perfectly acceptable for a version to use markup for its content. Indeed, the Alpha multi-version wiki described on this blog is an example of that approach. An MVD doesn’t care about whether you use markup or not because it is one level above it, one step ahead of it. Because of that it can leverage any existing technology to represent the content. If you want to use markup for that purpose, you can. Markup languages are very useful in a wide variety of modern applications but they also have the serious limitation of not being able to represent overlapping structures in text very well. An MVD is designed to overcome that.
Q. How can your solution represent overlapping hierarchies?
A. Every known form of overlapping hierarchies is a form of textual variation. If you read the extensive literature on the topic you will see that there are essentially three kinds of overlapping hierarchies:
- Completely separate hierarchies of tags that share the same text, e.g. CONCUR in SGML. This the OHCO2 model of Renear, Mylonas and Durand
- Hierarchies that describe the same text but which partly overlap with each other. This is the OHCO3 model described by Renear, Mylonas and Durand in the same paper.
- Hierarchies that may or may not overlap, but which share only part of the same text. (e.g. Durand, Mylonas and DeRose, 1996)
All three types of overlapping hierarchies can be written out as versions. Each individual hierarchy, whether or not it overlaps with another hierarchy, whether or not it describes part or all of the text, can be written out as a separate
version simply by traversing its DOM or parse tree. What varies in this case is not the text but the markup, but that’s still a version. If we can solve the problem of how to represent multiple versions we have also solved the problem of how to represent overlapping hierarchies.
Q. Why can't I convert your MVD format into TEI-XML?
A. Each pair in the MVD’s sequence of pairs contains some text, whose content is not necessarily well-formed. For example, you might have a deleted paragraph end as a fragment: ‘</p><p>’. An MVD treats markup as plain text or CDATA – it doesn’t parse it. You can’t have variation of the metadata in XML, but you can in an MVD. That’s one thing that makes it a powerful technique, but it is also something that prevents it being represented usefully in XML. The XML export format in the nmerge library is just to show you what’s in an MVD. It's actually XML inside XML and is not intended for editing. Even a small change could break the MVD and render it invalid. All that you might want to do with an MVD is already implemented in the nmerge library anyway. There can be no possible use for an XML version of an MVD, except to see what is inside it. A binary MVD is only one fifth the size of the same file rendered in XML.
Q. Can I gain access to the underlying wiki technology?
A. The multi-version wiki is one example of what you can do with MVDs. The underlying technology is contained in a JAVA library called nmerge, available here. Since this library does not implement any user interface, but is wholly concerned with loading, saving, updating, searching and extracting versions etc. from MVDs, you can build a whole application on top of it. Alternatively, if that appears too daunting, the multi-version wiki can be customised, for example, by changing the stylesheets or some of the servlets of which it is composed. Both programs are available under the GPLv3 license.
Q. How do I get my texts into the wiki?
A. If you have a number of separate versions you can use nmerge to build an MVD. What you do is supply each version as a separate file and add them one by one to the MVD that you specify. You have to write a small guide file in xml so that nmerge knows what you want to call your versions and what their grouping is, and where the files are. Alternatively, you will soon be able to add new versions directly in the wiki. If you have a text marked up with, say TEI <app> structures, you will either have to write a conversion script to strip out each version separately, or you can use the Versioning Machine.
Q. What are ‘NMerge’ and ‘Alpha’?
A. NMerge is the JAVA library that powers the multi-version wiki, which is called Alpha. NMerge saves, merges, compares, queries and reads versions in MVDs. It knows nothing about the structure of the data it represents. Alpha, on the other hand, is a JAVA web application and is entirely concerned with formatting and content. Its purpose is to provide at least one means for people to update and read MVDs.
Q. Isn't an MVD or Variant Graph just another kind of GODDAG?
A. The only similarity shared by a Variant Graph and a GODDAG, as described by Sperberg-McQueen and Huitfeldt 1999, is that both are types of directed acyclic graph. However, a GODDAG is really a subsumption hierarchy, a set of partly merged trees describing a single linear version of text. A Variant Graph on the other hand, is a flow or PERT graph with an entirely different structure and labeling, used to describe an entire set of versions. A GODDAG can't be serialised, but an MVD can. In other words they have very little in common and one did not inspire the other.
Q. Why should I stop using markup to record variation?
A. The biggest drawback of encoding variation using markup, apart from the inaccuracy, is the excessive cost in human labour. Editors have to be trained to use complex markup. Someone needs to oversee the work to ensure consistency. The markup has to be checked multiple times against the source texts. Someone has to write specialised software to produce an electronic or printed edition from that information. This is all very costly. Using an automatic alignment method, as provided by nmerge, takes away the need to encode interconnections between versions altogether. Now they can be computed, rather than specified by hand. And the MVD format is simply more accurate at recording variation than markup.
Q. If I adopt MVD won't I'll be stuck with a technology that might disappear later?
A. No. You can get your texts out at any time in the same form as you put them in, by using the archive facility of nmerge. MVD has a zero footprint - it doesn't add anything at all to the files it manipulates and merges together. If you want to abandon the MVD format in the future that choice is always open. Can you say the same about XML?
T.H. Nelson, Literary Machines, Mindful Press, Sausalito, 1992