<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss'><id>tag:blogger.com,1999:blog-4555078640999654611</id><updated>2009-11-07T17:13:00.962-08:00</updated><title type='text'>Multi-Version Documents</title><subtitle type='html'>This project is about creating a Wiki to handle documents consisting of multiple simultaneous versions (MVDs) or which contain overlapping markup.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default?start-index=26&amp;max-results=25'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>36</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-3342341601161739986</id><published>2009-11-01T13:51:00.000-08:00</published><updated>2009-11-01T14:13:02.461-08:00</updated><title type='text'>C++ Version of nmerge</title><content type='html'>&lt;p&gt;One problem with the current design of nmerge is that it is written in Java. The commandline tool is a thin C wrapper around that, and if you want to process larger files you can't pass in arguments to increase available memory. So it just fails to work on large files. Also if you want to run it on servers that don't have, or won't allow, Java (true of many commercial hosting sites) you're also out of luck. Since the Digital Variants people and probably a large number of humanities projects will have these problems also, I have decided to convert it into pure C++. This should be relatively easy, and the benefits are:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;memory usage will be limited only by what is available on the machine, not to that allocated to the Java Virtual machine (JVM)&lt;/li&gt;&lt;li&gt;nmerge-c++ will be callable from PHP or another scripting language without requiring installation of a JVM.&lt;/li&gt;&lt;li&gt;nmerge can optionally write to a database instead of directly to disk. This is usually the only way you can save changes on a commercial hosting site.&lt;/li&gt;&lt;li&gt;The C++ version will use far less memory than the Java version and should be a bit faster.&lt;/li&gt;&lt;/ol&gt;
&lt;p&gt;Overall, these changes will facilitate the building of a practical web application or plugin, which can be added to existing sites. Initially, my intention is to produce a Joomla! plugin that other people can use.&lt;/p&gt;
&lt;p&gt;Some changes that will be possible in this revision include:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Grouped transpositions. By assessing individual transposition candidates as a group it will be possible to detect larger transpositions that contain small corrections.&lt;/li&gt;&lt;li&gt;Proper multi-tasking of the merging process in C++ will hopefully speed up the algorithm considerably.&lt;/li&gt;&lt;/ol&gt;
&lt;p&gt;That's the plan. I thought I'd let you know where I'm taking this, and it is to turn it into a generally usable tool.&lt;/p&gt;
&lt;p&gt;There is at least one drawback, of course. C++ is cumbersome to write code in, compared to the relative heaven of Java. It's like painting a room with a brush instead of a roller.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-3342341601161739986?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/3342341601161739986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=3342341601161739986' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/3342341601161739986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/3342341601161739986'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/11/c-version-of-nmerge.html' title='C++ Version of nmerge'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-4428730191377233423</id><published>2009-10-23T15:16:00.000-07:00</published><updated>2009-10-23T18:05:20.931-07:00</updated><title type='text'>Whoops!</title><content type='html'>&lt;p&gt;A favourite quotation of Edgar Dijkstra is that 'testing shows the presence, not the absence of bugs'. This is very true. In Australian homes too you can squash a few cockroaches and think you've got them all, but how do you know there isn't a whole colony hiding in the skirting boards? I'm guilty of putting in a '!' when I shouldn't have. My only explanation was that I was jetlagged in Montreal and fed up with preparing my presentation. For some reason I put in that 'not', which prevented nmerge from finding any left-side transpositions at all. All I can say is: 'Whoops!'&lt;/p&gt;
&lt;p&gt;I'll fix it in the next hour or so and upload the new version as 1.0.2, and update Alpha too. The transposition algorithm is not perfect - I never said so, if you read the Balisage paper, particularly at the end - but it is workable. One thing you should keep in mind is that this is a unique program in its field. Several people have written merging programs for humanistic texts, and a couple have even included transpositions (MEDITE, JNDiff). But only between &lt;em&gt;two&lt;/em&gt; texts at a time. I merge &lt;em&gt;N&lt;/em&gt; texts into one digital representation.&lt;/p&gt;
&lt;p&gt;One thing I'd like to do soon is make it find transpositions in groups (a flaw that Peter Robinson rightly pointed out). And it could be even faster, if I can work out how to parallelise the algorithm. That's why I 'built' this fancy i7 computer.&lt;/p&gt;
&lt;p&gt;The good thing about computing variants automatically rather than manually is that it is not final. Any improvements in the algorithm are immediately visible. Whereas making systematic changes to a manually coded set of texts with complex variants is &lt;em&gt;not&lt;/em&gt; trivial.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-4428730191377233423?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/4428730191377233423/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=4428730191377233423' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4428730191377233423'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4428730191377233423'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/10/whoops.html' title='Whoops!'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-7159088014283086467</id><published>2009-09-01T04:50:00.001-07:00</published><updated>2009-09-01T04:55:01.257-07:00</updated><title type='text'>New Versions of nmerge and Alpha posted</title><content type='html'>&lt;p&gt;The difference here is that nmerge now includes the full source code, released under the GPL v3, and also contains a single example text that I can give away under the same license. It is the first scene of Shakespeare's King Lear. I have tried to make it as true to the source texts as I can but it's a lot of work getting markup to look like a manuscript. I never realised before how much the tags interfere with that. It's very annoying. Anyway, let me know if there are any mistakes. Or any ideas on how Alpha can be improved. I'm sure there are lots.&lt;/p&gt;
&lt;p&gt;Of course it is full of markup hacks, mainly lines split over speeches, but I couldn't fix that without introducing another layer for each MS. I'd prefer using some other technology other than markup for the content but there isn't one yet. Oh well!&lt;/p&gt;
&lt;p&gt;&lt;a href="http://code.google.com/p/multiversiondocs/downloads/list"&gt;Here's the link.&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-7159088014283086467?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/7159088014283086467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=7159088014283086467' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7159088014283086467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7159088014283086467'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/09/new-versions-of-nmerge-and-alpha-posted.html' title='New Versions of nmerge and Alpha posted'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-4656383745883872749</id><published>2009-08-13T19:44:00.001-07:00</published><updated>2009-08-18T15:42:46.164-07:00</updated><title type='text'>Balisage Presentation 13 August 2009</title><content type='html'>&lt;p&gt;My talk at Balisage in Montreal went very much as planned. The slides took as long as I had paced them to last: 28 minutes. Then I did two software demonstrations, one of the nmerge commandline tool and another of the Alpha multi-version wiki. The former is more or less finished (though I keep tweaking it) and the Alpha wiki is about half done, but usable. There was time afterwards for a few questions. The best of these came from Fabio Vitali, who also works with Angelo Di Iorio on a diff calculating algorithm for edited XML texts. He convinced me after the talk that their method of computing diffs has some advantages over my simplistic greedy approach for XML texts. But my method I think is still a good fallback in the general case. I think the best thing is to try to incorporate the basic idea of their JNDiff algorithm, which is making the merging algorithm &lt;em&gt;optionally&lt;/em&gt; XML-aware, rather than try to use their code, which is not really open-source yet.&lt;/p&gt;
&lt;p&gt;I think the paper went down well because of the demos. No one else whose talk I saw presented any finished software. It was mostly work in progress - the usual conference fare. But reactions to it were not very critical. They had little to say I think because it was not about an application of XSLT or XQuery - their favourite tools. But the talk at least has exposed the MVD idea to a wider audience. No more excuses any more for &lt;em&gt;not&lt;/em&gt; mentioning it when discussing solutions to overlapping hierarchies.&lt;/p&gt;
&lt;p&gt;I received favourable comments from the upper reaches of the Balisage hierarchy which seemed genuine. And I am encouraged by that.&lt;/p&gt;
&lt;p&gt;I have updated nmerge with &lt;a href="http://code.google.com/p/multiversiondocs/downloads/list"&gt;the version I demonstrated at Montreal.&lt;/a&gt; Also there is a copy of the wiki in its current state, minus any MVDs. I can't use any of the usual examples because of copyright restrictions. So I'll have to create some of my own pretty soon.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-4656383745883872749?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/4656383745883872749/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=4656383745883872749' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4656383745883872749'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4656383745883872749'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/08/balisage-presentation-13-august-2009.html' title='Balisage Presentation 13 August 2009'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-516643747070155549</id><published>2009-08-12T17:40:00.000-07:00</published><updated>2009-08-12T18:02:26.338-07:00</updated><title type='text'>The Biggest Advantage of Using MVDs</title><content type='html'>&lt;p&gt;I suddenly realised now I am here in Montreal preparing to defend my ideas against 100 experts from around the world that I have failed to notice all this time the biggest advantage of MVDs. And it is this: The alternative to computing the interrelations between multi-version texts can only be encoding them &lt;em&gt;manually.&lt;/em&gt; In speaking of the supposed advantages of standard XML tools what is often forgotten is the enormous human cost of training people to use markup, and getting them to encode it and check it against the originals. I know from experience that this is very expensive. We literally spent thousands of man-hours encoding variants in Wittgenstein. If we could have had a tool for doing that automatically, much of that time and money would have been saved.&lt;/p&gt;
&lt;p&gt;Another advantage of computing interrelations automatically is that it is so easy to get back what you put in, unmolested. Hand-encoded XML hard-wires the interconnections between versions, and getting back the original text can be a hard problem if you decide later to change to another technology. With nmerge I just press the "archive" button and it is done.&lt;/p&gt;
&lt;p&gt;If computers are good for anything they are good for saving human effort.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-516643747070155549?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/516643747070155549/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=516643747070155549' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/516643747070155549'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/516643747070155549'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/08/biggest-advantage-of-using-mvds.html' title='The Biggest Advantage of Using MVDs'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-7845460932957899906</id><published>2009-07-26T17:52:00.000-07:00</published><updated>2009-07-28T03:16:15.857-07:00</updated><title type='text'>Alpha Prototype Ready</title><content type='html'>&lt;p&gt;I am renaming the multi-version wiki Alpha, simply because it's easier to say than Phaidros. It's a bit of a joke, really, because 'Alpha' was just the description of the product I developed for DH2008. It was the 'alpha' release of that.&lt;/p&gt;
&lt;p&gt;The old Alpha didn't do transpositions, and to remedy this deficiency I have been labouring hard for the past year. NMerge was revised to support transpositions, but I hadn't integrated it into the multi-version wiki. But when I finally saw the result of the new nmerge in the web browser, it was suddenly clear that there were still some bugs in the transposition algorithm. Finding out exactly &lt;em&gt;what&lt;/em&gt; was going wrong, though, took me about a week of solid debugging. But it is done now and I am finally satisfied. And now I have something to take to Montr&amp;eacute;al to show the audience. And I can say: 'Hey folks, you said this conference was all about theory, but here's something that &lt;em&gt;actually works.&lt;/em&gt;' I think that is a pretty good argument.&lt;/p&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GGwOcLYrsVk/Sm4xmHk37FI/AAAAAAAAAIM/V2BxTmDGPG8/s1600-h/Screenshot.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 550px; height: 90px;" src="http://4.bp.blogspot.com/_GGwOcLYrsVk/Sm4xmHk37FI/AAAAAAAAAIM/V2BxTmDGPG8/s1600/Screenshot.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5363278737183337554" /&gt;&lt;/a&gt;
&lt;p&gt;In this screendump of part of the TwinView of Galiano's 'El mapa de las aguas' you can see the transposition of 'otras de un hachazo' from after 'de un bocado rabioso' (in version B, left) to before (in version C, right). To consistently detect cases like this manually would be near to impossible.&lt;/p&gt;
&lt;p&gt;Red text is deleted in the left-hand version with respect to the version on the right. Blue text is inserted, and transpositions are shown in grey. Black text is merged and, like transpositions, clicking on it aligns the text on each side. This use of these simple features of HTML results in a surprisingly effective UI.&lt;/p&gt;
&lt;h4&gt;Character-Level vs Word-Level Alignment&lt;/h4&gt;
&lt;p&gt;The use of character-level alignment by default is new to this version. For example, the expression 'el molino chico' became 'el molino' through the deletion of the character sequence 'o chic'. This goes to show that what humans would expect &amp;ndash; the deletion of '&amp;nbsp;chico' &amp;ndash; and what the computer detects, don't always correspond. I don't think that is a bad thing. The alternative would be to fail to see changes of spelling such as 'desaparecido' for 'desparecido' or the capitalisation of 'Ojos' for 'ojos'. A word-level granularity would puzzle the reader while he/she tried to work out the difference. It is clearer to see small changes like these highlighted, so I agree with the MEDITE people that character-level alignment is more powerful. After all, you can always reduce character-level granularity to word-level but if you only have word-level alignment you are stuck with it.&lt;/p&gt;
&lt;p&gt;'Collation' programs based on XML use word-level granularity because a finer resolution would make the markup impossibly complex (you'd have to mark up each letter separately). That doesn't have to be a restriction once we abandon the print-oriented concept of 'apparatus.' For the digital medium, at least, a new digital presentation of variation is needed. Let it evolve.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-7845460932957899906?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/7845460932957899906/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=7845460932957899906' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7845460932957899906'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7845460932957899906'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/07/alpha-prototype-ready.html' title='Alpha Prototype Ready'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GGwOcLYrsVk/Sm4xmHk37FI/AAAAAAAAAIM/V2BxTmDGPG8/s72-c/Screenshot.png' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-1989968327116034439</id><published>2009-07-02T15:27:00.000-07:00</published><updated>2009-07-02T15:40:59.681-07:00</updated><title type='text'>Interface 09 and Multi-Version Wiki</title><content type='html'>&lt;p&gt;We will be presenting a poster at &lt;a href="http://www.interface09.org.uk"&gt;Interface09&lt;/a&gt; at the University of Southampton. There will also be a demo of the multi-version wiki, which I hope will be an iteration further on from that presented at Oulu for Digital Humanities 2008. The new multi-version wiki is simply the old wiki with the new nmerge library added, but that includes support for transpositions, which is kind of important. It is a Jetty 6 based web application that runs inside your browser, and allows you to view and edit MVDs in a variety of intuitive ways.&lt;/p&gt;
&lt;h4&gt;Digital Variants Portal&lt;/h4&gt;
&lt;p&gt;Eventually the wiki will be broken up and integrated into the Digital Variants Website I am building. In this form the wiki will be a series of portlets inside a portal. Each portlet conforms to JSR 286 and is implemented in &lt;a href="http://portals.apache.org/jetspeed-2/"&gt;Jetspeed 2.&lt;/a&gt; A portal allows the user to configure his or her own interface on the web using the portlet components. It also promotes reuse of the portlets by other parties. We are going for broke with this design: I for one don't believe that deficient or obsolescent technology has any place in designs for the future. If we can build it, we will.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-1989968327116034439?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/1989968327116034439/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=1989968327116034439' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1989968327116034439'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1989968327116034439'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/07/interface-09-and-multi-version-wiki.html' title='Interface 09 and Multi-Version Wiki'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-5025115656802956367</id><published>2009-06-05T01:57:00.000-07:00</published><updated>2009-06-05T13:44:37.285-07:00</updated><title type='text'>nmerge 1.0 posted</title><content type='html'>&lt;p&gt;OK, I've posted the first &lt;a href="http://code.google.com/p/multiversiondocs/downloads/list"&gt;BETA version of nmerge&lt;/a&gt; for UNIX/Linux/OSX only. I'll add a Windows installer as soon as I can get around to it. Of course I expect it to go wrong immediately, even though I have tested it thoroughly. But I can only really gather more information by trying it on other files. And it comes currently with &lt;em&gt;no&lt;/em&gt; example files. &lt;/p&gt;
&lt;p&gt;Some basic installation instructions for the non-GNU afficionados:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Download the nmerge-1.0.tar.gz file using the link above&lt;/li&gt;
&lt;li&gt;Open a terminal window, navigate to the download file and unpack it
   using &lt;code&gt;tar xzf dir.tar.gz&lt;/code&gt; or just double click on it if you have a Mac&lt;/li&gt;
&lt;li&gt;In the terminal window type &lt;code&gt;cd nmerge-1.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;./configure&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;make&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sudo make install&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You should now have a command "&lt;code&gt;nmerge&lt;/code&gt;". If it complains about the Java make sure you have a valid JRE installed. It must be at least version 1.5.0 (1.4.2 is no good). To find out type &lt;code&gt;java -version&lt;/code&gt; in the terminal window. Download a more recent JRE from &lt;a href="http://java.sun.com/javase/downloads/index.jsp"&gt;Sun.&lt;/a&gt; (You only need the JRE not the JDK unless you also want to develop Java software). If it still doesn't work you have an issue that you should post on &lt;a href="http://code.google.com/p/multiversiondocs/issues/list"&gt;Google code&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The first update will contain the source code and documentation. I left it out because of my inexperience with GNU automake.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-5025115656802956367?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/5025115656802956367/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=5025115656802956367' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5025115656802956367'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5025115656802956367'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/06/nmerge-10-posted.html' title='nmerge 1.0 posted'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-5190496177537744994</id><published>2009-06-02T15:45:00.000-07:00</published><updated>2009-06-02T15:51:27.631-07:00</updated><title type='text'>Balisage Paper Accepted</title><content type='html'>&lt;p&gt;My Balisage paper about how to create and edit MVD files has been accepted. I have already bought the flight tickets and registered, so I will be going to Montreal on August 11-14. That's the other side of the world for me and I think I must be mad. But this is the only way to properly air the MVD concept and get some reactions from the people most likely to field valid objections. If they clear it, then I think that will vindicate it as far as it can be at this stage. The draft paper is &lt;a href="http://www.itee.uq.edu.au/~schmidt/_articles/balisagepaper.zip"&gt;here&lt;/a&gt;, although it is rather technical. I will post my simplified slide show when I have it.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-5190496177537744994?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/5190496177537744994/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=5190496177537744994' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5190496177537744994'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5190496177537744994'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/06/balisage-paper-accepted.html' title='Balisage Paper Accepted'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-1582700898342146904</id><published>2009-06-02T03:13:00.000-07:00</published><updated>2009-06-02T03:20:33.878-07:00</updated><title type='text'>Out of the Tunnel</title><content type='html'>&lt;p&gt;Well, it all works. Now I just have to build an installable package for it. To be honest I don't think many if any people will want to use nmerge. It's too user unfriendly because it has no real user interface. People want a GUI these days, and nmerge is designed to be the Swiss army knife for whatever GUI you might want to put on top of it. Nevertheless I will post it as soon as possible with a GNU type installer and maybe a Windows one if that is not too hard (perhaps using Nullsoft). The main point is that a milestone has been reached: the MVD file format is born. (Hooray!)&lt;/p&gt;
&lt;p&gt;After that it will be time to add my own GUI, which is just an updating of the Phaidros wiki which has lain untouched for nearly a year now. It is time to update it with some killer features: e.g. Tree View, which will show the genealogy of a set of versions via a graphical tree which you can configure and regenerate according to taste. I have some other ideas too which can be blended in gradually.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-1582700898342146904?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/1582700898342146904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=1582700898342146904' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1582700898342146904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1582700898342146904'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/06/out-of-tunnel.html' title='Out of the Tunnel'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-1365632653568191720</id><published>2009-05-28T04:50:00.000-07:00</published><updated>2009-06-02T03:16:57.740-07:00</updated><title type='text'>The Light at the End of the Tunnel</title><content type='html'>&lt;p&gt;Well I &lt;em&gt;finally&lt;/em&gt; got 'compare' to work properly. The delay was caused by having to redesign the 'chunking' mechanism that delivers the text back to the browser as a series of blocks with all the same characteristics. So all the deleted text can be made red, and the inserted blue and the merged black. And the user can click on the black text and be taken to the corresponding part of the compared text. Very important, but also very tricky to get absolutely right. And in this version I had to allow for transpositions, and they are even more complicated. But now at last it works. I will post the project on Google Code in the morning, because I am too tired now.&lt;/p&gt;
&lt;p&gt;&lt;table&gt;&lt;tr&gt;&lt;td bgcolor="green"&gt;usage&lt;/td&gt;&lt;td bgcolor="green"&gt;create&lt;/td&gt;&lt;td bgcolor="green"&gt;help&lt;/td&gt;&lt;td bgcolor="green"&gt;add&lt;/td&gt;&lt;td bgcolor="green"&gt;del&lt;/td&gt;&lt;td bgcolor="green"&gt;desc&lt;/td&gt;&lt;td bgcolor="green"&gt;arch&lt;/td&gt;&lt;td bgcolor="green"&gt;unarch&lt;/td&gt;&lt;td bgcolor="green"&gt;export&lt;/td&gt;&lt;td bgcolor="green"&gt;import&lt;/td&gt;&lt;td bgcolor="green"&gt;update&lt;/td&gt;&lt;td bgcolor="green"&gt;read&lt;/td&gt;&lt;td bgcolor="green"&gt;list&lt;/td&gt;&lt;td bgcolor="green"&gt;comp&lt;/td&gt;&lt;td bgcolor="green"&gt;find&lt;/td&gt;&lt;td bgcolor="green"&gt;vars&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/p&gt;
&lt;h4&gt;Several Days Later ...&lt;/h4&gt;
&lt;p&gt;Almost done testing the code. Just a few minor problems with find (again) and variants. The latter could be quite a useful feature in the GUI. For example, selecting a piece of text could conceivably show its variants dynamically in a sub-window at the bottom. I favour an in-line solution using popup text, but that will have to wait. This feature should demonstrate that we don't need to 'collate' separate physical versions any longer to get this information.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-1365632653568191720?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/1365632653568191720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=1365632653568191720' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1365632653568191720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1365632653568191720'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/05/light-at-end-of-tunnel.html' title='The Light at the End of the Tunnel'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-1228803817033945809</id><published>2009-05-14T16:22:00.000-07:00</published><updated>2009-06-02T15:03:55.529-07:00</updated><title type='text'>HyperNietzsche vs MVD</title><content type='html'>&lt;p&gt;I decided after all to make some general remarks about the recently proposed &lt;a href="http://wiki.tei-c.org/index.php/Genetic_Editions"&gt;'Encoding Model for Genetic Editions'&lt;/a&gt; being promoted by the HyperNietzsche people and the TEI. Since this is being put forward as a rival solution for a small subset of multi-version texts covered by my solution, I thought that readers of this blog might like to know the main reasons why I think that the MVD technology is much the better of the two.&lt;/p&gt;
&lt;h4&gt;One Work = One Text&lt;/h4&gt;
&lt;p&gt;Because it is difficult to record many versions in one file using markup, the proposal recommends a document-centric approach. In this method each physical document is encoded separately, even when they are just drafts of the one text. As a result there is a great deal of redundancy in their representation. They interconnect the variants between documents by means of links which are weighted with a probability, and they see in this their main advantage over MVD. But this is based purely on a misunderstanding of the MVD model. The weights can of course be encoded in the version information of the MVD as user-constructed paths. We can have an 80% probable version and a 20% probable version just as well as physical versions.&lt;/p&gt;
&lt;p&gt;Actually I think it is wrong to encode &lt;em&gt;one transcriber's opinion&lt;/em&gt; about the probability that a certain combination of variants is 'correct'. A transcription should just record the text and any interpretations should be kept separate. How else can it be shared? The display of alternative paths is a task for the software, mediated by the user's preferences.&lt;/p&gt;
&lt;p&gt;The main disadvantage in having multiple copies of the same text is that every subsequent operation on the text has to reestablish or maintain the connections between bits that are supposed to be the same. You thus have much more work to do than in an MVD. I believe that &lt;em&gt;text that is the same across versions should literally be the same text.&lt;/em&gt; This simplifies the whole approach to multi-version texts. I also don't believe that humanists want to maintain complex markup that essentially records interconnections between versions, when this same information can be recorded automatically as simple identity.&lt;/p&gt;
&lt;h4&gt;OHCO Thesis Redux&lt;/h4&gt;
&lt;p&gt;The section on 'grouping changes' implies that manuscript texts have a structure that can be broken down into a hierarchy of changes that can be conveniently grouped and nested arbitrarily. Similarly in section 4.1 a strict hierarchy is imposed consisting of document-&gt;writing surface-&gt;zone-&gt;line. Since Barnard's paper in 1988 where he pointed out the inherent failure of markup to adequately represent a simple case of nested speeches and lines in Shakespeare - sometimes a line was spread over two speeches - the problem of overlap has become the dominant issue in the digital encoding of historical texts. This representation, which seeks to reassert the OHCO thesis, which has been withdrawn by its own authors, will fail to adequately represent these genetic texts until it is recognised that they are fundamentally non-hierarchical. The last 20 years of research cannot simply be ignored. It is no longer possible to propose something for the future that does not address the overlap problem. And MVD neatly disposes of that.&lt;/p&gt;
&lt;h4&gt;Collation of XML Texts&lt;/h4&gt;
&lt;p&gt;I am also curious as to how they propose to 'collate' XML documents arranged in this structure, especially when the variants are distributed via two mechanisms: as markup in individual files and also as links between documentary versions. Collation programs work by comparing basically plain text files, containing only light markup for references in COCOA or empty XML elements (as in the case of Juxta). The virtual absence of collation programs able to process arbitrary XML renders this proposal at least very difficult to achieve. It would be better if a purely digital representation of the text were the objective, since in this case, an apparatus would not be needed.&lt;/p&gt;
&lt;h4&gt;Transpositions&lt;/h4&gt;
&lt;p&gt;The mechanism for transposition as described also sounds infeasible. It is unclear what is meant by the proposed standoff mechanism. However, if this allows chunks of transposed text to be moved around this will fail if the chunks contain non-well-formed markup or if the destination location does not permit that markup in the schema at that point. Also if transpositions between physical versions are allowed - and this actually comprises the majority of cases - how is such a mechanism to work, especially when transposed chunks may well overlap? &lt;/p&gt;
&lt;h4&gt;Simplicity = Limited Scope&lt;/h4&gt;
&lt;p&gt;Much is made in the supporting documentation of the HyperNietzsche Markup Language (HNML) and 'GML' (Genetic Markup Language)  of the greater simplicity of the proposed encoding schemes. Clearly, the more general an encoding scheme the less succinct it is going to be. Since the proposal is to encorporate the encoding model for genetic editions into TEI then this advantage will surely be lost. In any case there seems very little in the proposal that cannot already be encoded as well (or as poorly, depending on your point of view) in the TEI Guidelines as they now stand.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-1228803817033945809?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/1228803817033945809/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=1228803817033945809' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1228803817033945809'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/1228803817033945809'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/05/genetic-editions.html' title='HyperNietzsche vs MVD'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-6864692459326419027</id><published>2009-05-08T03:48:00.000-07:00</published><updated>2009-05-28T04:56:13.182-07:00</updated><title type='text'>A Slight Delay in a Good Cause</title><content type='html'>&lt;p&gt;OK, I'm not finished yet, when I said I would, but software is like that. Sorry. I decided that in order to really test the program properly I should have a complete test suite that I can run after making any changes to make sure that everything in the release is OK. Well when I say 'make sure' a test can only tell you if a bug is present, not tell you that there are none. But that's a lot better than letting the user find them. If I release something that is incomplete or not fully tested then I know the sceptics will attack the flaws. They will say 'See, it doesn't work, I told you so!' I can't afford that, so I have to be careful. So far I have tests for fourteen out of 16 commands.&lt;/p&gt;
&lt;p&gt;I also added an unarchive command to go with the archive command. With 'archive' users can save an MVD as a set of versions in a folder, &lt;em&gt;plus&lt;/em&gt; a small XML file instructing nmerge how to reassemble them into an MVD. This contains all the version and group information etc. So if you don't believe the MVD format will last, &lt;em&gt;it doesn't matter.&lt;/em&gt; You always have the archive and that is in whatever format the original files were in. A user could even construct such an archive manually. The 'unarchive' command takes this archive and builds an MVD from it in one step. &lt;/p&gt;
&lt;p&gt;Here's a progress bar for the tests. Green means there is a test routine and it passes. Yellow means there is a test routine but it doesn't pass yet. Red means there is no test routine and I don't know for sure if it works, but it might. There was an intermittent problem with update, but this is now fixed.&lt;/p&gt;
&lt;table&gt;&lt;tr&gt;&lt;td bgcolor="green"&gt;usage&lt;/td&gt;&lt;td bgcolor="green"&gt;create&lt;/td&gt;&lt;td bgcolor="green"&gt;help&lt;/td&gt;&lt;td bgcolor="green"&gt;add&lt;/td&gt;&lt;td bgcolor="green"&gt;del&lt;/td&gt;&lt;td bgcolor="green"&gt;desc&lt;/td&gt;&lt;td bgcolor="green"&gt;arch&lt;/td&gt;&lt;td bgcolor="green"&gt;unarch&lt;/td&gt;&lt;td bgcolor="green"&gt;export&lt;/td&gt;&lt;td bgcolor="green"&gt;import&lt;/td&gt;&lt;td bgcolor="green"&gt;update&lt;/td&gt;&lt;td bgcolor="green"&gt;read&lt;/td&gt;&lt;td bgcolor="green"&gt;list&lt;/td&gt;&lt;td bgcolor="green"&gt;comp&lt;/td&gt;&lt;td bgcolor="red"&gt;find&lt;/td&gt;&lt;td bgcolor="red"&gt;vars&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;I'm going for a beta version with this release. I think it's good enough.&lt;/p&gt;
&lt;p&gt;OK now there's a project on &lt;a href="http://code.google.com/p/multiversiondocs/"&gt;Google code&lt;/a&gt;. I must say it was much easier than creating a Sourceforge project. They wanted me to write an epic about it and even then I had to wait 1-3 days for their royal approval. On Google code it was instant. Cool.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-6864692459326419027?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/6864692459326419027/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=6864692459326419027' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6864692459326419027'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6864692459326419027'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/05/slight-delay.html' title='A Slight Delay in a Good Cause'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-2817768114281872489</id><published>2009-04-30T16:32:00.001-07:00</published><updated>2009-05-20T16:49:31.627-07:00</updated><title type='text'>Nmerge tool code-complete</title><content type='html'>&lt;p&gt;The nmerge commandline tool is now code-complete. I guess it's a 'pre-alpha' version. Since this is a revision of a previous working version, though, testing should not take too long. I would estimate that, after the Labor day weekend (Monday 4th May) I should have an alpha-version. But with software you never know. This version supports the new merging algorithm from the submitted &lt;a href="http://www.itee.uq.edu.au/~schmidt/_articles/balisagepaper.zip"&gt;Balisage 2009 paper,&lt;/a&gt; which works pretty well.&lt;/p&gt;
&lt;p&gt;Nmerge is also a JAVA library that can be used from within a JAVA application, like the Phaidros wiki, to provide support for Multi-Version-Documents. Once it has stabilised I will rewrite it as a C++ commandline tool. But for now we have to put up with a slightly more cumbersome syntax. Here is the "usage" statement produced by the program so you can get some idea of what it does. Once it is reasonably well tested I will put the source code on SourceForge under the GPL v3.&lt;/p&gt;
&lt;p&gt;The command syntax is a bit complicated, but so is what it is trying to do. I envisage that this tool could be used in a shell or commandline script to automate, say, the construction of an MVD from a set of files. At least that's what &lt;em&gt;I&lt;/em&gt; use it for. In any case the -h option prints out an example or two of how to use each command. The -c option specifies the command you want to perform on the MVD, and the other arguments are the parameters that the command uses, provided they make sense. If they don't you'll get an error message.&lt;/p&gt;
&lt;p&gt;With the nmerge tool MVD becomes a real format. There's no GUI user interface because if I added one, you couldn't take it away and put in your own. If you need one, wait for Phaidros.&lt;/p&gt;
&lt;pre&gt;
usage: java -jar nmerge.jar [-c command] [-a archive] [-b backup] 
     [-d description] [-e encoding] [-f string] [-g group] [-h command] 
     [-k length] [-l longname] [-m MVD] [-n mask] [-o offset] [-p]
     [-s shortname] [-t textfile] [-v version] [-w with] [-x XMLfile]
     [-?] 

-a archive - folder to use with archive and unarchive commands
-b backup - the version number of a backup (for partial versions)
-c command - operation to perform. One of:
     add - add the specified version to the MVD
     archive - save MVD in a folder as a set of separate versions
     compare - compare specified version 'with' another version
     create - create a new empty MVD
     description - print or change the MVD's description string
     delete - delete specified version from the MVD
     export - export the MVD as XML
     find - find specified text in all versions or in specified version
     import - convert XML file to MVD
     list - list versions and groups
     read - print specified version to standard out
     update - replace specified version with contents of textfile
     unarchive - convert an MVD archive into an MVD
     variants - find variants of specified version, offset and length
-d description - specified when setting/changing the MVD description
-e encoding - the encoding of the version's text e.g. UTF-8
-f string - to be found (used with command find)
-g group - name of group for new version
-h command - print example for command
-k length - find variants of this length in the base version's text
-l longname - the long name/description of the new version (quoted)
-m MVD - the MVD file to create/update
-n mask - mask out which kind of data in new mvd: none, xml or text
-o offset - in given version to look for variants
-p - specified version is partial
-s shortname - short name or siglum of specified version
-t textfile - the text file to add to/update in the MVD
-v version - number of version for command (starting from 1)
-w with - another version to compare with version
-x XML - the XML file to export or import
-? - print this message
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-2817768114281872489?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/2817768114281872489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=2817768114281872489' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/2817768114281872489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/2817768114281872489'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/04/nmerge-tool-code-complete.html' title='Nmerge tool code-complete'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-4405738502512958482</id><published>2009-04-23T15:13:00.000-07:00</published><updated>2009-04-24T12:13:45.985-07:00</updated><title type='text'>MVDs in binary or XML?</title><content type='html'>&lt;p&gt;A pattern is emerging in the effect that the MVD concept is having on people. They take on board its power at representing variation but they don't like the idea of representing the data in binary form. Instead they think it is possible to represent variation in some form of XML. So far I've heard proposals to use TEI-XML, RDF or GraphML. It's tempting, of course, to carry on using XML when this is the tool we are all most familiar with. However, my point of developing the MVD format was precisely to get around the limitations of all forms of markup. You can't represent a variant graph in XML satisfactorily if the text you are recording the variation of is itself XML &amp;ndash; and it usually is. The reason is that you can't represent cases where the markup itself varies: for example the deletion of a paragraph break:&lt;/p&gt;
&lt;pre&gt;
&amp;lt;del&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;/del&amp;gt;???
&lt;/pre&gt;
&lt;p&gt;Of course there are hacks to get around this particular case but they have negative consequences. What you end up doing is &lt;em&gt;modifying the markup to accommodate weaknesses in the representational power of markup itself.&lt;/em&gt; I think that is a fundamentally flawed strategy. It is just another form of putting presentational information into markup that is supposed to be generic. If you try to represent variation in a set of texts or in one text using markup you very quickly run up against the problem of overlap. And markup is very poor at representing that as we all know. The only way to completely get around the overlap problem is to represent variation using a non-markup based technology. That's the whole point of MVDs that doesn't seem to have  been widely acknowledged yet.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-4405738502512958482?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/4405738502512958482/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=4405738502512958482' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4405738502512958482'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/4405738502512958482'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/04/common-misunderstanding-of-mvd.html' title='MVDs in binary or XML?'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-3176054187289386672</id><published>2009-04-05T00:32:00.000-07:00</published><updated>2009-04-05T14:24:10.770-07:00</updated><title type='text'>MergeTester released</title><content type='html'>&lt;p&gt;For the thesis I wrote &lt;a href="http://www.itee.uq.edu.au/~schmidt/downloads.html"&gt;MergeTester&lt;/a&gt;, a simple utility that implements the merging algorithm from chapter 5. Although not a practical program, it does demonstrate how the program works and allows the user to test it on folders of versions in any format. It builds up a variant graph of the versions and prints them out one arc at a time. From the printout the user could manually reconstruct the graph or part of it.&lt;/p&gt;
&lt;p&gt;The advantage of the program lies in the fact that the way it works is not obscured by any other code and it does not depend on 3rd party libraries. Any comments and reports of bugs found will be gratefully received!&lt;p&gt;
&lt;p&gt;At the moment I am incorporating it into nmerge, which will also be released shortly. Nmerge can convert a variant graph into an MVD, so the merging algorithm will then become practical.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-3176054187289386672?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/3176054187289386672/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=3176054187289386672' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/3176054187289386672'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/3176054187289386672'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/04/mergetester-released.html' title='MergeTester released'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-843746255674046858</id><published>2009-03-18T19:16:00.000-07:00</published><updated>2009-04-04T20:03:56.175-07:00</updated><title type='text'>Final Version of Multi-Version Documents Paper Published by Elsevier</title><content type='html'>&lt;p&gt;&lt;a href="http://dx.doi.org/10.1016/j.ijhcs.2009.02.001"&gt;The final version of my MVD paper&lt;/a&gt; has now appeared online. This hyperlink is permanent and can be used in citations. The paper reference is Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, &lt;i&gt;International Journal of Human-Computer Studies,&lt;/i&gt; 67.6, 497-514.&lt;/p&gt;
&lt;h3&gt;Thesis Submission&lt;/h3&gt;
&lt;p&gt;Also I have now submitted my thesis. The final title was 'Multiple Versions and Overlap in Digital Text'. Here's the abstract:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;This thesis is unusual in that it tries to solve a problem that exists between two widely separated disciplines: the humanities (and to some extent also linguistics) on the one hand and information science on the other.  
&lt;/p&gt;&lt;p&gt;
Chapter 1 explains why it is essential to strike a balance between study of the solution and problem domains.
&lt;/p&gt;&lt;p&gt;
Chapter 2 surveys the various models of cultural heritage text, starting in the remote past, through the coming of the digital era to the present. It establishes why current models are outdated and need to be revised, and also what significance such a revision would have.
&lt;/p&gt;&lt;p&gt;
Chapter 3 examines the history of markup in an attempt to trace how inadequacies of representation arose. It then examines two major problems in cultural heritage and linguistics digital texts: overlapping hierarchies and textual variation. It assesses previously proposed solutions to both problems and explains why they are all inadequate. It argues that overlapping hierarchies is a subset of the textual variation problem, and also why markup cannot be the solution to either problem.
&lt;/p&gt;&lt;p&gt;
Chapter 4 develops a new data model for representing cultural heritage and linguistics texts, called a 'variant graph', which separates the natural overlapping structures from the content. It develops a simplified list-form of the graph that scales well as the number of versions increases. It  also describes the main operations that need to be performed on the graph and explores their algorithmic complexities.
&lt;/p&gt;&lt;p&gt;
Chapter 5 draws on research in bioinformatics and text processing to develop a greedy algorithm that aligns &lt;i&gt;n&lt;/i&gt; versions with non-overlapping block transpositions in &lt;i&gt;O(MN)&lt;/i&gt; time in the worst case, where &lt;i&gt;M&lt;/i&gt; is the size of the graph and &lt;i&gt;N&lt;/i&gt; is the length of the new version being added or updated. It shows how this algorithm can be applied to texts in corpus linguistics and the humanities, and tests an implementation of the algorithm on a variety of real-world texts.&lt;/p&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-843746255674046858?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/843746255674046858/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=843746255674046858' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/843746255674046858'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/843746255674046858'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/03/final-version-of-multi-version.html' title='Final Version of Multi-Version Documents Paper Published by Elsevier'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-7791140286533517335</id><published>2009-03-10T13:47:00.000-07:00</published><updated>2009-03-11T00:29:41.542-07:00</updated><title type='text'>MVD is Not a Replacement for Markup</title><content type='html'>&lt;p&gt;Some people still think of MVD as a &lt;em&gt;replacement&lt;/em&gt; for markup. It isn't. It &lt;em&gt;complements&lt;/em&gt; markup systems or any technology that can represent content. As I said in the main page &lt;a href="http://multiversiondocs.blogspot.com/2008/03/whats-multi-version-document.html"&gt;What's a Multi-Version Document?&lt;/a&gt; an MVD represents the overlapping structure of a set of versions or markup perspectives. It doesn't need to represent any of the detail of the content, which is the responsibility of the markup.&lt;/p&gt;
&lt;p&gt;I realise that it's easy, and natural, to seek to dismiss radical ideas simply because they are radical. The difference in this case is that MVD is a technology that definitely works. It's not all that radical anyway. Consider the direction in which multiple-sequence alignment is going in biology. They have also realised that the best way to represent multi-version genomes or protein sequences is via a directed graph (e.g. Raphael et al., 2004. A novel method for multiple alignment of sequences with repeated and shuffled elements, Genome Research, 14, 2336-2346). I prefer to think of that idea as parallel to mine, and his 'A-Bruijn' graph is rather different from my MVD, &lt;em&gt;but it represents the same kind of data in much the same way&lt;/em&gt;. Acceptance that this basic idea can also be applied to texts in humanities and linguistics is just a matter of time.&lt;/p&gt;
&lt;h3&gt;The Inadequacy of Markup&lt;/h3&gt;
&lt;p&gt;If markup is adequate for linguistics texts, why is it that every year someone thinks up a new way to manipulate markup systems to try to represent overlap? If it were adequate there would be no need for new systems, but we continue to see 1-3 new papers on the subject every year. It's seen as a game. Look at the &lt;a href="http://www.balisage.net/"&gt;Balisage website&lt;/a&gt;: 'There's nothing so practical as a good theory'. Perceived as an unsolvable problem, overlap is the perfect topic for a paper or a thesis.&lt;/p&gt;
&lt;p&gt;In the humanities, overlap in markup systems is more than an annoyance; it wrecks the whole process of digitisation. In simple texts you can just about get by, but it's a question of degree. Try to use markup to record the following structures:
&lt;ol&gt;
&lt;li&gt;Deletion of a paragraph break&lt;/li&gt;
&lt;li&gt;Deletion of underlining&lt;/li&gt;
&lt;li&gt;Changes to document &lt;em&gt;structure&lt;/em&gt;
&lt;li&gt;Transposition&lt;/li&gt;
&lt;li&gt;Overlapping variants&lt;/li&gt;
&lt;/ol&gt;
These can all be done somehow in markup, I admit, but very poorly. And they are features that occur all the time in original texts. The fundamental problem is that you can't adequately fit a non-hierarchical structure into a hierarchical template. To choose markup alone as a medium to preserve our textual cultural heritage is to resign yourself to &lt;em&gt;mangling&lt;/em&gt; that information.&lt;/p&gt;
&lt;p&gt;Why do we have to use markup to record complex structures it was never designed to represent? Hand that complexity over to the computer and let &lt;em&gt;it&lt;/em&gt; work it out. That's what MVD lets you do. If you are getting a headache shuffling around angle brackets and xml:ids, then think again. Is this any proper way for humans of the 21st century to interact with the texts of their forebears?&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-7791140286533517335?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/7791140286533517335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=7791140286533517335' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7791140286533517335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7791140286533517335'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/03/mvd-is-not-replacement-for-markup.html' title='MVD is Not a Replacement for Markup'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-7633838304486093510</id><published>2009-02-18T13:59:00.001-08:00</published><updated>2009-02-18T15:08:38.728-08:00</updated><title type='text'>MVD Paper available online</title><content type='html'>&lt;p&gt;Elsevier have published &lt;a href="http://dx.doi.org/10.1016/j.ijhcs.2009.02.001"&gt;the paper I wrote with Bob Colomb about Multi-Version Documents online&lt;/a&gt;. The Greek text has dropped out of Figure 16, but the rest is good. I hope this has an impact, and it is certainly something I will be referring to in future. It represents everything I knew about the MVD idea and its implications as of December 2008.&lt;/p&gt;
&lt;h3&gt;Thesis Complete&lt;/h3&gt;
&lt;p&gt;This morning I submitted a near-final draft of my thesis 'Multiple Versions and Overlap in Digital Text' to my two supervisors. The last chapter describes some new work on aligning multi-version texts automatically. Here's a table taken from the thesis which summarises its performance on a variety of multi-version texts.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GGwOcLYrsVk/SZyIJypD_cI/AAAAAAAAAG8/FG3WzM-sP9o/s1600-h/table.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 281px;" src="http://1.bp.blogspot.com/_GGwOcLYrsVk/SZyIJypD_cI/AAAAAAAAAG8/FG3WzM-sP9o/s400/table.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5304264162929802690" /&gt;&lt;/a&gt;

&lt;p&gt;The SZ column is the average version size in kilobytes, NV is the number of versions, TT is the total time taken to merge all versions, AT is the average time to merge one version after the first, both in seconds. The test machine had a 1.66GHz Core Duo processor, using one core. The Romulo doesn't merge properly at the moment because there is almost nothing in common between the versions, so the merge times don't mean much in this case.&lt;/p&gt;
&lt;p&gt;The key is the AT column, which is how long it takes to 'save' an edited version back into the document. As you can see, it's pretty fast, considering that this is a hard problem. As far as quality goes, I can't see any bad alignments or false transpositions, except in the Malvezzi case. Once I can coerce the input into a sensible format this should also work.&lt;/p&gt;
&lt;h3&gt;Balisage&lt;/h3&gt;
&lt;p&gt;It looks as if I will be going to Balisage this year. I will be presenting a boiled down version of Chapter 5 of the thesis, which is all new work. I'll be very interested to hear their reactions, especially as I can now demonstrate the theory. (Their motto is 'There is nothing so practical as a good theory').&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-7633838304486093510?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/7633838304486093510/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=7633838304486093510' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7633838304486093510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7633838304486093510'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2009/02/mvd-paper-available-online.html' title='MVD Paper available online'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GGwOcLYrsVk/SZyIJypD_cI/AAAAAAAAAG8/FG3WzM-sP9o/s72-c/table.png' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-6618258187052007115</id><published>2008-12-04T13:19:00.000-08:00</published><updated>2008-12-04T13:31:08.293-08:00</updated><title type='text'>The MVD File Format</title><content type='html'>&lt;p&gt;Several people have asked me what is inside an MVD, so I thought I would put it on the record.&lt;/p&gt;
&lt;p&gt;The idea behind the Multi-Version Document or MVD format is to use the list form of the variant graph as the basis for an encoding of a single work, in all its versions or markup perspectives, as a single digital entity. The advantages of this form of digital document should be obvious. It enables a work to be viewed and searched, its versions compared and edited as one file. For example, all versions of Homer's Iliad or the seven markup perspectives of the American National Corpus (Ide, 2006) could be encapsulated in a single compact and editable representation. Also, the relationships between various parts of each version, the what-is-a-variant-of-what information, is also recorded. Storing a multi-version work as a set of separate files has the great disadvantage of requiring this kind of data to be recalculated each time it is needed. In an MVD this has already been calculated once and is thus built-in.&lt;/p&gt;

&lt;p&gt;If the content of each version is itself XML, then XML is a poor format for an MVD. An MVD may, however, be written in binary or XML format. In the latter case, the XML content of each version, &lt;em&gt;inside&lt;/em&gt; the XML encoding of the MVD structure, is escaped. That is, all instances of '&amp;lt;', '&amp;gt;' and '&amp;amp;' have to be replaced by their equivalent entities '&amp;amp;lt;', '&amp;amp;gt;' and '&amp;amp;amp;'. The purpose of the XML form of an MVD is merely to allow the researcher to look inside it to see what is there. Editing it by hand is virtually impossible, because the delicate list format produced by Algorithm 1 can so easily be broken.&lt;/p&gt;

&lt;p&gt;A tinker-proof binary format is therefore preferred. If desired for archival purposes, an MVD can be written out as a set of separate XML files, but the format uses open-source software to encode its content, so it is also archivable. The structure of an MVD is shown below:&lt;/p&gt;

&lt;a href="http://2.bp.blogspot.com/_GGwOcLYrsVk/SThKAlC9Y2I/AAAAAAAAAGg/ScqBj-WDdqo/s1600-h/mvd.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://2.bp.blogspot.com/_GGwOcLYrsVk/SThKAlC9Y2I/AAAAAAAAAGg/ScqBj-WDdqo/s400/mvd.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5276048337269515106" /&gt;&lt;/a&gt;

&lt;p&gt;The outer wrapper is a Base64 encoding, expressing binary data as plain text. &lt;/p&gt;

&lt;p&gt;The inner wrapper is the ZIP encoding performed by the open source Zlib library (Gaily and Adler, 1995). This serves the double purpose of scrambling the data to deter tinkering, and compressing it so that one MVD typically occupies little more space than a single original version. Even the alteration of a single byte of the outer Base64 wrapper will very likely break the inner ZIP encoding and the document will fail to load, as it should. Inside the ZIP container are the four parts that comprise the real content:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;&lt;td valign="top"&gt;&lt;em&gt;Magic&lt;/em&gt;&lt;/td&gt;&lt;td&gt;the presence of this hexadecimal string guarantees that this is an MVD&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;em&gt;Groups&lt;/em&gt;&lt;/td&gt;&lt;td&gt;these are labels for a hierarchy of arbitrary depth used to group versions or other groups&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;em&gt;Versions&lt;/em&gt;&lt;/td&gt;&lt;td&gt;these provide a simple description sufficient to identify the ID, short name and long name of each version, whether or not it is a partial version, and its group&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;em&gt;Pairs&lt;/em&gt;&lt;/td&gt;&lt;td&gt;the pairs list that defines the variant graph itself&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;No further detail is needed, and would in fact damage the general applicability of the format. Groups can be used to express any desired classification system for versions. The short name of a version would typically be a siglum or other short name for convenient reference, and the longer name would typically be a full version name. All other details of a version's text are the responsability of the content format.&lt;/p&gt;

&lt;h4&gt;References&lt;/h4&gt;
&lt;p&gt;J.-L. Gailly  and M. Adler (1995) &lt;a href="http://www.zlib.net/"&gt;Zlib&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;N. Ide and K. Suderman (2006) &lt;a href="http://www.cs.vassar.edu/~ide/papers/ANC-LREC06.pdf"&gt;Integrating Linguistic Resources: The American National Corpus Model.&lt;/a&gt;
In &lt;em&gt;Proceedings of the Fifth Language Resources and Evaluation Conference.&lt;/em&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-6618258187052007115?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/6618258187052007115/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=6618258187052007115' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6618258187052007115'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6618258187052007115'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/12/mvd-file-format.html' title='The MVD File Format'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_GGwOcLYrsVk/SThKAlC9Y2I/AAAAAAAAAGg/ScqBj-WDdqo/s72-c/mvd.png' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-8209189649917661893</id><published>2008-11-30T12:49:00.000-08:00</published><updated>2008-11-30T12:54:29.945-08:00</updated><title type='text'>From Toy Time to Big Time</title><content type='html'>&lt;p&gt;I don't know if this warrants another entry but the test program is now robust enough to handle large real world files in XML. I tried it on three 16K texts and it took 13.5 seconds overall to merge them with hundreds of transpositions. That is probably too many, but it does break up longer transpositions if it finds an alignment or insertion/deletion in the middle. The next step is to incorporate the test program into the NMerge library and thus allow the results to be displayed in the multi-version wiki.&lt;/p&gt;
&lt;p&gt;The transposition program works in the real world and it is fast.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-8209189649917661893?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/8209189649917661893/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=8209189649917661893' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/8209189649917661893'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/8209189649917661893'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/11/from-toy-time-to-big-time.html' title='From Toy Time to Big Time'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-6915047723380378986</id><published>2008-11-16T14:23:00.000-08:00</published><updated>2008-11-18T17:45:30.955-08:00</updated><title type='text'>Transpositions Conquered</title><content type='html'>&lt;p&gt;Today the test program correctly merged three versions of a single sentence of the Sibylline Gospel, detecting four transpositions and encoding them correctly. The sentences were:&lt;/p&gt;
&lt;p&gt;A: Et sumpno suscepto tribus diebus morte morietur et deinde ab inferis regressus ad lucem veniet.&lt;/p&gt;
&lt;p&gt;B: Et mortem sortis finiet post tridui somnum et morte morietur tribus diebus somno suscepto et tunc ab inferis regressus ad lucem veniet.&lt;/p&gt;
&lt;p&gt;C: Et sortem mortis tribus diebus sompno suscepto et tunc ab inferis regressus ad lucem veniet.&lt;/p&gt;
&lt;p&gt;I must thank Nicoletta for supplying this splendid example, which in a small space contains so many transpositions. Here is the variant graph built &lt;em&gt;automatically&lt;/em&gt; from the three versions. When I say 'automatically' what I mean is that I drew the graph manually from the program's textual output. The program was set to make no variants of less than five characters, although it does split arcs down to a single character. There are two transpositions, each present twice. I have indicated these by drawing the transposed forms in grey. The parent arcs are in black and the two are connected by dotted lines. The triple repetition of 'Et' at the start of the graph could be removed by reducing the minimal variant size. At the moment I am happy to see such high quality output without resorting to fine tuning.&lt;/p&gt;&lt;p&gt;The best thing about the program is the degree to which repetitions between versions have been systematically removed. This is the whole objective of the variant graph model.&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/_GGwOcLYrsVk/SSKTmIxCZ0I/AAAAAAAAAGY/e_7wiH6u_0c/s1600-h/transpose.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height:undefinedpx" src="http://3.bp.blogspot.com/_GGwOcLYrsVk/SSKTmIxCZ0I/AAAAAAAAAGY/e_7wiH6u_0c/s400/transpose.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5269936797374375746" /&gt;&lt;/a&gt;
&lt;p&gt;This is, of course, only a test program. The algorithm will eventually be added to NMerge and all this will happen behind the scenes in the multi-version wiki whenever you save.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-6915047723380378986?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/6915047723380378986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=6915047723380378986' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6915047723380378986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6915047723380378986'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/11/mvd-overview.html' title='Transpositions Conquered'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GGwOcLYrsVk/SSKTmIxCZ0I/AAAAAAAAAGY/e_7wiH6u_0c/s72-c/transpose.jpg' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-5220808325213373473</id><published>2008-10-26T17:39:00.000-07:00</published><updated>2008-11-10T18:03:13.178-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Align'/><title type='text'>Transpositions</title><content type='html'>&lt;p&gt;Transpositions are undeniably part of real world texts, and must be included in any practical solution to the technical problems of how to represent overlapping structures in digital texts. The MVD or variant graph model described on this website includes transpositions, but until now there has been no way to calculate them automatically.&lt;/p&gt;
&lt;p&gt;Transpositions can't be supported in any markup scheme. In principal, any section of the document can be transposed, not only small bits of text. This means that a feature like the &amp;lsquo;copyof&amp;rsquo; attribute (Smith, 1999), if used to record transpositions, would have to allow &lt;em&gt;any&lt;/em&gt; element to be contained in any other element &amp;ndash; thus destroying the whole idea of a document schema or DTD. Also, the transposed sections might not even contain well-formed markup, i.e. they might contain unmatched start or end-tags. So this approach doesn&amp;rsquo;t work.&lt;/p&gt;
&lt;p&gt;The method I have been using until now is that suggested by Lopresti and Tomkins (1997): that one should do all the alignments, variants, deletions and insertions &lt;em&gt;first&lt;/em&gt;, and then consider pairs of insertions/deletions as candidates for transposition &lt;em&gt;afterwards&lt;/em&gt;. The problem is that, while this works well for two versions, it leads to problems caused by &amp;lsquo;contamination&amp;rsquo; between multiple versions. You end up with the same word being recorded as a variant of itself in another version. What is left over after the completion of alignment is a lot of noise that does not form a suitable basis for the calculation of transpositions.&lt;/p&gt;

&lt;h4&gt;The French Connection&lt;/h4&gt;
&lt;p&gt;Alternatively, the approach adopted by Bourdaillet (2007) in his thesis is the exact opposite: do the transpositions (and alignments) &lt;em&gt;first,&lt;/em&gt; then what is left over can be considered as candidate variants, insertions and deletions. His method is, however, still tied to just two versions, but there are some useful ideas that can be applied to the case of aligning N versions too.&lt;/p&gt;
&lt;p&gt;One reason why I suspect he avoided trying to merge N versions into one document was that he didn't have a data structure to record it, but also because, until now, this has been considered to be too hard a problem. However, I believe that it is solvable, by combining his general approach with the MVD document structure. I will try to describe how it will work, by illustrating a simple example step by step.&lt;/p&gt;
&lt;h4&gt;The Case for Automatic Detection of Transpositions&lt;/h4&gt;
&lt;p&gt;You might think that automatically detecting transpositions would be a bad idea. If the author transposed some text in a holograph, this should be clear from the manuscript and can simply be encoded as such by the editor. But what about the case where there are several versions of the same work &amp;ndash; perhaps redrafts of a single text or independent versions created by copying? Spotting such transpositions visually is very hard work. In these cases calculating transpositions is a good idea, and saves the editor a lot of trouble. Even in the holograph case, calculating transpositions can still be useful. So long as the computer gets it right we don't have to encode it manually (which saves work); only when it gets it wrong do we have to do anything.&lt;/p&gt;
&lt;h4&gt;Outline of the Proposed Method&lt;/h4&gt;
&lt;p&gt;Imagine you have already aligned N versions into an MVD. Now you want to add the N+1th version to the structure. (This is the standard inductive formulation: if we can prove that it works in this case, then it will always work.) The Longest-Common-Substring or LCS is the longest section of text shared by the MVD and the new version where successive characters are all the same. The basic algorithm uses this property to merge the entire graph. In essence the algorithm is simply:
&lt;ol&gt;&lt;li&gt;Merge the variant graph and the new version where the LCS occurs.&lt;/li&gt;
&lt;li&gt;Call the algorithm recursively on the two unaligned sections before and after the LCS.&lt;/li&gt;&lt;/ol&gt;&lt;/p&gt;
&lt;h4&gt;The Challenge&lt;/h4&gt;
&lt;p&gt;As an example consider the three versions:&lt;/p&gt;
&lt;pre&gt;
1. The quick brown fox jumps over the lazy dog.
2. The quick white rabbit jumps over the lazy dog.
3. The quick brown ferret leaps over the lazy dog.
&lt;/pre&gt;
Imagine we have already built an MVD out of this. Now we want to merge it with:
&lt;pre&gt;
4. The white quick rabbit jumps over the dog.
&lt;/pre&gt;
&lt;p&gt;There is a small transposition here: version 4 has &amp;lsquo;white quick&amp;rsquo; instead of &amp;lsquo;quick white&amp;rsquo; in version 2. Let&amp;rsquo;s see if the algorithm can detect it.&lt;/p&gt;
&lt;h4&gt;Following the Algorithm as it Works&lt;/h4&gt;
&lt;p&gt;We can add the fourth version to the graph very easily, by creating one big arc with the text of the version and attaching it to the start and end of the graph, like this:&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS2zAkkenI/AAAAAAAAAEo/Zm0rtXWt-CY/s1600-h/trans1.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS2zAkkenI/AAAAAAAAAEo/Zm0rtXWt-CY/s400/trans1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266034851745921650" /&gt;&lt;/a&gt;
&lt;p&gt;This is already a valid variant graph, but it is full of redundancy. Every word of the new version can also be found in the &amp;lsquo;old&amp;rsquo; graph. Apart from wasting storage, this redundancy fails to inform us of the relationship between the various parts of the new version and the rest of the document. The algorithm will correct this problem by gradually removing all of the copies.&lt;/p&gt;
&lt;p&gt;The &amp;lsquo;longest-common-substring&amp;rsquo; (or LCS) between the first three versions and the 4th one is &amp;lsquo;rabbit jumps over the&amp;rsquo; from version 2, even though bits of that string are shared by other versions. What we do is align version 4 with the LCS, leaving two bits at either end non-aligned. The two bits are &amp;lsquo;The white quick&amp;rsquo; and &amp;lsquo;dog.&amp;rsquo;:&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS4EmdSB6I/AAAAAAAAAE4/I-7nZ7viegY/s1600-h/trans2.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS4EmdSB6I/AAAAAAAAAE4/I-7nZ7viegY/s400/trans2.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266036253485303714" /&gt;&lt;/a&gt;
&lt;p&gt;Note that &amp;lsquo;rabbit jumps over the&amp;rsquo; has now acquired version D. We then call the same routine (this is called recursion) on these two bits left over, but align them with the corresponding parts of the MVD that precede &lt;em&gt;OR&lt;/em&gt; follow the LCS in version 2.&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS9fqJGF_I/AAAAAAAAAFA/PQfNgHI1AJQ/s1600-h/trans3.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS9fqJGF_I/AAAAAAAAAFA/PQfNgHI1AJQ/s400/trans3.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266042215888984050" /&gt;&lt;/a&gt;
&lt;p&gt;We now have two subgraphs containing arcs that definitely precede or follow the LCS. All the other arcs that are effectively parallel to the LCS are left in place but are not considered further. Now we have to try to align Arc 1 &amp;lsquo;The white quick&amp;rsquo; with Graph 1 &lt;em&gt;and&lt;/em&gt; Graph 2, and likewise align Arc 2 &amp;lsquo;dog.&amp;rsquo; with graphs 1 &amp;amp; 2. Because we now have more than one subgraph we will have to consider transpositions. However, when we compare the LCS between Arc 2 and Graph 1 (nothing) with that calculated between Arc 2 and Graph 2 (&amp;lsquo;dog.&amp;rsquo;) it is clear that no transpositions are possible. Instead, the best alignment is the direct one of &amp;lsquo;dog.&amp;rsquo; between versions ABC and D:&lt;/p&gt;
&lt;a href="http://4.bp.blogspot.com/_GGwOcLYrsVk/SRTB2K5tHsI/AAAAAAAAAFQ/8tSPQDwG0VU/s1600-h/trans4.jpg"&gt;&lt;img width="150" style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://4.bp.blogspot.com/_GGwOcLYrsVk/SRTB2K5tHsI/AAAAAAAAAFQ/8tSPQDwG0VU/s400/trans4.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266047000686436034" /&gt;&lt;/a&gt;
&lt;p&gt;We have no more D-text to align here so the process stops. The empty D-arc becomes a deletion in version D. On the other hand, there is still work to do on the left, in Graph 1. Here comparison between Graph 1 and Arc 1 suggests either &amp;lsquo;white&amp;rsquo; or &amp;lsquo;quick&amp;rsquo; as the LCS. We will choose &amp;lsquo;quick&amp;rsquo; because it is more central:&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRTJRQA_ScI/AAAAAAAAAFY/4eB-vEqBIrM/s1600-h/trans5.jpg"&gt;&lt;img width="250" style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://3.bp.blogspot.com/_GGwOcLYrsVk/SRTJRQA_ScI/AAAAAAAAAFY/4eB-vEqBIrM/s400/trans5.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266055162497026498" /&gt;&lt;/a&gt;
&lt;p&gt;This is still not quite right. The instance of &amp;lsquo;white&amp;rsquo; on the left is clearly a transposition of the &amp;lsquo;white&amp;rsquo; on the right. Again the merging of the LCS leads to two arcs and two graphs:&lt;/p&gt;
&lt;a href="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRTMYVtOXQI/AAAAAAAAAFo/oLnH0Z9Y1eI/s1600-h/trans6.jpg"&gt;&lt;img width="200" style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRTMYVtOXQI/AAAAAAAAAFo/oLnH0Z9Y1eI/s400/trans6.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266058582818708738" /&gt;&lt;/a&gt;
&lt;p&gt;Calculation of the LCS between Arc 1 and Graph 2 yields &amp;lsquo;white&amp;rsquo;, shown here in bold. So we merge the LCS into the graph, except that this is a transposition, and so we must leave the text where it is and point to the target of the transposition. This leaves only one copy of &amp;lsquo;white&amp;rsquo; in the graph, and another copy, shown in grey, that points to it.&lt;/p&gt;
&lt;a href="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRTOzZLatsI/AAAAAAAAAFw/bDs7h0ArtBE/s1600-h/trans7.jpg"&gt;&lt;img width="250" style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRTOzZLatsI/AAAAAAAAAFw/bDs7h0ArtBE/s400/trans7.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266061246630377154" /&gt;&lt;/a&gt;
&lt;p&gt;We are still not finished. &amp;lsquo;The&amp;rsquo; appears twice. So we need one further LCS calculation between the &amp;lsquo;The&amp;rsquo; D-arc and the &amp;lsquo;The&amp;rsquo; ABC-arc:&lt;/p&gt;
&lt;a href="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRUBSR1lqoI/AAAAAAAAAF4/67k5q-aMD4g/s1600-h/trans8.jpg"&gt;&lt;img width="150" style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://2.bp.blogspot.com/_GGwOcLYrsVk/SRUBSR1lqoI/AAAAAAAAAF4/67k5q-aMD4g/s400/trans8.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266116752817105538" /&gt;&lt;/a&gt;
&lt;p&gt;Now the two &amp;lsquo;The&amp;rsquo;s are merged, all that remains is to introduce an empty ABC-arc to indicate that &amp;lsquo;white&amp;rsquo; only appears in that position in version D.&lt;/p&gt;
&lt;h4&gt;Taking Stock&lt;/h4&gt;
&lt;p&gt;We have been recursing into smaller and smaller portions of the graph. That does not mean that these portions or subgraphs are in any way detached from the rest of the graph. The other parts were simply omitted for clarity. Overall the graph now looks like this:&lt;/p&gt;
&lt;a href="http://1.bp.blogspot.com/_GGwOcLYrsVk/SRUKXlnfU2I/AAAAAAAAAGI/mnBcPknuHgo/s1600-h/trans9.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: undefinedpx; height: undefinedpx;" src="http://1.bp.blogspot.com/_GGwOcLYrsVk/SRUKXlnfU2I/AAAAAAAAAGI/mnBcPknuHgo/s400/trans9.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5266126739630674786" /&gt;&lt;/a&gt;
&lt;p&gt;The text of version D has been fully assimilated. It has been aligned with ALL the versions of the graph, not just with the one that was most similar to the version we were trying to add. (This is what biologists do in their &amp;lsquo;progressive&amp;rsquo; alignment technique, and they don't even attempt transpositions). The result is a much better alignment with a lot less redundancy. To add further versions to the MVD we simply repeat the above steps, with the new variant graph as our starting point.&lt;/p&gt;
&lt;h4&gt;Time Complexity&lt;/h4&gt;
&lt;p&gt;I believe this routine may eventually be O(N log N), that is as fast as the famous &amp;lsquo;quicksort&amp;rsquo; routine of Hoare from 1961, which it resembles. At the moment it is somewhat slower because my current calculation of the LCS takes O(N&lt;sup&gt;2&lt;/sup&gt;) time in the worst case. The LCS between two &lt;em&gt;strings&lt;/em&gt; can be calculated in O(N) time according to Gusfield. But that requires the construction of two suffix trees using Ukkonen's 1995 algorithm. I have implemented that for the text of the new version but I can't generate a suffix tree for the variant graph because it is too difficult, and may not be possible in O(N) time. To calculate the LCS I just traverse the graph, looking for runs of matching characters in the new version which has been converted into a suffix tree using Ukkonen's algorithm. Overall I think this is O(N&lt;sup&gt;2&lt;/sup&gt;). However, even as it now stands, the algorithm is still very fast because expected performance is usually much better than that.&lt;/p&gt;
&lt;h3&gt;References&lt;/h3&gt;
&lt;p&gt;D. Lopresti and A. Tomkins, &amp;lsquo;Block edit models for approximate string matching&amp;rsquo;, &lt;i&gt;Theoretical Computer Science&lt;/i&gt; 1997, 181, 159&amp;ndash;179.&lt;br/&gt;
J. Bourdaillet, &lt;i&gt;Alignment textuel monolingue avec recherche de d&amp;eacute;placements: algorithmique pour la critique g&amp;eacute;n&amp;eacute;tique&lt;/i&gt; PhD Thesis, Universit&amp;eacute; Paris 6 Pierre et Marie Curie, 2007.&lt;br/&gt;
C. Hoare, &amp;lsquo;Partition: Algorithm 63&amp;rsquo;, &amp;lsquo;Quicksort: Algorithm 64,&amp;rsquo; &lt;i&gt;Communications of the ACM&lt;/i&gt; 4(7), 321&amp;ndash;322, 1961&lt;br/&gt;
D. Smith, &amp;lsquo;Textual Variation and Version Control in the TEI&amp;rsquo; &lt;i&gt;Computers and the Humanities&lt;/i&gt;, 33.1, 1999, 103&amp;ndash;112.&lt;br/&gt;
E. Ukkonen, &amp;lsquo;On-line Construction of Suffix Trees&amp;rsquo; &lt;i&gt;Algorithmica&lt;/i&gt; 14 (1995), 249--260.&lt;br/&gt;
D. Gusfield &lt;i&gt;Algorithms on Strings, Trees and Sequences&lt;/i&gt;, Cambridge: Cambridge University Press, 1997.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-5220808325213373473?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/5220808325213373473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=5220808325213373473' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5220808325213373473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/5220808325213373473'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/10/transpositions.html' title='Transpositions'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GGwOcLYrsVk/SRS2zAkkenI/AAAAAAAAAEo/Zm0rtXWt-CY/s72-c/trans1.jpg' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-6216410517893927606</id><published>2008-09-11T00:23:00.001-07:00</published><updated>2008-09-11T17:17:10.960-07:00</updated><title type='text'>MVD paper accepted</title><content type='html'>In case anyone was wondering if my theories have been independently verified, the International Journal of Human-Computer Studies has just accepted my paper that explains the core idea behind the MVD technology. This is a respected technical journal and I spared no detail in the paper, which is 16 pages long. There is a link at the bottom of the &lt;a href="http://multiversiondocs.blogspot.com/2008/03/whats-multi-version-document.html"&gt;"What is an MVD?"&lt;/a&gt; page to the PDF I submitted.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-6216410517893927606?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/6216410517893927606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=6216410517893927606' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6216410517893927606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/6216410517893927606'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/09/mvd-paper-accepted.html' title='MVD paper accepted'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4555078640999654611.post-7250671412279799697</id><published>2008-07-29T14:27:00.000-07:00</published><updated>2008-08-06T04:50:08.792-07:00</updated><title type='text'>Improvements to Alignment</title><content type='html'>&lt;p&gt;In preparation to publicly releasing the NMerge code I have been revising the alignment algorithm, as there were significant weaknesses in the naïve approach I had previously adopted. In cases like the Sibylline Gospel, there is a great deal of variation in a small space, and simply requiring the user to choose a base text for alignment doesn&amp;rsquo;t work very well. There seem to be a few reasons for this:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;The user doesn't actually know to which existing version a new version should be aligned. It should be the task of the software to find this out.&lt;/li&gt;
&lt;li&gt;There can't be a distinction, as I originally supposed, between updating an existing version and adding a new one. The newly revised version might resemble another version more than the one it came from. For example, we might change &amp;lsquo;the quick brown dog&amp;rsquo; &lt;em&gt;back to&lt;/em&gt; &amp;lsquo;the quick brown fox&amp;rsquo;. In the case of manuscript traditions like the Sibylline Gospel this happens all the time, because the versions aren&amp;rsquo;t a succession of edits, but a set of parallel alternatives. To avoid this problem I now automatically calculate the most similar version already in the MVD and then align with that.&lt;/li&gt;
&lt;li&gt;When you add a new arc to the graph you need to look for identical &lt;em&gt;paths&lt;/em&gt; not merely identical &lt;em&gt;arcs&lt;/em&gt; that are already in that position. Now when the program finds an existing path with the same text, spanning the same two end-points, the new arc can be discarded and its version simply added to the path instead:
&lt;a href="http://4.bp.blogspot.com/_GGwOcLYrsVk/SJmP3KT853I/AAAAAAAAADc/-6XU_S1Atn8/s1600-h/optimise.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_GGwOcLYrsVk/SJmP3KT853I/AAAAAAAAADc/-6XU_S1Atn8/s400/optimise.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5231370619991156594" /&gt;&lt;/a&gt;
In this simplified real-world example, the new D-version was aligned with version C, overall its most similar version. However,  in this location the D-variant &amp;lsquo;milia hominum&amp;rsquo; already exists in version A. When the program tried to add the D-Arc it realised that there was already an A-path with the same text and instead merely added the D-version to that path.&lt;/li&gt;&lt;/ol&gt;
&lt;p&gt;This is only the first stage of a series of improvements. Still to come: &lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Calling the alignment algorithm recursively to handle contamination, i.e. aligning to more than one base text. After aligning to the most similar version, any significant portions of the new version that didn't align can be realigned to the most similar version &lt;em&gt;between the two endpoints&lt;/em&gt; of the unaligned section.&lt;/li&gt;
&lt;li&gt;Calculating what is a variant of what, one use of which might be to generate a kind of &lt;em&gt;apparatus criticus&lt;/em&gt; &lt;/li&gt;
&lt;li&gt;Calculating transpositions. These can be done after the other alignments are complete: any leftover insertion/deletion pairs meeting certain criteria can be tested for equality, and the transposition carried out.&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;h3&gt;XML Awareness&lt;/h3&gt;
&lt;p&gt;While NMerge remains ignorant of XML, as it should, the public interface class now breaks up &amp;lsquo;words&amp;rsquo; based on angle-brackets as well as white space. In addition I am contemplating moving this code and the class that XML-izes the differences between versions into a separate package. The latter functionality is needed by the wiki since a difference might occur in the middle of a tag, and the marker for this needs to be moved to the start of the next piece of real text, so it can be displayed.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4555078640999654611-7250671412279799697?l=multiversiondocs.blogspot.com'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://multiversiondocs.blogspot.com/feeds/7250671412279799697/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=4555078640999654611&amp;postID=7250671412279799697' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7250671412279799697'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4555078640999654611/posts/default/7250671412279799697'/><link rel='alternate' type='text/html' href='http://multiversiondocs.blogspot.com/2008/07/improvements-to-alignment.html' title='Improvements to Alignment'/><author><name>desmond</name><uri>http://www.blogger.com/profile/01722159590093138289</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='06159622047589651206'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GGwOcLYrsVk/SJmP3KT853I/AAAAAAAAADc/-6XU_S1Atn8/s72-c/optimise.png' height='72' width='72'/><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry></feed>