Friday, March 30, 2012

Importing into HRIT

Having a good import tool is one of the most important things in building a digital text archive. When you have 2,000+ files to load into the system doing anything by hand with them is prohibitively expensive. I've tried to identify the steps that must be implemented to represent texts as multi-version works described by multi-version standoff properties. I intend to tick off these steps as I do them. Most of the stuff is already available in some form. It's really just a matter of pulling it all together.

  1. A GUI to interact with the user:
    • Select files for upload.
    • Gather information about where the merged files are to go.
    • Also specify a filter for plain text files if any
    • Verify that the inputted information makes sense
  2. Compare each submitted file with the first file. If less than say 10% similar or it is too large reject it, and tell the user.
  3. Split the remaining files into two groups: a) plain text and b) XML
  4. Filter the plain text files and so produce one set of markup for each.
  5. Split the XML files into further versions, so multiplying them. This happens if there are any add/del, sic/corr, app/rdg etc variations.
  6. Strip each XML file into markup and text.
  7. Merge the sets of markup and text into a corcode and a cortex, and install it in the specified location.

That completes the basic import process.

No comments: