Having a good import tool is one of the most important things in building a digital text archive. When you have 2,000+ files to load into the system doing anything by hand with them is prohibitively expensive. I've tried to identify the steps that must be implemented to represent texts as multi-version works described by multi-version standoff properties. I intend to tick off these steps as I do them. Most of the stuff is already available in some form. It's really just a matter of pulling it all together.
- A GUI to interact with the user: ✓
- Select files for upload.
- Gather information about where the merged files are to go.
- Also specify a filter for plain text files if any
- Verify that the inputted information makes sense
- Compare each submitted file with the first file. If less than say 10% similar or it is too large reject it, and tell the user.✓
- Split the remaining files into two groups: a) plain text and b) XML ✓
- Filter the plain text files and so produce one set of markup for each.✓
- Split the XML files into further versions, so multiplying them. This happens if there are any add/del, sic/corr, app/rdg etc variations.✓
- Strip each XML file into markup and text.✓
- Merge the sets of markup and text into a corcode and a cortex, and install it in the specified location.✓
That completes the basic import process.