XML for mark-up of text projects for the web
- 1 What is XML?
- 2 OK, I get the basics of XML. What tools are out there to help me edit?
- 3 I am starting a project and XML makes sense for it, but I don't want to invent my own markup schema. What sorts of schemas are already available?
- 4 I have a text I want to annotate. Is it better to combine my annotations with the source as a single file, or should I keep the annotations in a separate file?
- 5 If the latter, how should I go about this?
- 6 I have a text and a translation, and I want to use XML to align them. How do I do that?
What is XML?
XML, which stands for eXtensible Markup Language, is a widely used set of rules for marking up text. It is not the only possible way to annotate a text, but it is popular with digital classicists working with materials that fit a word processor better than they do a spreadsheet. XML allows a scholar to deeply annotate a text such that it can be read and understood by both humans and computers. It allows a scholar to make aspects of the text useful for data processing, but doesn't force the text to look like a database. For a good basic introduction to XML, see the Wikipedia article.
OK, I get the basics of XML. What tools are out there to help me edit?
Any text processor can be used to write XML. But some text processors are better than others, because some programs can validate your files, or present the annotation and text in different colors, making the markup more readable. New tools are constantly appearing, and some tools fall by the wayside. A good place to start to look for software is on the Bamboo DiRT wiki, which lists tools particularly helpful in the digital humanities. Many digital classicists like Oxygen, an affordable program that greatly facilitates writing in XML.
I am starting a project and XML makes sense for it, but I don't want to invent my own markup schema. What sorts of schemas are already available?
There are numerous XML markup schemas for all sorts of purposes. One of the most widely discussed, if not used, is the Text Encoding Initiative (TEI), which provides a set of rules for the markup of any texts, particularly historical. Because TEI's aims are broad, the schema is relatively loose (for example, it is not always clear when
or <quote> should be used). That makes interchange difficult. There are proper subsets of TEI such as TEI analytics, which facilitate interchange by restricting the tagset but do not resolve TEI ambiguities (e.g., the
vs. <quote> issue remains). But customizations of TEI such as EpiDoc, widely used in papyrology and epigraphy provide greater structure, and may be appropriate for your project.
Different research purposes call for different tagging schemas.
- Morphology See relevant DC wiki entries, especially Morpheus and the relevant FAQ entry.
- Syntax XML schemes for treebanks are widely used by linguists, whose views on language, and research purposes, differ considerably, enough that there are numerous XML schemas being used. A good place for classicists to start is the Perseus Ancient Greek and Latin Dependency Treebank. For another XML model useful to classicists, and developed in conjunction with ISO standards, see <tiger2/>.
- Text reuse Annotating quotations, allusions, paraphrases, and other forms of text reuse is important, but as of 2012 not many published schemes are available. The TEI has some guidance on marking quotations but the definitions of the handful of relevant elements are vague and subject to interpretation. CiTO, Citation Typing Ontology purports to give a standard for linked open data, but as of July 2012 there were no known examples of how the ontology might be used. For real projects trying to model a schema see:
I have a text I want to annotate. Is it better to combine my annotations with the source as a single file, or should I keep the annotations in a separate file?
If the latter, how should I go about this?
Whether inline annotations (a single file) or stand-off annotations (multiple files) make sense depends upon the complexity of the project and the end users of your data. Simple projects are often served well with inline annotations whereas complex ones, especially those that require multiple levels of annotation, are best served with a stand-off system. But the latter has numerous models. An excellent way to see the different possible models is in Bański, Piotr. “Why TEI stand-off annotation doesn't quite work: and why you might want to use it nevertheless.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Banski01.
I have a text and a translation, and I want to use XML to align them. How do I do that?
Take a look at the TEI guidelines on linking, segmentation, and alignment. This is not the only way to approach this issue. The Alpheios project has a tool under development for an XML-based, stand-off alignment scheme. See also Bamboo DiRT's tools for text alignment and the list by Rada Mihalcea.