XML for mark-up of text projects for the web

From The Digital Classicist Wiki
Revision as of 15:54, 5 August 2014 by GabrielBodard (talk | contribs) (internal link)
Jump to navigation Jump to search

What is XML?

XML, which stands for eXtensible Markup Language, is a widely used set of rules for marking up text. It is not the only possible way to annotate a text, but it is popular with digital classicists working with materials that fit a word processor better than they do a spreadsheet. XML allows a scholar to deeply annotate a text such that it can be read and understood by both humans and computers. It allows a scholar to make aspects of the text useful for data processing, but doesn't force the text to look like a database. For a good basic introduction to XML, see the Wikipedia article.

OK, I get the basics of XML. What tools are out there to help me edit?

Any text processor can be used to write XML. But some text processors are better than others, because some programs can validate your files, or present the annotation and text in different colors, making the markup more readable. New tools are constantly appearing, and some tools fall by the wayside. A good place to start to look for software is on the Bamboo DiRT wiki, which lists tools particularly helpful in the digital humanities. Many digital classicists like Oxygen, an affordable program that greatly facilitates writing in XML.

I am starting a project and XML makes sense for it, but I don't want to invent my own markup schema. What sorts of schemas are already available?

There are numerous XML markup schemas for all sorts of purposes. One of the most widely discussed, if not used, is the Text Encoding Initiative, TEI, which provides a set of rules for the markup of any texts, particularly historical. Because TEI's aims are broad, the schema is relatively loose (for example, it is not always clear when or <quote> should be used). That makes interchange difficult. There are proper subsets of TEI such as TEI analytics, which facilitate interchange by restricting the tagset but do not resolve TEI ambiguities (e.g., the vs. <quote> issue remains). But customizations of TEI such as EpiDoc, widely used in papyrology and epigraphy provide greater structure, and may be appropriate for your project.

Different research purposes call for different tagging schemas.

I have a text I want to annotate. Is it better to combine my annotations with the source as a single file, or should I keep the annotations in a separate file? If the latter, how should I go about this?

Whether inline annotations (a single file) or stand-off annotations (multiple files) make sense depends upon the complexity of the project and the end users of your data. Simple projects are often served well with inline annotations whereas complex ones, especially those that require multiple levels of annotation, are best served with a stand-off system. But the latter has numerous models. An excellent way to see the different possible models is in Bański, Piotr. “Why TEI stand-off annotation doesn't quite work: and why you might want to use it nevertheless.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Banski01.

I have a text and a translation, and I want to use XML to align them. How do I do that?

Take a look at the TEI guidelines on linking, segmentation, and alignment. This is not the only way to approach this issue. The Alpheios project has a tool under development for an XML-based, stand-off alignment scheme. See also Bamboo DiRT's tools for text alignment and the list by Rada Mihalcea.