XML for mark-up of text projects for the web: Difference between revisions

Revision as of 16:54, 5 August 2014

What is XML?

XML, which stands for eXtensible Markup Language, is a widely used set of rules for marking up text. It is not the only possible way to annotate a text, but it is popular with digital classicists working with materials that fit a word processor better than they do a spreadsheet. XML allows a scholar to deeply annotate a text such that it can be read and understood by both humans and computers. It allows a scholar to make aspects of the text useful for data processing, but doesn't force the text to look like a database. For a good basic introduction to XML, see the Wikipedia article.

OK, I get the basics of XML. What tools are out there to help me edit?

Any text processor can be used to write XML. But some text processors are better than others, because some programs can validate your files, or present the annotation and text in different colors, making the markup more readable. New tools are constantly appearing, and some tools fall by the wayside. A good place to start to look for software is on the Bamboo DiRT wiki, which lists tools particularly helpful in the digital humanities. Many digital classicists like Oxygen, an affordable program that greatly facilitates writing in XML.

I am starting a project and XML makes sense for it, but I don't want to invent my own markup schema. What sorts of schemas are already available?

There are numerous XML markup schemas for all sorts of purposes. One of the most widely discussed, if not used, is the Text Encoding Initiative, TEI, which provides a set of rules for the markup of any texts, particularly historical. Because TEI's aims are broad, the schema is relatively loose (for example, it is not always clear when or <quote> should be used). That makes interchange difficult. There are proper subsets of TEI such as TEI analytics, which facilitate interchange by restricting the tagset but do not resolve TEI ambiguities (e.g., the vs. <quote> issue remains). But customizations of TEI such as EpiDoc, widely used in papyrology and epigraphy provide greater structure, and may be appropriate for your project.

Different research purposes call for different tagging schemas.

Morphology See relevant DC wiki entries, especially Morpheus and the relevant FAQ entry.
Syntax XML schemes for treebanks are widely used by linguists, whose views on language, and research purposes, differ considerably, enough that there are numerous XML schemas being used. A good place for classicists to start is the Perseus Ancient Greek and Latin Dependency Treebank. For another XML model useful to classicists, and developed in conjunction with ISO standards, see <tiger2/>.
Text reuse Annotating quotations, allusions, paraphrases, and other forms of text reuse is important, but as of 2012 not many published schemes are available. The TEI has some guidance on marking quotations but the definitions of the handful of relevant elements are vague and subject to interpretation. CiTO, Citation Typing Ontology purports to give a standard for linked open data, but as of July 2012 there were no known examples of how the ontology might be used. For real projects trying to model a schema see:

I have a text I want to annotate. Is it better to combine my annotations with the source as a single file, or should I keep the annotations in a separate file? If the latter, how should I go about this?

Whether inline annotations (a single file) or stand-off annotations (multiple files) make sense depends upon the complexity of the project and the end users of your data. Simple projects are often served well with inline annotations whereas complex ones, especially those that require multiple levels of annotation, are best served with a stand-off system. But the latter has numerous models. An excellent way to see the different possible models is in Bański, Piotr. “Why TEI stand-off annotation doesn't quite work: and why you might want to use it nevertheless.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Banski01.

I have a text and a translation, and I want to use XML to align them. How do I do that?

Take a look at the TEI guidelines on linking, segmentation, and alignment. This is not the only way to approach this issue. The Alpheios project has a tool under development for an XML-based, stand-off alignment scheme. See also Bamboo DiRT's tools for text alignment and the list by Rada Mihalcea.

@@ Line 1: / Line 1: @@
-(to be added)
+''''' What is XML? '''''
+[http://en.wikipedia.org/wiki/Xml XML], which stands for eXtensible Markup Language, is a widely used set of rules for marking up text. It is not the only possible way to annotate a text, but it is popular with digital classicists working with materials that fit a word processor better than they do a spreadsheet. XML allows a scholar to deeply annotate a text such that it can be read and understood by both humans and computers. It allows a scholar to make aspects of the text useful for data processing, but doesn't force the text to look like a database. For a good basic introduction to XML, see the [http://en.wikipedia.org/wiki/Xml Wikipedia article].
+''''' OK, I get the basics of XML. What tools are out there to help me edit? '''''
+''Any'' text processor can be used to write XML. But some text processors are better than others, because some programs can validate your files, or present the annotation and text in different colors, making the markup more readable. New tools are constantly appearing, and some tools fall by the wayside. A good place to start to look for software is on the [http://dirt.projectbamboo.org/search/node/xml Bamboo DiRT] wiki, which lists tools particularly helpful in the digital humanities. Many digital classicists like [http://oxygenxml.com/ Oxygen], an affordable program that greatly facilitates writing in XML.
+''''' I am starting a project and XML makes sense for it, but I don't want to invent my own markup schema. What sorts of schemas are already available? '''''
+There are numerous XML markup schemas for all sorts of purposes. One of the most widely discussed, if not used, is the [http://www.tei-c.org/Guidelines/P5/ Text Encoding Initiative], TEI, which provides a set of rules for the markup of any texts, particularly historical. Because TEI's aims are broad, the schema is relatively loose (for example, it is not always clear when <q> or <quote> should be used). That makes interchange difficult. There are proper subsets of TEI such as [http://segonku.unl.edu/teianalytics/TEIAnalytics.html TEI analytics], which facilitate interchange by restricting the tagset but do not resolve TEI ambiguities (e.g., the <q> vs. <quote> issue remains). But customizations of TEI such as [[EpiDoc]], widely used in [[:category:papyrology|papyrology]] and [[:category:epigraphy|epigraphy]] provide greater structure, and may be appropriate for your project.
+Different research purposes call for different tagging schemas.
+* '''Morphology''' See relevant [[:category:morphology|DC wiki]] entries, especially [[Morpheus]] and the [[Morphological_parsing_or_lemmatising_Greek_and_Latin|relevant FAQ entry]].
+* '''Syntax''' XML schemes for treebanks are widely used by linguists, whose views on language, and research purposes, differ considerably, enough that there are numerous XML schemas being used. A good place for classicists to start is the [[Perseus Ancient Greek and Latin Dependency Treebank]]. For another XML model useful to classicists, and developed in conjunction with ISO standards, see [http://korpling.german.hu-berlin.de/tiger2/homepage/index.html <tiger2/>].
+* '''Text reuse''' Annotating quotations, allusions, paraphrases, and other forms of text reuse is important, but as of 2012 not many published schemes are available. The TEI has [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COHQQ some guidance on marking quotations] but the definitions of the handful of relevant elements are vague and subject to interpretation. [http://speroni.web.cs.unibo.it/cgi-bin/lode/req.py?req=http:/purl.org/spar/cito CiTO, Citation Typing Ontology] purports to give a standard for [[linked open data]], but as of July 2012 there were no known examples of how the ontology might be used. For real projects trying to model a schema see:
+** [[Fragmentary Texts]]
+** [[Biblindex]]
+** [[Sharing_Ancient_Wisdoms_%28SAWS%29|Sharing Ancient Wisdoms]]
+''''' I have a text I want to annotate. Is it better to combine my annotations with the source as a single file, or should I keep the annotations in a separate file? If the latter, how should I go about this? '''''
+Whether inline annotations (a single file) or stand-off annotations (multiple files) make sense depends upon the complexity of the project and the end users of your data. Simple projects are often served well with inline annotations whereas complex ones, especially those that require multiple levels of annotation, are best served with a stand-off system. But the latter has numerous models. An excellent way to see the different possible models is in Bański, Piotr. “[http://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html Why TEI stand-off annotation doesn't quite work: and why you might want to use it nevertheless].” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Banski01.
+''''' I have a text and a translation, and I want to use XML to align them. How do I do that? '''''
+Take a look at the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html TEI guidelines on linking, segmentation, and alignment]. This is not the only way to approach this issue. The [[Alpheios Tools|Alpheios project]] has a [http://alpheios.net/content/resources-under-development tool under development for an XML-based, stand-off alignment scheme]. See also Bamboo DiRT's [http://dirt.projectbamboo.org/search/node/alignment tools for text alignment] and the [http://www.cse.unt.edu/~rada/wa/ list by Rada Mihalcea].
 [[category:FAQ]]
+[[category:XML]]

XML for mark-up of text projects for the web: Difference between revisions

Revision as of 16:54, 5 August 2014

Navigation menu

Search