OSCE Bodard Paper

Positioning Paper on Markup
Gabriel Bodard

(*in progress*)

What is markup?
When an editor "marks-up" a Greek or Latin text, whether in "deep" TEI XML, basic formatting (X)HTML, or even in Beta Code, she is adding a level of information to that text. At the simplest level perhaps this is information about paragraphs, sections, or line breaks, not much more than formatting information. A text in defined paragraphs will be represented with breaks between sections, a verse text will have line and perhaps stanza breaks. These breaks may be purely visual, but they may also affect the way in which a text is searched, for example: find me two words in collocation within two verse lines of one another.

There are several levels of added information that can be expressed with such markup. Structural information, such as line or section breaks, can be expressed with the simplest of markup, carriage returns in a text-only document, or a simple section escape mark in Beta Code (1), for example. Internal referencing information can probably also be expressed in this simple form, although some agreed siglum becomes necessary if the references are to be machine-readable; this can be as simple as employing an otherwise-unused bracket to surround a number, a combination of alphabetic and numeric characters as in Beta Code, or a fully unambiguous element or attribute value in XML.

At a further level down into the text, markup can be used to unambiguously express critical information about the nature of the text itself. In a sense the brackets and sigla used by epigraphic and papyrological editors who follow the Leiden conventions, or the similar conventions used by manuscript scholars, are doing exactly this. Square brackets represent characters restored by editorial conjecture from text completely lost; sub-puncted characters are visible but incomplete, etc. Beta Code extends this range of sigla to a very large degree, but (traditionally at least, although the TLG were trying to be more consistent when I worked there in 2000/1) it replicates the physical appearance of the sigla used rather than their semantic association. On the other hand, for example, the EpiDoc Guidelines for expressing epigraphic editions in TEI XML (2) recommend a fully unambiguous set of markup for representing the semantic meaning behind all of the Leiden sigla—including variants from Krummrey-Panciera to more idiosyncratic usages. Whatever the specific siglum used in a given edition or style sheet, it is the meaning of restored text (vel sim.) that is expressed in XML.

All of these levels of markup described above are considered essential by most projects, indeed should be expressed unambiguously even in traditional texts—although they may not always be machine-readable. A further level of information that many digital projects add to their texts, is to tag specific words (names, places, grammatical words, key subjects) within text. These words can then be extracted from the text and indexed, or made contextually searchable (find the string "ΑΓΓΕΛ" when it is part of a personal name), or even hyperlinked to a reference table or prosopography. It is here that XML is really the only feasible way forward, although at a bare minimum these keywords could be disambiguated in XHTML with classed anchors, I suppose. TEI allows one to tag a personal name, give a regularised spelling, give the nominative singular, and give a link to the database key or table id of the person to whom the name refers.

Beyond this, cross-references can be tagged, so that texts can point to other texts, or to other sections within the same text, either at the level of base text itself or of editorial comment, if this is part of the marked-up document. Other types of commentary, whether historical, philological, physical description, observation or conjecture can also be given explicit, machine-readable form in markup. Markup can also be used to give metadata about a text or edition, as well as about the electronic file itself, including revisions, authorship, and versioning information, in machine-readable form and inside the marked-up text. It is not my intention to go into great detail of all of these possibilities.

Critical markup
Rather the question I would like to focus on for a moment is that level of markup which makes our text a true digital critical edition. That is to say, not merely a digital representation of a traditional critical text, but one where the markup itself reflects and enriches those elements that make an edition critical: the explicit recording of editorial decisions, critical apparatus, and notes. The nature of a digital critical apparatus depends somewhat on the kind of edition that is being created. Texts with a single exemplar, such as most documentary inscriptions and papyri, need to record editorial restorations, physical observations about damage, lacuna, and uncertainty, and detailed description of letterforms and scripts that can contribute to dating and reading, for example. A text that is a consolidated or eclectic edition, such as a literary text with dozens of manuscript exemplars, needs to record different features: principally textual variants between witnesses. Individual lacunas and physical features are of less interest, and tend not to be included in the apparatus.

The TEI, with its strong support in the palaeographical and manuscript communities, has very good features in place for recording the second kind of apparatus information in XML. In fact the element in TEI is extremely well-placed to record lemmata, textual variants , and multiple editorial restorations. The lack of flexibility in both the P4 and the P5 guidelines for this element ( may contain one and must contain one or more, with no top-level or other free text element) highlights the fact that textual variants are considered the only use for the apparatus construction. The epigraphic or papyrological apparatus usage currently needs to either abuse the TEI elements and attributes somewhat, modify the schema for local use, or use a different mechanism: free paragraphs, for example. This need for adaptation may well lead to a wide variety in usage for marked-up critical apparatus, especially if there are no corpus-wide guidelines.

There are already several ways of marking-up critical apparatus within a TEI file. The traditional apparatus in a printed text appears as a list of notes either at the end of the edition, or as a running set at the bottom of each page, much like footnotes. Either of these visual layouts (or other electronic alternatives such as a sidebar or frame, a pop-up window, a tool-tip, modified mouse-over text, etc.) can be generated both from in-line and external apparatus markup. In-line apparatus is perhaps the less intuitive to a traditional philologist: it involves tagging the text itself, as it appears in the edition, with all of the variant readings or restorations associated with it. It is not even necessary to indicate a preference for one reading over the others—indeed some projects such as the OCP (3) eschew this artificial selection, prefering to offer all variants in an egalitarian apparatus. This in-line tagging allows apparatus features to be displayed non-traditionally, as pop-ups or dynamic text, for example, as well as extracting editorial comments and displaying them in the manner of footnotes or endnotes. External apparatus markup is closer to the traditional manner of keeping such comments outside of the text itself, but since the apparatus element is explicitly linked to the point in the text to which it refers, it is equally possible to display an external apparatus entry in any of the ways suggested above.

It is especially important in the case of critical apparatus remarks and variants that the markup in an Open Source Critical Edition be explicit and semantic, since this is information that affects the content of the text itself. A critical text that might be repurposed for a future publication or study, imported into a database for contextual searching, or otherwise used as source rather than merely output, needs to contain all of the text, including all feasible readings, somewhere in the encoding. Many scholars might be primarily or even only interested in searching the privileged reading that the editor prints on the page; choosing to trust, for example, West's judgement when studying Homer's use of the verb κηλέω. But a scholar re-using the Open Source text of Homer to create her own edition—with whatever slant or digital angle she brings to it—may need to examine every possible instance of this verb, even those occuring only in very few manuscripts or rejected by the consensus of the best modern scholarship. Not all digital editions will have the opportunity to list every single manuscript idiosyncracy in what may be a huge and complex tradition, but a truly digital critical apparatus will no doubt need to encode many such variants, even in the case of the most modest edition.

Consistency of markup
Another important issue that I would like to address for the purposes of this meeting is the question of how much markup we should aim for in an Open Source Critical Edition. Indeed, if we are looking to build a distributed collection of compatible texts that can be handled, displayed, searched, and processed by a single corpus or database (or that can feed into more than one such engine), do we need consistency of markup at all? Clearly we need enough information attached to the text for it to be identifiable within the registry and protocols for identication, but these are issues that are being discussed by other sessions. How much does the structure of the critical markup need to be dictated by the collection?

It has been an assumption of my discussion of markup, above, that a sensible (perhaps the only reasonable) approach to marking up a digital critical edition would be to use TEI for the structural and semantic layers of markup (possibly with some stand-off solution to keep the base text clear for further modification). Would it be sufficient for consistency and interoperability that all the files validate to the TEI P5 schema for interchange, for example? There are often many different ways to tag the same phenomenon in TEI, depending on the editor's priorities and focus, for example, or the practice of a local interest group. Some of these varieties of practice might make a mixed collection of disparately marked-up texts less useful for global searching and processing, especially if "deep" features such as names and subjects are tagged.

One solution to this problem would be to recommend the use of a subset of TEI for use in Open Source Critical Editions, with specific options dictated for encoding various semantic distinctions, reference points, deep features, etc. This is the approach taken, for example, by the EpiDoc Guidelines and the constellation of epigraphic, papyrological, sigillographic, numismatic and other projects who take a similar approach. While EpiDoc is (mildly adapted) TEI XML, and while the Guidelines allow for a certain amount of flexibility of structure and editorial decision-making by individual projects and publications, in many cases the range of options available has been drastically cut down, reducing the opportunity for confusion and problematic variety. A search across a large corpus of EpiDoc or *Doc files will find more consistency, and therefore more useful results, than across a free corpus of TEI XML with no restrictions.

A second possible solution to this problem, one which allows for a more forgiving attitude to markup in an OSCE corpus, might be simply to have the specific decisions made about choice of TEI markup within a project documented in a machine-readable format. A TEI P5 schema declared in the RelaxNG language and built using the online tool Roma (4) will include an ODD file which documents in machine-readable form the decisions made in the definition of the TEI schema subset. It is not yet possible to use two ODDs to compare the schemas of two independant TEI projects and create a consolidated version of both with all diverging markup decisions normalised in a fully automated process, since the important information about element usage in an ODD is in the form of human-readable prose comments. Similarly, the statements about encoding in the  of the TEI header are ultimately prose, even if tagged and structured prose, and it is hard to see how a machine will detect the equivalence of one project's to another's, for example. It might yet be possible to devise a secondary typology, using for example Leiden or Krummrey-Panciera numbered categories, to make this documentation more machine-readable. At this point, however, it becomes arguable that agreeing on such a typology is about as arduous as agreeing on a strict markup scheme in the first place.

Essential questions
The important questions that I think we need to take away from this session include:


 * 1) Does an Open Source Critical Edition need to be marked up in TEI, or is a more relaxed approach to digital texts acceptable?
 * 2) How much attention needs to be paid to the apparatus criticus of a digital edition?
 * 3) How much consistency do we need to recommend for Open Source texts that we hope will be compatible with one or more large publication or search projects?
 * 4) How well can these hypothetical (and tangible!) large projects handle a mixture of deeply encoded and merely structurally marked-up documents in the same process?