OSCE Smith Paper

Neel Smith, College of the Holy Cross: OCSE position paper

=An architecture for a distributed library incorporating open-source critical editions=

In this position paper, I outline recent work at the Center for Hellenic Studies (Washington, D.C.) on a suite of protocols for creating a distributed library of interoperable scholarly resources. In the opening section, I provide some background to our approach. In the following section, I describe the service stack we are currently testing in collaboration with the Perseus project. At our meeting in London, I hope to use my introductory time to illustrate the ideas presented here with a couple of concrete examples of applications.

Background: digital publications
Designing a technical architecture for scholarly publication is the last link in a logical chain. We must first define what we mean by “publication,”  identify its distinctive features, and translate those into functional requirements. Functional requirements in turn can be expressed as technical requirements, and we can then choose an architecture that satisfies those requirements. Here I summarize very briefly views on those topics I have spelled out more fully in a paper entitled "[Digital publication for digital libraries>http://chs75.harvard.edu/projects/diginc/techpub/digitalpub]."

In the scholarly world, publication serves as the *permanent record of reference* for scholarly work. In any medium therefore, scholarly publications must be designed for both *permanence* and *citability*.

I would translate these defining characteristics of scholarly publication into at least three functional requirements:


 * it must be identically replicable
 * it must be alienated from its author
 * it must be citable in a fixed version

We could rephrase these functional requirements by defining the form of scholarly published works as *works possessing an explicitly identified edition and explicitly identified citation scheme, that can be irrevocably and identically replicated*.

In "[Digital publication for digital libraries>http://chs75.harvard.edu/projects/diginc/techpub/digitalpub],"  I develop arguments for a list of technical specifications that are necessary to satisfy this understanding of digital publication. Rather than repeat those in detail here, I wish simply to underscore that a digital publication has to capture the *functionality* rather than the appearance of a scholarly work. Beyond identifying appropriate ways to represent an open-source critical edition (e.g., recommended applications of TEI encoding to a document), then, we need to develop an infrastructure for working with critical editions in the broader context of a distirbuted and interoperating digital library.

Architecture: digital libraries.
The natural architecture permitting interactions among potentially distributed objects is a suite of network services following defined protocols. In much of our work defining services for scholarly work, we have been influenced by the pioneering work of the Open Geospatial Consortium developing service protocols to enable distributed GIS operation. (See the Geospatial Consortium home page.)

Our initial goal is to work with the most fundamental kinds of services to provide functionality that other services can in turn build on. A structured “diff” service describing differences in the structure and content of two XML fragments, for example, might be layered on top of an elementary retrieval service that abstracts the problem of retrieving text passages from canonical references. The structured diff service in turn might serve as a base for a higher-order service statistically summarizing or analyzing differences in two pieces of text.

Part of the attraction of the service model is its technical simplicity, since protocols for scholarly services can be layered on top of well established technical protocols: HTTP as the transport mechanism, XML for service requests and replies. Part of the attraction, too, is that this hierarchical model corresponds to a scholarly ideal: it simultaneously allows for high-level abstraction of complexity, while ensuring the transparency of supporting or underlying functionality.

Fundamental services
While we can easily imagine interesting, complex services we might like to have as easily available as an internet access point, I would argue that the most fundamental services for scholarly publication are those supporting the *simple identification and retrieval of fundamental objects with stable, location-independent references* -- services, in other words, that directly support our view of publication as a permanent and citable record.

For many kinds of material we refer to, citation is comparatively straightforward. We often work with collections of discrete objects cited simply by a unique identifier: an “author-year” label to identify one entry in a bibliographic list, a museum inventory number to identify a specific archaeological artifact, a catalog number to identify a listing in a collection like Erbse's scholia vetera of the Iliad. Even when we refer to specific properties of an object (the author property of a bibliographic entry, the die axis of a coin, Erbse's source attribution of a scholiastic comment ...), we continue to cite the object as a discrete entity. One fundamental service we need then is a service for identification and retrieval of discrete entities in a collection.

Texts present a different challenge. In the first place, the entities we refer to with textual citations are not simple discrete objects, as librarians attempting to catalog texts are aware. The Functional Requirements for Bibliographic Records (FRBR) describes a hierarchical model for texts, from the notional work, to the expression of that work in some version, to the manifestation of a version in some concrete form, to an individual item. (A good introduction to FRBR is the U.S. Library of Congress' page [What is FRBR?>http://www.loc.gov/cds/FRBR.html]). Classicists and biblical scholars have long implied a similar but not identical abstraction of notional work from particular versions in their use of version-independent, canonical reference systems. One difference is that classicists' citation practice normally associates texts in groups or corpora that may or may not appear in documentary components of FRBR; another is that FRBR's “manifestation” distinguishes different reproductions of a given expression (such as identical printings of a given edition) that may not be significant for scholarly citation.

FRBR, of course, as a cataloging model does not address citation, and a second problem texts present is that we must allow for continuous citation. Canonical citation schemes are often hierarchical (e.g., book/chapter/section of a prose work); our service must support citation to this level of granularity, and beyond that should allow citation of subsections of text for a specific version.

A second fundamental service, then, is a service for identifying texts and retrieving textual references in accordance with the semantics of citation practice traditional in fields like classics or biblical studies.

To make these two methods of identifying and retrieving citable objects useful together in a distributed library, we can define a third basic service: indexing information to either form of citation. An index of personal names in a text, for example, might literally index strings with forms of names to a text reference, but it might also, more usefully for many purposes, index identifiers in a prosopographic collection to textual references. The identifier could both disambiguate superficial strings of characters in the text, and provide a key to the prosopographic collection.

At CHS, we have drafted standards for these three services, and have implemented each as a java servlet. For more detailed information, see this page on "[Fundamental services for scholarly reference>http://chs75.harvard.edu/projects/diginc/techpub/tic]."

Ancillary services and standards
As the abstract in the conference program indicates, to create an effectively interoperable network of resources, we need to agree not only on service protocols, but on the meaning of standard *values* that can be used in the framework of the protocol. Having an agreed-upon system for finding what texts a service offers, discovering their citation schemes, and requesting sections of the text in that scheme will not help us to interoperate if we can't agree on how to identify Herodotus' Histories, or an inscription from Aphrodisias. To support the three fundamental services previously described, we have also developed ancillary services and standards to address these issues.

Texts cited by canonical reference are a comparatively stable set of resources. Technically, we need a simple service that resolves some kind of query string to standard identifiers, comparable to the [uBio service>http://www.ubio.org/] that scientists can use to automatically search for standard taxonomic identifiers for species. In contrast to uBio, however, our service must be able to support a hierarchical scheme of identifiers so that we can refer to texts at the level of works, versions (such as a specific translation or edition) or individual exemplars. To fill this technical gap, we have developed a hierarchical Registry service (see [fuller information with links>http://chs75.harvard.edu/projects/diginc/techpub/registry]).

Institutionally, we need to find appropriate custodians to manage these authority lists for given domains. CHS has taken responsibility for maintaining a Registry service for identifiers of Greek literary works; the Aphrodisias project would be a logical choice to assume responsibility for assigning identifiers to inscriptions from Aphrodisias. (Whether choosing to administer a service directly, or to take editorial responsibility for material served elsewhere is not important.) The internet's DNS system offers a good analogy to what we might ultimately develop:  the equivalent of a root server or servers is being run at CHS as a Registry of authoritative registries for given domains or corpora;  individual registries in turn may be disseminated so that an actual application might consult a local copy of the registry information to resolve a reference.

In contrast to canonically cited texts, collections of discrete objects may be created so freely that a comparable system of registries would be unrealistically burdensome. What authority should I register my collection with if I, as an individual scholar, create a database of results of my work, and want to expose it to the world using a Collections Service? I am the only authority responsible for defining the unique identifiers in my collection, so I need a namespace of my own within which I can freely manage my collection's IDs. This is very similar to the problem that authors of XML document structures face, and we are adopting a very similar solution. Just as XML namespaces utilize the same mechanism used for URLs to provide unique namespaces to anyone creating a new XML structure, so we use that structure to provide unique *data namespaces*. At CHS, a Collection of data about digital images is given unique identifiers from the data namespace chs.harvard.edu/datans/images; the Perseus project could, for example, use a data namespace like perseus.tufts.edu/images/namespaces, and if both collections have an image with the same ID, they can be correctly resolved.

We need to consider one further important difference between reference by unique ID and the kind of canonical reference we use for texts. Unique IDs can be represented by simple strings of characters; the semantics of a reference within a hierarchical citation scheme to a text in a FRBR-like hierarchy cannnot. We have therefore proposed a syntax for a notation scheme with explicit semantics, following the requirements of the IETF's URN system. These Canonical Text Services URNs make it possible to reduce the complexity of a reference like “First occurrence of the string 'cano' in line 1 of book 1 of Vergil's Aeneid” to a flat string that can then be used by any application that understands CTS-URNs. (For more information and links, see [CTS URNs>http://chs75.harvard.edu/projects/diginc/techpub/cts-urn].)

For an overview of CHS work on these topics, see "[Ancillary services suppporting scholarly reference>http://chs75.harvard.edu/projects/diginc/techpub/ancillary]."

Composite objects and the TICI stack
An extraordinary range of scholarly citation can be handled through the simple mechanisms of Collection Services, and Canonical Text Services, while indexing using Reference Index Services enables a complex web of associations to be built on top these citation mechanisms. We want to incorporate spatial manipulation into our stack of services, but for the present are very happy to let others, including the Open Geospatial Consortium, take the lead in this area. In the summer of 2006, we began to build the first examples of compound objects, adding to the simple identification and retrieval of Collections and Canonical Texts, more specialized manipulation for binary images.

Image Procesing Services perform operations such as scaling an image, selecting a subsection of it, or altering its brightness and contrast. (See “[Image Processing Services>http://chs75.harvard.edu/projects/diginc/techpub/images].”) By itself, an image processing service is of little use;  it really becomes valuable only in association with some other information. Collections services already provide a ready means of working with metadata about each image; Reference Index Services make it possible to associate binary image identifiers with objects in other collections, or with texts. An index of, say, page images to CTS URNs could define the relation between a text and images of pages in a specific edition; a CTS instance could provide access to an XML text, while a related Image Processing Service could work with the image data.

At CHS, the result in the fall of 2006 is a stack of four principle interrelated services: Texts, Indexes, Collections and Images, that together provide a sufficient infrastructure for a surprising range of scholarly publications. We have been closely collaborating with the Perseus project over the last several months to test these services, and build end-user applications on top of them. Text browsing and reading applications work simultaneously with CHS implementations of Canonical Text Services in Washington, D.C., at Holy Cross College in Worcester, Massachusetts, and at Furman University in Greenville, S.C., as well as with an independent implementation using completely different back-end technology at the Perseus project at Tufts University.

For more information, see "[An overview of services for composite objects>http://chs75.harvard.edu/projects/diginc/techpub/composites]"

Current work: Scenarios
Even as small a set of services as the TICI stack allows for very complex networks of information, and it is becoming increasingly apparent that we need to plan now for a further dimension to our work: a means of making machine-parseable statements about the relations among these resources.

In September, 2006, we have begun work on a simple XML schema for inventorying and describing the relations among stable, citable resources anywhere in the TICI stack. These inventories, which we are provisionally calling “Scenarios,” are in a sense a digital extension of bibliography: they add to the  static lists of print bibliography a specification of how resources relate to each other. Scenarios are declarative or descriptive, not functional: applications may use the information in a Scenario as they choose, but as a print bibliography ideally catalogs resources needed to read a print publication, Scenarios catalog resources needed to read a digital publication.

A simple text reader can, for example, list a single resource with a CTS URN referring to a passage in a text; in  this instance, the Scenario amounts to a simple bookmark. But a text reader that filters the text with information from an index might overlay links on the words of a Latin text to a morphological index. Its Scenario can specify how the text resource and index relate. An even more sophisticated reader might in turn associate the lemma with other morphological data; this could appear as a Collection in the application's Scenario.

Our work on Scenarios is very preliminary at this point, but illustrates a number of themes that are relevant to the broader topic of this conference: the leverage we can obtain from building on openly available resources, the ways very simple, even minimal resources can in their complex interrelations lead to  sophisticated scholarly productions, and the ease of interoperation that is possible when we can work with common protocols and standards.

More information
 * Documentation of technical work at CHS, [Digital Incunabula>http://chs75.harvard.edu/projects/diginc/home]
 * "Update blog" with syndicated feeds for [announcements and updates>https://chs76.harvard.edu/weblog/neel/] from the CHS Technical Working Group

License (c) Neel Smith 2006 Distributed under the [Creative Commons Attribution-Share-alike license v. 2.5>http://creativecommons.org/licenses/by-sa/2.5/]