OSCE Crane Paper
(Position paper at the OSCE Programme workshop, September 2016. Saved here for archival purposes. Please do not delete.)
We need a comprehensive library of initial editions, openly accessible and freely available for re-use in derivative works. This paper outlines one strategy for starting with print editions and moving into a more purely digital stage.
There are two components to this argument, both on the Perseus Development Wiki:
- 1 Open Content Scholarly Sources
- 2 Next Generation Editions
- 3 Building a digital library of primary sources
Open Content Scholarly Sources
Google, Microsoft, Yahoo and other internet giants are now creating digital libraries designed to become more comprehensive than any academic library in human history. The current philosophy of these efforts stresses open access. The creators of the Google project and the Internet Archive have expressed a dedication to open access. Open access also maximizes the potential audience and thus reinforces the advertising based business model on which these internet giants have founded their library efforts.
The funders, however, retain varying rights to their work. Google, for example, has now made available full PDF image books of public domain documents but it asserts proprietary rights over the page images and does not allow third parties to apply their own OCR or document recognition software. The Open Content Alliance in principle encourages its partners to share everything but individual funders can impose their own restrictions on what they submit to OCA.
We are therefore creating a completely open source library of core resources such as reference works and critical editions. Our goal is to provide access to foundational information and also a foundation of materials that subsequent authors can modify, update, expand, and otherwise improve.
Our selection criteria differ from those of the print world. A print library picks the best, most up-to-date documents available, knowing that print publications can be replaced but cannot change. In a true digital library, documents can be dynamic and evolve in real time. A recent encyclopedia will, presumably, be superior to another that is a century old. But if the century-old encyclopedia can be freely updated and attracts high quality modifications, it can evolve and become more up-to-date and more authoritative than its frozen print counterpart.
The classics component of the Open Content Scholarly Library that Perseus is helping create is being made available under a sharalike/attribution/non-commercial Creative Commons license. It contains the following:
- Source texts of Greek and Latin: We have already released c. 8.5 million words of Greek and Latin source texts in TEI-compliant XML. We have also digitized several hundred volumes of source texts. These will be available as image books with searchable OCR and, where feasible, XML transcriptions. Unlike most previous collections, this includes, where possible, multiple editions as well as traditional lists of places where on-line editions differ from editions not yet available on-line.
- Lexica of Greek and Latin: These include major works such as the Liddell Scott Jones Greek-English Lexicon and the Lewis and Short Latin-English Lexicon as well as more specialized works such as Cunliff's Homeric Lexicon.
- Grammars: These include student grammars such as Smyth's Greek Grammar and Allen and Greenough's Latin Grammar as well as extensive scholarly works such as Kühner-Gerth.
- Commentaries: These include scholarly editions as well as school commentaries with linguistic annotations. Commentaries lend themselves particularly well to electronic publication, which is optimally designed for the production, display and management of annotations.
- Tools: These include Morpheus, the morphological analysis system developed in the late 1980s and still providing useful analyses of Greek and Latin words. More importantly, this will include the databases with c. 100,000 stems and endings, mined from many sources, and of potential use to third party morphological analysis systems. All the core tools in the Perseus Digital Library have been rewritten in Java and will be available as additions to institutional repositories such as Fedora and any developers.
- FRBR Catalog Records for source texts: Large projects such as dictionaries and text corpora have developed checklists of editions which they have used. We are creating a modern catalog that builds on prior work (e.g., we use the author and work numbers developed by the TLG and PHI for Greek and Latin author) but provides an extensible architecture that can manage multiple editions, translations (e.g, English, French and German translations of an author), multiple versions of the same editions (e.g., an image book vs. a TEI transcription), multiple citation schemes (e.g., sections vs. chapters in Cicero)..
- Authority lists of people, places, dictionary entries, organizations, etc. The reference works that we are producing lay the foundation for a comprehensive, extensible set of authority lists -- shared names with which we can uniquely identify particular people, places dictionary entries, organizations, etc. While such authority lists are difficult -- experts may differ on which Sallust a particular passage designates and will never all agree on which when we have a dictionary word with two distinct meanings vs. two distinct dictionary words. Nevertheless, all scholarly work depends upon the entries that appear in our reference works and electronic authority lists, however imperfect, are essential tools for large digital collections.
- Service providers: we would like to see the data released useful to as many groups and in as many ways as possible. Thus, we hope to see the content in Google and the Open Content Alliance as well as scholarly environment such as Chicago's Philologic and the Canadian TAPOR project.
- Experts in the field: we hope that experts in the field will revise and extend every document that we release, with versioning systems tracking these changes and allowing experts to get the credit which they deserve for the work that they do.
- General students of the field: we hope to see Wiki based commentaries in which non-experts working their way through a text pose and answer the questions which puzzle them.
- Advanced service developers: we hope that developers will mine the encylopedias to drive their named entity identification systems (e.g., analyzer the articles in Smith's to determine which Alexander a particular document is discussing), sense disambiguation (e.g., which sense of a word in an on-line lexicon is in play in a given passage), machine translation (e.g., mine the parallel texts and translations and the bilingual dictionaries so that a modern machine translation system can provide Greek/English, Latin/English translations etc.).
Next Generation Editions
We propose a new generation of primary source corpora that are:
- Permanent: The texts are not leased from a commercial vendor over a period of time but are permanently accessible, with reference copies and versioning information stored in multiple institutional repositories for long term preservation as well as freely available.
- Openly accessible: Cultural heritage primary sources in the public domain should be openly accessible to all. If it is necessary to restrict access to newly digitized materials in order to secure funding, that restriction should be clearly delimited and as short as possible: e.g., those who fund digitization may have exclusive access for five years before the texts are released for universal access.
- Multi-versioned: The texts themselves can be updated, with all changes tracked in a versioning system. Alternately, the texts provide a stable foundation for standoff markup representing textual variants or advanced interpretation.
- Paid for and maintained by academic libraries: While external funding may help begin this process, library acquisition budgets are the long term source of funding for costs such as data entry. Libraries already pay for the production of digital resources by commercial, for-profit entitites, which restrict access to public domain content. The same library budgets can support open access databases built on public domain source materials.
Open Content Editions
The Perseus Project has released TEI conformant XML texts with 55 million words of American English, 13 million words of Latin and Greek source texts, and, for most of the Greek and Latin, corresponding English translations. These texts are available under a Creative Commons non-commercial license: they must be used with attribution; changes must be shared; they cannot be used as part of a commercial corpus. Commercial entities can, however, freely design for profit services that add value to these openly accessible sources.
While these source texts can freely circulate, they will also be part of the university's permanent institutional repository, thus providing a stable, long term home that will outlast any single project or contributor.
The Greek and Latin corpus contains most of the major works of classical literature. The Perseus Latin Collection contains more than half of the classical corpus and that coverage will approach 100% over the course of 2006/2007.
Working wish lists for Latin and Greek are available for comment/addition.
- Links to page images of paper sources: With Google Library, the Open Content Alliance and Europe's i2010 we see the emerge of digital libraries with millions of books with high quality page images. Copyright restrictions complicate these efforts but solid versions of most major authors are available in the public domain.
- Full coverage including apparatus, introduction, indices etc.: Digital editions can include all information in the print text and not only the text.
- Semantic markup: Markup should reflect meaning and not only appearence.
- Collation of multiple sources: Semantic markup, if applied to the apparatus criticus, should result in machine actionable data, allowing users to compare multiple versions of the same text.
Building a digital library of primary sources
The first generation of large scale, on-line text corpora provided transcriptions of primary materials. Projects such as the TLG and the Packard Humanities Institute Latin CD ROM carefully document the copy texts on which their electronic versions depend. The provenance of texts in the extensive Latin corpus at [the Latin Library] is often unclear, with volunteer transcribers blending texts and leaving no trail of their changes.
We now see vast libraries with millions of digital books either in active development or in advanced stages of planning. Most, if not all, of books now in the public domain will be available in electronic form. Rights disputes may slow digitization of the rest but Google's aggressive stance may, at worst, make publishers more open to pursuing an acceptable arrangement with Yahoo, Microsoft and others now entering this market. In this model, readers view scanned page images but search text automatically generated by OCR software. For many purposes, such "image front" collections are quite effective: narrative prose printed since the mid 19th century lends itself very well to commercial OCR.
Image books do not, however, provide the accuracy and detailed markup that users of primary sources expect. Text collections with millions of words will contain errors for some time after publication but we want to minimize these errors. We want to be able to identify pieces of texts by standard citation (e.g., "Liv. 3.22" should retrieve the text of Book 3, Chapter 22 of Livy's History of Rome. We also want text searches to be able to distinguish between primary text, textual notes and other annotations.
The following describes an approach of adding structure to digital image books of primary sources.
- Collate an image-front edition with searchable, OCR generated text against other electronic editions of the same text: Many classical texts are available on-line in at least one edition. Once we have scanned a new edition and generated text with OCR, we can collate the OCR against pre-existing electronic editions with surprisingly little effort: half of the word forms in a book length document are generally unique. By comparing sequences of unique word forms in pre-existing text and new OCR, we can align use these sequences to align two texts. In our experiments, we have found that we can immediately align one word in ten. We can then compare the intervening sequence (on the average nine words long) to identify variations. Variations include errors in data entry (whether in the OCR or in the pre-existing text), deliberate textual variations and non-textual elements such as headers and textual notes. Where a variation involves one or two words and we cannot generate a morphological analysis for the new words, then we probably have an error. If we can generate morphological analyses for the variants in both versions, then we probably have deliberate variations. If we have extra text at the start or end of pages, we probably have headers or notes. If we have extraneous numbers in the source texts, then these are probably citations. Even if we are working with a pre-existing text that contains errors or whose provenance is unknown, we can often use this text to determine that page 123 of edition X contains book 3, lines 33 to 57 of a given edition, thus making the OCR generated edition citable by chapter and verse. If we have an accurate pre-existing text without textual notes, we can compare the results of searching that text with searching the relevant sections of the OCR-generated text. If a word shows up in the OCR generated text but not in the pre-existing text, then we probably have a match in the textual notes. While OCR quality varies from text to text and from language to language, we can thus produce initial searches of the textual notes with relatively little effort.
- Create an accurate, carefully marked up transcription of a print original: In this stage, we aim to capture every character on the printed source page and to represent the logical structure of the document: ideally, the text should be sufficiently well encoded that readers could ask to compare the readings reported by different witnesses (e.g., "display places where M differs from P and provide a statistical analysis of how often these sources differ").
- Create a new edition, traceable to its print original, but able to represent multiple versions representing multiple witnesses and multiple new editions: The source text becomes the foundation multiple new editions. Once we have a carefully constructed source text, we can generate as many variations as we like. The source may -- and probably willl -- soon recede into the background but will provide a stable framework whereby we can compare all subsequent editions.
Choice of source texts
If we were creating a traditional scholarly text collection, we would want the most up-to-date current editions, In this model, however, we need to balance the authority of the source text against their ability to evolve into richer editions encoding multiple sources and editorial versions. If a serious user community exists, if it values additions to textual scholarship and if it has reasonable technical and editorial mechanisms to enhance its editions, living older texts will overtake any static edition.
The two extreme cases are:
- Recent editions that may be at present the most comprehensive and authoritative but cannot be augmented. Whether or not publishers can claim copyright to scholarly reconstructions of primary source materials, editors should certainly have the right to prepare a single version of an edition to which no one else can make changes.
- Editions that are are designed to accept -- and document -- new witnesses and editorial decisions. In the simplest case, this would include careful transcriptions of public domain editions. A mature versioning environment tracks each addition and can reconstruct any given version. Versioning software analyzes new transcriptions of witnesses and editions.
In practical terms, the best accessible editions will usually be the best public domain editions, with a few editors initially offering their work. It would probably be best to use public domain editions as initial test cases and to use these to work out inevitable bugs and organizational issues. Current editors may, in any event, find it as easy to add their changes to a well-structured public domain edition than to supervise the markup of their own print editions or the word processing files from which they derive.
Sources for Images of Print Editions
- Local book scanning: A number of institutions (including Perseus) can scan limited numbers of books. Sheet feeder scanners can process c. 1,000 pages an hour but they require that the source books be disbound. Look down scanners do not damage the source materials and are slower but they still can process 100+ pages in an hour and are useful for smaller jobs.
- Large book scanning projects: There are now a number of projects that are scanning very large numbers of books. [Google Print] has begun assembling a library that will include tens of millions of books. Google plans to make the library openly searchable and will return copies of the scanned books to the participating research libraries, but it is not clear how easily other developers will be able to get their own copies on which to apply specialized OCR and content analysis. The [Open Content Alliance] constitutes a growing consortium of content providers and third party service providers. Led by the [Internet Archive], the OCA has begun making high resolution image books available and is providing [a clearing house for related efforts] such as the [Million Book Project]. The newer robotic scanners do a very good job of turning pages -- even pausing to let one page clinging to another drop off as they turn. They seem to be able to process more than 1,000 pages an hour and thus to exceed the best throughput we have achieved running disbound pages through a sheet feeder -- very impressive. The drawback is that these robots are expensive: the most recent ones from Kirtas cost $140,000-$180,000. You need to get high volume to justify this enconomically. If you can get 1,200 pages an hour, then you might do three books an hour and 120 books a week. That would be about 6,000 books a year -- or about $30-$40 per book for the hardware investement alone exclusive of labor and postprocessing. If you consider 100 hours/week over two years and thus 300 400-page books a week, you get to 15,000 a year and the price clearly comes down. Run that over three years with 45,000 books and the cost becomes manageable.
In practice, editors interested in a few authors can get their source materials scanned at a variety of locations. Larger series (such as the Patrologia Latina) are well suited to the large scale book scanning projects. The biggest problem involves getting copies of the desired books to a location where large scale scanning is taking place. The California Digital Library, which serves the UC system, and the University of Toronto were early on partners in OCA and between them would have virtually every edition of Greek or Latin texts published in the past two centuries. An [article in LibraryJournal from November 1, 2005] reports that the European Commission is planning a large digital library project of its own that will focus initially on the public domain.
Components of next generation electronic editions
These editions will have the following components:
- One or more baseline print editions available as image books: At least one print edition should be available as an electronic source to which readers can refer if they feel that they have detected a data entry or formatting error. Everything necessary for representing at least one core edition in a tagged file should be available to the community. Given the demands of publishers, these may not be the most up-to-date editions of an author but they are intended as a starting point. All such texts should, of course, have OCR generated searchable text. If the original source texts have page numbers, then these should be encoded and citable.
- A flexible editing environment which allows user communities to improve the current document: Electronic documents are by nature dynamic and can evolve over time. Where print editions constitute end points of a long stage of development, electronic editions can serve as starting points to on-going development. Initial tasks may focus on correcting OCR errors, adding structural markup and other basic chores. Ultimately, however, users will want to associate higher level annotations (e.g., specifying that a given "Salamis" is the Salamis in Cyprus rather than near Athens, or indicating that "faciam" is a subjunctive rather than a future, etc.). Examples of decentralized editing environments that link transcriptions with images of the source pages include [Distributed Proofreaders] program of [Project Gutenberg] and the [Digital Facsimile Editions] of the [Christian Classics Ethereal Library] ,
- A tagged transcript of one or more print editions: This should include everything from the original edition, including introduction, textual notes, commentary, index, and any other materials from the source book. At this stage, the idioyncratic line breaks of particular editions should be preserved if the textual notes, commentary or other parts of the book use these line breaks for internal citations. All citations should be tagged and activated: e.g., wherever the text refers to "page 132 line 18" or "chapter 44, line 8", these expressions should be converted into active links. Textual notes should appear as simple notes and placed within the body of the source texts. This version serves as a temporary work space and should yield to the following stage. It should become the official representation of the original print edition. The [| Camena project]
- Fully interpreted electronic version of the print text: While many documents may be complete at this stage, textual notes in critical editions should be converted from human readable descriptions into machine interpretable operations. Thus, readers should be able to view the text as it appears in any given manuscript, view places where any two witnesses disagree with one another, and see analyses of how far different versions of the text differ from one another. This version of the text should become the default and replace the tagged transcript.
- One or more translations: Translations should have provenance so that readers know whether or not they reflect the online version of the source text. Translations should, like the editions, include all accompanying materials including introduction, notes, appendices, indices etc. Like editions, translations should be available both as image books so that readers can, when in doubt, consult the print originals.
The fully interpreted electronic edition should then provide a starting for subsequent edits. The text could evolve in a variety of ways.
- Systematic collations: Individuals may systematically collate the source text against new witnesses (e.g., manuscripts, papyri, etc.) or new editions (where editors may have derived different conclusions and printed different readings). All additions must be transparent: thus, we cannot record new readings without providing their jusification. We can add new readings from manuscripts and other sources without necessarily changing the text. We cannot record different editorial decisions without encoding the source for those decisions.
- Coordination of edition, textual notes and at least one reference translation: We may have multiple translations reflecting multiple editions of a given work but we should have at least one edition that reflects the content of the base edition and that can represent the different readings in the textual notes. Readers should always be able to see how (or whether) any given reading affects the main translation. Readers should thus be able to filter out those notes which do not impact upon the English and to analyze the aggregate impact of choosing one version over another. While small changes of language can have dramatic effects upon meaning, readers should be able to gauge the overall significance of different version.
A great deal more can be done with and for any given edition: we can add (and have added) commentaries, linguistic markup, links to scholarship and other supplementary materials. At the same time, the but the above represents a basic level of documentation towards which producers should, in our view, aim.
- Changes from the source text to the transcription: The Text Encoding Initiative provides tags to record locations where editors have corrected errors in the source, expanded abbreviations, and regularized spellings.
- Markup stylesheet: The Text Encoding Initiative offers a range of tags but is not universal. In some cases, we will need to extend the TEI. In other cases, the TEI allows us to represent the same information in different ways: e.g., <name type="place">Rome</name> or <placeName>Rome</placeName>. The more homogeneous editions can be, the easier it will be to search, browse and maintain them over time. Perseus has evolved conventions of its own over time, but even within Perseus different projects has approached the same problems differently. We need documentation that is more extensive and that can be updated in real time (e.g., a Wiki).