Text Alignment Network

From The Digital Classicist Wiki
Jump to navigation Jump to search



The Text Alignment Network (TAN) is a suite of XML formats designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources. TAN is particularly suited to aligning texts with multiple versions (copies, translations, paraphrases), and to annotating quotations, translation clusters (word-to-word), and lexicomorphological features. Simple, modular, and networked, the TAN format allows users, working independently and collaboratively, to find, create, edit, study, align, and share their texts and annotations. The extensive validation rules are integrated into a library of functions that definitively interpret the format and provide a foundation for third-party tools and applications.

TAN XML is resembles Text Encoding Initiative XML, but shifts traditional inline annotations to stand-off annotation, assigning one format to one specific job. Formats fall into one of three classes:

Class 1: transcriptions: TEI (all, slightly modified), TAN-T (like TEI, but closer to plain text)

Class 2: annotations and alignments of class 1 files

  • TAN-A: basic claims about class 1 files, particularly supportive of assertions about text reuse, variations in witnesses
  • TAN-A-lm: lexico-morphological data about any class 1 file
  • TAN-A-tok: word/phrase-to-word/phrase alignments of any two texts that are purported to be alternate versions (normally a source and its translation)

Class 3: other types

  • TAN-mor: a format for declaring the rules allowed in TAN-A-lm, by declaring the grammatical categories that are allowed, the codes that should be used, and what combinations are allowed or disallowed.
  • TAN-voc: a format for declaring the IRIs and names of entities to be invoked in an other TAN file.

Some other innovative features include:

  • A text pointer system that relies upon the reference coordinates of a division in a reference system, not @xml:id, stream, or tree.
  • The ability to provide any number of concurrent, overlapping annotations on a given text, by any number of people.
  • An allowance to define tokens precisely, via regular expression.
  • Dependence upon Semantic Web / RDF principles, integrated with synonymous scoped ids, to support human-readable alternatives.
  • A heavily regulated validation suite that deeply checks aspects not covered by TEI schema (e.g., Unicode normalization, dates set in the future) and allows the administrator of a file to "talk" to dependent files to communicate updates.
  • An extensive XSLT function library applicable in other contexts, e.g., TAN-regex, an extension of regular expressions for deeper use of Unicode, tan:diff(), and tan:collate().

Alpha (early) versions of TAN were released in 2018 and 2020.

TAN Corpora and Projects