Unicode for ancient languages

From The Digital Classicist Wiki
Revision as of 17:42, 1 December 2020 by ElliMylonas (talk | contribs) (added info for other ancient languages and scripts)
Jump to navigation Jump to search


Unicode is the de facto standard for the consistent encoding, representation, and handling of text expressed in most of the world's digital writing systems. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, changes in fonts notwithstanding. That is, any text that is Unicode compliant remains constant, no matter what font is used to display the data. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still fine. Swapping the text to a font that does support the alphabet will reveal this to be the case.

With more than 110,000 characters, Unicode is as complex as human writing itself, and so lends itself to organization. Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode in hexadecimal numeration, which uses the digits 0 through 9 and the letters A through F. (Decimal 10 is Hexadecimal A; decimal 17 = hex 10; decimal 79 = hex 4F.) It is therefore helpful to think of Unicode as a very long ribbon sixteen characters wide. That ribbon is divided into Unicode blocks, each one corresponding, more or less, with a particular alphabet. (This ribbon image is illustrated nicely here.) Even more important, it is common to refer to Unicode characters with the assigned hexadecimal value in four or more digits, prefaced by "U+", and to state the official name of the Unicode character fully in uppercase. Thus U+0020 is SPACE, U+00AE is ®, the REGISTERED SIGN, and U+044E is ю, CYRILLIC SMALL LETTER YU.

Special Features in Unicode Classicists Should Know

Because writing systems from antiquity have been introduced to Unicode in phases, there are special exceptions or behaviors that classicists should be aware of.

The Greek alphabet is split principally between two major blocks:

  • Greek and Coptic U+0370..03FF
  • Greek Extended U+1F00..1FFF.

The first block retains its original name, but the Coptic alphabet has been given its own Unicode block: Coptic U+2C80..2CFF. Other Unicode blocks that carry Greek characters:

  • Ancient Greek Numbers U+10140..1018F (note that the end of this block includes papyrological characters that are not numbers)
  • Linear B Syllabary, U+10000..1007F
  • Linear B Ideograms U+10080..100FF
  • Aegean Numbers, U+10100..1013F
  • Byzantine Musical Symbols, U+1D000..1D0FF
  • General Punctuation, U+2000..206F (note esp. punctuation marks toward end)

If you are using combining characters, these are generally drawn from the general Combining Diacritical Marks block, U+0300..036F. This applies also to common punctuation.

A false distinction was introduced to Unicode between the oxia (acute) and tonos, resulting in wrongly duplicated code points. See Greek Unicode duplicated vowels for a full discussion.

The Greek question mark is simply the common semicolon (U+003B). There is a GREEK QUESTION MARK U+037E but the Unicode database marks the latter to be normalized to the former.

Unicode introduced a distinction between uppercase and lowercase numerals for six, ninety, and nine hundred: U+03DA, U+03DB (Ϛϛ), and U+03DE..U+03E1 (ϞϟϠϡ). There are no rules that dictate which form is preferred.

Latin Code Points

Unicode provide characters in the Ancient Symbols block for Roman currency and the Tau-Rho monogram. Unicode block U+10190..1019C

Other Ancient Alphabets

Unicode provides glyphs for several ancient languages of the Mediterranean. Not all languages are represented, and even in the ones that are, not all glyphs are present. A sample is listed below.

  • Lycian
  • Phoenician U+10900..U+1091F (applies to Archaic Phoenician, Phoenician, Early Aramaic, Late Phoenician cursive, Phoenician papyri, Siloam Hebrew, Palaeo-Hebrew, Hebrew seals, Ammonite, Moabite, and Punic)
  • Old Italic U+10300..U+1032F (applies to glyphs for Etruscan, Faliscan, Oscan, Umbrian, South Picene)
  • Cuneiform U+12000..U+123FF (applies to Sumerian, Akkadian, Elamite, Hittite, Hurrian)
    • Also Cuneiform Numbers and Punctuation U+12400–U+1247F and Early Dynastic Cuneiform U+12480–U+1254F

Standardization of Glyphs Not in Unicode

If you think a glyph deserves to be included in Unicode but are not certain, it is best to start with the Unicode discussion list and Deborah Anderson at Berkeley.

If a new character is in order and you need to create a proposal, you may wish to study how other proposals have been developed.

  • The Thesaurus Linguae Graecae has made several Unicode proposals for the encoding of Ancient Greek characters and symbols (several TLG proposals online);
  • the EAGLE committee, who have also made recommendations for Latin epigraphic symbols;

If a glyph is not considered to have the merit of being included in Unicode, but is thought important for specialized fonts, you may wish to study how others have been designed and developed:

  • David Perry (creator of the Cardo font) has also done work in this area;
  • MUFI (the Mediaeval Unicode Font Initiative) are a group with related concerns
  • Athena Ruby, from Dumbarton Oaks, has encoded a number of specialized symbols and characters by tying the glyphs not only to the Private Use Area but to their proper code point (where it exists).

Learning more about Unicode blocks and glyphs

The definitive reference for Unicode are the publications of the Unicode Consortium, the most recent of which is Unicode 13.0. Unicode code blocks and characters are also documented in Wikipedia, which may provide a faster and more accessible way to see which characters are available and what their code points are. Search for the language name, for ex. Old Italic (Unicode Block).

Other Resources