Difference between revisions of "Unicode for ancient languages"

From The Digital Classicist Wiki
Jump to: navigation, search
m (cat)
(Drastic revisions)
Line 1: Line 1:
==Where should one go to find out about Unicode Greek/Latin/epigraphic symbols, etc.==
+
==Unicode==
  
(alternative question: '''Is there anything like MUFI for Classicists?''')
+
Unicode is the de facto standard for the consistent encoding, representation, and handling of text expressed in most of the world's digital writing systems. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, changes in fonts notwithstanding. That is, any text that is Unicode compliant remains constant, no matter what font is used to display the data. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still fine. Swapping the text to a font that does support the alphabet will reveal this to be the case.
  
(advice received so far; please add comments or expand)
+
With more than 110,000 characters, Unicode is as complex as human writing itself, and so lends itself to organization. Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode in hexadecimal numeration, which uses the digits 0 through 9 and the letters A through F. (Decimal 10 is Hexadecimal A; decimal 17 = hex 10; decimal 79 = hex 4F.) It is therefore helpful to think of Unicode as a very long ribbon sixteen characters wide. That ribbon is divided into Unicode blocks, each one corresponding, more or less, with a particular alphabet. (This ribbon image [http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF is illustrated nicely here].) Even more important, it is common to refer to Unicode characters with the assigned hexadecimal value in four or more digits, prefaced by "U+", and to state the official name of the Unicode character fully in uppercase. Thus U+0020 is SPACE, U+00AE is ®, the REGISTERED SIGN, and U+044E is ю, CYRILLIC SMALL LETTER YU.
  
* I think the people who most concern themselves with this sort of thing include: the [[Thesaurus Linguae Graecae]], who have made several Unicode proposals for the encoding of Ancient Greek characters and symbols (several [http://repositories.cdlib.org/tlg/unicode/ TLG proposals] online);
+
==Special Features in Unicode Classicists Should Know==
 +
 
 +
Because writing systems from antiquity have been introduced to Unicode in phases, there are special exceptions or behaviors that classicists should be aware of.
 +
 
 +
A false distinction was introduced to Unicode between the oxia (acute) and tonos, resulting in wrongly duplicated code points. See [[Greek Unicode duplicated vowels]] for a full discussion.
 +
 
 +
Unicode introduced an anachronistic distinction between uppercase and lowercase numerals for six, ninety, and nine hundred: U+03DA, U+03DB (Ϛϛ), and U+03DE..U+03E1 (ϞϟϠϡ). There are no normalization rules that dictate which form is preferred.
 +
 
 +
==Standardization of Glyphs Not in Unicode==
 +
 
 +
If you think a glyph deserves to be included in Unicode but are not certain, it is best to start with the [[Unicode discussion list]] and Deborah Anderson at Berkeley.
 +
 
 +
If a new character is in order and you need to create a proposal, you may wish to study how other proposals have been developed.
 +
 
 +
* The [[Thesaurus Linguae Graecae]] has made several Unicode proposals for the encoding of Ancient Greek characters and symbols (several [http://repositories.cdlib.org/tlg/unicode/ TLG proposals] online);
 
* the [[EAGLE]] committee, who have also made recommendations for Latin epigraphic symbols;
 
* the [[EAGLE]] committee, who have also made recommendations for Latin epigraphic symbols;
 +
 +
If a glyph is not considered to have the merit of being included in Unicode, but is thought important for specialized fonts, you may wish to study how others have been designed and developed:
 +
 
* David Perry (creator of the [http://scholarsfonts.net/cardofnt.html Cardo font]) has also done work in this area;
 
* David Perry (creator of the [http://scholarsfonts.net/cardofnt.html Cardo font]) has also done work in this area;
* The best person to ask about this would be Deborah Anderson at Berkeley
 
* or consult the [[Unicode discussion list]]
 
 
* [http://gandalf.aksis.uib.no/mufi/ MUFI] (the Mediaeval Unicode Font Initiative) are a group with related concerns
 
* [http://gandalf.aksis.uib.no/mufi/ MUFI] (the Mediaeval Unicode Font Initiative) are a group with related concerns
* several items in the [[:Category:Unicode]] in the Digital Classicist Wiki may be of interest
+
* [[Athena Ruby]], from Dumbarton Oaks, has encoded a number of specialized symbols and characters by tying the glyphs not only to the Private Use Area but to their proper code point (where it exists).
 +
 
 +
==Other Resources ==
 +
 
 +
* [[Unicode discussion list]]
 +
* [[:Category:Unicode]]
  
 
[[category:FAQ]]
 
[[category:FAQ]]
 
[[category:unicode]]
 
[[category:unicode]]

Revision as of 17:24, 3 February 2015

Contents

Unicode

Unicode is the de facto standard for the consistent encoding, representation, and handling of text expressed in most of the world's digital writing systems. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, changes in fonts notwithstanding. That is, any text that is Unicode compliant remains constant, no matter what font is used to display the data. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still fine. Swapping the text to a font that does support the alphabet will reveal this to be the case.

With more than 110,000 characters, Unicode is as complex as human writing itself, and so lends itself to organization. Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode in hexadecimal numeration, which uses the digits 0 through 9 and the letters A through F. (Decimal 10 is Hexadecimal A; decimal 17 = hex 10; decimal 79 = hex 4F.) It is therefore helpful to think of Unicode as a very long ribbon sixteen characters wide. That ribbon is divided into Unicode blocks, each one corresponding, more or less, with a particular alphabet. (This ribbon image is illustrated nicely here.) Even more important, it is common to refer to Unicode characters with the assigned hexadecimal value in four or more digits, prefaced by "U+", and to state the official name of the Unicode character fully in uppercase. Thus U+0020 is SPACE, U+00AE is ®, the REGISTERED SIGN, and U+044E is ю, CYRILLIC SMALL LETTER YU.

Special Features in Unicode Classicists Should Know

Because writing systems from antiquity have been introduced to Unicode in phases, there are special exceptions or behaviors that classicists should be aware of.

A false distinction was introduced to Unicode between the oxia (acute) and tonos, resulting in wrongly duplicated code points. See Greek Unicode duplicated vowels for a full discussion.

Unicode introduced an anachronistic distinction between uppercase and lowercase numerals for six, ninety, and nine hundred: U+03DA, U+03DB (Ϛϛ), and U+03DE..U+03E1 (ϞϟϠϡ). There are no normalization rules that dictate which form is preferred.

Standardization of Glyphs Not in Unicode

If you think a glyph deserves to be included in Unicode but are not certain, it is best to start with the Unicode discussion list and Deborah Anderson at Berkeley.

If a new character is in order and you need to create a proposal, you may wish to study how other proposals have been developed.

  • The Thesaurus Linguae Graecae has made several Unicode proposals for the encoding of Ancient Greek characters and symbols (several TLG proposals online);
  • the EAGLE committee, who have also made recommendations for Latin epigraphic symbols;

If a glyph is not considered to have the merit of being included in Unicode, but is thought important for specialized fonts, you may wish to study how others have been designed and developed:

  • David Perry (creator of the Cardo font) has also done work in this area;
  • MUFI (the Mediaeval Unicode Font Initiative) are a group with related concerns
  • Athena Ruby, from Dumbarton Oaks, has encoded a number of specialized symbols and characters by tying the glyphs not only to the Private Use Area but to their proper code point (where it exists).

Other Resources

Personal tools