OCR for ancient Greek: Difference between revisions

From The Digital Classicist Wiki
Jump to navigation Jump to search
m (typos fixed)
(Added missing link, adjusted structure)
Line 3: Line 3:
'''Optical Character Recognition''' or '''OCR''' is the process of using software to reading analogue, printed texts (or raster images of such text) and interpret it as character data, usually using probabilistic pattern-recognition methods. It is related to, but more usually more straightforward than, [[Handwritten Text-Recognition]] (HTR). OCR is relatively easy to perform on modern printed text, but struggles significantly more with: older print and non-standard fonts; less-common languages with complex diacritical systems; historical language not normally of interest to the AI and intelligence communities who invest a lot in text analysis applications. Ancient Greek is at the intersection of all these difficulties, and has traditionally been among the most difficult printed languages to OCR. The [[TLG]], for example, has compiled hundreds of millions of words of Greek literature through outsourced manual keying, rather than even attempting OCR.
'''Optical Character Recognition''' or '''OCR''' is the process of using software to reading analogue, printed texts (or raster images of such text) and interpret it as character data, usually using probabilistic pattern-recognition methods. It is related to, but more usually more straightforward than, [[Handwritten Text-Recognition]] (HTR). OCR is relatively easy to perform on modern printed text, but struggles significantly more with: older print and non-standard fonts; less-common languages with complex diacritical systems; historical language not normally of interest to the AI and intelligence communities who invest a lot in text analysis applications. Ancient Greek is at the intersection of all these difficulties, and has traditionally been among the most difficult printed languages to OCR. The [[TLG]], for example, has compiled hundreds of millions of words of Greek literature through outsourced manual keying, rather than even attempting OCR.


However, there have recently been several more successful attempts at applying OCR to Ancient Greek, especially involving shared training sets and machine learning approaches. Please add more recent examples and discussion below.
However, there have recently been several more successful attempts at applying OCR to Ancient Greek, especially involving shared training sets and machine learning approaches.


==Projects==
==Projects collecting Ancient Greek texts digitized with OCR technology==


* [https://en.wikipedia.org/wiki/Million_Book_Project Million Book Project]
* [https://en.wikipedia.org/wiki/Million_Book_Project Million Book Project]
* [[Perseus Digital Library]]
* [[Perseus Digital Library]]
* [[First Thousand Years of Greek]]
* [https://opengreekandlatin.github.io/First1KGreek/|First Thousand Years of Greek]
* [[Open Greek and Latin]]
* [[Open Greek and Latin project]]
* [[Lace: Greek OCR]]
* [[Lace: Greek OCR]]


Line 32: Line 32:
===Alternatives===
===Alternatives===


* [http://accesstei.apexcovantage.com/ AccessTEI] is a service for members of the TEI for manual keying of texts which can handle ancient Greek
* [http://accesstei.apexcovantage.com/ AccessTEI] is a service for members of the TEI for manual keying of texts which can handle ancient Greek.





Revision as of 11:38, 16 December 2023

Definitions

Optical Character Recognition or OCR is the process of using software to reading analogue, printed texts (or raster images of such text) and interpret it as character data, usually using probabilistic pattern-recognition methods. It is related to, but more usually more straightforward than, Handwritten Text-Recognition (HTR). OCR is relatively easy to perform on modern printed text, but struggles significantly more with: older print and non-standard fonts; less-common languages with complex diacritical systems; historical language not normally of interest to the AI and intelligence communities who invest a lot in text analysis applications. Ancient Greek is at the intersection of all these difficulties, and has traditionally been among the most difficult printed languages to OCR. The TLG, for example, has compiled hundreds of millions of words of Greek literature through outsourced manual keying, rather than even attempting OCR.

However, there have recently been several more successful attempts at applying OCR to Ancient Greek, especially involving shared training sets and machine learning approaches.

Projects collecting Ancient Greek texts digitized with OCR technology

Tools, recommendations and policies

Alternatives

  • AccessTEI is a service for members of the TEI for manual keying of texts which can handle ancient Greek.