OCR for ancient Greek: Difference between revisions

From The Digital Classicist Wiki
Jump to navigation Jump to search
(Added Tesseract tool, reordered)
mNo edit summary
 
(14 intermediate revisions by 7 users not shown)
Line 1: Line 1:
* [http://code.google.com/p/tesseract-ocr/ Tesseract] is an ongoing Google open source project for OCR.
==Definitions==
 
'''Optical Character Recognition''' or '''OCR''' is the process of using software to reading analogue, printed texts (or raster images of such text) and interpret it as character data, usually using probabilistic pattern-recognition methods. It is related to, but more usually more straightforward than, [[Handwritten Text-Recognition]] (HTR). OCR is relatively easy to perform on modern printed text, but struggles significantly more with: older print and non-standard fonts; less-common languages with complex diacritical systems; historical language not normally of interest to the AI and intelligence communities who invest a lot in text analysis applications.
 
Ancient Greek is at the intersection of all these difficulties, and has traditionally been among the most difficult printed languages to OCR. The [[TLG]], for example, has compiled hundreds of millions of words of Greek literature through outsourced manual keying, rather than even attempting OCR. However, there have recently been several more successful attempts at applying OCR to Ancient Greek, especially involving shared training sets and machine learning approaches.
 
==Projects collecting Ancient Greek texts digitized with OCR==
 
* [https://en.wikipedia.org/wiki/Million_Book_Project Million Book Project]
* [[Perseus Digital Library]]
* [https://opengreekandlatin.github.io/First1KGreek/|First Thousand Years of Greek]
* [[Open Greek and Latin project]]
* [[Lace: Greek OCR]]
 
==Tools, recommendations and policies==
 
* [http://ancientgreekocr.org Ancient Greek OCR] provides downloads and instructions for OCR using the [http://code.google.com/p/tesseract-ocr Tesseract] engine. Works on Windows, Linux, OSX & Android.
* [https://dcthree.github.io/antigrapheus/ Antigrapheus] allows you to use the Ancient Greek OCR training file above to OCR documents in a web browser, using Tesseract.js.
* Bruce Robertson has created "Rigaudon", "a complete suite of scripts, python code and data required for producing polytonic Greek OCR"
** [https://github.com/brobertson/rigaudon Rigaudon GitHub page]
** [[Lace: Greek OCR]] collects results of OCR processing with Rigaudon on public domain texts
** Initial reports on preliminary results of a survey of techniques: http://www.heml.org/RobertsonGreekOCR/
* A number of people have produced training files for specific Greek fonts in the [http://kraken.re/ Kraken] OCR engine:
** [https://github.com/pharos-alexandria/kraken-ocr-greek_cursive Greek Cursive, from an edition of John Chrysostom's works by Henry Savile]
** [https://github.com/ryanfb/kraken-gaza-iliad Greek from an edition of Theodorus Gaza's Attic paraphrase of the Iliad]
** [https://github.com/mittagessen/kraken-models Greek models in the Kraken models repo] (these are in the legacy pyrnn model format and may not work with the latest version of Kraken, see [https://github.com/mittagessen/kraken/issues/118 this issue])
* The [http://gamera.informatik.hsnr.de/ Gamera] toolkit for analysing and scanning complex texts includes some experiments with polytonic Greek
* The [http://gamera.informatik.hsnr.de/ Gamera] toolkit for analysing and scanning complex texts includes some experiments with polytonic Greek
* Bruce Robertson reports on some preliminary results of a survey of techniques: http://www.heml.org/RobertsonGreekOCR/
* Federico Boschetti did some earlier experimentation with adapting/training Google's OCR engine [http://code.google.com/p/tesseract-ocr/ tesseract] to ancient Greek texts: http://www.himeros.eu/ ([http://www.perseus.tufts.edu/~ababeu/ecdl2009-preprint.pdf related paper])
* Federico Boschetti has been experimenting with adapting/training Google's OCR engine [http://code.google.com/p/tesseract-ocr/ tesseract] to ancient Greek texts: http://www.himeros.eu/ ([http://www.perseus.tufts.edu/~ababeu/ecdl2009-preprint.pdf related paper])
* The commercial OCR software [http://www.ideatech-online.com/index.php?option=com_content&task=view&id=23&Itemid=27 Anagnostis] (€585) can handle ancient Greek, though apparently poorly
* [http://finereader.abbyy.com/ ABBYY FineReader] can be made to work with ancient Greek with extensive training
* [http://finereader.abbyy.com/ ABBYY FineReader] can be made to work with ancient Greek with extensive training
* Google Docs now allows you to have it do [http://googledocs.blogspot.com/2011/02/optical-character-recognition-ocr-in-34.html OCR on uploaded documents in a variety of languages], and you can get some results by specifying "Greek" and uploading a PDF (images seem not to work). Quality is about on the level of Google Books OCR of printed ancient Greek.
* Google Docs now allows you to have it do [http://googledocs.blogspot.com/2011/02/optical-character-recognition-ocr-in-34.html OCR on uploaded documents in a variety of languages], and you can get some results by specifying "Greek" and uploading a PDF (images seem not to work). Quality is about on the level of Google Books OCR of printed ancient Greek.


===alternatives===
===Alternatives===
 
* [http://accesstei.apexcovantage.com/ AccessTEI] is a service for members of the TEI for manual keying of texts which can handle ancient Greek.


* [http://accesstei.apexcovantage.com/ AccessTEI] is a service for members of the TEI for manual keying of texts which can handle ancient Greek


==External links==
* [https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1005&L=DIGITALCLASSICIST&F=&S=&P=2180 Discussion of ancient Greek OCR software on Digital Classicist mailing list]
* [http://www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf Deciding whether Optical Character Recognition is feasible, Simon Tanner (KDCS), 2004]


[[category:FAQ]]
[[category:FAQ]]
[[category:Tools]]
[[category:Tools]]
[[category:OCR]]

Latest revision as of 10:00, 18 December 2023

Definitions

Optical Character Recognition or OCR is the process of using software to reading analogue, printed texts (or raster images of such text) and interpret it as character data, usually using probabilistic pattern-recognition methods. It is related to, but more usually more straightforward than, Handwritten Text-Recognition (HTR). OCR is relatively easy to perform on modern printed text, but struggles significantly more with: older print and non-standard fonts; less-common languages with complex diacritical systems; historical language not normally of interest to the AI and intelligence communities who invest a lot in text analysis applications.

Ancient Greek is at the intersection of all these difficulties, and has traditionally been among the most difficult printed languages to OCR. The TLG, for example, has compiled hundreds of millions of words of Greek literature through outsourced manual keying, rather than even attempting OCR. However, there have recently been several more successful attempts at applying OCR to Ancient Greek, especially involving shared training sets and machine learning approaches.

Projects collecting Ancient Greek texts digitized with OCR

Tools, recommendations and policies

Alternatives

  • AccessTEI is a service for members of the TEI for manual keying of texts which can handle ancient Greek.