OCR Support Files
Tesseract OCR files come in three variants:
Variant |
Trained models |
Speed |
Accuracy |
LTSM Only (Fast) |
Best “value for money” in terms of speed vs accuracy |
Fastest |
Least accurate |
LTSM Only (Best) |
Use if you are willing to trade a lot of speed for slightly better accuracy |
Slowest |
Most accurate |
LTSM + Legacy |
Includes both “LSTM-Fast” and legacy models. |
Faster than “LTSM-Best” |
Slightly less accurate than “LTSM-Best” |
There are also two types of models:
· Language Models: Support for a single language
· Script Models: Support for a single script (e.g. Latin or Japanese) that may be used by multiple languages
Lang Code |
Description |
Filename |
osd |
Orientation and script detection |
osd.traineddata |
equ |
Math / equation detection |
equ.traineddata |
The following are OCR data files supporting a single language:
Lang Code |
Language |
Filename |
afr |
Afrikaans |
afr.traineddata |
amh |
Amharic |
amh.traineddata |
ara |
Arabic |
ara.traineddata |
asm |
Assamese |
asm.traineddata |
aze |
Azerbaijani |
aze.traineddata |
aze_cyrl |
Azerbaijani - Cyrillic |
aze_cyrl.traineddata |
bel |
Belarusian |
bel.traineddata |
ben |
Bengali |
ben.traineddata |
bod |
Tibetan |
bod.traineddata |
bos |
Bosnian |
bos.traineddata |
bul |
Bulgarian |
bul.traineddata |
cat |
Catalan; Valencian |
cat.traineddata |
ceb |
Cebuano |
ceb.traineddata |
ces |
Czech |
ces.traineddata |
chi_sim |
Chinese - Simplified |
chi_sim.traineddata |
chi_tra |
Chinese - Traditional |
chi_tra.traineddata |
chr |
Cherokee |
chr.traineddata |
cym |
Welsh |
cym.traineddata |
dan |
Danish |
dan.traineddata |
deu |
German |
deu.traineddata |
dzo |
Dzongkha |
dzo.traineddata |
ell |
Greek, Modern (1453-) |
ell.traineddata |
eng |
English |
eng.traineddata |
enm |
English, Middle (1100-1500) |
enm.traineddata |
epo |
Esperanto |
epo.traineddata |
est |
Estonian |
est.traineddata |
eus |
Basque |
eus.traineddata |
fas |
Persian |
fas.traineddata |
fin |
Finnish |
fin.traineddata |
fra |
French |
fra.traineddata |
frk |
German Fraktur |
frk.traineddata |
frm |
French, Middle (ca. 1400-1600) |
frm.traineddata |
gle |
Irish |
gle.traineddata |
glg |
Galician |
glg.traineddata |
grc |
Greek, Ancient (-1453) |
grc.traineddata |
guj |
Gujarati |
guj.traineddata |
hat |
Haitian; Haitian Creole |
hat.traineddata |
heb |
Hebrew |
heb.traineddata |
hin |
Hindi |
hin.traineddata |
hrv |
Croatian |
hrv.traineddata |
hun |
Hungarian |
hun.traineddata |
iku |
Inuktitut |
iku.traineddata |
ind |
Indonesian |
ind.traineddata |
isl |
Icelandic |
isl.traineddata |
ita |
Italian |
ita.traineddata |
ita_old |
Italian - Old |
ita_old.traineddata |
jav |
Javanese |
jav.traineddata |
jpn |
Japanese |
jpn.traineddata |
kan |
Kannada |
kan.traineddata |
kat |
Georgian |
kat.traineddata |
kat_old |
Georgian - Old |
kat_old.traineddata |
kaz |
Kazakh |
kaz.traineddata |
khm |
Central Khmer |
khm.traineddata |
kir |
Kirghiz; Kyrgyz |
kir.traineddata |
kor |
Korean |
kor.traineddata |
kur |
Kurdish |
kur.traineddata |
lao |
Lao |
lao.traineddata |
lat |
Latin |
lat.traineddata |
lav |
Latvian |
lav.traineddata |
lit |
Lithuanian |
lit.traineddata |
mal |
Malayalam |
mal.traineddata |
mar |
Marathi |
mar.traineddata |
mkd |
Macedonian |
mkd.traineddata |
mlt |
Maltese |
mlt.traineddata |
msa |
Malay |
msa.traineddata |
mya |
Burmese |
mya.traineddata |
nep |
Nepali |
nep.traineddata |
nld |
Dutch; Flemish |
nld.traineddata |
nor |
Norwegian |
nor.traineddata |
ori |
Oriya |
ori.traineddata |
pan |
Panjabi; Punjabi |
pan.traineddata |
pol |
Polish |
pol.traineddata |
por |
Portuguese |
por.traineddata |
pus |
Pushto; Pashto |
pus.traineddata |
ron |
Romanian; Moldavian; Moldovan |
ron.traineddata |
rus |
Russian |
rus.traineddata |
san |
Sanskrit |
san.traineddata |
sin |
Sinhala; Sinhalese |
sin.traineddata |
slk |
Slovak |
slk.traineddata |
slv |
Slovenian |
slv.traineddata |
spa |
Spanish; Castilian |
spa.traineddata |
spa_old |
Spanish; Castilian - Old |
spa_old.traineddata |
sqi |
Albanian |
sqi.traineddata |
srp |
Serbian |
srp.traineddata |
srp_latn |
Serbian - Latin |
srp_latn.traineddata |
swa |
Swahili |
swa.traineddata |
swe |
Swedish |
swe.traineddata |
syr |
Syriac |
syr.traineddata |
tam |
Tamil |
tam.traineddata |
tel |
Telugu |
tel.traineddata |
tgk |
Tajik |
tgk.traineddata |
tgl |
Tagalog |
tgl.traineddata |
tha |
Thai |
tha.traineddata |
tir |
Tigrinya |
tir.traineddata |
tur |
Turkish |
tur.traineddata |
uig |
Uighur; Uyghur |
uig.traineddata |
ukr |
Ukrainian |
ukr.traineddata |
urd |
Urdu |
urd.traineddata |
uzb |
Uzbek |
uzb.traineddata |
uzb_cyrl |
Uzbek - Cyrillic |
uzb_cyrl.traineddata |
vie |
Vietnamese |
vie.traineddata |
yid |
Yiddish |
yid.traineddata |
Script data files support all languages of a specific script, e.g. Latin
supports Italian, French, Spanish, etc.
All script models also support English, except for Cyrillic.
Script Code |
Script |
Filename |
Arabic |
All Arabic languages |
Arabic.traineddata |
Armenian |
Armenian language |
Armenian.traineddata |
Bengali |
Bengali language |
Bengali.traineddata |
Canadian_Aboriginal |
All Canadian languages (Aboriginal) |
Canadian_aboriginal.traineddata |
Cherokee |
Cherokee language |
Cherokee.traineddata |
Cyrillic |
All Cyrillic languages |
Cyrillic.traineddata |
Devanagari |
Hin, San, Mar, Nep and Eng languages |
Devanagari.traineddata |
Ethiopic |
Ethiopic language |
Ethiopic.traineddata |
Fraktur |
Combination of all the latin-based languages that have an “old” variant |
Fraktur.traineddata |
Georgian |
Georgian language |
Georgian.traineddata |
Greek |
Greek language |
Greek.traineddata |
Gujarati |
Gujarati language |
Gujarati.traineddata |
Gurmukhi |
Gurmukhi language |
Gurmukhi.traineddata |
Hangul |
All Hangul languages |
Hangul.traineddata |
Hangul_Vert |
All Hangul languages (Vertical) |
Hangul_vert.traineddata |
Hans |
All Han languages (Simplified) |
Hans.traineddata |
Hans_Vert |
All Han languages (Simplified, Vertical) |
Hans_vert.traineddata |
Hant |
All Han languages (Traditional) |
Hant.traineddata |
Hant_Vert |
All Han languages (Traditional, Vertical) |
Hant_vert.traineddata |
Hebrew |
Hebrew language |
Hebrew.traineddata |
Japanese |
Japanese language |
Japanese.traineddata |
Japanese_Vert |
Japanese language (Vertical) |
Japanese_vert.traineddata |
Kannada |
Kannada language |
Kannada.traineddata |
Khmer |
Khmer language |
Khmer.traineddata |
Lao |
Lao language |
Lao.traineddata |
Latin |
All latin-based languages, except Vietnamese |
Latin.traineddata |
Malayalam |
Malayalam language |
Malayalam.traineddata |
Myanmar |
Myanmar language |
Myanmar.traineddata |
Oriya |
Oriya language |
Oriya.traineddata |
Sinhala |
Sinhala language |
Sinhala.traineddata |
Syriac |
Syriac language |
Syriac.traineddata |
Tamil |
Tamil language |
Tamil.traineddata |
Telugu |
Telugu language |
Telugu.traineddata |
Thaana |
Thaana language |
Thaana.traineddata |
Thai |
Thai language |
Thai.traineddata |
Tibetan |
Tibetan language |
Tibetan.traineddata |
Vietnamese |
Latin-based Vietnamese language |
Vietnamese.traineddata |