OCR Support Files

 

Tesseract OCR files come in three variants:

Variant 

Trained models

Speed

Accuracy

LTSM Only (Fast)

Best “value for money” in terms of speed vs accuracy

Fastest

Least accurate

LTSM Only (Best)

Use if you are willing to trade a lot of speed for slightly better accuracy

Slowest

Most accurate

LTSM + Legacy

Includes both “LSTM-Fast” and legacy models.

Faster than “LTSM-Best”

Slightly less accurate than “LTSM-Best”

 

There are also two types of models:

·         Language Models: Support for a single language

·         Script Models: Support for a single script (e.g. Latin or Japanese) that may be used by multiple languages

 

Special Data Files

 

Lang Code

Description

Filename

osd

Orientation and script detection

osd.traineddata

equ

Math / equation detection

equ.traineddata

 

 

Language Data Files


The following are OCR data files supporting a single language:

Lang Code

Language

Filename

afr

Afrikaans

afr.traineddata

amh

Amharic

amh.traineddata

ara

Arabic

ara.traineddata

asm

Assamese

asm.traineddata

aze

Azerbaijani

aze.traineddata

aze_cyrl

Azerbaijani - Cyrillic

aze_cyrl.traineddata

bel

Belarusian

bel.traineddata

ben

Bengali

ben.traineddata

bod

Tibetan

bod.traineddata

bos

Bosnian

bos.traineddata

bul

Bulgarian

bul.traineddata

cat

Catalan; Valencian

cat.traineddata

ceb

Cebuano

ceb.traineddata

ces

Czech

ces.traineddata

chi_sim

Chinese - Simplified

chi_sim.traineddata

chi_tra

Chinese - Traditional

chi_tra.traineddata

chr

Cherokee

chr.traineddata

cym

Welsh

cym.traineddata

dan

Danish

dan.traineddata

deu

German

deu.traineddata

dzo

Dzongkha

dzo.traineddata

ell

Greek, Modern (1453-)

ell.traineddata

eng

English

eng.traineddata

enm

English, Middle (1100-1500)

enm.traineddata

epo

Esperanto

epo.traineddata

est

Estonian

est.traineddata

eus

Basque

eus.traineddata

fas

Persian

fas.traineddata

fin

Finnish

fin.traineddata

fra

French

fra.traineddata

frk

German Fraktur

frk.traineddata

frm

French, Middle (ca. 1400-1600)

frm.traineddata

gle

Irish

gle.traineddata

glg

Galician

glg.traineddata

grc

Greek, Ancient (-1453)

grc.traineddata

guj

Gujarati

guj.traineddata

hat

Haitian; Haitian Creole

hat.traineddata

heb

Hebrew

heb.traineddata

hin

Hindi

hin.traineddata

hrv

Croatian

hrv.traineddata

hun

Hungarian

hun.traineddata

iku

Inuktitut

iku.traineddata

ind

Indonesian

ind.traineddata

isl

Icelandic

isl.traineddata

ita

Italian

ita.traineddata

ita_old

Italian - Old

ita_old.traineddata

jav

Javanese

jav.traineddata

jpn

Japanese

jpn.traineddata

kan

Kannada

kan.traineddata

kat

Georgian

kat.traineddata

kat_old

Georgian - Old

kat_old.traineddata

kaz

Kazakh

kaz.traineddata

khm

Central Khmer

khm.traineddata

kir

Kirghiz; Kyrgyz

kir.traineddata

kor

Korean

kor.traineddata

kur

Kurdish

kur.traineddata

lao

Lao

lao.traineddata

lat

Latin

lat.traineddata

lav

Latvian

lav.traineddata

lit

Lithuanian

lit.traineddata

mal

Malayalam

mal.traineddata

mar

Marathi

mar.traineddata

mkd

Macedonian

mkd.traineddata

mlt

Maltese

mlt.traineddata

msa

Malay

msa.traineddata

mya

Burmese

mya.traineddata

nep

Nepali

nep.traineddata

nld

Dutch; Flemish

nld.traineddata

nor

Norwegian

nor.traineddata

ori

Oriya

ori.traineddata

pan

Panjabi; Punjabi

pan.traineddata

pol

Polish

pol.traineddata

por

Portuguese

por.traineddata

pus

Pushto; Pashto

pus.traineddata

ron

Romanian; Moldavian; Moldovan

ron.traineddata

rus

Russian

rus.traineddata

san

Sanskrit

san.traineddata

sin

Sinhala; Sinhalese

sin.traineddata

slk

Slovak

slk.traineddata

slv

Slovenian

slv.traineddata

spa

Spanish; Castilian

spa.traineddata

spa_old

Spanish; Castilian - Old

spa_old.traineddata

sqi

Albanian

sqi.traineddata

srp

Serbian

srp.traineddata

srp_latn

Serbian - Latin

srp_latn.traineddata

swa

Swahili

swa.traineddata

swe

Swedish

swe.traineddata

syr

Syriac

syr.traineddata

tam

Tamil

tam.traineddata

tel

Telugu

tel.traineddata

tgk

Tajik

tgk.traineddata

tgl

Tagalog

tgl.traineddata

tha

Thai

tha.traineddata

tir

Tigrinya

tir.traineddata

tur

Turkish

tur.traineddata

uig

Uighur; Uyghur

uig.traineddata

ukr

Ukrainian

ukr.traineddata

urd

Urdu

urd.traineddata

uzb

Uzbek

uzb.traineddata

uzb_cyrl

Uzbek - Cyrillic

uzb_cyrl.traineddata

vie

Vietnamese

vie.traineddata

yid

Yiddish

yid.traineddata

 

Script Data Files


Script data files support all languages of a specific script, e.g. Latin supports Italian, French, Spanish, etc.

All script models also support English, except for Cyrillic.

Script Code

Script

Filename

Arabic

All Arabic languages

Arabic.traineddata

Armenian

Armenian language

Armenian.traineddata

Bengali

Bengali language

Bengali.traineddata

Canadian_Aboriginal

All Canadian languages (Aboriginal)

Canadian_aboriginal.traineddata

Cherokee

Cherokee language

Cherokee.traineddata

Cyrillic

All Cyrillic languages

Cyrillic.traineddata

Devanagari

Hin, San, Mar, Nep and Eng languages

Devanagari.traineddata

Ethiopic

Ethiopic language

Ethiopic.traineddata

Fraktur

Combination of all the latin-based languages that have an “old” variant

Fraktur.traineddata

Georgian

Georgian language

Georgian.traineddata

Greek

Greek language

Greek.traineddata

Gujarati

Gujarati language

Gujarati.traineddata

Gurmukhi

Gurmukhi language

Gurmukhi.traineddata

Hangul

All Hangul languages

Hangul.traineddata

Hangul_Vert

All Hangul languages (Vertical)

Hangul_vert.traineddata

Hans

All Han languages (Simplified)

Hans.traineddata

Hans_Vert

All Han languages (Simplified, Vertical)

Hans_vert.traineddata

Hant

All Han languages (Traditional)

Hant.traineddata

Hant_Vert

All Han languages (Traditional, Vertical)

Hant_vert.traineddata

Hebrew

Hebrew language

Hebrew.traineddata

Japanese

Japanese language

Japanese.traineddata

Japanese_Vert

Japanese language (Vertical)

Japanese_vert.traineddata

Kannada

Kannada language

Kannada.traineddata

Khmer

Khmer language

Khmer.traineddata

Lao

Lao language

Lao.traineddata

Latin

All latin-based languages, except Vietnamese

Latin.traineddata

Malayalam

Malayalam language

Malayalam.traineddata

Myanmar

Myanmar language

Myanmar.traineddata

Oriya

Oriya language

Oriya.traineddata

Sinhala

Sinhala language

Sinhala.traineddata

Syriac

Syriac language

Syriac.traineddata

Tamil

Tamil language

Tamil.traineddata

Telugu

Telugu language

Telugu.traineddata

Thaana

Thaana language

Thaana.traineddata

Thai

Thai language

Thai.traineddata

Tibetan

Tibetan language

Tibetan.traineddata

Vietnamese

Latin-based Vietnamese language

Vietnamese.traineddata