Page 1 of 1

Bad OCR recognition with Tesseract, how to improve it

PostPosted:Mon Jul 22, 2019 10:05 am
by OpaHeinz
I have still problems with the quality of ocr textrecognition.
with installed tesseract

When I try to extract text manually on command line the extracted text is first class.
Only one example(always the same jpg-file) a text should be: "Allmersbach im Tal" on command line i got "Allmersbach im Tal" as text --> fine!

When I let extract the text by openkm the extracted text in the database is bad.
by openkm I got "Gummersbacher Jim Tal", or "Angeknackst", or out of "EUR" i got "EUER"

My configuration:
ubuntu Ubuntu 18.04.2 LTS
tomcat-8.5.24
openkm: 6.3.6 build: 87d181f CE
system.ocr: /usr/bin/tesseract ${fileIn} ${fileOut} -l deu+deu_frak+eng
system.ocr.rotate: 0;90;180;270;
system.pdf.force.ocr: true
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.openoffice.dictionary: /home/openkm/dictonarys/dict-de_de-frami_2017-01-12.oxt
system.openoffice.path: /usr/lib/libreoffice

following languages are installed:
user@server:# tesseract --list-langs
List of available languages (4):
eng
osd
deu_frak
deu

For the both languages deu_frak, deu are the newest training libraries installed.
The problem is, that with such results a fulltext search always fails.

Any help/hint, how I can improve the results?

Thank you in advance
OpaHeinz

Re: Bad OCR recognition with Tesseract, how to improve it

PostPosted:Thu Jul 25, 2019 7:36 am
by jllort
If you can not survive with Tesseract results should consider buying a commercial OCR engine, the cost will be lower than try to solve the issues.

Re: Bad OCR recognition with Tesseract, how to improve it

PostPosted:Thu Jul 25, 2019 8:27 am
by OpaHeinz
Hi Jllort, not the answer I expected :lol: :lol:
Just additional short questions, are my configuration so far ok?
I am still struggling, because as explained my experience the manual ocr process on command line delivers a better result as by openkm automatic process.

The comparison of ocr tools, which are available on OKM handbook are still old.
Is there a newer recommendation which ocr engine whould be the best?

Add.: I tried to get the abby finreader engine, but the company said, they will not give it away to privat persons :cry:

Re: Bad OCR recognition with Tesseract, how to improve it

PostPosted:Thu Jul 25, 2019 3:17 pm
by jllort
I supposed that was not the answer you were expecting, but training tesseract seems not easy, you can play with it, is another option ( or looking for a freelance what makes it for you, will have some cost, but might be more reasonable ). For my experience, unfortunately, when you have a lot of issues with OCR, results in most cases is a lost battle. That's why my suggestion is switch to commercial or somebody what train the OCR engine for you. Otherwise, you will expend a lot of efforts, hours for nothing and at the end will arrive at the same point you started ( with more experience, but similar result ).

The best open source OCR engine is tesseract. It depends on OS version will be installed by default version 3 or 4.
Also, you can try with ocr4linux ( abby company too ) https://www.ocr4linux.com/en:start ( we have tested in the past and works really fine, but you pay for pages processed not documents and each year ).

Another solution might be using a tool like Chronoscan ( cheap ) and integrate the OCR Text extractor in another manner, but will need some development efforts ( send a document to chronoscan folder and periodically looking for text extracted and import into -> crontab task binding another folder ).