Open Source Document Management System | OpenKM - Bad OCR recognition with Tesseract, how to improve it

Bad OCR recognition with Tesseract, how to improve it

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

4 posts

4 posts

Bad OCR recognition with Tesseract, how to improve it

#48464 by OpaHeinz
Mon Jul 22, 2019 10:05 am

I have still problems with the quality of ocr textrecognition.
with installed tesseract

When I try to extract text manually on command line the extracted text is first class.
Only one example(always the same jpg-file) a text should be: "Allmersbach im Tal" on command line i got "Allmersbach im Tal" as text --> fine!

When I let extract the text by openkm the extracted text in the database is bad.
by openkm I got "Gummersbacher Jim Tal", or "Angeknackst", or out of "EUR" i got "EUER"

My configuration:
ubuntu Ubuntu 18.04.2 LTS
tomcat-8.5.24
openkm: 6.3.6 build: 87d181f CE
system.ocr: /usr/bin/tesseract ${fileIn} ${fileOut} -l deu+deu_frak+eng
system.ocr.rotate: 0;90;180;270;
system.pdf.force.ocr: true
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.openoffice.dictionary: /home/openkm/dictonarys/dict-de_de-frami_2017-01-12.oxt
system.openoffice.path: /usr/lib/libreoffice

following languages are installed:
user@server:# tesseract --list-langs
List of available languages (4):
eng
osd
deu_frak
deu

For the both languages deu_frak, deu are the newest training libraries installed.
The problem is, that with such results a fulltext search always fails.

Any help/hint, how I can improve the results?

Thank you in advance
OpaHeinz

Username

OpaHeinz

Rank

Junior Boarder

Posts

Joined

Tue Jan 29, 2019 11:31 am

Re: Bad OCR recognition with Tesseract, how to improve it

#48483 by jllort
Thu Jul 25, 2019 7:36 am

If you can not survive with Tesseract results should consider buying a commercial OCR engine, the cost will be lower than try to solve the issues.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Bad OCR recognition with Tesseract, how to improve it

#48488 by OpaHeinz
Thu Jul 25, 2019 8:27 am

Hi Jllort, not the answer I expected

Just additional short questions, are my configuration so far ok?
I am still struggling, because as explained my experience the manual ocr process on command line delivers a better result as by openkm automatic process.

The comparison of ocr tools, which are available on OKM handbook are still old.
Is there a newer recommendation which ocr engine whould be the best?

Add.: I tried to get the abby finreader engine, but the company said, they will not give it away to privat persons

Username

OpaHeinz

Rank

Junior Boarder

Posts

Joined

Tue Jan 29, 2019 11:31 am

Re: Bad OCR recognition with Tesseract, how to improve it

#48494 by jllort
Thu Jul 25, 2019 3:17 pm

I supposed that was not the answer you were expecting, but training tesseract seems not easy, you can play with it, is another option ( or looking for a freelance what makes it for you, will have some cost, but might be more reasonable ). For my experience, unfortunately, when you have a lot of issues with OCR, results in most cases is a lost battle. That's why my suggestion is switch to commercial or somebody what train the OCR engine for you. Otherwise, you will expend a lot of efforts, hours for nothing and at the end will arrive at the same point you started ( with more experience, but similar result ).

The best open source OCR engine is tesseract. It depends on OS version will be installed by default version 3 or 4.
Also, you can try with ocr4linux ( abby company too ) https://www.ocr4linux.com/en:start ( we have tested in the past and works really fine, but you pay for pages processed not documents and each year ).

Another solution might be using a tool like Chronoscan ( cheap ) and integrate the OCR Text extractor in another manner, but will need some development efforts ( send a document to chronoscan folder and periodically looking for text extracted and import into -> crontab task binding another folder ).

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
4 posts

Return to “Configuration”

Display:

Sort by:

Jump to: