• Bad OCR recognition with Tesseract, how to improve it

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #48464  by OpaHeinz
 
I have still problems with the quality of ocr textrecognition.
with installed tesseract

When I try to extract text manually on command line the extracted text is first class.
Only one example(always the same jpg-file) a text should be: "Allmersbach im Tal" on command line i got "Allmersbach im Tal" as text --> fine!

When I let extract the text by openkm the extracted text in the database is bad.
by openkm I got "Gummersbacher Jim Tal", or "Angeknackst", or out of "EUR" i got "EUER"

My configuration:
ubuntu Ubuntu 18.04.2 LTS
tomcat-8.5.24
openkm: 6.3.6 build: 87d181f CE
system.ocr: /usr/bin/tesseract ${fileIn} ${fileOut} -l deu+deu_frak+eng
system.ocr.rotate: 0;90;180;270;
system.pdf.force.ocr: true
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.openoffice.dictionary: /home/openkm/dictonarys/dict-de_de-frami_2017-01-12.oxt
system.openoffice.path: /usr/lib/libreoffice

following languages are installed:
user@server:# tesseract --list-langs
List of available languages (4):
eng
osd
deu_frak
deu

For the both languages deu_frak, deu are the newest training libraries installed.
The problem is, that with such results a fulltext search always fails.

Any help/hint, how I can improve the results?

Thank you in advance
OpaHeinz
 #48483  by jllort
 
If you can not survive with Tesseract results should consider buying a commercial OCR engine, the cost will be lower than try to solve the issues.
 #48488  by OpaHeinz
 
Hi Jllort, not the answer I expected :lol: :lol:
Just additional short questions, are my configuration so far ok?
I am still struggling, because as explained my experience the manual ocr process on command line delivers a better result as by openkm automatic process.

The comparison of ocr tools, which are available on OKM handbook are still old.
Is there a newer recommendation which ocr engine whould be the best?

Add.: I tried to get the abby finreader engine, but the company said, they will not give it away to privat persons :cry:
 #48494  by jllort
 
I supposed that was not the answer you were expecting, but training tesseract seems not easy, you can play with it, is another option ( or looking for a freelance what makes it for you, will have some cost, but might be more reasonable ). For my experience, unfortunately, when you have a lot of issues with OCR, results in most cases is a lost battle. That's why my suggestion is switch to commercial or somebody what train the OCR engine for you. Otherwise, you will expend a lot of efforts, hours for nothing and at the end will arrive at the same point you started ( with more experience, but similar result ).

The best open source OCR engine is tesseract. It depends on OS version will be installed by default version 3 or 4.
Also, you can try with ocr4linux ( abby company too ) https://www.ocr4linux.com/en:start ( we have tested in the past and works really fine, but you pay for pages processed not documents and each year ).

Another solution might be using a tool like Chronoscan ( cheap ) and integrate the OCR Text extractor in another manner, but will need some development efforts ( send a document to chronoscan folder and periodically looking for text extracted and import into -> crontab task binding another folder ).

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.