Page 1 of 1

No extraction after installing OCR

PostPosted:Thu Aug 26, 2021 2:50 pm
by Marco_I
Hi all

I'm running openKM on a docker environment on a Intel NUC.

Now I installed OCR according to this guide:
https://docs.openkm.com/kcenter/view/ok ... ngine.html
Code: Select all
system.ocr	String 	/usr/bin/tesseract ${fileIn} ${fileOut} 
If I test OCR in debian console it works.
But in openKM nothing happens.

And if I set system.pdf.force.ocr to "true" then also the regular text extraction works anymore.
So I set it back to "false". Now the text extraction works again, but no OCR.

Anyone an idea what I'm doing wrong? Haven't found any solution on google.

Thank you very much
Marco

PS: "Check text extraction" show exactly the same. And if I test an image it shows me the metadata but no recognized text inside the pic.

Re: No extraction after installing OCR

PostPosted:Thu Aug 26, 2021 3:17 pm
by Marco_I
I think I found a solution. I had the wrong docker container.

https://hub.docker.com/r/openkm/openkm-ce

This helped me.

But I have an additional question:
https://s29843.pcdn.co/blog/wp-content/ ... 24x768.png

From this picture it only can extract "eee FROM AN IMAGE".
This is quite bad. Is there any chance to improve the result?

Re: No extraction after installing OCR

PostPosted:Fri Aug 27, 2021 6:50 pm
by jllort
What tesseract-ocr engine do you have configured ... version 4.x?

Re: No extraction after installing OCR

PostPosted:Mon Aug 30, 2021 2:03 pm
by kvist
I am having a somewhat similar problem using the official Docker image, which already comes with Tesseract 4.00 installed. I have found that for some bizarre reason, OpenKM seems to randomly choose any of the Abby, Cuneiform, Tesseract3, and Barcode TextExtractors, no matter the configuration.

Every time I run
Code: Select all
docker run --rm -p 8080:8080 openkm/openkm-ce
then go to localhost:8080 and navigate to Administration > Utility > Test text extraction, OpenKM uses a completely different TextExtractor every time I start a new container, but almost never the one I want it to use.

What exactly am I missing here? I've also created this issue on GitHub, complete with a demo repository

Help would be greatly appreciated!