Page 1 of 1

OCR feature not working in community

PostPosted:Mon Nov 07, 2016 4:09 pm
by Tazbir
Hi,

I dedicated several days to configure OpenKM. I would like to use the program to manage my documents at home. The OCR feature is critical as I would like the contents of all uploaded documents to be taken into account while searching. This is all.

I've installed OpenKM Community 6.3.2 under Debian Stretch 4.7.8-1 (2016-10-19) x86_64 GNU/Linux
I've installed tesseract 3.04.01
I've installed all required Java staff.

Below is the configuration that I performed in the administration tab in OpenKM.
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
system.ocr=/usr/bin/tesseract
system.ocr.rotate= 90;180;270; 
system.pdf.force.ocr=TRUE
The OCR feature does not seem to be working. When I try the Tessaract over the command line I'm able to get results.

In the log file I see the following message:
Code: Select all
WARN  com.openkm.extractor.RegisteredExtractors- Text extraction failure: Full text indexing of 'image/png' is not supported

Re: OCR feature not working in community

PostPosted:Tue Nov 08, 2016 12:55 pm
by jllort
This is wrong:
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
Should be
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
About the
Code: Select all
system.ocr=/usr/bin/tesseract
Should be ( as is explained here http://wiki.openkm.com/index.php/Third- ... ation:_OCR )
Code: Select all
system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l eng
Really if you only install eng support language for tesseract is not necessary specify the -l