• OCR feature not working in community

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #42541  by Tazbir
 
Hi,

I dedicated several days to configure OpenKM. I would like to use the program to manage my documents at home. The OCR feature is critical as I would like the contents of all uploaded documents to be taken into account while searching. This is all.

I've installed OpenKM Community 6.3.2 under Debian Stretch 4.7.8-1 (2016-10-19) x86_64 GNU/Linux
I've installed tesseract 3.04.01
I've installed all required Java staff.

Below is the configuration that I performed in the administration tab in OpenKM.
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
system.ocr=/usr/bin/tesseract
system.ocr.rotate= 90;180;270; 
system.pdf.force.ocr=TRUE
The OCR feature does not seem to be working. When I try the Tessaract over the command line I'm able to get results.

In the log file I see the following message:
Code: Select all
WARN  com.openkm.extractor.RegisteredExtractors- Text extraction failure: Full text indexing of 'image/png' is not supported
 #42549  by jllort
 
This is wrong:
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
Should be
Code: Select all
registered.text.extractors= com.openkm.extractor.Tesseract3TextExtractor -l eng
About the
Code: Select all
system.ocr=/usr/bin/tesseract
Should be ( as is explained here http://wiki.openkm.com/index.php/Third- ... ation:_OCR )
Code: Select all
system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l eng
Really if you only install eng support language for tesseract is not necessary specify the -l

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.