• OCR on existing documents and full text search not working

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #54030  by Toormser
 
Hello Community,

I've getting in trouble with my openKM installation since few days after the installation.

----
My first problem is, that on my latest version of Community Edition openKM (6.3.12 (build: a3587ce)) there is no OCR working for uploaded and existing files.

I've already configured and installed ocr like this:
Code: Select all
system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut}	
system.ocr.rotate = 90;180;270;
system.pdf.force.ocr = True 
When I testing a documents with sql request like this:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='id from doc which should be ocr';
I've got 0 rows return.

----
My second problem is, that the full text search are not working for documents which already have ocr before uploading. So when I try to find a document with a specific word from a pdf file, I can't find this with the search.

Thanks for your assistance and best regards
Toorms
 #54056  by jllort
 
The column NBS_UUID is the unique document identifier, I think you should use a column named NBS_CONTENT with a LIKE '%content searched%';

About why OCR is not working I suggest checking from administration > tools > text extractors check.

Finally, ensure you have these values in the configuration parameter named "registered.text.extractors":
Code: Select all
com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
 #54094  by jllort
 
Keep only what I have previously shared. When updated restart the openkm service and check again. You can check extraction from Administration > Tools > Check Text Extraction

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.