• OCR on existing documents and full text search not working

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #54030  by Toormser
 
Hello Community,

I've getting in trouble with my openKM installation since few days after the installation.

----
My first problem is, that on my latest version of Community Edition openKM (6.3.12 (build: a3587ce)) there is no OCR working for uploaded and existing files.

I've already configured and installed ocr like this:
Code: Select all
system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut}	
system.ocr.rotate = 90;180;270;
system.pdf.force.ocr = True 
When I testing a documents with sql request like this:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='id from doc which should be ocr';
I've got 0 rows return.

----
My second problem is, that the full text search are not working for documents which already have ocr before uploading. So when I try to find a document with a specific word from a pdf file, I can't find this with the search.

Thanks for your assistance and best regards
Toorms
 #54056  by jllort
 
The column NBS_UUID is the unique document identifier, I think you should use a column named NBS_CONTENT with a LIKE '%content searched%';

About why OCR is not working I suggest checking from administration > tools > text extractors check.

Finally, ensure you have these values in the configuration parameter named "registered.text.extractors":
Code: Select all
com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
 #54094  by jllort
 
Keep only what I have previously shared. When updated restart the openkm service and check again. You can check extraction from Administration > Tools > Check Text Extraction
 #54273  by ndorf
 
I have this same issue. I am running CE, and I do have Tesseract4 (not 3). The registered text extractor value for Tesseract only shows Tesseract3, and I see no way to edit that.

Full text search works for documents like word processing files or even spreadsheets that are not images, but does nothing to OCR tif files. If i understand the documentation correctly, Tesseract will not ocr image PDF files (such as scanned documents to PDF) so they would have to be converted to TIF first?

Thank you
 #54279  by ndorf
 
Thank you for the suggestion and link. I tried that (disabling Cuneiform and Abby) with no success. I even shutdown and restarted Tomcat / OKM with no luck.

I was under the impression that after OCR with Tesseract, within OKM, a text layer would be saved and associated with it's source TIFF file and used for full text search. Maybe its a feature not available in the CE version?
 #54295  by jllort
 
I suggest disable these options:
Code: Select all
system.ocr.rotate = 
system.pdf.force.ocr = False 
In the Administration > Tools have the option to test the text extraction -> from there and watching the openkm.log you will discover what happens.

Finally, if all the documents passed previously in the indexing queue and they have not been indexed, they should be set in the queue again ( before must be sure the OCR is working ).

Simply execute the next SQL query for it:
Code: Select all
UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='F';

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.