Page 1 of 1

OCR on existing documents and full text search not working

PostPosted:Mon Jan 30, 2023 7:50 am
by Toormser
Hello Community,

I've getting in trouble with my openKM installation since few days after the installation.

----
My first problem is, that on my latest version of Community Edition openKM (6.3.12 (build: a3587ce)) there is no OCR working for uploaded and existing files.

I've already configured and installed ocr like this:
Code: Select all
system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut}	
system.ocr.rotate = 90;180;270;
system.pdf.force.ocr = True 
When I testing a documents with sql request like this:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='id from doc which should be ocr';
I've got 0 rows return.

----
My second problem is, that the full text search are not working for documents which already have ocr before uploading. So when I try to find a document with a specific word from a pdf file, I can't find this with the search.

Thanks for your assistance and best regards
Toorms

Re: OCR on existing documents and full text search not working

PostPosted:Sat Feb 18, 2023 9:18 am
by jllort
The column NBS_UUID is the unique document identifier, I think you should use a column named NBS_CONTENT with a LIKE '%content searched%';

About why OCR is not working I suggest checking from administration > tools > text extractors check.

Finally, ensure you have these values in the configuration parameter named "registered.text.extractors":
Code: Select all
com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Re: OCR on existing documents and full text search not working

PostPosted:Wed Feb 22, 2023 11:57 am
by Toormser
Howdy,

these are my enabled extractors
Image

Re: OCR on existing documents and full text search not working

PostPosted:Mon Mar 06, 2023 8:19 am
by jllort
Keep only what I have previously shared. When updated restart the openkm service and check again. You can check extraction from Administration > Tools > Check Text Extraction

Re: OCR on existing documents and full text search not working

PostPosted:Fri Jun 16, 2023 3:28 pm
by ndorf
I have this same issue. I am running CE, and I do have Tesseract4 (not 3). The registered text extractor value for Tesseract only shows Tesseract3, and I see no way to edit that.

Full text search works for documents like word processing files or even spreadsheets that are not images, but does nothing to OCR tif files. If i understand the documentation correctly, Tesseract will not ocr image PDF files (such as scanned documents to PDF) so they would have to be converted to TIF first?

Thank you

Re: OCR on existing documents and full text search not working

PostPosted:Sat Jun 17, 2023 2:53 pm
by patson
I had a simiar issue. Try to follow this post and dissable not necesary plugins. viewtopic.php?t=24710#p53869
This solved the issue for me and tesseract is working as expected.

Re: OCR on existing documents and full text search not working

PostPosted:Sat Jun 17, 2023 9:57 pm
by ndorf
Thank you for the suggestion and link. I tried that (disabling Cuneiform and Abby) with no success. I even shutdown and restarted Tomcat / OKM with no luck.

I was under the impression that after OCR with Tesseract, within OKM, a text layer would be saved and associated with it's source TIFF file and used for full text search. Maybe its a feature not available in the CE version?

Re: OCR on existing documents and full text search not working

PostPosted:Fri Jun 23, 2023 4:22 pm
by jllort
I suggest disable these options:
Code: Select all
system.ocr.rotate = 
system.pdf.force.ocr = False 
In the Administration > Tools have the option to test the text extraction -> from there and watching the openkm.log you will discover what happens.

Finally, if all the documents passed previously in the indexing queue and they have not been indexed, they should be set in the queue again ( before must be sure the OCR is working ).

Simply execute the next SQL query for it:
Code: Select all
UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='F';