Open Source Document Management System | OpenKM - OCR on existing documents and full text search not working

OCR on existing documents and full text search not working

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

8 posts

8 posts

OCR on existing documents and full text search not working

#54030 by Toormser
Mon Jan 30, 2023 7:50 am

Hello Community,

I've getting in trouble with my openKM installation since few days after the installation.

----
My first problem is, that on my latest version of Community Edition openKM (6.3.12 (build: a3587ce)) there is no OCR working for uploaded and existing files.

I've already configured and installed ocr like this:

Code: Select all

system.ocr = /usr/bin/tesseract ${fileIn} ${fileOut}	
system.ocr.rotate = 90;180;270;
system.pdf.force.ocr = True

When I testing a documents with sql request like this:

Code: Select all

select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='id from doc which should be ocr';

I've got 0 rows return.

----
My second problem is, that the full text search are not working for documents which already have ocr before uploading. So when I try to find a document with a specific word from a pdf file, I can't find this with the search.

Thanks for your assistance and best regards
Toorms

Username

Toormser

Rank

Fresh Boarder

Posts

Joined

Mon Jan 30, 2023 7:38 am

Re: OCR on existing documents and full text search not working

#54056 by jllort
Sat Feb 18, 2023 9:18 am

The column NBS_UUID is the unique document identifier, I think you should use a column named NBS_CONTENT with a LIKE '%content searched%';

About why OCR is not working I suggest checking from administration > tools > text extractors check.

Finally, ensure you have these values in the configuration parameter named "registered.text.extractors":

Code: Select all

com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Username

jllort

Rank

Moderator

Posts

12129

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR on existing documents and full text search not working

#54072 by Toormser
Wed Feb 22, 2023 11:57 am

Howdy,

these are my enabled extractors

Username

Toormser

Rank

Fresh Boarder

Posts

Joined

Mon Jan 30, 2023 7:38 am

Re: OCR on existing documents and full text search not working

#54094 by jllort
Mon Mar 06, 2023 8:19 am

Keep only what I have previously shared. When updated restart the openkm service and check again. You can check extraction from Administration > Tools > Check Text Extraction

Username

jllort

Rank

Moderator

Posts

12129

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR on existing documents and full text search not working

#54273 by ndorf
Fri Jun 16, 2023 3:28 pm

I have this same issue. I am running CE, and I do have Tesseract4 (not 3). The registered text extractor value for Tesseract only shows Tesseract3, and I see no way to edit that.

Full text search works for documents like word processing files or even spreadsheets that are not images, but does nothing to OCR tif files. If i understand the documentation correctly, Tesseract will not ocr image PDF files (such as scanned documents to PDF) so they would have to be converted to TIF first?

Thank you

Username

ndorf

Rank

Fresh Boarder

Posts

Joined

Fri Jun 16, 2023 3:18 pm

Re: OCR on existing documents and full text search not working

#54278 by patson
Sat Jun 17, 2023 2:53 pm

I had a simiar issue. Try to follow this post and dissable not necesary plugins. viewtopic.php?t=24710#p53869
This solved the issue for me and tesseract is working as expected.

Username

patson

Rank

Fresh Boarder

Posts

Joined

Fri Jun 16, 2023 8:29 pm

Re: OCR on existing documents and full text search not working

#54279 by ndorf
Sat Jun 17, 2023 9:57 pm

Thank you for the suggestion and link. I tried that (disabling Cuneiform and Abby) with no success. I even shutdown and restarted Tomcat / OKM with no luck.

I was under the impression that after OCR with Tesseract, within OKM, a text layer would be saved and associated with it's source TIFF file and used for full text search. Maybe its a feature not available in the CE version?

Username

ndorf

Rank

Fresh Boarder

Posts

Joined

Fri Jun 16, 2023 3:18 pm

Re: OCR on existing documents and full text search not working

#54295 by jllort
Fri Jun 23, 2023 4:22 pm

I suggest disable these options:

Code: Select all

system.ocr.rotate = 
system.pdf.force.ocr = False

In the Administration > Tools have the option to test the text extraction -> from there and watching the openkm.log you will discover what happens.

Finally, if all the documents passed previously in the indexing queue and they have not been indexed, they should be set in the queue again ( before must be sure the OCR is working ).

Simply execute the next SQL query for it:

Code: Select all

UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='F';

Username

jllort

Rank

Moderator

Posts

12129

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
8 posts

Return to “Configuration”

Display:

Sort by:

Jump to: