Open Source Document Management System | OpenKM - Text Extraction with Tesseract and OCR

Text Extraction with Tesseract and OCR

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

4 posts

4 posts

Text Extraction with Tesseract and OCR

#39413 by Catscratch
Thu May 07, 2015 10:02 am

Hi,

I got PDFs with scanned images. So I configured OpenKM to use Tesseract 3 for ocr. I also enabled system.pdf.force.ocr, because without this option, no text is extracted from scanned pdf files.

Using german dictionary: dict-de_de-frami_2013-12-06.oxt

After running the text extractor cron, text was extracted.

Code: Select all

DEBUG com.openkm.extractor.Tesseract3TextExtractor- TEXT: <my text goes here>
DEBUG com.openkm.extractor.PdfTextExtractor- OCR Extracted: <my text goes here>

But when I try to search for any text from <my text goes here> nothing is found in fulltext search window. So my question is why? Also also rebuild the indexes. Admin -> Utils -> Rebuild indexes -> Text extractor and Lucene too.

Can you give me a hint why this do not work?

Thanks!

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: Text Extraction with Tesseract and OCR

#39432 by jllort
Sat May 09, 2015 10:05 am

Try in our online demo -> demo.openkm.com ( wait 10-15 minutes, because document will be at batch queue at the beginning and will be needed some time to be processed ).

In your OpenKM take document UUID ( is available on tab properties, select and press CTRL+C to copy ) and then go to administration -> database query and execute:

Code: Select all

select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='YOUR UUID HERE'

The result should be the document -> do it contains the extracted text ?

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Text Extraction with Tesseract and OCR

#39527 by Catscratch
Mon May 11, 2015 1:18 pm

Hm, you're right. Maybe I had to be more patient. After some time, the index contains extracted text.

So thanks for help!

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: Text Extraction with Tesseract and OCR

#39537 by jllort
Wed May 13, 2015 9:58 am

At administration -> stats -> text extractor queue you can see the pending documents to be processed.

Also with the uuid and the previous query I gave to you, can check if document has been processed or not -> the FIELD NDC_TEXT_EXTRACTED='F' indicate is still not processed and 'T', has been

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
4 posts

Return to “Usage”

Display:

Sort by:

Jump to: