Page 1 of 1

Text Extraction with Tesseract and OCR

PostPosted:Thu May 07, 2015 10:02 am
by Catscratch
Hi,

I got PDFs with scanned images. So I configured OpenKM to use Tesseract 3 for ocr. I also enabled system.pdf.force.ocr, because without this option, no text is extracted from scanned pdf files.

Using german dictionary: dict-de_de-frami_2013-12-06.oxt

After running the text extractor cron, text was extracted.
Code: Select all
DEBUG com.openkm.extractor.Tesseract3TextExtractor- TEXT: <my text goes here>
DEBUG com.openkm.extractor.PdfTextExtractor- OCR Extracted: <my text goes here>
But when I try to search for any text from <my text goes here> nothing is found in fulltext search window. So my question is why? Also also rebuild the indexes. Admin -> Utils -> Rebuild indexes -> Text extractor and Lucene too.

Can you give me a hint why this do not work?

Thanks!

Re: Text Extraction with Tesseract and OCR

PostPosted:Sat May 09, 2015 10:05 am
by jllort
Try in our online demo -> demo.openkm.com ( wait 10-15 minutes, because document will be at batch queue at the beginning and will be needed some time to be processed ).

In your OpenKM take document UUID ( is available on tab properties, select and press CTRL+C to copy ) and then go to administration -> database query and execute:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='YOUR UUID HERE'
The result should be the document -> do it contains the extracted text ?

Re: Text Extraction with Tesseract and OCR

PostPosted:Mon May 11, 2015 1:18 pm
by Catscratch
Hm, you're right. Maybe I had to be more patient. After some time, the index contains extracted text.

So thanks for help!

Re: Text Extraction with Tesseract and OCR

PostPosted:Wed May 13, 2015 9:58 am
by jllort
At administration -> stats -> text extractor queue you can see the pending documents to be processed.

Also with the uuid and the previous query I gave to you, can check if document has been processed or not -> the FIELD NDC_TEXT_EXTRACTED='F' indicate is still not processed and 'T', has been