Open Source Document Management System | OpenKM

PostPosted:**Thu May 07, 2015 10:02 am**

Hi,

I got PDFs with scanned images. So I configured OpenKM to use Tesseract 3 for ocr. I also enabled system.pdf.force.ocr, because without this option, no text is extracted from scanned pdf files.

Using german dictionary: dict-de_de-frami_2013-12-06.oxt

After running the text extractor cron, text was extracted.

Code: Select all

DEBUG com.openkm.extractor.Tesseract3TextExtractor- TEXT: <my text goes here>
DEBUG com.openkm.extractor.PdfTextExtractor- OCR Extracted: <my text goes here>

But when I try to search for any text from <my text goes here> nothing is found in fulltext search window. So my question is why? Also also rebuild the indexes. Admin -> Utils -> Rebuild indexes -> Text extractor and Lucene too.

Can you give me a hint why this do not work?

Thanks!

PostPosted:**Sat May 09, 2015 10:05 am**

Try in our online demo -> demo.openkm.com ( wait 10-15 minutes, because document will be at batch queue at the beginning and will be needed some time to be processed ).

In your OpenKM take document UUID ( is available on tab properties, select and press CTRL+C to copy ) and then go to administration -> database query and execute:

Code: Select all

select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='YOUR UUID HERE'

The result should be the document -> do it contains the extracted text ?

PostPosted:**Mon May 11, 2015 1:18 pm**

Hm, you're right. Maybe I had to be more patient. After some time, the index contains extracted text.

So thanks for help!

PostPosted:**Wed May 13, 2015 9:58 am**

At administration -> stats -> text extractor queue you can see the pending documents to be processed.

Also with the uuid and the previous query I gave to you, can check if document has been processed or not -> the FIELD NDC_TEXT_EXTRACTED='F' indicate is still not processed and 'T', has been

Open Source Document Management System | OpenKM

Text Extraction with Tesseract and OCR

Text Extraction with Tesseract and OCR

Re: Text Extraction with Tesseract and OCR

Re: Text Extraction with Tesseract and OCR

Re: Text Extraction with Tesseract and OCR