• Text Extraction with Tesseract and OCR

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #39413  by Catscratch
 
Hi,

I got PDFs with scanned images. So I configured OpenKM to use Tesseract 3 for ocr. I also enabled system.pdf.force.ocr, because without this option, no text is extracted from scanned pdf files.

Using german dictionary: dict-de_de-frami_2013-12-06.oxt

After running the text extractor cron, text was extracted.
Code: Select all
DEBUG com.openkm.extractor.Tesseract3TextExtractor- TEXT: <my text goes here>
DEBUG com.openkm.extractor.PdfTextExtractor- OCR Extracted: <my text goes here>
But when I try to search for any text from <my text goes here> nothing is found in fulltext search window. So my question is why? Also also rebuild the indexes. Admin -> Utils -> Rebuild indexes -> Text extractor and Lucene too.

Can you give me a hint why this do not work?

Thanks!
 #39432  by jllort
 
Try in our online demo -> demo.openkm.com ( wait 10-15 minutes, because document will be at batch queue at the beginning and will be needed some time to be processed ).

In your OpenKM take document UUID ( is available on tab properties, select and press CTRL+C to copy ) and then go to administration -> database query and execute:
Code: Select all
select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='YOUR UUID HERE'
The result should be the document -> do it contains the extracted text ?
 #39537  by jllort
 
At administration -> stats -> text extractor queue you can see the pending documents to be processed.

Also with the uuid and the previous query I gave to you, can check if document has been processed or not -> the FIELD NDC_TEXT_EXTRACTED='F' indicate is still not processed and 'T', has been

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.