• tesseract not working with scanned PDF's & images

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #41957  by kashif0777
 
Hi Dears,

I installed OpenKM with following softwares:-

OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3

to enable Tesseract as OCR engine i performed following configurations

1. install Tesseract 3 for windows .

2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )

3. default values of registered.text.extractors are:-

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.


I have following questions:-

1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?

Best Regards.
 #41965  by jllort
 
First of all check if at demo.openkm.com the OCR process goes right ( wait 10 minutes for checking, because files going into pending text extractor queue what is executed each 5 minutes ).

En sure documents are still not present in Administration / Stats / Pending text extractor queue ( ensure files are really processed and are not stalled there ).

The class com.openkm.extractor.Tesseract3TextExtractor is well configured should not be any problem there. Take a look at catalina.log file if there's some error while trying to process the file.
 #41971  by kashif0777
 
thanks for reply-back,

today i added a scanned pdf document and get following warning in tomcat console:-

2016-06-27 11:30:00,052 [Thread-2717] WARN com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/scan0001.pdf': Full text indexing of 'application/pdf' is not supported


Does full text extract not supported in community edition ?
 #42124  by Punitha
 
Hi,

I am new user to this tool. I dont see the OCR button in the Administration section in community version. Could you please let me know how to enable the OCR button.

thanks,
Punitha
 #42127  by jllort
 
Are you talking about OCR Zone ( is only present in professional edition ), because I do not know what you have in mind when you talk about "OCR button" ?

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.