Page 1 of 1

tesseract not working with scanned PDF's & images

PostPosted:Fri Jun 24, 2016 6:22 am
by kashif0777
Hi Dears,

I installed OpenKM with following softwares:-

OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3

to enable Tesseract as OCR engine i performed following configurations

1. install Tesseract 3 for windows .

2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )

3. default values of registered.text.extractors are:-

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.


I have following questions:-

1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?

Best Regards.

Re: tesseract not working with scanned PDF's & images

PostPosted:Sat Jun 25, 2016 3:00 pm
by jllort
First of all check if at demo.openkm.com the OCR process goes right ( wait 10 minutes for checking, because files going into pending text extractor queue what is executed each 5 minutes ).

En sure documents are still not present in Administration / Stats / Pending text extractor queue ( ensure files are really processed and are not stalled there ).

The class com.openkm.extractor.Tesseract3TextExtractor is well configured should not be any problem there. Take a look at catalina.log file if there's some error while trying to process the file.

Re: tesseract not working with scanned PDF's & images

PostPosted:Mon Jun 27, 2016 6:57 am
by kashif0777
thanks for reply-back,

today i added a scanned pdf document and get following warning in tomcat console:-

2016-06-27 11:30:00,052 [Thread-2717] WARN com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/scan0001.pdf': Full text indexing of 'application/pdf' is not supported


Does full text extract not supported in community edition ?

Re: tesseract not working with scanned PDF's & images

PostPosted:Wed Jun 29, 2016 2:59 pm
by jllort
Should check this PDF to understanding why you are getting this message. PDF extraction is supported in community.

There is no OCR button in Community version

PostPosted:Sun Aug 07, 2016 2:57 am
by Punitha
Hi,

I am new user to this tool. I dont see the OCR button in the Administration section in community version. Could you please let me know how to enable the OCR button.

thanks,
Punitha

Re: tesseract not working with scanned PDF's & images

PostPosted:Sun Aug 07, 2016 4:21 pm
by jllort
Are you talking about OCR Zone ( is only present in professional edition ), because I do not know what you have in mind when you talk about "OCR button" ?