Open Source Document Management System | OpenKM - tesseract not working with scanned PDF's & images

tesseract not working with scanned PDF's & images

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

6 posts

6 posts

tesseract not working with scanned PDF's & images

#41957 by kashif0777
Fri Jun 24, 2016 6:22 am

Hi Dears,

I installed OpenKM with following softwares:-

OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3

to enable Tesseract as OCR engine i performed following configurations

1. install Tesseract 3 for windows .

2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )

3. default values of registered.text.extractors are:-

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.

I have following questions:-

1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?

Best Regards.

Username

kashif0777

Rank

Fresh Boarder

Posts

Joined

Fri Jun 17, 2016 6:47 am

Re: tesseract not working with scanned PDF's & images

#41965 by jllort
Sat Jun 25, 2016 3:00 pm

First of all check if at demo.openkm.com the OCR process goes right ( wait 10 minutes for checking, because files going into pending text extractor queue what is executed each 5 minutes ).

En sure documents are still not present in Administration / Stats / Pending text extractor queue ( ensure files are really processed and are not stalled there ).

The class com.openkm.extractor.Tesseract3TextExtractor is well configured should not be any problem there. Take a look at catalina.log file if there's some error while trying to process the file.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: tesseract not working with scanned PDF's & images

#41971 by kashif0777
Mon Jun 27, 2016 6:57 am

thanks for reply-back,

today i added a scanned pdf document and get following warning in tomcat console:-

2016-06-27 11:30:00,052 [Thread-2717] WARN com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/scan0001.pdf': Full text indexing of 'application/pdf' is not supported

Does full text extract not supported in community edition ?

Username

kashif0777

Rank

Fresh Boarder

Posts

Joined

Fri Jun 17, 2016 6:47 am

Re: tesseract not working with scanned PDF's & images

#41981 by jllort
Wed Jun 29, 2016 2:59 pm

Should check this PDF to understanding why you are getting this message. PDF extraction is supported in community.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

There is no OCR button in Community version

#42124 by Punitha
Sun Aug 07, 2016 2:57 am

Hi,

I am new user to this tool. I dont see the OCR button in the Administration section in community version. Could you please let me know how to enable the OCR button.

thanks,
Punitha

Username

Punitha

Rank

Fresh Boarder

Posts

Joined

Sun Aug 07, 2016 12:18 am

Re: tesseract not working with scanned PDF's & images

#42127 by jllort
Sun Aug 07, 2016 4:21 pm

Are you talking about OCR Zone ( is only present in professional edition ), because I do not know what you have in mind when you talk about "OCR button" ?

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
6 posts

Return to “Configuration”

Display:

Sort by:

Jump to: