Open Source Document Management System | OpenKM

PostPosted:**Fri Jun 24, 2016 6:22 am**

Hi Dears,

I installed OpenKM with following softwares:-

OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3

to enable Tesseract as OCR engine i performed following configurations

1. install Tesseract 3 for windows .

2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )

3. default values of registered.text.extractors are:-

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.

I have following questions:-

1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?

Best Regards.

PostPosted:**Sat Jun 25, 2016 3:00 pm**

First of all check if at demo.openkm.com the OCR process goes right ( wait 10 minutes for checking, because files going into pending text extractor queue what is executed each 5 minutes ).

En sure documents are still not present in Administration / Stats / Pending text extractor queue ( ensure files are really processed and are not stalled there ).

The class com.openkm.extractor.Tesseract3TextExtractor is well configured should not be any problem there. Take a look at catalina.log file if there's some error while trying to process the file.

PostPosted:**Mon Jun 27, 2016 6:57 am**

thanks for reply-back,

today i added a scanned pdf document and get following warning in tomcat console:-

2016-06-27 11:30:00,052 [Thread-2717] WARN com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/scan0001.pdf': Full text indexing of 'application/pdf' is not supported

Does full text extract not supported in community edition ?

PostPosted:**Wed Jun 29, 2016 2:59 pm**

Should check this PDF to understanding why you are getting this message. PDF extraction is supported in community.

PostPosted:**Sun Aug 07, 2016 2:57 am**

Hi,

I am new user to this tool. I dont see the OCR button in the Administration section in community version. Could you please let me know how to enable the OCR button.

thanks,
Punitha

PostPosted:**Sun Aug 07, 2016 4:21 pm**

Are you talking about OCR Zone ( is only present in professional edition ), because I do not know what you have in mind when you talk about "OCR button" ?

Open Source Document Management System | OpenKM

tesseract not working with scanned PDF's & images

tesseract not working with scanned PDF's & images

Re: tesseract not working with scanned PDF's & images

Re: tesseract not working with scanned PDF's & images

Re: tesseract not working with scanned PDF's & images

There is no OCR button in Community version

Re: tesseract not working with scanned PDF's & images