Hi Dears,
I installed OpenKM with following softwares:-
OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3
to enable Tesseract as OCR engine i performed following configurations
1. install Tesseract 3 for windows .
2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )
3. default values of registered.text.extractors are:-
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.
I have following questions:-
1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?
Best Regards.
I installed OpenKM with following softwares:-
OS: win server 2008
JDK: jdk1.8 (32-bit) (Xmx1024m)
OpenKM: 6.3.1 (Build 8235) (Community Edition)
Tesseract 3
to enable Tesseract as OCR engine i performed following configurations
1. install Tesseract 3 for windows .
2. set system.ocr=C:\tomcat\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
(Its working fine from command promt and with Text Extraction utility (available in utility tab of OpenKM administration ) )
3. default values of registered.text.extractors are:-
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Now with these settings OpenKM content Serach working fine for all files except scanned pdf's and all kind of images.
I have following questions:-
1. I'm wondering whether Tesseract is getting used for OCR OR Not ?
2. whats needs to be configured so that "content search" show documents for scanned pages & images ?
3. Does OCR for scanned pages & images only supported in Professional edition ?
4. I also upload a simple jpg/png file having text "hello world" into online demo using (user4/pass4) and content search also did not worked here too ! why ?
Best Regards.