Page 1 of 1

Tesseract integration does not search scanned Pdf's

PostPosted:Tue Jul 01, 2014 6:56 pm
by kesha1
Hey,
I am facing the error in catalina log for Tesseract integration with Tesseract. I am running OpenKm6.3 Community Edition on Ubuntu 10.04. I have configured the system.ocr property and dictionary corresponding to Tesseract 3.00. Whenever I search using the Check Extraction option from Admin-->Utilities-->Check Extraction I am able to extract the text from any file format with tesseract but the search result does not generate the pdf's containing that search word.

I am stuck with this for past 4 days.
Any help with this will be really appreciated.
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.PdfTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor 
com.openkm.extractor.Tesseract3TextExtractor

Re: Tesseract integration does not search scanned PDF's

PostPosted:Fri Jul 04, 2014 6:39 pm
by jllort
There's some error in log ?
Sure you got tesseract 3 installed in this older ubuntu version ? from terminal is going right ?
In Administration -> Stats -> indexing queue -> do you got a lot of files or all has been procesed ? take in mind documents go into a queue and are procesed based on crontab task called "text extractor" or similar name.