Open Source Document Management System | OpenKM - Tesseract integration does not search scanned Pdf's

Reply

Tesseract integration does not search scanned Pdf's

#29132 by kesha1
Tue Jul 01, 2014 6:56 pm

Hey,
I am facing the error in catalina log for Tesseract integration with Tesseract. I am running OpenKm6.3 Community Edition on Ubuntu 10.04. I have configured the system.ocr property and dictionary corresponding to Tesseract 3.00. Whenever I search using the Check Extraction option from Admin-->Utilities-->Check Extraction I am able to extract the text from any file format with tesseract but the search result does not generate the pdf's containing that search word.

I am stuck with this for past 4 days.
Any help with this will be really appreciated.

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.PdfTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor 
com.openkm.extractor.Tesseract3TextExtractor

Username

kesha1

Rank

Fresh Boarder

Posts

1

Joined

Tue Jul 01, 2014 6:44 pm

Re: Tesseract integration does not search scanned PDF's

#29161 by jllort
Fri Jul 04, 2014 6:39 pm

There's some error in log ?
Sure you got tesseract 3 installed in this older ubuntu version ? from terminal is going right ?
In Administration -> Stats -> indexing queue -> do you got a lot of files or all has been procesed ? take in mind documents go into a queue and are procesed based on crontab task called "text extractor" or similar name.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Reply

Page 1 of 1
2 posts