Open Source Document Management System | OpenKM

PostPosted:**Tue Jul 01, 2014 6:56 pm**

Hey,
I am facing the error in catalina log for Tesseract integration with Tesseract. I am running OpenKm6.3 Community Edition on Ubuntu 10.04. I have configured the system.ocr property and dictionary corresponding to Tesseract 3.00. Whenever I search using the Check Extraction option from Admin-->Utilities-->Check Extraction I am able to extract the text from any file format with tesseract but the search result does not generate the pdf's containing that search word.

I am stuck with this for past 4 days.
Any help with this will be really appreciated.

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.PdfTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor 
com.openkm.extractor.Tesseract3TextExtractor

PostPosted:**Fri Jul 04, 2014 6:39 pm**

There's some error in log ?
Sure you got tesseract 3 installed in this older ubuntu version ? from terminal is going right ?
In Administration -> Stats -> indexing queue -> do you got a lot of files or all has been procesed ? take in mind documents go into a queue and are procesed based on crontab task called "text extractor" or similar name.

Open Source Document Management System | OpenKM

Tesseract integration does not search scanned Pdf's

Tesseract integration does not search scanned Pdf's

Re: Tesseract integration does not search scanned PDF's