• Tesseract integration does not search scanned Pdf's

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #29132  by kesha1
 
Hey,
I am facing the error in catalina log for Tesseract integration with Tesseract. I am running OpenKm6.3 Community Edition on Ubuntu 10.04. I have configured the system.ocr property and dictionary corresponding to Tesseract 3.00. Whenever I search using the Check Extraction option from Admin-->Utilities-->Check Extraction I am able to extract the text from any file format with tesseract but the search result does not generate the pdf's containing that search word.

I am stuck with this for past 4 days.
Any help with this will be really appreciated.
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.PdfTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor 
com.openkm.extractor.Tesseract3TextExtractor
 #29161  by jllort
 
There's some error in log ?
Sure you got tesseract 3 installed in this older ubuntu version ? from terminal is going right ?
In Administration -> Stats -> indexing queue -> do you got a lot of files or all has been procesed ? take in mind documents go into a queue and are procesed based on crontab task called "text extractor" or similar name.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.