Hi,
I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.
Settings are:

I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.
Settings are:
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractorCode: Select all
Anyone who has the golden answer for me? system.imagemagick.convert String /usr/bin/convert -density 200 -quality 90
system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut} -l nld
system.swftools.pdf2swf String /opt/openkm-6.3.0-community/tomcat/bin/pdf2swf -f -T 9 -t -s storeallcharacters ${fileIn} -o ${fileOut}
system.openoffice.dictionary String
system.openoffice.path String /usr/lib/libreoffice
system.pdf.force.ocr Boolean Inactive