my tesseract 3.0 is working good but only with TIF file, when i check text extraction , jpg and scanned pdf file is not working .
updated
after i uncheck force ocr pdf and move com.openkm.extractor.Tesseract3TextExtractor after com.openkm.extractor.PdfTextExtractor in text.extractors key
now i can find fulltext from file docx save to pdf , and scanned pdf image file .
Code: Select allorg.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
updated
in registered.text.extractors key i remove exifTextextractor , and i let empty system.openoffice.dictionary key link to oxt file , now when i check text extraction , pdf scanned file (image) working good about 80% extracly VietNamese . jpg file too , but i try save 1 docx file to pdf ( word 2010 ) and upload to openkm , it have no text extraction ???
Code: Select allorg.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor