Hello, I have a problem with the text extractor from OpenKM (CE) not extracting the text from any PDF correctly. But it process all of them correctly (checked from Administration panel).
As an example, this is one of the documents that was successfully analyzed by OCR, but the 'NDC_TEXT' column does not have any content actually.
The PDF (from above), and the other ones I have uploaded to OpenKM, are all OCR compatible and fully searchables.
I left as default the list of all included text extractor (registered.text.extractors):
Any ideas?
- - - - - - - - - - - - - - - - -
Some useful information:
- Docker installation (v6.3.11).
As an example, this is one of the documents that was successfully analyzed by OCR, but the 'NDC_TEXT' column does not have any content actually.
The PDF (from above), and the other ones I have uploaded to OpenKM, are all OCR compatible and fully searchables.
I left as default the list of all included text extractor (registered.text.extractors):
Code: Select all
And also, left without anyvalue the system.ocr. But I have tried adding tesseract (value=/usr/bin/tesseract ${fileIn} ${fileOut} -l spa), but as this is used for images, I disabled that field again (it didn't extract any text from the test I have made).com.openkm.extractor.PlainTextExtractor
com.openkm.extractor.MsWordTextExtractor
com.openkm.extractor.MsExcelTextExtractor
com.openkm.extractor.MsPowerPointTextExtractor
com.openkm.extractor.OpenOfficeTextExtractor
com.openkm.extractor.RTFTextExtractor
com.openkm.extractor.HTMLTextExtractor
com.openkm.extractor.XMLTextExtractor
com.openkm.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Any ideas?
- - - - - - - - - - - - - - - - -
Some useful information:
- Docker installation (v6.3.11).
Last edited by LauryMenton on Mon Jul 26, 2021 3:21 pm, edited 1 time in total.