Page 1 of 1

Text extraction of pdf files

PostPosted:Mon Jun 20, 2022 11:19 am
by mbrain
I have installed the community version 6.3.11. Office documents are indexed correctly and fulltext search works, only with pdf files there is a problem. When i export a word document that was successfully indexed as pdf and upload it, it tells me:
Code: Select all
2022-06-20 12:50:08,723 [Thread-22] WARN  c.o.extractor.CuneiformTextExtractor - Undefined OCR application
2022-06-20 12:50:08,724 [Thread-22] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:trash/okmAdmin/Prozessliste_2017.pdf': Too few text extracted
What does "Undefined OCR application" mean? Is there no OCR engine included in the bundle? I read something about tesseract, you have to install it manually, is that correct?

Thank you in advance

Re: Text extraction of pdf files

PostPosted:Tue Jun 28, 2022 7:27 am
by jllort
Cuneiform is a very old OCR engine and you should have it disabled, you must have Tesseract OCR enabled

Should have system.ocr configuration parameter configured with tesseract -> take a look here https://docs.openkm.com/kcenter/view/ok ... ngine.html

I attach some screenshots about configuration
Selección_059.png
Selección_059.png (26.38 KiB) Viewed 1366 times
Selección_060.png
Selección_060.png (88.7 KiB) Viewed 1366 times
Selección_061.png
Selección_061.png (47.03 KiB) Viewed 1366 times