Open Source Document Management System | OpenKM

PostPosted:**Mon Jun 20, 2022 11:19 am**

I have installed the community version 6.3.11. Office documents are indexed correctly and fulltext search works, only with pdf files there is a problem. When i export a word document that was successfully indexed as pdf and upload it, it tells me:

Code: Select all

2022-06-20 12:50:08,723 [Thread-22] WARN  c.o.extractor.CuneiformTextExtractor - Undefined OCR application
2022-06-20 12:50:08,724 [Thread-22] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:trash/okmAdmin/Prozessliste_2017.pdf': Too few text extracted

What does "Undefined OCR application" mean? Is there no OCR engine included in the bundle? I read something about tesseract, you have to install it manually, is that correct?

Thank you in advance

PostPosted:**Tue Jun 28, 2022 7:27 am**

Cuneiform is a very old OCR engine and you should have it disabled, you must have Tesseract OCR enabled

Should have system.ocr configuration parameter configured with tesseract -> take a look here https://docs.openkm.com/kcenter/view/ok ... ngine.html

I attach some screenshots about configuration

Selección_059.png (26.38 KiB) Viewed 1366 times

Selección_060.png (88.7 KiB) Viewed 1366 times

Selección_061.png (47.03 KiB) Viewed 1366 times

Open Source Document Management System | OpenKM

Text extraction of pdf files

Text extraction of pdf files

Re: Text extraction of pdf files