• Text extraction of pdf files

  • Problems with installing OpenKM? No problemo, the solution is closer than you think.
Problems with installing OpenKM? No problemo, the solution is closer than you think.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #53617  by mbrain
 
I have installed the community version 6.3.11. Office documents are indexed correctly and fulltext search works, only with pdf files there is a problem. When i export a word document that was successfully indexed as pdf and upload it, it tells me:
Code: Select all
2022-06-20 12:50:08,723 [Thread-22] WARN  c.o.extractor.CuneiformTextExtractor - Undefined OCR application
2022-06-20 12:50:08,724 [Thread-22] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:trash/okmAdmin/Prozessliste_2017.pdf': Too few text extracted
What does "Undefined OCR application" mean? Is there no OCR engine included in the bundle? I read something about tesseract, you have to install it manually, is that correct?

Thank you in advance
 #53635  by jllort
 
Cuneiform is a very old OCR engine and you should have it disabled, you must have Tesseract OCR enabled

Should have system.ocr configuration parameter configured with tesseract -> take a look here https://docs.openkm.com/kcenter/view/ok ... ngine.html

I attach some screenshots about configuration
Selección_059.png
Selección_059.png (26.38 KiB) Viewed 929 times
Selección_060.png
Selección_060.png (88.7 KiB) Viewed 929 times
Selección_061.png
Selección_061.png (47.03 KiB) Viewed 929 times

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.