• text extract is ok but search terms get cut off

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #53312  by creatronics
 
Hi,
my documents are indexed and search is generally working. Sometimes, a search term won't have a match even if it is just a text in a pdf document.

For example:
In one Document, text extraction shows: words like
paraller Auslesungsfunktion • USB Adapter für • SD • usw Gesamt Lieferanschrift Handelsregister

when I take a look into the list indexes, some of these Words get cut of:
lieferanschrif auslesungsfunktio parall handelsregist

sometimes one, sometimes more characters are missing.

If you enter Auslesungsfunktio into the search, you get a match while the correct word wont find anything.
Any Ideas what to do? Rebuilding the index did not work....
 #53325  by jllort
 
I suppose these PDF are based in images, in this scenario the OCR engine get the images and try to extract contents -> sometimes not so accurate as you wish. Should be sure in the tesseract engine you choose the right language with parameter -l ( https://tesseract-ocr.github.io/tessdoc ... sions.html -> in the case of tesseract 4 can configure with several languages in this manner -l eng+deu )

About how to set all the document for being indexed again, there a column in the table OKM_NODE_DOCUMENT named NDC_TEXT_EXTRACTED ( when this value is set to F, it means the document is pending for text extraction procedure ). you can do an update of this table directly or used in combination with OKM_NODE_BASE and filtering by NBS_NAME with something like '%.pdf' to choose only PDF files ( here will find value information about database structure https://docs.openkm.com/kcenter/view/ok ... ption.html )

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.