Page 1 of 1

text extract is ok but search terms get cut off

PostPosted:Mon Mar 07, 2022 2:51 pm
by creatronics
Hi,
my documents are indexed and search is generally working. Sometimes, a search term won't have a match even if it is just a text in a pdf document.

For example:
In one Document, text extraction shows: words like
paraller Auslesungsfunktion • USB Adapter für • SD • usw Gesamt Lieferanschrift Handelsregister

when I take a look into the list indexes, some of these Words get cut of:
lieferanschrif auslesungsfunktio parall handelsregist

sometimes one, sometimes more characters are missing.

If you enter Auslesungsfunktio into the search, you get a match while the correct word wont find anything.
Any Ideas what to do? Rebuilding the index did not work....

Re: text extract is ok but search terms get cut off

PostPosted:Sat Mar 12, 2022 10:11 am
by jllort
I suppose these PDF are based in images, in this scenario the OCR engine get the images and try to extract contents -> sometimes not so accurate as you wish. Should be sure in the tesseract engine you choose the right language with parameter -l ( https://tesseract-ocr.github.io/tessdoc ... sions.html -> in the case of tesseract 4 can configure with several languages in this manner -l eng+deu )

About how to set all the document for being indexed again, there a column in the table OKM_NODE_DOCUMENT named NDC_TEXT_EXTRACTED ( when this value is set to F, it means the document is pending for text extraction procedure ). you can do an update of this table directly or used in combination with OKM_NODE_BASE and filtering by NBS_NAME with something like '%.pdf' to choose only PDF files ( here will find value information about database structure https://docs.openkm.com/kcenter/view/ok ... ption.html )