Page 1 of 1

Lucene Index does not contain extracted Text

PostPosted:Thu Jul 23, 2015 1:30 pm
by Catscratch
Hi guys,

I got a question. I got PDFs with OCR which contain text. Through text extraction mechanism, everything was extracted successful.

When I check extraction in backend (Administration -> Utilities -> Check text extraction) I found a lot of extracted text for documents. So every document was extracted successful.

Anyway. I can't search (fulltext search) for every document. Because there a some minor ones left, which are not found.

To check this. I went back to Administration -> Utilities -> List indexes and activated "Show terms". A lot of documents got the extracted text as terms. But these one, which I can't search for, also doesn't contain any terms.

I also tried to Rebuild indexes few times (Administration -> Utilities -> Rebuild indexes -> Lucene indexes). But without success. Terms for some documents are still empty.

So my question is, where does these terme for documents come from? And do you have any idea whats going wrong here?

Regards!

Re: Lucene Index does not contain extracted Text

PostPosted:Sat Jul 25, 2015 10:07 am
by jllort
Rebuild indexes does not perform the text extraction again.
Take a look into your OKM_NODE_DOCUMENT table and filter ( form database query ) by NBS_UUID field ( the document uuid ), the problematic word in field NDV_TEXT is present ?

Re: Lucene Index does not contain extracted Text

PostPosted:Mon Jul 27, 2015 6:44 am
by Catscratch
You mean NDC_TEXT, right?

Yeah. All documents got text in this field. But not all documents contain terms. After building the lucene index the Nth time, it seems to work now. At least with the problematic documents I used for verification. I don't know if it worked for all documents now.

Is there a way to check if there are any documents in the repository with empty terms?
(Administration -> Utilities -> List indexes -> Search indexes -> "Show terms" ... and there the last field "terms")

Re: Lucene Index does not contain extracted Text

PostPosted:Wed Jul 29, 2015 10:09 am
by jllort
Can not be done a query - like with database - on Lucene scenario looking for document with no terms. The only thing should be after the Lucene engine finished setting terms, perform a query and in case it's empty save on a database column.

Re: Lucene Index does not contain extracted Text

PostPosted:Wed Jul 29, 2015 2:13 pm
by Catscratch
Is there a way to query lucene index through admin panel -> Database Query -> Hibernate?

If so: how would the query look like? I don't have much knowlegde on hibernate syntax.
And if not: how may I query the lucene index?

Re: Lucene Index does not contain extracted Text

PostPosted:Thu Jul 30, 2015 7:09 am
by pavila
I was not able to reproduce the issue in current development code. Please, try to reproduce it in last night build from http://integration.openkm.com/6.3/ and in case you are able, please give me detailed steps to reproduce it.

Thanks.

Re: Lucene Index does not contain extracted Text

PostPosted:Thu Jul 30, 2015 7:30 am
by Catscratch
Ok, I'll report back when I found something.