Open Source Document Management System | OpenKM - Lucene Index does not contain extracted Text

Lucene Index does not contain extracted Text

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

7 posts

7 posts

Lucene Index does not contain extracted Text

#40145 by Catscratch
Thu Jul 23, 2015 1:30 pm

Hi guys,

I got a question. I got PDFs with OCR which contain text. Through text extraction mechanism, everything was extracted successful.

When I check extraction in backend (Administration -> Utilities -> Check text extraction) I found a lot of extracted text for documents. So every document was extracted successful.

Anyway. I can't search (fulltext search) for every document. Because there a some minor ones left, which are not found.

To check this. I went back to Administration -> Utilities -> List indexes and activated "Show terms". A lot of documents got the extracted text as terms. But these one, which I can't search for, also doesn't contain any terms.

I also tried to Rebuild indexes few times (Administration -> Utilities -> Rebuild indexes -> Lucene indexes). But without success. Terms for some documents are still empty.

So my question is, where does these terme for documents come from? And do you have any idea whats going wrong here?

Regards!

Last edited by Catscratch on Mon Jul 27, 2015 6:38 am, edited 2 times in total.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: Lucene Index does not contain extracted Text

#40163 by jllort
Sat Jul 25, 2015 10:07 am

Rebuild indexes does not perform the text extraction again.
Take a look into your OKM_NODE_DOCUMENT table and filter ( form database query ) by NBS_UUID field ( the document uuid ), the problematic word in field NDV_TEXT is present ?

Username

jllort

Rank

Moderator

Posts

12182

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Lucene Index does not contain extracted Text

#40170 by Catscratch
Mon Jul 27, 2015 6:44 am

You mean NDC_TEXT, right?

Yeah. All documents got text in this field. But not all documents contain terms. After building the lucene index the Nth time, it seems to work now. At least with the problematic documents I used for verification. I don't know if it worked for all documents now.

Is there a way to check if there are any documents in the repository with empty terms?
(Administration -> Utilities -> List indexes -> Search indexes -> "Show terms" ... and there the last field "terms")

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: Lucene Index does not contain extracted Text

#40180 by jllort
Wed Jul 29, 2015 10:09 am

Can not be done a query - like with database - on Lucene scenario looking for document with no terms. The only thing should be after the Lucene engine finished setting terms, perform a query and in case it's empty save on a database column.

Username

jllort

Rank

Moderator

Posts

12182

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Lucene Index does not contain extracted Text

#40188 by Catscratch
Wed Jul 29, 2015 2:13 pm

Is there a way to query lucene index through admin panel -> Database Query -> Hibernate?

If so: how would the query look like? I don't have much knowlegde on hibernate syntax.
And if not: how may I query the lucene index?

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: Lucene Index does not contain extracted Text

#40193 by pavila
Thu Jul 30, 2015 7:09 am

I was not able to reproduce the issue in current development code. Please, try to reproduce it in last night build from http://integration.openkm.com/6.3/ and in case you are able, please give me detailed steps to reproduce it.

Thanks.

Username

pavila

Rank

Moderator

Posts

3144

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Lucene Index does not contain extracted Text

#40195 by Catscratch
Thu Jul 30, 2015 7:30 am

Ok, I'll report back when I found something.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Page 1 of 1
7 posts

Return to “Usage”

Display:

Sort by:

Jump to: