• Lucene Index does not contain extracted Text

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #40145  by Catscratch
 
Hi guys,

I got a question. I got PDFs with OCR which contain text. Through text extraction mechanism, everything was extracted successful.

When I check extraction in backend (Administration -> Utilities -> Check text extraction) I found a lot of extracted text for documents. So every document was extracted successful.

Anyway. I can't search (fulltext search) for every document. Because there a some minor ones left, which are not found.

To check this. I went back to Administration -> Utilities -> List indexes and activated "Show terms". A lot of documents got the extracted text as terms. But these one, which I can't search for, also doesn't contain any terms.

I also tried to Rebuild indexes few times (Administration -> Utilities -> Rebuild indexes -> Lucene indexes). But without success. Terms for some documents are still empty.

So my question is, where does these terme for documents come from? And do you have any idea whats going wrong here?

Regards!
Last edited by Catscratch on Mon Jul 27, 2015 6:38 am, edited 2 times in total.
 #40163  by jllort
 
Rebuild indexes does not perform the text extraction again.
Take a look into your OKM_NODE_DOCUMENT table and filter ( form database query ) by NBS_UUID field ( the document uuid ), the problematic word in field NDV_TEXT is present ?
 #40170  by Catscratch
 
You mean NDC_TEXT, right?

Yeah. All documents got text in this field. But not all documents contain terms. After building the lucene index the Nth time, it seems to work now. At least with the problematic documents I used for verification. I don't know if it worked for all documents now.

Is there a way to check if there are any documents in the repository with empty terms?
(Administration -> Utilities -> List indexes -> Search indexes -> "Show terms" ... and there the last field "terms")
 #40180  by jllort
 
Can not be done a query - like with database - on Lucene scenario looking for document with no terms. The only thing should be after the Lucene engine finished setting terms, perform a query and in case it's empty save on a database column.
 #40188  by Catscratch
 
Is there a way to query lucene index through admin panel -> Database Query -> Hibernate?

If so: how would the query look like? I don't have much knowlegde on hibernate syntax.
And if not: how may I query the lucene index?

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.