Open Source Document Management System | OpenKM

PostPosted:**Wed May 15, 2013 9:31 am**

Hi there,

I'm testing OpenKM and I'm very impressed with the features, but I have a problem with the search function: when I search for a word, not all documents containing it are returned.

I tried rebuilding the indexes (in this order: first "Text extractor" then "Lucene indexes" and finally I used "Optimize indexes", I don't know if this is useful) but still not all instances of the word I'm looking for are returned.

My test case is the word "medicines", contained several times in a particular (rather old) MS Word document. When I search for the word, that particular DOC file is not returned by the search.

I tried converting the file to RTF and to DOCX and both these files are indexed correctly. I also opened the document with LibreOffice and re-saved it in DOC (not DOCX) format, again the indexing works correctly on this "refreshed" file. My final test was opening the original DOC file and re-saving it using MS Word 2011 for Mac: this time the indexing worked correctly as well.

My guess is that the Word 97-2007 Binary File Format is not 100% supported by Lucene or whatever other indexing tool OpenKM is using (I must say this is hardly surprising given that that particular format is well known to be problematic).

The only workaround I can think of at the moment is either converting all old documents to DOCX or opening and re-saving them with a recent version of Word, but that sounds like a lot of work, is there any other solution? Did anyone else face this problem before?

PostPosted:**Thu May 16, 2013 9:23 pm**

First you should to be sure that the document has been indexed. There's a queue -> Administration -> Stats -> pending queue and until the document will be indexed you'll not found with search engine. Text estractor are independant from lucene ( are outside lucene, you can define the textextractor you wish, that's the idea ). Using Administrator / Database query you can select OKM_NODE_DOCUMENT -> there're you can see the extracted text passed to lucen. In utilities there's an option "check text extraction" you can directly test uploading the document there ( I'm not sure if that option is available in community version ).

I suggest you upload the document in our demo.openkm.com and test there too.

PostPosted:**Fri May 17, 2013 11:16 am**

Thank you for your reply, I'll try that and let you know how it goes.

I have another couple of problems though: yesterday I uploaded a large number of texts via webdav (about 5 GB of data), I let the night pass but almost none of the new documents (including RTF, PDF and DOCX files) got indexed. It's really strange because some of them were indexed while others weren't.

At the moment I'm running the "Text extractor", I launched it manually this morning but it's taking a very long time (I project about 6-7 hours). In a production environment this could be a real problem: I need to be able to count on the fact that when I add new documents they'll be indexed automatically and incrementally, I don't want to schedule nightly rebuilds of the index (I'll have about 10 GB of data, so I'm guessing it could take up to 14 hours to rebuild the index).

Is it possible that the indexing wasn't carried out because I added the files via webdav?

PostPosted:**Mon May 20, 2013 8:14 am**

The documents got to extractor queue ( you can see at administration -> stats -> pending extraction queue ).
Consider text extractor a high consumer or resources ( specially cpu ). OpenKM can be configured to more or less agressive text extraction ( that can be changed between days and night ) and use one or all cpus you got in system ( default configuration only uses one ).

Take a look here http://wiki.openkm.com/index.php/Applic ... extraction

10GB is small, should not be a problem store and index it. Reindex procedure is not normal, you should investigate which documents are indexed and which not ... and then try to understand why are not indexed. First identify which are not indexed ( take a look at OKM_NODE_DOCUMENT to see which have been extracted and which not, there's a column which indicates if text has been extracted or is still on queue. )

You could force extraction from uploading document, but is not good practice.

PostPosted:**Tue Sep 17, 2013 5:41 pm**

I'm taking this problem too. And i tried manually index like him too (administration --> utilities --> rebuild indexes) but no hope while OpenKM says that index completed(but too fast, i doubt this, the indexing just take some seconds to complete??).

I got about 140 documents (word, pdf, pptx...) and i see that in the administration --> stat --> text extrac queue, there are 120 documents pending

Those 140 documents stayed in my server for nearly 2 months. Thats mean only ~20 documents indexed in ~months. Is that too long to index ?

My docs size varied from 200KB to 10MB

Im using 6.2.4 community version. Can someone help me get it faster

I cant search correctly !

PostPosted:**Wed Sep 18, 2013 5:56 pm**

which openkm version are you using ?
I think index queue has stopped in one file for some reason and do not advance while it'll not be finished ( before told you how to jump it I would like to know your openkm version )

PostPosted:**Thu Sep 19, 2013 4:34 am**

jllort wrote:which openkm version are you using ?
I think index queue has stopped in one file for some reason and do not advance while it'll not be finished ( before told you how to jump it I would like to know your openkm version )

Im using OpenKM 6.2.4 community.

PostPosted:**Fri Sep 20, 2013 10:12 am**

In OpenKM Professional there are features which can be used to detect problems in the text extraction. Sadly are not present in Community and I don't know if can be ported without problems.

Meanwhile you can check if a document text have been extracted querying the OKM_NODE_DOCUMENT table.

PostPosted:**Sun Sep 22, 2013 12:31 pm**

pavila wrote:In OpenKM Professional there are features which can be used to detect problems in the text extraction. Sadly are not present in Community and I don't know if can be ported without problems.

Meanwhile you can check if a document text have been extracted querying the OKM_NODE_DOCUMENT table.

Sad to know that

I've checked that table, and i see some text extracted but not enough.

So theres no way to force lucen index or restart when it stucks ? What can i do now ??

By the way, i wonder the difference between Lucene Index and Text Extractors. Are they different ?

PostPosted:**Mon Sep 23, 2013 4:35 am**

Code: Select all

could not update: [com.openkm.dao.bean.NodeDocument#[id of the document]

Code: Select all

Packet for query is too large (1728595 > 1048576). You can change this value on the server by setting the max_allowed_packet'

And i see these 2 lines in log file catalina.out.

Maybe the indexing got jammed for some reasons. I searched max_allowed_packet and set the new value 2000M. Im not sure whether these 2 lines are relevant each other ?

Can pls someone supports me this problem

(

PostPosted:**Wed Oct 02, 2013 7:09 am**

If there are problems with the text extraction you should see errors in the Tomcat log. Anyway I recommend you to try to reproduce this problem with the last night build from http://integration.openkm.com/6.2/

Open Source Document Management System | OpenKM

Search functionality does not return all matches

Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches

Re: Search functionality does not return all matches