Hi there,
I'm testing OpenKM and I'm very impressed with the features, but I have a problem with the search function: when I search for a word, not all documents containing it are returned.
I tried rebuilding the indexes (in this order: first "Text extractor" then "Lucene indexes" and finally I used "Optimize indexes", I don't know if this is useful) but still not all instances of the word I'm looking for are returned.
My test case is the word "medicines", contained several times in a particular (rather old) MS Word document. When I search for the word, that particular DOC file is not returned by the search.
I tried converting the file to RTF and to DOCX and both these files are indexed correctly. I also opened the document with LibreOffice and re-saved it in DOC (not DOCX) format, again the indexing works correctly on this "refreshed" file. My final test was opening the original DOC file and re-saving it using MS Word 2011 for Mac: this time the indexing worked correctly as well.
My guess is that the Word 97-2007 Binary File Format is not 100% supported by Lucene or whatever other indexing tool OpenKM is using (I must say this is hardly surprising given that that particular format is well known to be problematic).
The only workaround I can think of at the moment is either converting all old documents to DOCX or opening and re-saving them with a recent version of Word, but that sounds like a lot of work, is there any other solution? Did anyone else face this problem before?
I'm testing OpenKM and I'm very impressed with the features, but I have a problem with the search function: when I search for a word, not all documents containing it are returned.
I tried rebuilding the indexes (in this order: first "Text extractor" then "Lucene indexes" and finally I used "Optimize indexes", I don't know if this is useful) but still not all instances of the word I'm looking for are returned.
My test case is the word "medicines", contained several times in a particular (rather old) MS Word document. When I search for the word, that particular DOC file is not returned by the search.
I tried converting the file to RTF and to DOCX and both these files are indexed correctly. I also opened the document with LibreOffice and re-saved it in DOC (not DOCX) format, again the indexing works correctly on this "refreshed" file. My final test was opening the original DOC file and re-saving it using MS Word 2011 for Mac: this time the indexing worked correctly as well.
My guess is that the Word 97-2007 Binary File Format is not 100% supported by Lucene or whatever other indexing tool OpenKM is using (I must say this is hardly surprising given that that particular format is well known to be problematic).
The only workaround I can think of at the moment is either converting all old documents to DOCX or opening and re-saving them with a recent version of Word, but that sounds like a lot of work, is there any other solution? Did anyone else face this problem before?
