Page 1 of 2
Content search not working for notepad content and styled text
PostPosted:Thu Feb 26, 2015 12:43 pm
by Prajakta
Hi OpenKM Support Team,
We have OpenKM Community Edition 6.3.0 installed on our machine. Browsers used: Mozilla Firefox, IE
In the content search we are facing following issues,
1. The content search is not working for notepad content.
2. The search result is not returned for the styled text like bold and italics.
Please let us know if there are some configurations which are to be done so that the appropriate result is returned.
Regards,
Prajakta
Re: Content search not working for notepad content and styled text
PostPosted:Fri Feb 27, 2015 5:42 pm
by jllort
Please provide us some screenshots ( zip in the post ), and if it's possible some document to reproduce the problem.
Re: Content search not working for notepad content and styled text
PostPosted:Mon Mar 02, 2015 10:10 am
by Prajakta
Please find the
screenshots.zip consisting of the screenshots of the OpenKM search and
notepad reference document.zip consisting of the notepad file used for searching.
We were not able to attach the PDF reference document due to its size. Please find below the link to the PDF reference document :-
http://docs.spring.io/spring/docs/2.5.x ... erence.pdf
Re: Content search not working for notepad content and styled text
PostPosted:Fri Mar 06, 2015 5:24 pm
by jllort
Did you see if document has been processed by text extractor queue -> Administration -> Stats -> pending extractor queue. Documents are not processed just in time, go into queue and processed to extract text.
In Administratin -> Crontab tab you got the task "Text extractor" what does it, you can force execution from there.
Re: Content search not working for notepad content and styled text
PostPosted:Wed May 13, 2015 1:05 pm
by ravikumar
Hi,
I am colleague of the member who posted this issue, and would be working on this issue.
So after adding debug logs for Text Extractor, I see below exception in logs:
Code: Select all2015-05-13 18:30:00,112 [Thread-29] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=00d1db8d-4dbd-4376-a1c1-47ddb8d851f8, docPath=/okm:trash/okmAdmin/date results.txt, docVerUuid=548b2da3-97ad-4053-93c0-ec6fd59dfbf4, date=Fri Oct 04 16:35:00 IST 2013}
2015-05-13 18:30:00,113 [Thread-29] WARN com.openkm.extractor.TextExtractorWorker - /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
java.io.FileNotFoundException: /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at com.openkm.module.db.stuff.FsDataStore.read(FsDataStore.java:68)
at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1291)
at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
Is it because of this exception that the textextrator is not completing ? Please help.
Re: Content search not working for notepad content and styled text
PostPosted:Thu May 14, 2015 2:30 pm
by jllort
Seems the document processed is on trash /okm:trash/okmAdmin. We'll can you go to administration -> utilities and do a repository check from /okm:trash node ( choose version history check ). I suspect there's a missing version file /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 and repository checker tools will check if there's some error on repository or not ?
Re: Content search not working for notepad content and styled text
PostPosted:Mon May 18, 2015 6:58 am
by JavaDev
We are still getting the File Not Found Exception, even we run the repository checker form Administration -> Utilities as suggested in above post.
Code: Select all2015-05-18 11:35:00,022 [Thread-2481] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=00d1db8d-4dbd-4376-a1c1-47ddb8d851f8, docPath=/okm:trash/okmAdmin/date results.txt, docVerUuid=548b2da3-97ad-4053-93c0-ec6fd59dfbf4, date=Fri Oct 04 16:35:00 IST 2013}
2015-05-18 11:35:00,023 [Thread-2481] WARN com.openkm.extractor.TextExtractorWorker - /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
java.io.FileNotFoundException: /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at com.openkm.module.db.stuff.FsDataStore.read(FsDataStore.java:68)
at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1291)
at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
at sun.reflect.GeneratedMethodAccessor856.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at bsh.Reflect.invokeOnMethod(Unknown Source)
Is there any way we can do some cleanup to resolve this issue ?
Re: Content search not working for notepad content and styled text
PostPosted:Mon May 18, 2015 7:09 am
by JavaDev
Another additional thing I want to share is that, I am not able to purge trash folder. When I select any folder or file and try to delete, they do not get deleted. I do not get any error message also.
Re: Content search not working for notepad content and styled text
PostPosted:Tue May 19, 2015 2:33 pm
by jllort
My suggestion is upgrade to nighly build ( integration.openkm.com ). The migration process you must do is
http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1
There was a bug on deleting documents with more than one version, that was not deleted in correct order and that caused this problem. To solve it should create the missing files on hard disk ( probably should execute the process serveral times until you get it solved ).
1- Go to administration -> utilities -> check repository
2- For each missing file execute the command
touch /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4
For example, if document had 5 versions, probably you should execute the process 5 times ( apologies for this tedious bug ).
Re: Content search not working for notepad content and styled text
PostPosted:Wed May 20, 2015 4:32 am
by JavaDev
Thank you for your suggestion, but we have some customizations in OpenKM because of which upgrading would not be simple.
I have OpenKM hosted on Windows machine, so can you tell me what command to execute in place of "touch" ?
Re: Content search not working for notepad content and styled text
PostPosted:Fri May 22, 2015 1:40 pm
by jllort
use it
Code: Select allecho $null >> d:\repository\tenant_2\datastore\6e\c5\b5\ef\6ec5b5ef-b3de-4698-bd17-abf7cb8ea099
If you've modified 6.3.0 code, you can create a patch and apply to 6.3.1 ( actual 6.3 branch ), we've done minimal changes and should go right without conflicts.
Re: Content search not working for notepad content and styled text
PostPosted:Thu Jul 09, 2015 12:05 pm
by Prajakta
Hi,
content search is not working properly for notepad content , styled text like bold and italics.
I uploaded 10 sample .text files with the same content as ( Admin & Admin )
Then i tried to search for the content "Admin & Admin"
But it didn't returned any document
Then as suggested by you
i tried to see if the documents i uploaded recently has been processed by text extractor queue in (-> Administration -> Stats -> pending extractor queue. )
but found that documents are still in pending queue for around 5 min
what if i don't want to force execution from Crontab -Text Extractor
Can you please tell me
where to find the configuration of awaking the text extractor after certain time period(in our case its 5 min)
i tried to add new property managed.text.extraction.pool.timeout =1 minute
But its not working
Re: Content search not working for notepad content and styled text
PostPosted:Mon Jul 13, 2015 10:14 am
by jllort
Hi
You can not search by exact phrase, you're searching by keywords ( tokens ), your query should be a single keyword Admin. Take in mind when content goes into lucene to be processed, it removed some characters ( stop characters ) etc ... you can set your own lucene analyzer.
First step should be see which content has been extracted
Code: Select allSELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='your doc uuid here ';
And also where are you doing the query, on simply or advanced search view ? because are not doing exactly the same.
OpenKM Text extraction is not working properly
PostPosted:Mon Jul 13, 2015 12:37 pm
by Prajakta
Text extraction is not working properly
i uploaded 5 documents with the same content and
found that text extractor marked them as extracted in the database OKM_NODE_DOCUMENT(NDC_TEXT_EXTRACTED =T)
but in NDC_TEXT column it put the value as "NULL"
Also for some documents i can see an entry in OKM_NODE_BASE table with same NBS_UUID( present in OKM_NODE_DOCUMENT)
So even though the document is processed by text extractor,searching is not working.
Can you please tell me, root cause behind this problem and how extraction process works internally ,so that we can fix the problem
Re: Content search not working for notepad content and styled text
PostPosted:Thu Jul 16, 2015 4:01 pm
by jllort
What kind of documents are you uploading ? Seems the extractor is not going right for your documents.