Open Source Document Management System | OpenKM - Content search not working for notepad content and styled text

Reply

Content search not working for notepad content and styled text

#31431 by Prajakta
Thu Feb 26, 2015 12:43 pm

Hi OpenKM Support Team,
We have OpenKM Community Edition 6.3.0 installed on our machine. Browsers used: Mozilla Firefox, IE

In the content search we are facing following issues,
1. The content search is not working for notepad content.
2. The search result is not returned for the styled text like bold and italics.

Please let us know if there are some configurations which are to be done so that the appropriate result is returned.

Regards,
Prajakta

Username

Prajakta

Rank

Fresh Boarder

Posts

6

Joined

Thu Feb 26, 2015 6:48 am

Re: Content search not working for notepad content and styled text

#31453 by jllort
Fri Feb 27, 2015 5:42 pm

Please provide us some screenshots ( zip in the post ), and if it's possible some document to reproduce the problem.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Content search not working for notepad content and styled text

#31481 by Prajakta
Mon Mar 02, 2015 10:10 am

Please find the screenshots.zip consisting of the screenshots of the OpenKM search and notepad reference document.zip consisting of the notepad file used for searching.
We were not able to attach the PDF reference document due to its size. Please find below the link to the PDF reference document :-
http://docs.spring.io/spring/docs/2.5.x ... erence.pdf

Attachments

notepad reference document.zip

Reference
(6.22 KiB) Downloaded 1393 times

screenshots.zip

Screenshots
(543.32 KiB) Downloaded 1127 times

Username

Prajakta

Rank

Fresh Boarder

Posts

6

Joined

Thu Feb 26, 2015 6:48 am

Re: Content search not working for notepad content and styled text

#31513 by jllort
Fri Mar 06, 2015 5:24 pm

Did you see if document has been processed by text extractor queue -> Administration -> Stats -> pending extractor queue. Documents are not processed just in time, go into queue and processed to extract text.

In Administratin -> Crontab tab you got the task "Text extractor" what does it, you can force execution from there.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Content search not working for notepad content and styled text

#39539 by ravikumar
Wed May 13, 2015 1:05 pm

Hi,

I am colleague of the member who posted this issue, and would be working on this issue.
So after adding debug logs for Text Extractor, I see below exception in logs:

Code: Select all

2015-05-13 18:30:00,112 [Thread-29] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=00d1db8d-4dbd-4376-a1c1-47ddb8d851f8, docPath=/okm:trash/okmAdmin/date results.txt, docVerUuid=548b2da3-97ad-4053-93c0-ec6fd59dfbf4, date=Fri Oct 04 16:35:00 IST 2013}
2015-05-13 18:30:00,113 [Thread-29] WARN  com.openkm.extractor.TextExtractorWorker - /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
java.io.FileNotFoundException: /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at com.openkm.module.db.stuff.FsDataStore.read(FsDataStore.java:68)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1291)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)

Is it because of this exception that the textextrator is not completing ? Please help.

Username

ravikumar

Rank

Fresh Boarder

Posts

1

Joined

Wed May 13, 2015 12:48 pm

Re: Content search not working for notepad content and styled text

#39559 by jllort
Thu May 14, 2015 2:30 pm

Seems the document processed is on trash /okm:trash/okmAdmin. We'll can you go to administration -> utilities and do a repository check from /okm:trash node ( choose version history check ). I suspect there's a missing version file /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 and repository checker tools will check if there's some error on repository or not ?

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Content search not working for notepad content and styled text

#39583 by JavaDev
Mon May 18, 2015 6:58 am

We are still getting the File Not Found Exception, even we run the repository checker form Administration -> Utilities as suggested in above post.

Code: Select all

2015-05-18 11:35:00,022 [Thread-2481] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=00d1db8d-4dbd-4376-a1c1-47ddb8d851f8, docPath=/okm:trash/okmAdmin/date results.txt, docVerUuid=548b2da3-97ad-4053-93c0-ec6fd59dfbf4, date=Fri Oct 04 16:35:00 IST 2013}
2015-05-18 11:35:00,023 [Thread-2481] WARN  com.openkm.extractor.TextExtractorWorker - /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
java.io.FileNotFoundException: /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4 (No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at com.openkm.module.db.stuff.FsDataStore.read(FsDataStore.java:68)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1291)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor856.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at bsh.Reflect.invokeOnMethod(Unknown Source)

Is there any way we can do some cleanup to resolve this issue ?

Username

JavaDev

Rank

Fresh Boarder

Posts

12

Joined

Thu Apr 09, 2015 12:44 pm

Re: Content search not working for notepad content and styled text

#39584 by JavaDev
Mon May 18, 2015 7:09 am

Another additional thing I want to share is that, I am not able to purge trash folder. When I select any folder or file and try to delete, they do not get deleted. I do not get any error message also.

Username

JavaDev

Rank

Fresh Boarder

Posts

12

Joined

Thu Apr 09, 2015 12:44 pm

Re: Content search not working for notepad content and styled text

#39592 by jllort
Tue May 19, 2015 2:33 pm

My suggestion is upgrade to nighly build ( integration.openkm.com ). The migration process you must do is http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1

There was a bug on deleting documents with more than one version, that was not deleted in correct order and that caused this problem. To solve it should create the missing files on hard disk ( probably should execute the process serveral times until you get it solved ).
1- Go to administration -> utilities -> check repository
2- For each missing file execute the command
touch /usr/share/apache-tomcat-7.0.53/repository/datastore/54/8b/2d/a3/548b2da3-97ad-4053-93c0-ec6fd59dfbf4

For example, if document had 5 versions, probably you should execute the process 5 times ( apologies for this tedious bug ).

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Content search not working for notepad content and styled text

#39596 by JavaDev
Wed May 20, 2015 4:32 am

Thank you for your suggestion, but we have some customizations in OpenKM because of which upgrading would not be simple.

I have OpenKM hosted on Windows machine, so can you tell me what command to execute in place of "touch" ?

Username

JavaDev

Rank

Fresh Boarder

Posts

12

Joined

Thu Apr 09, 2015 12:44 pm

Re: Content search not working for notepad content and styled text

#39617 by jllort
Fri May 22, 2015 1:40 pm

use it

Code: Select all

echo $null >> d:\repository\tenant_2\datastore\6e\c5\b5\ef\6ec5b5ef-b3de-4698-bd17-abf7cb8ea099

If you've modified 6.3.0 code, you can create a patch and apply to 6.3.1 ( actual 6.3 branch ), we've done minimal changes and should go right without conflicts.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Content search not working for notepad content and styled text

#40070 by Prajakta
Thu Jul 09, 2015 12:05 pm

Hi,

content search is not working properly for notepad content , styled text like bold and italics.
I uploaded 10 sample .text files with the same content as ( Admin & Admin )
Then i tried to search for the content "Admin & Admin"
But it didn't returned any document
Then as suggested by you
i tried to see if the documents i uploaded recently has been processed by text extractor queue in (-> Administration -> Stats -> pending extractor queue. )

but found that documents are still in pending queue for around 5 min
what if i don't want to force execution from Crontab -Text Extractor

Can you please tell me
where to find the configuration of awaking the text extractor after certain time period(in our case its 5 min)
i tried to add new property managed.text.extraction.pool.timeout =1 minute
But its not working

Username

Prajakta

Rank

Fresh Boarder

Posts

6

Joined

Thu Feb 26, 2015 6:48 am

Re: Content search not working for notepad content and styled text

#40094 by jllort
Mon Jul 13, 2015 10:14 am

Hi

You can not search by exact phrase, you're searching by keywords ( tokens ), your query should be a single keyword Admin. Take in mind when content goes into lucene to be processed, it removed some characters ( stop characters ) etc ... you can set your own lucene analyzer.

First step should be see which content has been extracted

Code: Select all

SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='your doc uuid here ';

And also where are you doing the query, on simply or advanced search view ? because are not doing exactly the same.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

OpenKM Text extraction is not working properly

#40102 by Prajakta
Mon Jul 13, 2015 12:37 pm

Text extraction is not working properly
i uploaded 5 documents with the same content and
found that text extractor marked them as extracted in the database OKM_NODE_DOCUMENT(NDC_TEXT_EXTRACTED =T)
but in NDC_TEXT column it put the value as "NULL"
Also for some documents i can see an entry in OKM_NODE_BASE table with same NBS_UUID( present in OKM_NODE_DOCUMENT)

So even though the document is processed by text extractor,searching is not working.

Can you please tell me, root cause behind this problem and how extraction process works internally ,so that we can fix the problem

Username

Prajakta

Rank

Fresh Boarder

Posts

6

Joined

Thu Feb 26, 2015 6:48 am

Re: Content search not working for notepad content and styled text

#40112 by jllort
Thu Jul 16, 2015 4:01 pm

What kind of documents are you uploading ? Seems the extractor is not going right for your documents.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Reply

Page 1 of 2
17 posts

1
2