Page 1 of 1
Message Extractor Problem
PostPosted:Sun Mar 09, 2014 7:18 pm
by Raphel
We are hosting OpenKM 6.2.5 on Amazon cloud platform. There seems to be a problem with the text message extractor. 3 days back, I uploaded 5 text files to test the findByContent web service. Of the 5 files uploaded, only two had the messages extracted. I confirmed this by looking at the log and the indexes built for these 5 files. In the index, I could see the 'message Extracted' property is set to true only for the two files that returned text matches when tested with 'findByContent' webservice. I am not sure about the status of the remaining 3 files, whether text extraction failed or still pending? I ran the query 'select * from okm_activity where act_action = 'MISC_TEXT_EXTRACTION_FAILURE', but none of the 3 files appeared in the result. So I guess, text extraction failure may not be the reason.
That leaves me with only one option, whether the text extraction is still pending. I checked the 'Pending Text Extraction' option under 'Stats' feature. In the screen, I could see only 20 results, but the count showed more that 7000 files are pending for text extraction! I couldn't confirm if the 3 files that didn't return result in text search were part of the pending queue, as the UI only showed 20 files and I couldn't see any option to view all the files pending for extraction.
Next, I checked the Crontab and the text extractor job is scheduled to run every 5 seconds, but the count of pending jobs never changes. I even manually invoked it from the 'Execute' option of Crontab for the text extractor job, but still nothing changes.
My questions are
1) How can I find whether a particular document's text extraction status?(Some kind of database query would be helpful)
2) Why is it that the text extractor job does not seem to execute? Where can I check if the text extractor job is executing as scheduled?
3) If the text extractor job is not executing as scheduled, is there any way I can execute it manually? Also, if I can select a particular file for text extraction?
4) Can the text extractor be triggered through the code while the file is getting uploaded through the OpenKM webservices?
Please share your suggestions.
Re: Message Extractor Problem
PostPosted:Mon Mar 10, 2014 8:42 am
by jllort
1- document status can be shown at OKM_NODE_DOCUMENT_TABLE ( there's a column that indicates if has been procesed or not )
2- First, I think job is executed each 5 minutes not each 5 seconds. You should see log - catalina.log - while you're click on executing to see if something happens. If you've not changed configuration OpenKM will use only one cpu and extract only one file at same time. If for some reason the file is locking the queue that could be the cause not advance and is stopped all there.
3- Can execute text extraction on demand, but not across webservice, should be added some event after document creation and execute then. The event then will be part of uploading process and until will not be finished the uploading process will not be finished neither.
Are you uploading xls files or similar, sometimes these kind of files can have problems with default parsers. I suggest mark actual file on queue as EXTRACTED = 'T' in the OKM_NODE_COCUMENT column table.
Re: Message Extractor Problem
PostPosted:Mon Mar 10, 2014 1:34 pm
by Raphel
Hi,
Thanks for the reply. Using your suggestion, I could see the text extraction status is set to F in OKM_NODE_DOCUMENT table for the 3 files that I had mentioned in my post. So, I guess they are still pending for extraction. It makes me wonder, when I uploaded the 5 files at around the same time, 2 files were picked for text extraction and 3 files were not picked. Is there any particular logic/algorithm being applied for picking up the files for text extraction or everything is first put into a queue and then some algorithm like FIFO/LIFO is being applied to pick a file for text extraction?
I need a clarification regarding the last point you mentioned. Did you mean to change the extraction status of all files in OKM_NODE_DOCUMENT to T or only the problematic file's status should be changed to T? And to clarify your query, we don't upload xls files. Usually they are PDF, Doc, txt or some image files.
Re: Message Extractor Problem
PostPosted:Tue Mar 11, 2014 11:22 am
by jllort
Queue if FIFO. T indicates document text extractor has already been processed. I suggest change status of this files I think for some reason they're locking the queue.
Re: Message Extractor Problem
PostPosted:Fri Mar 14, 2014 5:36 pm
by Raphel
Hi,
Using your suggestion, I have cleared the status of files in pending queue and I can see that the message extraction process continue without any issue now. Analyzing this issue further, I think one of the root cause for the issue was that the message extraction happened using default setting(sequential with single thread). I think this could have been avoided, if the message extraction happened in parallel mode, so that even if one thread gets stuck, the other threads can continue the message extraction process for the rest of the files in pending queue.
I want to try the parallel mode for message extraction. Please let me know the settings I'll need to change to achieve this. What would be the recommended settings from OpenKM for parallel mode, so that the parallel mode doesn't become performance/resource issue. Also if you can provide some additional input on the pros and cons of sequential Vs parallel mode, it'll be really beneficial.
Re: Message Extractor Problem
PostPosted:Sat Mar 15, 2014 7:02 pm
by jllort
You must read it to understand how to use all cores in your openkm server
http://wiki.openkm.com/index.php/Applic ... extraction
Re: Message Extractor Problem
PostPosted:Tue Mar 18, 2014 3:35 am
by matt81
Hi,
Just to let you know that as of version 6.2.26 Pro trial, there is no need to change the settings for text extraction, it is done automatically, there is no such setting "managed.text.extraction.schedule".
What I noticed is that when i upload a new document, and then when I check the value for 'textExtracted' under utilities and List indexes, the value is false at start, and thereofre you cannot search for it. I checked it again after 10 minutes or so and it was set to true, so it is updated automactically which is good.
However when I create a new file using webservice OKMDocument->createSimple, even though the value for 'textExtracted' is true, the file is never searchable. I have even used the "Check text extraction' found in utilities, and still it doesn't return anything. Why are documents created from web services not searchable?
Re: Message Extractor Problem
PostPosted:Tue Mar 18, 2014 9:49 pm
by matt81
Can I get an answer on the above please?
Re: Message Extractor Problem
PostPosted:Wed Mar 19, 2014 7:49 pm
by jllort
First of all, normally we try answering questions during first 24 hours, sometimes is not possible, but always answer the question. About your problems with trial I do not know where you see this property, but probably in version 6.2.26 has been deprecated, simply search for managed and you'll see all possible values ( here get information about all parameters
http://wiki.openkm.com/index.php/Applic ... extraction ).
If you want to test search feature, always will be better test in our online demo ( demo.openkm.com ) with application installed by us, could be some problem in your trial OS environment why indexing is not going right ( quite strange but possible ). Anyway take in mind indexing is a queue, that means passed some time documents are indexed not at the moment you upload, you can see queue at Administration -> Stats -> Indexing queue
To be sure if a document has already been indexed should do a query -> Administration -> database query -> select * from OKM_NODE_DOCUMENT WHERE NBS_UUID='NODE UUID' ( you can copy node uuid from UI -> properties ) and then see if column what indicates document has been indexed is set to true
Re: Message Extractor Problem
PostPosted:Wed Mar 19, 2014 10:24 pm
by matt81
Thanks for the info, I don't know how you cannot see the value 'textExtracted', it is found in adminstration tab, then utilities, from there you will see a setting called 'List Indexes'. I cannot use the demo version as there is no Administration tab. I have created documents and have waited for 2 days and still they are not searchable. As I said if I create it using the upload document it works, it's only when I create documents through Web service. There shouldn't be a problem with installation or the OS, as nothing would have worked. There is some sort of issue with the API from the server side.
I have executed the SQL statement and I couldn't find the column for Indexing. Please find below columns and values, let me know which column it it. As you can see Text extracted is true, however it doesn;t work.:
Code: Select allNDC_CHECKED_OUT = F
NDC_CIPHER_NAME = arron
NDC_DESCRIPTION = ''
NDC_ENCRYPTION = T
NDC_LANGUAGE = ''
NDC_LAST_MODIFIED = 2014-03-17 22:14:47.712000
NLK_CREATED = ''
NLK_OWNER = ''
NLK_TOKEN = ''
NDC_LOCKED = F
NDC_MIME_TYPE = application/vnd.openxmlformats-officedocument.wordprocessingml.document
NDC_SIGNED = F
NDC_TEXT = ''
NDC_TEXT_EXTRACTED = T
NDC_TITLE = ''
NBS_UUID = 28281f82-5eb6-4abf-be5d-6b6abf81a955
Re: Message Extractor Problem
PostPosted:Fri Mar 21, 2014 6:57 pm
by jllort
the NDC_TEXT contains the text extracted after extraction process.
Can you give us minimal Java code you're using to upload document across the API, and the document you upload to test ourselves.