Open Source Document Management System | OpenKM

PostPosted:**Fri Feb 01, 2013 6:34 am**

Hi,
I have installed OpenKM 6.2.2 on windows 7. It was working well with embedded database HSQLDB. Then I configure it to replace HSQLDB with MySQL. Right now, OpenKM is working except Full Text search from .pdf files.

On the console window I can see the following Error:

Code: Select all

chDAO - findBySimpleQuery(database AND context:okm_root, 0, 10)
2013-02-01 18:25:00,103 [Thread-16] INFO  com.openkm.extractor.TextExtractorWork
er - processSerial.Working on {docUuid=1183f2af-bd56-479f-bb14-0039644380eb, doc
Path=/okm:root/my_pdf.pdf, docVerUuid=1afc47ea-25c7-4404-8ae8-ae7c6af5cc87, date
=Fri Feb 01 18:05:54 PKT 2013}
2013-02-01 18:25:00,957 [Thread-16] WARN  com.openkm.extractor.PdfTextExtractor
- PDF does not contains text layer
2013-02-01 18:25:00,958 [Thread-16] WARN  com.openkm.dao.NodeDocumentDAO - There
 was a problem extracting text from '/okm:root/my_pdf.pdf': Too few text extracted

I have tested my_pdf.pdf file at OpenKM online demo where it is successfully searched through Full text search but at my local machine it generated the above error.
Can any body please tell me how to resolve this problem?
Thanks in advance

PostPosted:**Sat Feb 02, 2013 7:34 pm**

has your pdf text layer or pdf is based on images, in the last case you should have configured the ocr

PostPosted:**Mon Feb 04, 2013 5:41 am**

My pdf has the selectable text in it.
I tried with the following pdf file.
http://freepdfhosting.com/92aa8d6dd0.pdf

PostPosted:**Thu Feb 07, 2013 6:27 pm**

I have tested in my environment without major problems. The only possibility I got in mind should be the document is still in pending queeu and is still not processed. go to administration -> stats -> there's menu option to see actual queue. How much files are in the queue pending to be indexed.

In this case, if documents have text layer, index not need ocr and only should be process by text extraction queue ( here ocr on other miss configuration has no influence ).

PostPosted:**Fri Feb 08, 2013 6:15 am**

Hi jllort, Thank your for reply.
I have upload my test css.pdf document to my local machine.
I can see my test document css.pdf in Administration-> status->extraction queue as follow:

How much time will take the text extraction process?
When My document will be indexed?
When my document will be available for "Full Text Search"?

PostPosted:**Sat Feb 09, 2013 9:57 am**

TextExtractor should run each 1-5 minutes. If not then we have identified the problem. Tell us if document is indexed or is still on queue.

PostPosted:**Mon Feb 11, 2013 6:32 am**

hi jllort,

After uploading a .pdf document I can see it indexed in "Pending Extraction " queue. After some time the pending extraction queue becomes empty. The "progress in extraction" queue always remains empty.
Please tell me how to check that my document has been indexed?

I think after leaving pending queue the document should enter into "progress in extraction" queue. But on my local machine no document is going to be indexed at progress in extraction queue.So i am guessing that there is some problem in switching pending extraction queue to progress in extraction queue. Is it right? If yes then please tell me how to solve this problem?
Thank you...!

PostPosted:**Tue Feb 12, 2013 3:58 pm**

If penging extraction queue is empty means all files have been indexed. In OKM_BASE_DOCUMENT table should see the extracted text.

PostPosted:**Wed Feb 13, 2013 6:50 am**

Hi Jllort,
I didn't find the OKM_BASE_DOCUMENT at my local machine. I have configured the OpenKM 6.2.2 to MySQL.
Please tell me the schema of this table.So that I could be able to create this table in my database.

PostPosted:**Thu Feb 14, 2013 10:14 pm**

Code: Select all

SELECT * FROM OKM_NODE_DOCUMENT;

PostPosted:**Fri Feb 15, 2013 1:47 pm**

hi jllort,
I have viewed the okm_node_document table on my local machine. It's screenshot is available at the following url:
http://s8.postimage.org/3rzta3fj9/okm_node_document.png

In okm_node_document table at my local machine I can see the extracted text of .txt, .docx, .png, .jpeg etc in NDC_TEXT field of this table.
But the NDC_TEXT field contains NULL only when NDC_MIME_TYPE field is contains "application/pdf".
Why NDC_TEXT field contains the Null value only for .pdf documents?

PostPosted:**Sat Feb 16, 2013 6:27 pm**

Does your pdf contains text layer and images or only images ?
Can you try with configuratiom parameter system.pdf.force.ocr

PostPosted:**Mon Feb 18, 2013 6:43 am**

yes my pdf document contains the text layer. My test pdf document is available at following url:
http://freepdfhosting.com/92aa8d6dd0.pdf

Boolean configuratiom parameter system.pdf.force.ocr is also set to true.

PostPosted:**Mon Feb 18, 2013 9:56 am**

Please, post your whole OpenKM configuration ( Administration > Configuration ). You can attach screenshots in this forum thread, BTW.

PostPosted:**Tue Feb 19, 2013 6:24 am**

Hi jllort,

My OpenKM configuration ( Administration > Configuration ) Screeshots:

Now please tell me if there is some problem in my OpenKM configuration?

Open Source Document Management System | OpenKM

Full Text Search from .pfd file is Not Working

Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working

Re: Full Text Search from .pfd file is Not Working