• Full Text Search from .pfd file is Not Working

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #21216  by Muhammad Imran
 
Hi,
I have installed OpenKM 6.2.2 on windows 7. It was working well with embedded database HSQLDB. Then I configure it to replace HSQLDB with MySQL. Right now, OpenKM is working except Full Text search from .pdf files.

On the console window I can see the following Error:
Code: Select all
chDAO - findBySimpleQuery(database AND context:okm_root, 0, 10)
2013-02-01 18:25:00,103 [Thread-16] INFO  com.openkm.extractor.TextExtractorWork
er - processSerial.Working on {docUuid=1183f2af-bd56-479f-bb14-0039644380eb, doc
Path=/okm:root/my_pdf.pdf, docVerUuid=1afc47ea-25c7-4404-8ae8-ae7c6af5cc87, date
=Fri Feb 01 18:05:54 PKT 2013}
2013-02-01 18:25:00,957 [Thread-16] WARN  com.openkm.extractor.PdfTextExtractor
- PDF does not contains text layer
2013-02-01 18:25:00,958 [Thread-16] WARN  com.openkm.dao.NodeDocumentDAO - There
 was a problem extracting text from '/okm:root/my_pdf.pdf': Too few text extracted
I have tested my_pdf.pdf file at OpenKM online demo where it is successfully searched through Full text search but at my local machine it generated the above error.
Can any body please tell me how to resolve this problem?
Thanks in advance
 #21268  by jllort
 
I have tested in my environment without major problems. The only possibility I got in mind should be the document is still in pending queeu and is still not processed. go to administration -> stats -> there's menu option to see actual queue. How much files are in the queue pending to be indexed.

In this case, if documents have text layer, index not need ocr and only should be process by text extraction queue ( here ocr on other miss configuration has no influence ).
 #21285  by Muhammad Imran
 
Hi jllort, Thank your for reply.
I have upload my test css.pdf document to my local machine.
I can see my test document css.pdf in Administration-> status->extraction queue as follow:
Image

How much time will take the text extraction process?
When My document will be indexed?
When my document will be available for "Full Text Search"?
 #21300  by jllort
 
TextExtractor should run each 1-5 minutes. If not then we have identified the problem. Tell us if document is indexed or is still on queue.
 #21315  by Muhammad Imran
 
hi jllort,

After uploading a .pdf document I can see it indexed in "Pending Extraction " queue. After some time the pending extraction queue becomes empty. The "progress in extraction" queue always remains empty.
Please tell me how to check that my document has been indexed?

I think after leaving pending queue the document should enter into "progress in extraction" queue. But on my local machine no document is going to be indexed at progress in extraction queue.So i am guessing that there is some problem in switching pending extraction queue to progress in extraction queue. Is it right? If yes then please tell me how to solve this problem?
Thank you...!
 #21341  by jllort
 
If penging extraction queue is empty means all files have been indexed. In OKM_BASE_DOCUMENT table should see the extracted text.
 #21392  by Muhammad Imran
 
hi jllort,
I have viewed the okm_node_document table on my local machine. It's screenshot is available at the following url:
http://s8.postimage.org/3rzta3fj9/okm_node_document.png

In okm_node_document table at my local machine I can see the extracted text of .txt, .docx, .png, .jpeg etc in NDC_TEXT field of this table.
But the NDC_TEXT field contains NULL only when NDC_MIME_TYPE field is contains "application/pdf".
Why NDC_TEXT field contains the Null value only for .pdf documents?
 #21429  by jllort
 
Does your pdf contains text layer and images or only images ?
Can you try with configuratiom parameter system.pdf.force.ocr

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.