Page 1 of 2
Full Text Search from .pfd file is Not Working
PostPosted:Fri Feb 01, 2013 6:34 am
by Muhammad Imran
Hi,
I have installed OpenKM 6.2.2 on windows 7. It was working well with embedded database HSQLDB. Then I configure it to replace HSQLDB with MySQL. Right now, OpenKM is working except Full Text search from .pdf files.
On the console window I can see the following Error:
Code: Select allchDAO - findBySimpleQuery(database AND context:okm_root, 0, 10)
2013-02-01 18:25:00,103 [Thread-16] INFO com.openkm.extractor.TextExtractorWork
er - processSerial.Working on {docUuid=1183f2af-bd56-479f-bb14-0039644380eb, doc
Path=/okm:root/my_pdf.pdf, docVerUuid=1afc47ea-25c7-4404-8ae8-ae7c6af5cc87, date
=Fri Feb 01 18:05:54 PKT 2013}
2013-02-01 18:25:00,957 [Thread-16] WARN com.openkm.extractor.PdfTextExtractor
- PDF does not contains text layer
2013-02-01 18:25:00,958 [Thread-16] WARN com.openkm.dao.NodeDocumentDAO - There
was a problem extracting text from '/okm:root/my_pdf.pdf': Too few text extracted
I have tested my_pdf.pdf file at OpenKM online demo where it is successfully searched through Full text search but at my local machine it generated the above error.
Can any body please tell me how to resolve this problem?
Thanks in advance
Re: Full Text Search from .pfd file is Not Working
PostPosted:Sat Feb 02, 2013 7:34 pm
by jllort
has your pdf text layer or pdf is based on images, in the last case you should have configured the ocr
Re: Full Text Search from .pfd file is Not Working
PostPosted:Mon Feb 04, 2013 5:41 am
by Muhammad Imran
My pdf has the selectable text in it.
I tried with the following pdf file.
http://freepdfhosting.com/92aa8d6dd0.pdf
Re: Full Text Search from .pfd file is Not Working
PostPosted:Thu Feb 07, 2013 6:27 pm
by jllort
I have tested in my environment without major problems. The only possibility I got in mind should be the document is still in pending queeu and is still not processed. go to administration -> stats -> there's menu option to see actual queue. How much files are in the queue pending to be indexed.
In this case, if documents have text layer, index not need ocr and only should be process by text extraction queue ( here ocr on other miss configuration has no influence ).
Re: Full Text Search from .pfd file is Not Working
PostPosted:Fri Feb 08, 2013 6:15 am
by Muhammad Imran
Hi jllort, Thank your for reply.
I have upload my test css.pdf document to my local machine.
I can see my test document css.pdf in Administration-> status->extraction queue as follow:
How much time will take the text extraction process?
When My document will be indexed?
When my document will be available for "Full Text Search"?
Re: Full Text Search from .pfd file is Not Working
PostPosted:Sat Feb 09, 2013 9:57 am
by jllort
TextExtractor should run each 1-5 minutes. If not then we have identified the problem. Tell us if document is indexed or is still on queue.
Re: Full Text Search from .pfd file is Not Working
PostPosted:Mon Feb 11, 2013 6:32 am
by Muhammad Imran
hi jllort,
After uploading a .pdf document I can see it indexed in "Pending Extraction " queue. After some time the pending extraction queue becomes empty. The "progress in extraction" queue always remains empty.
Please tell me how to check that my document has been indexed?
I think after leaving pending queue the document should enter into "progress in extraction" queue. But on my local machine no document is going to be indexed at progress in extraction queue.So i am guessing that there is some problem in switching pending extraction queue to progress in extraction queue. Is it right? If yes then please tell me how to solve this problem?
Thank you...!
Re: Full Text Search from .pfd file is Not Working
PostPosted:Tue Feb 12, 2013 3:58 pm
by jllort
If penging extraction queue is empty means all files have been indexed. In OKM_BASE_DOCUMENT table should see the extracted text.
Re: Full Text Search from .pfd file is Not Working
PostPosted:Wed Feb 13, 2013 6:50 am
by Muhammad Imran
Hi Jllort,
I didn't find the OKM_BASE_DOCUMENT at my local machine. I have configured the OpenKM 6.2.2 to MySQL.
Please tell me the schema of this table.So that I could be able to create this table in my database.
Re: Full Text Search from .pfd file is Not Working
PostPosted:Thu Feb 14, 2013 10:14 pm
by jllort
Re: Full Text Search from .pfd file is Not Working
PostPosted:Fri Feb 15, 2013 1:47 pm
by Muhammad Imran
hi jllort,
I have viewed the okm_node_document table on my local machine. It's screenshot is available at the following url:
http://s8.postimage.org/3rzta3fj9/okm_node_document.png
In okm_node_document table at my local machine I can see the extracted text of .txt, .docx, .png, .jpeg etc in NDC_TEXT field of this table.
But the NDC_TEXT field contains NULL only when NDC_MIME_TYPE field is contains "
application/pdf".
Why NDC_TEXT field contains the Null value only for .pdf documents?
Re: Full Text Search from .pfd file is Not Working
PostPosted:Sat Feb 16, 2013 6:27 pm
by jllort
Does your pdf contains text layer and images or only images ?
Can you try with configuratiom parameter system.pdf.force.ocr
Re: Full Text Search from .pfd file is Not Working
PostPosted:Mon Feb 18, 2013 6:43 am
by Muhammad Imran
yes my pdf document contains the text layer. My test pdf document is available at following url:
http://freepdfhosting.com/92aa8d6dd0.pdf
Boolean configuratiom parameter system.pdf.force.ocr is also set to true.
Re: Full Text Search from .pfd file is Not Working
PostPosted:Mon Feb 18, 2013 9:56 am
by pavila
Please, post your whole OpenKM configuration ( Administration > Configuration ). You can attach screenshots in this forum thread, BTW.
Re: Full Text Search from .pfd file is Not Working
PostPosted:Tue Feb 19, 2013 6:24 am
by Muhammad Imran
Hi jllort,
My OpenKM configuration ( Administration > Configuration ) Screeshots:
Now please tell me if there is some problem in my OpenKM configuration?