Open Source Document Management System | OpenKM - Full Text Search from .pfd file is Not Working

Full Text Search from .pfd file is Not Working

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

17 posts
1
2
Next

17 posts

Full Text Search from .pfd file is Not Working

#21216 by Muhammad Imran
Fri Feb 01, 2013 6:34 am

Hi,
I have installed OpenKM 6.2.2 on windows 7. It was working well with embedded database HSQLDB. Then I configure it to replace HSQLDB with MySQL. Right now, OpenKM is working except Full Text search from .pdf files.

On the console window I can see the following Error:

Code: Select all

chDAO - findBySimpleQuery(database AND context:okm_root, 0, 10)
2013-02-01 18:25:00,103 [Thread-16] INFO  com.openkm.extractor.TextExtractorWork
er - processSerial.Working on {docUuid=1183f2af-bd56-479f-bb14-0039644380eb, doc
Path=/okm:root/my_pdf.pdf, docVerUuid=1afc47ea-25c7-4404-8ae8-ae7c6af5cc87, date
=Fri Feb 01 18:05:54 PKT 2013}
2013-02-01 18:25:00,957 [Thread-16] WARN  com.openkm.extractor.PdfTextExtractor
- PDF does not contains text layer
2013-02-01 18:25:00,958 [Thread-16] WARN  com.openkm.dao.NodeDocumentDAO - There
 was a problem extracting text from '/okm:root/my_pdf.pdf': Too few text extracted

I have tested my_pdf.pdf file at OpenKM online demo where it is successfully searched through Full text search but at my local machine it generated the above error.
Can any body please tell me how to resolve this problem?
Thanks in advance

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21229 by jllort
Sat Feb 02, 2013 7:34 pm

has your pdf text layer or pdf is based on images, in the last case you should have configured the ocr

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21240 by Muhammad Imran
Mon Feb 04, 2013 5:41 am

My pdf has the selectable text in it.
I tried with the following pdf file.
http://freepdfhosting.com/92aa8d6dd0.pdf

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21268 by jllort
Thu Feb 07, 2013 6:27 pm

I have tested in my environment without major problems. The only possibility I got in mind should be the document is still in pending queeu and is still not processed. go to administration -> stats -> there's menu option to see actual queue. How much files are in the queue pending to be indexed.

In this case, if documents have text layer, index not need ocr and only should be process by text extraction queue ( here ocr on other miss configuration has no influence ).

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21285 by Muhammad Imran
Fri Feb 08, 2013 6:15 am

Hi jllort, Thank your for reply.
I have upload my test css.pdf document to my local machine.
I can see my test document css.pdf in Administration-> status->extraction queue as follow:

How much time will take the text extraction process?
When My document will be indexed?
When my document will be available for "Full Text Search"?

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21300 by jllort
Sat Feb 09, 2013 9:57 am

TextExtractor should run each 1-5 minutes. If not then we have identified the problem. Tell us if document is indexed or is still on queue.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21315 by Muhammad Imran
Mon Feb 11, 2013 6:32 am

hi jllort,

After uploading a .pdf document I can see it indexed in "Pending Extraction " queue. After some time the pending extraction queue becomes empty. The "progress in extraction" queue always remains empty.
Please tell me how to check that my document has been indexed?

I think after leaving pending queue the document should enter into "progress in extraction" queue. But on my local machine no document is going to be indexed at progress in extraction queue.So i am guessing that there is some problem in switching pending extraction queue to progress in extraction queue. Is it right? If yes then please tell me how to solve this problem?
Thank you...!

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21341 by jllort
Tue Feb 12, 2013 3:58 pm

If penging extraction queue is empty means all files have been indexed. In OKM_BASE_DOCUMENT table should see the extracted text.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21356 by Muhammad Imran
Wed Feb 13, 2013 6:50 am

Hi Jllort,
I didn't find the OKM_BASE_DOCUMENT at my local machine. I have configured the OpenKM 6.2.2 to MySQL.
Please tell me the schema of this table.So that I could be able to create this table in my database.

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21374 by jllort
Thu Feb 14, 2013 10:14 pm

Code: Select all

SELECT * FROM OKM_NODE_DOCUMENT;

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21392 by Muhammad Imran
Fri Feb 15, 2013 1:47 pm

hi jllort,
I have viewed the okm_node_document table on my local machine. It's screenshot is available at the following url:
http://s8.postimage.org/3rzta3fj9/okm_node_document.png

In okm_node_document table at my local machine I can see the extracted text of .txt, .docx, .png, .jpeg etc in NDC_TEXT field of this table.
But the NDC_TEXT field contains NULL only when NDC_MIME_TYPE field is contains "application/pdf".
Why NDC_TEXT field contains the Null value only for .pdf documents?

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21429 by jllort
Sat Feb 16, 2013 6:27 pm

Does your pdf contains text layer and images or only images ?
Can you try with configuratiom parameter system.pdf.force.ocr

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21440 by Muhammad Imran
Mon Feb 18, 2013 6:43 am

yes my pdf document contains the text layer. My test pdf document is available at following url:
http://freepdfhosting.com/92aa8d6dd0.pdf

Boolean configuratiom parameter system.pdf.force.ocr is also set to true.

Username

Muhammad Imran

Rank

Junior Boarder

Posts

Joined

Wed Jan 02, 2013 11:00 am

Re: Full Text Search from .pfd file is Not Working

#21445 by pavila
Mon Feb 18, 2013 9:56 am

Please, post your whole OpenKM configuration ( Administration > Configuration ). You can attach screenshots in this forum thread, BTW.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Full Text Search from .pfd file is Not Working

#21453 by Muhammad Imran
Tue Feb 19, 2013 6:24 am

Hi jllort,

My OpenKM configuration ( Administration > Configuration ) Screeshots:

Now please tell me if there is some problem in my OpenKM configuration?