Open Source Document Management System | OpenKM - Certain PDF files not indexed ?!

Certain PDF files not indexed ?!

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

4 posts

4 posts

Certain PDF files not indexed ?!

#23148 by meadowtec
Tue May 21, 2013 3:24 pm

Hi !
I'm playing with okm 6.2.3 on Debian Linux. Everything works fine except that certain PDF files are not indexed.

Those PDFs do for sure contain a text layer. The only difference is that they are created with Abbyy Finereader 11.

The logfile has the following entries:

Code: Select all

2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'fff.pdf'
2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'fff.pdf' into '/okm:root (137.7 KB)'
2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-21 16:03:33,896 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Ffff.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-21 16:03:33,897 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Ffff.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-21 16:05:00,014 [Thread-4072] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=08375552-ff32-4a0c-8f88-234fa0c1986a, docPath=/okm:root/fff.pdf, docVerUuid=883f7a97-0a19-449b-a23a-9cc81dbd5b54, date=Tue May 21 16:03:33 CEST 2013}
2013-05-21 16:05:00,021 [Thread-4072] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-21 16:05:00,733 [Thread-4072] INFO  com.openkm.extractor.Tesseract3TextExtractor - TEXT:
2013-05-21 16:05:00,734 [Thread-4072] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/fff.pdf': Too few text extracted

The files index all right on Adobe, just don't work in OKM.

Has anyone had similar problems and probably can suggest a solution ?
Thanks
Alex

Username

meadowtec

Rank

Fresh Boarder

Posts

Joined

Tue May 21, 2013 3:12 pm

Re: Certain PDF files not indexed ?!

#23162 by jllort
Wed May 22, 2013 9:36 am

contining a text layer is the perfect case is strange are not indexed. I'm not sure - if community has it - but in administration - utilities -> should be a check text extractor utility -> can you check there.

Can you check in our online demo at demo.openkm.com and if it's possible upload here some pdf file to take a look into.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Certain PDF files not indexed ?!

#23164 by meadowtec
Wed May 22, 2013 11:31 am

Hi !
The text extractor utility check is not there in the community version

I uploaded a sample PDF to the demo site. it's called meadowtec.pdf in folder meadowtec. Interestingly enough I can't even search in the preview there. This however works fine in my own installation.
I managed to extract more from my logfile:

Code: Select all

2013-05-22 13:05:00,015 [Thread-582] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=732f5ba7-bc4d-475d-afa1-4ef0a56305e0, docPath=/okm:root/meadowtec.pdf, docVerUuid=9174bace-50e5-4809-b2de-9b78af185a4a, date=Wed May 22 13:00:03 CEST 2013}
2013-05-22 13:05:00,030 [Thread-582] WARN  com.openkm.extractor.PdfTextExtractor - Failed to extract PDF text content
java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:161)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor305.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at bsh.Reflect.invokeOnMethod(Unknown Source)
        at bsh.Reflect.invokeObjectMethod(Unknown Source)
        at bsh.BSHPrimarySuffix.doName(Unknown Source)
        at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
        at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
        at java.lang.Thread.run(Thread.java:679)
2013-05-22 13:05:00,032 [Thread-582] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/meadowtec.pdf': Too few text extracted

Does this tell you anything ?
What I forgot to mention in my initial post, PDFs scanned and OCRed with the utility which came together with my HP scanner work find. They are full text indexed without any problem. But the OCR engine is not as good as Abbyy.

Username

meadowtec

Rank

Fresh Boarder

Posts

Joined

Tue May 21, 2013 3:12 pm

Re: Certain PDF files not indexed ?!

#23176 by jllort
Thu May 23, 2013 7:26 am

is pdf protected in some way ?
with error I can not tell you anything more. We need test in controled environment to see if there's some problem related with pdf or with your installation.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
4 posts

Return to “Configuration”

Display:

Sort by:

Jump to: