• Certain PDF files not indexed ?!

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #23148  by meadowtec
 
Hi !
I'm playing with okm 6.2.3 on Debian Linux. Everything works fine except that certain PDF files are not indexed.

Those PDFs do for sure contain a text layer. The only difference is that they are created with Abbyy Finereader 11.

The logfile has the following entries:
Code: Select all
2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'fff.pdf'
2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'fff.pdf' into '/okm:root (137.7 KB)'
2013-05-21 16:03:33,857 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-21 16:03:33,896 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Ffff.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-21 16:03:33,897 [http-bio-0.0.0.0-8080-exec-18] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Ffff.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-21 16:05:00,014 [Thread-4072] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=08375552-ff32-4a0c-8f88-234fa0c1986a, docPath=/okm:root/fff.pdf, docVerUuid=883f7a97-0a19-449b-a23a-9cc81dbd5b54, date=Tue May 21 16:03:33 CEST 2013}
2013-05-21 16:05:00,021 [Thread-4072] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-21 16:05:00,733 [Thread-4072] INFO  com.openkm.extractor.Tesseract3TextExtractor - TEXT:
2013-05-21 16:05:00,734 [Thread-4072] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/fff.pdf': Too few text extracted
The files index all right on Adobe, just don't work in OKM.

Has anyone had similar problems and probably can suggest a solution ?
Thanks
Alex
 #23162  by jllort
 
contining a text layer is the perfect case is strange are not indexed. I'm not sure - if community has it - but in administration - utilities -> should be a check text extractor utility -> can you check there.

Can you check in our online demo at demo.openkm.com and if it's possible upload here some pdf file to take a look into.
 #23164  by meadowtec
 
Hi !
The text extractor utility check is not there in the community version :-(
I uploaded a sample PDF to the demo site. it's called meadowtec.pdf in folder meadowtec. Interestingly enough I can't even search in the preview there. This however works fine in my own installation.
I managed to extract more from my logfile:
Code: Select all
2013-05-22 13:05:00,015 [Thread-582] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=732f5ba7-bc4d-475d-afa1-4ef0a56305e0, docPath=/okm:root/meadowtec.pdf, docVerUuid=9174bace-50e5-4809-b2de-9b78af185a4a, date=Wed May 22 13:00:03 CEST 2013}
2013-05-22 13:05:00,030 [Thread-582] WARN  com.openkm.extractor.PdfTextExtractor - Failed to extract PDF text content
java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:161)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor305.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at bsh.Reflect.invokeOnMethod(Unknown Source)
        at bsh.Reflect.invokeObjectMethod(Unknown Source)
        at bsh.BSHPrimarySuffix.doName(Unknown Source)
        at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
        at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
        at java.lang.Thread.run(Thread.java:679)
2013-05-22 13:05:00,032 [Thread-582] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/meadowtec.pdf': Too few text extracted
Does this tell you anything ?
What I forgot to mention in my initial post, PDFs scanned and OCRed with the utility which came together with my HP scanner work find. They are full text indexed without any problem. But the OCR engine is not as good as Abbyy.
 #23176  by jllort
 
is pdf protected in some way ?
with error I can not tell you anything more. We need test in controled environment to see if there's some problem related with pdf or with your installation.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.