• Text in searchable pdfs

  • Problems with installing OpenKM? No problemo, the solution is closer than you think.
Problems with installing OpenKM? No problemo, the solution is closer than you think.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #18241  by STB2010
 
Hi!

I'm using OpenKM 5.1.10 on opensuse 12.1.

My searchable pdfs created with Abbyy are not indexed. They are searchable in preview. But I cannot search for words in openkm.
I uploaded a test document on your demo system, user0 => searchablepdf=>rubiks.pdf. Document is not indexed.

I tested is also with older versions.

Greetings
Stephan
 #18248  by STB2010
 
It seems as if the pdf-export from abbyy does something special to these pdf-files ...
Converting input file with ghostscript to all different pdf-Levels and openkm is able to index documents.

What is used to index pdf-documents in openkm?

error-log when uploading now working pdf:
Code: Select all
[PdfTextExtractor] Failed to extract PDF text content                                                          
java.lang.NullPointerException                                                                                                    
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)                      
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
        at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
        at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
 #18264  by jllort
 
Your pdf files can have different kind of contents. If you have passed OCR engine with abby normally you should create pdf with extra layer with content text, seems this is not your case and your pdf files are stored as pdf images. Depending the resolution can be indexed by open source ocr or not, for example less 300 dpi normally open source ocr can not indexing images, abby engine for example works perfect with 100 dpi ( but this is a payment engine that you can replace tesseract or cuneiform ).
 #18272  by STB2010
 
In preview text is searchable and when opening with Acrobat Reader too.

In the meantime I found a workaround and batch-converted all my pdf-files to 1.5 with ghostscript.
I resetted all abbyy-settings and suddenly they are indexed by openkm again.

Thanks
Stephan
 #18278  by jllort
 
Files are not indexed inmediatly, needs some time to processing batch queue, specially if you have uploaded a lot of documents at same time. In version 6.0 we take more control with batch queue, at version 5.1 this is delegated to jackrabbit

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.