Open Source Document Management System | OpenKM

PostPosted:**Thu Aug 23, 2012 7:03 am**

Hi!

I'm using OpenKM 5.1.10 on opensuse 12.1.

My searchable pdfs created with Abbyy are not indexed. They are searchable in preview. But I cannot search for words in openkm.
I uploaded a test document on your demo system, user0 => searchablepdf=>rubiks.pdf. Document is not indexed.

I tested is also with older versions.

Greetings
Stephan

PostPosted:**Thu Aug 23, 2012 10:23 am**

It seems as if the pdf-export from abbyy does something special to these pdf-files ...
Converting input file with ghostscript to all different pdf-Levels and openkm is able to index documents.

What is used to index pdf-documents in openkm?

error-log when uploading now working pdf:

Code: Select all

[PdfTextExtractor] Failed to extract PDF text content                                                          
java.lang.NullPointerException                                                                                                    
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)                      
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
        at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
        at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)

PostPosted:**Fri Aug 24, 2012 4:21 pm**

Your pdf files can have different kind of contents. If you have passed OCR engine with abby normally you should create pdf with extra layer with content text, seems this is not your case and your pdf files are stored as pdf images. Depending the resolution can be indexed by open source ocr or not, for example less 300 dpi normally open source ocr can not indexing images, abby engine for example works perfect with 100 dpi ( but this is a payment engine that you can replace tesseract or cuneiform ).

PostPosted:**Sat Aug 25, 2012 8:38 am**

In preview text is searchable and when opening with Acrobat Reader too.

In the meantime I found a workaround and batch-converted all my pdf-files to 1.5 with ghostscript.
I resetted all abbyy-settings and suddenly they are indexed by openkm again.

Thanks
Stephan

PostPosted:**Sat Aug 25, 2012 9:25 am**

Files are not indexed inmediatly, needs some time to processing batch queue, specially if you have uploaded a lot of documents at same time. In version 6.0 we take more control with batch queue, at version 5.1 this is delegated to jackrabbit

Open Source Document Management System | OpenKM

Text in searchable pdfs

Text in searchable pdfs

Re: Text in searchable pdfs

Re: Text in searchable pdfs

Re: Text in searchable pdfs

Re: Text in searchable pdfs