Open Source Document Management System | OpenKM

Text in searchable pdfs

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

5 posts

5 posts

Text in searchable pdfs

#18241 by STB2010
Thu Aug 23, 2012 7:03 am

Hi!

I'm using OpenKM 5.1.10 on opensuse 12.1.

My searchable pdfs created with Abbyy are not indexed. They are searchable in preview. But I cannot search for words in openkm.
I uploaded a test document on your demo system, user0 => searchablepdf=>rubiks.pdf. Document is not indexed.

I tested is also with older versions.

Greetings
Stephan

Username

STB2010

Rank

Fresh Boarder

Posts

Joined

Fri Mar 18, 2011 7:27 am

Re: Text in searchable pdfs

#18248 by STB2010
Thu Aug 23, 2012 10:23 am

It seems as if the pdf-export from abbyy does something special to these pdf-files ...
Converting input file with ghostscript to all different pdf-Levels and openkm is able to index documents.

What is used to index pdf-documents in openkm?

error-log when uploading now working pdf:

Code: Select all

[PdfTextExtractor] Failed to extract PDF text content                                                          
java.lang.NullPointerException                                                                                                    
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)                      
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
        at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
        at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
        at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)

Username

STB2010

Rank

Fresh Boarder

Posts

Joined

Fri Mar 18, 2011 7:27 am

Re: Text in searchable pdfs

#18264 by jllort
Fri Aug 24, 2012 4:21 pm

Your pdf files can have different kind of contents. If you have passed OCR engine with abby normally you should create pdf with extra layer with content text, seems this is not your case and your pdf files are stored as pdf images. Depending the resolution can be indexed by open source ocr or not, for example less 300 dpi normally open source ocr can not indexing images, abby engine for example works perfect with 100 dpi ( but this is a payment engine that you can replace tesseract or cuneiform ).

Username

jllort

Rank

Moderator

Posts

12193

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Text in searchable pdfs

#18272 by STB2010
Sat Aug 25, 2012 8:38 am

In preview text is searchable and when opening with Acrobat Reader too.

In the meantime I found a workaround and batch-converted all my pdf-files to 1.5 with ghostscript.
I resetted all abbyy-settings and suddenly they are indexed by openkm again.

Thanks
Stephan

Username

STB2010

Rank

Fresh Boarder

Posts

Joined

Fri Mar 18, 2011 7:27 am

Re: Text in searchable pdfs

#18278 by jllort
Sat Aug 25, 2012 9:25 am

Files are not indexed inmediatly, needs some time to processing batch queue, specially if you have uploaded a lot of documents at same time. In version 6.0 we take more control with batch queue, at version 5.1 this is delegated to jackrabbit

Username

jllort

Rank

Moderator

Posts

12193

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
5 posts

Return to “Installation”

Display:

Sort by:

Jump to: