Page 1 of 1
Text in searchable pdfs
PostPosted:Thu Aug 23, 2012 7:03 am
by STB2010
Hi!
I'm using OpenKM 5.1.10 on opensuse 12.1.
My searchable pdfs created with Abbyy are not indexed. They are searchable in preview. But I cannot search for words in openkm.
I uploaded a test document on your demo system, user0 => searchablepdf=>rubiks.pdf. Document is not indexed.
I tested is also with older versions.
Greetings
Stephan
Re: Text in searchable pdfs
PostPosted:Thu Aug 23, 2012 10:23 am
by STB2010
It seems as if the pdf-export from abbyy does something special to these pdf-files ...
Converting input file with ghostscript to all different pdf-Levels and openkm is able to index documents.
What is used to index pdf-documents in openkm?
error-log when uploading now working pdf:
Code: Select all[PdfTextExtractor] Failed to extract PDF text content
java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:70)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
Re: Text in searchable pdfs
PostPosted:Fri Aug 24, 2012 4:21 pm
by jllort
Your pdf files can have different kind of contents. If you have passed OCR engine with abby normally you should create pdf with extra layer with content text, seems this is not your case and your pdf files are stored as pdf images. Depending the resolution can be indexed by open source ocr or not, for example less 300 dpi normally open source ocr can not indexing images, abby engine for example works perfect with 100 dpi ( but this is a payment engine that you can replace tesseract or cuneiform ).
Re: Text in searchable pdfs
PostPosted:Sat Aug 25, 2012 8:38 am
by STB2010
In preview text is searchable and when opening with Acrobat Reader too.
In the meantime I found a workaround and batch-converted all my pdf-files to 1.5 with ghostscript.
I resetted all abbyy-settings and suddenly they are indexed by openkm again.
Thanks
Stephan
Re: Text in searchable pdfs
PostPosted:Sat Aug 25, 2012 9:25 am
by jllort
Files are not indexed inmediatly, needs some time to processing batch queue, specially if you have uploaded a lot of documents at same time. In version 6.0 we take more control with batch queue, at version 5.1 this is delegated to jackrabbit