Page 1 of 1

problem with textextractor from fax

PostPosted:Wed Jan 21, 2015 1:47 pm
by MartinR
Hi,
we're receiving fax messages with our fritzbox . The fritzbox sends them as pdf to my email. Trying to extract the text throws a NullPointerException:
Code: Select all
2015-01-21 10:52:45,623 [http-bio-0.0.0.0-8080-exec-5] DEBUG com.openkm.extractor.PdfTextExtractor- TextStripped: ''
2015-01-21 10:52:45,623 [http-bio-0.0.0.0-8080-exec-5] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer
2015-01-21 10:52:45,625 [http-bio-0.0.0.0-8080-exec-5] DEBUG com.openkm.extractor.PdfTextExtractor- Writing image: /opt/openkm-6.3.0-community/tomcat/temp/I1FY4679468667909435275.tiff
2015-01-21 10:52:45,633 [http-bio-0.0.0.0-8080-exec-5] WARN  com.openkm.extractor.PdfTextExtractor- Failed to extract PDF text content
java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:562)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:407)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:398)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:172)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:99)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
        at com.openkm.servlet.admin.CheckTextExtractionServlet.doPost(CheckTextExtractionServlet.java:139)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:311)
        at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:116)
        at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:83)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:101)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:54)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.savedrequest.RequestCacheAwareFilter.doFilter(RequestCacheAwareFilter.java:45)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:182)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:87)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:173)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:346)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:259)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:315)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
All other pdf are working fine

OpenKm 6.3.0 community on fedora 21 64 bit

Regards Martin

Re: problem with textextractor from fax

PostPosted:Sat Jan 24, 2015 9:26 am
by jllort
I've debugged the source code and all seems right. Sometimes the problem could comes from security applied to pdf document, but seems is not the case. Into de documents seems there's a tiff, my suggestions is scan your documents as pdf or png and then convert to pdf, seems this tiff format doesn't like much when trying to extract content.

How are you doing the conversion, directly with scanner tool or some process before ? take a view at options you got in the application you're using to convert scanned document to pdf.

Re: problem with textextractor from fax

PostPosted:Mon Jan 26, 2015 10:05 am
by MartinR
The pdf is createt from my fritzbox 7270 (DSL Router) on receiving of of a fax message.

Re: problem with textextractor from fax

PostPosted:Thu Jan 29, 2015 11:33 am
by jllort
Can you try with other tool ? because when try to extract the tiff image what's into pdf document the code return null, is really quite strange. If you only can use this software for conversion, try to take a look into the options, for example if it's possible store into pdf jpg or png image format ( not tiff ).

My suggestion is -> get fax, scan and convert with other tool to identify if problems is on side I suppose.

Re: problem with textextractor from fax

PostPosted:Thu Jan 29, 2015 12:21 pm
by MartinR
Most of our faxes are spam, so we let the fritzbox receive them. They are never printed.
Wenn I use imagemagick on this, I get a tif (convert -density 400x400 27.01.15_12.09_Telefax.unbekannt.pdf -resize 25% test.tif) but the quality of the textextraction is very poor.

Re: problem with textextractor from fax

PostPosted:Fri Jan 30, 2015 6:06 am
by pavila
The PDFBox library we use to extract the images from the PDF has a problem with your TIFF image inside the PDF and fails. I don't know if this PDF has been incorrectly generated or perhaps it's a bug in PDFBox.

Re: problem with textextractor from fax

PostPosted:Fri Jan 30, 2015 10:16 am
by MartinR
I have no problems with this pdf files and other software. They can all read them. So I guess it is something with the PDFBOX.

Re: problem with textextractor from fax

PostPosted:Sat Jan 31, 2015 7:46 am
by pavila
The problem is not reading, but extracting the image inside. This PDF does not contains text, so OpenKM tried to extract images to do OCR.