• problem with textextractor from fax

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #30984  by MartinR
 
Hi,
we're receiving fax messages with our fritzbox . The fritzbox sends them as pdf to my email. Trying to extract the text throws a NullPointerException:
Code: Select all
2015-01-21 10:52:45,623 [http-bio-0.0.0.0-8080-exec-5] DEBUG com.openkm.extractor.PdfTextExtractor- TextStripped: ''
2015-01-21 10:52:45,623 [http-bio-0.0.0.0-8080-exec-5] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer
2015-01-21 10:52:45,625 [http-bio-0.0.0.0-8080-exec-5] DEBUG com.openkm.extractor.PdfTextExtractor- Writing image: /opt/openkm-6.3.0-community/tomcat/temp/I1FY4679468667909435275.tiff
2015-01-21 10:52:45,633 [http-bio-0.0.0.0-8080-exec-5] WARN  com.openkm.extractor.PdfTextExtractor- Failed to extract PDF text content
java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:562)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:407)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:398)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:172)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:99)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
        at com.openkm.servlet.admin.CheckTextExtractionServlet.doPost(CheckTextExtractionServlet.java:139)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:311)
        at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:116)
        at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:83)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:101)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:54)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.savedrequest.RequestCacheAwareFilter.doFilter(RequestCacheAwareFilter.java:45)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:182)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:87)
        at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:323)
        at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:173)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:346)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:259)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:315)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
All other pdf are working fine

OpenKm 6.3.0 community on fedora 21 64 bit

Regards Martin
Attachments
(95.27 KiB) Downloaded 156 times
 #31027  by jllort
 
I've debugged the source code and all seems right. Sometimes the problem could comes from security applied to pdf document, but seems is not the case. Into de documents seems there's a tiff, my suggestions is scan your documents as pdf or png and then convert to pdf, seems this tiff format doesn't like much when trying to extract content.

How are you doing the conversion, directly with scanner tool or some process before ? take a view at options you got in the application you're using to convert scanned document to pdf.
 #31078  by jllort
 
Can you try with other tool ? because when try to extract the tiff image what's into pdf document the code return null, is really quite strange. If you only can use this software for conversion, try to take a look into the options, for example if it's possible store into pdf jpg or png image format ( not tiff ).

My suggestion is -> get fax, scan and convert with other tool to identify if problems is on side I suppose.
 #31083  by MartinR
 
Most of our faxes are spam, so we let the fritzbox receive them. They are never printed.
Wenn I use imagemagick on this, I get a tif (convert -density 400x400 27.01.15_12.09_Telefax.unbekannt.pdf -resize 25% test.tif) but the quality of the textextraction is very poor.
 #31099  by pavila
 
The PDFBox library we use to extract the images from the PDF has a problem with your TIFF image inside the PDF and fails. I don't know if this PDF has been incorrectly generated or perhaps it's a bug in PDFBox.
 #31104  by MartinR
 
I have no problems with this pdf files and other software. They can all read them. So I guess it is something with the PDFBOX.
 #31117  by pavila
 
The problem is not reading, but extracting the image inside. This PDF does not contains text, so OpenKM tried to extract images to do OCR.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.