Page 2 of 4
Re: OCR/Indexing Problem
PostPosted:Sat Oct 22, 2011 4:35 pm
by jllort
Curiously on error appearing com.openkm.extractor.CuneiformTextExtractor but you've got tesseract configured.
Which OpenKM version are you using ?
Re: OCR/Indexing Problem
PostPosted:Fri Nov 04, 2011 6:01 am
by Alexires
Sorry, I didn't see your post; it was on the next page.
I'm using Version: 5.1.8-SNAPSHOT (build: 7221)
Re: OCR/Indexing Problem
PostPosted:Fri Nov 04, 2011 4:48 pm
by jllort
upgrade to latest integration.openkm.com
Re: OCR/Indexing Problem
PostPosted:Thu Nov 10, 2011 7:06 am
by Alexires
Upgraded and it doesn't throw an error now (except on the preview) but I don't think it is OCRing. I'll leave it for a day or so and see if it gets around to it...
Speaking of that, what does OpenKM do at midnight everynight?
Re: OCR/Indexing Problem
PostPosted:Thu Nov 10, 2011 7:52 am
by jllort
At migdnight we build latest svn code. 5.1.x version will be next 5.1.8 and covers all bug modifications we've found in 5.1.7, really it's 5.1.7 with solved bugs, on that version we don't introduce new features. Meanwhile we waiting for release 5.1.8 you can starting using nightly build that solutions problems from actual release 5.1.7 ( that's the idea ).
Re: OCR/Indexing Problem
PostPosted:Sat Nov 12, 2011 4:58 am
by Alexires
Sorry, I didn't make that clear. On my server, at midnight every night, it spins up the HDD's and does something where it looks like it reads words or perhaps it is doing some kind of cron task?
Any idea when 5.1.8 is coming out? I'll keep upgrading to the latest svn code once a week till it comes out.
Re: OCR/Indexing Problem
PostPosted:Sat Nov 12, 2011 9:37 am
by jllort
Version 5.1.8 is really closed, but we've not found time to release it ( that's the actual problem ), hope on next week will doing it.
About nightly OpenKM operation, we've got some internal procedure to calculate statistics, probably that's the feature you're detecting, if it's a problem tell us and we'll study some solution.
Re: OCR/Indexing Problem
PostPosted:Tue Nov 15, 2011 9:17 am
by Alexires
Nah, not a problem at all. Just noticed and wondered what it did. Thanks for the answer

Re: OCR/Indexing Problem
PostPosted:Tue Jan 17, 2012 7:09 am
by Alexires
Has anyone actually managed to get OCR working in their instance of OpenKM? I got sick of fighting with it and took a break for a while, but I'm back and I thought the new version might have helped. Still fighting with errors though, and I'm at a loss.
If someone has managed to get an instance of OpenKM working, could you please give me the details of the install so I can attempt to replicate it?
Re: OCR/Indexing Problem
PostPosted:Thu Jan 19, 2012 11:05 am
by jllort
You should talk about your scenario, OS, OpenKM version, your actual configuration parameters and which OCR are you trying to configure.
Re: OCR/Indexing Problem
PostPosted:Tue Jan 24, 2012 5:49 am
by Alexires
Alright. This is my current setup:
I have a VPS with OpenKM 5.1.8_2 freshly installed on it. I have installed/configured openoffice, imagemagick, swftools, clamAV and an OpenOffice dictionary.
I have installed Tesseract 3.0.0 and have edited the textFilterClasses parameter of SearchIndex in repository.xml to include com.openkm.extractor.Tesseract3TextExtractor and have also inserted the same into the database configuration (found in the admin tools)
I am getting this error at the moment:
Code: Select all08:41:54,429 WARN [PdfTextExtractor] PDF does not contains text layer
08:41:54,429 WARN [PdfTextExtractor] Failed to extract PDF text content
java.lang.IllegalArgumentException: Prefix string too short
at java.io.File.createTempFile(File.java:1782)
at java.io.File.createTempFile(File.java:1828)
at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:89)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:419)
at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:378)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
at java.lang.Thread.run(Thread.java:662)
08:41:54,431 WARN [RegisteredExtractors] There was a problem extracting text from '/okm:root/interstellar.pdf'
Re: OCR/Indexing Problem
PostPosted:Tue Jan 24, 2012 12:20 pm
by pavila
Can you attach the PDF document which generates this error?
Re: OCR/Indexing Problem
PostPosted:Tue Jan 31, 2012 9:14 am
by Alexires
Apparently I can't upload the document (document PDF is not allowed).
Re: OCR/Indexing Problem
PostPosted:Wed Feb 01, 2012 9:54 am
by jllort
By default you can upload any document. Have you made some chages on default OpenKM configuration ... which kind of error you get when try uploading pdf file ?
Re: OCR/Indexing Problem
PostPosted:Wed Feb 01, 2012 2:56 pm
by pavila
Try to upload the zipped PDF. ZIP extensions are allowed. Anyway I will take a look at the forum configuration to enable PDF attachments.