Page 2 of 4

Re: OCR/Indexing Problem

PostPosted:Sat Oct 22, 2011 4:35 pm
by jllort
Curiously on error appearing com.openkm.extractor.CuneiformTextExtractor but you've got tesseract configured.

Which OpenKM version are you using ?

Re: OCR/Indexing Problem

PostPosted:Fri Nov 04, 2011 6:01 am
by Alexires
Sorry, I didn't see your post; it was on the next page.

I'm using Version: 5.1.8-SNAPSHOT (build: 7221)

Re: OCR/Indexing Problem

PostPosted:Fri Nov 04, 2011 4:48 pm
by jllort
upgrade to latest integration.openkm.com

Re: OCR/Indexing Problem

PostPosted:Thu Nov 10, 2011 7:06 am
by Alexires
Upgraded and it doesn't throw an error now (except on the preview) but I don't think it is OCRing. I'll leave it for a day or so and see if it gets around to it...

Speaking of that, what does OpenKM do at midnight everynight?

Re: OCR/Indexing Problem

PostPosted:Thu Nov 10, 2011 7:52 am
by jllort
At migdnight we build latest svn code. 5.1.x version will be next 5.1.8 and covers all bug modifications we've found in 5.1.7, really it's 5.1.7 with solved bugs, on that version we don't introduce new features. Meanwhile we waiting for release 5.1.8 you can starting using nightly build that solutions problems from actual release 5.1.7 ( that's the idea ).

Re: OCR/Indexing Problem

PostPosted:Sat Nov 12, 2011 4:58 am
by Alexires
Sorry, I didn't make that clear. On my server, at midnight every night, it spins up the HDD's and does something where it looks like it reads words or perhaps it is doing some kind of cron task?

Any idea when 5.1.8 is coming out? I'll keep upgrading to the latest svn code once a week till it comes out.

Re: OCR/Indexing Problem

PostPosted:Sat Nov 12, 2011 9:37 am
by jllort
Version 5.1.8 is really closed, but we've not found time to release it ( that's the actual problem ), hope on next week will doing it.

About nightly OpenKM operation, we've got some internal procedure to calculate statistics, probably that's the feature you're detecting, if it's a problem tell us and we'll study some solution.

Re: OCR/Indexing Problem

PostPosted:Tue Nov 15, 2011 9:17 am
by Alexires
Nah, not a problem at all. Just noticed and wondered what it did. Thanks for the answer :D

Re: OCR/Indexing Problem

PostPosted:Tue Jan 17, 2012 7:09 am
by Alexires
Has anyone actually managed to get OCR working in their instance of OpenKM? I got sick of fighting with it and took a break for a while, but I'm back and I thought the new version might have helped. Still fighting with errors though, and I'm at a loss.

If someone has managed to get an instance of OpenKM working, could you please give me the details of the install so I can attempt to replicate it?

Re: OCR/Indexing Problem

PostPosted:Thu Jan 19, 2012 11:05 am
by jllort
You should talk about your scenario, OS, OpenKM version, your actual configuration parameters and which OCR are you trying to configure.

Re: OCR/Indexing Problem

PostPosted:Tue Jan 24, 2012 5:49 am
by Alexires
Alright. This is my current setup:

I have a VPS with OpenKM 5.1.8_2 freshly installed on it. I have installed/configured openoffice, imagemagick, swftools, clamAV and an OpenOffice dictionary.

I have installed Tesseract 3.0.0 and have edited the textFilterClasses parameter of SearchIndex in repository.xml to include com.openkm.extractor.Tesseract3TextExtractor and have also inserted the same into the database configuration (found in the admin tools)

I am getting this error at the moment:
Code: Select all
08:41:54,429 WARN  [PdfTextExtractor] PDF does not contains text layer
08:41:54,429 WARN  [PdfTextExtractor] Failed to extract PDF text content
java.lang.IllegalArgumentException: Prefix string too short
        at java.io.File.createTempFile(File.java:1782)
        at java.io.File.createTempFile(File.java:1828)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:89)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
        at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
        at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
        at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
        at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
        at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
        at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:419)
        at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:378)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
        at java.lang.Thread.run(Thread.java:662)
08:41:54,431 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/interstellar.pdf'

Re: OCR/Indexing Problem

PostPosted:Tue Jan 24, 2012 12:20 pm
by pavila
Can you attach the PDF document which generates this error?

Re: OCR/Indexing Problem

PostPosted:Tue Jan 31, 2012 9:14 am
by Alexires
Apparently I can't upload the document (document PDF is not allowed).

Re: OCR/Indexing Problem

PostPosted:Wed Feb 01, 2012 9:54 am
by jllort
By default you can upload any document. Have you made some chages on default OpenKM configuration ... which kind of error you get when try uploading pdf file ?

Re: OCR/Indexing Problem

PostPosted:Wed Feb 01, 2012 2:56 pm
by pavila
Try to upload the zipped PDF. ZIP extensions are allowed. Anyway I will take a look at the forum configuration to enable PDF attachments.