Open Source Document Management System | OpenKM

PostPosted:**Sat Oct 22, 2011 4:35 pm**

Curiously on error appearing com.openkm.extractor.CuneiformTextExtractor but you've got tesseract configured.

Which OpenKM version are you using ?

PostPosted:**Fri Nov 04, 2011 6:01 am**

Sorry, I didn't see your post; it was on the next page.

I'm using Version: 5.1.8-SNAPSHOT (build: 7221)

PostPosted:**Fri Nov 04, 2011 4:48 pm**

upgrade to latest integration.openkm.com

PostPosted:**Thu Nov 10, 2011 7:06 am**

Upgraded and it doesn't throw an error now (except on the preview) but I don't think it is OCRing. I'll leave it for a day or so and see if it gets around to it...

Speaking of that, what does OpenKM do at midnight everynight?

PostPosted:**Thu Nov 10, 2011 7:52 am**

At migdnight we build latest svn code. 5.1.x version will be next 5.1.8 and covers all bug modifications we've found in 5.1.7, really it's 5.1.7 with solved bugs, on that version we don't introduce new features. Meanwhile we waiting for release 5.1.8 you can starting using nightly build that solutions problems from actual release 5.1.7 ( that's the idea ).

PostPosted:**Sat Nov 12, 2011 4:58 am**

Sorry, I didn't make that clear. On my server, at midnight every night, it spins up the HDD's and does something where it looks like it reads words or perhaps it is doing some kind of cron task?

Any idea when 5.1.8 is coming out? I'll keep upgrading to the latest svn code once a week till it comes out.

PostPosted:**Sat Nov 12, 2011 9:37 am**

Version 5.1.8 is really closed, but we've not found time to release it ( that's the actual problem ), hope on next week will doing it.

About nightly OpenKM operation, we've got some internal procedure to calculate statistics, probably that's the feature you're detecting, if it's a problem tell us and we'll study some solution.

PostPosted:**Tue Nov 15, 2011 9:17 am**

Nah, not a problem at all. Just noticed and wondered what it did. Thanks for the answer

PostPosted:**Tue Jan 17, 2012 7:09 am**

Has anyone actually managed to get OCR working in their instance of OpenKM? I got sick of fighting with it and took a break for a while, but I'm back and I thought the new version might have helped. Still fighting with errors though, and I'm at a loss.

If someone has managed to get an instance of OpenKM working, could you please give me the details of the install so I can attempt to replicate it?

PostPosted:**Thu Jan 19, 2012 11:05 am**

You should talk about your scenario, OS, OpenKM version, your actual configuration parameters and which OCR are you trying to configure.

PostPosted:**Tue Jan 24, 2012 5:49 am**

Alright. This is my current setup:

I have a VPS with OpenKM 5.1.8_2 freshly installed on it. I have installed/configured openoffice, imagemagick, swftools, clamAV and an OpenOffice dictionary.

I have installed Tesseract 3.0.0 and have edited the textFilterClasses parameter of SearchIndex in repository.xml to include com.openkm.extractor.Tesseract3TextExtractor and have also inserted the same into the database configuration (found in the admin tools)

I am getting this error at the moment:

Code: Select all

08:41:54,429 WARN  [PdfTextExtractor] PDF does not contains text layer
08:41:54,429 WARN  [PdfTextExtractor] Failed to extract PDF text content
java.lang.IllegalArgumentException: Prefix string too short
        at java.io.File.createTempFile(File.java:1782)
        at java.io.File.createTempFile(File.java:1828)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:89)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
        at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
        at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
        at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
        at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
        at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
        at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:419)
        at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:378)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
        at java.lang.Thread.run(Thread.java:662)
08:41:54,431 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/interstellar.pdf'

PostPosted:**Tue Jan 24, 2012 12:20 pm**

Can you attach the PDF document which generates this error?

PostPosted:**Tue Jan 31, 2012 9:14 am**

Apparently I can't upload the document (document PDF is not allowed).

PostPosted:**Wed Feb 01, 2012 9:54 am**

By default you can upload any document. Have you made some chages on default OpenKM configuration ... which kind of error you get when try uploading pdf file ?

PostPosted:**Wed Feb 01, 2012 2:56 pm**

Try to upload the zipped PDF. ZIP extensions are allowed. Anyway I will take a look at the forum configuration to enable PDF attachments.

Open Source Document Management System | OpenKM

OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem