• OCR/Indexing Problem

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #12670  by jllort
 
Curiously on error appearing com.openkm.extractor.CuneiformTextExtractor but you've got tesseract configured.

Which OpenKM version are you using ?
 #12818  by Alexires
 
Sorry, I didn't see your post; it was on the next page.

I'm using Version: 5.1.8-SNAPSHOT (build: 7221)
 #12827  by jllort
 
upgrade to latest integration.openkm.com
 #12853  by Alexires
 
Upgraded and it doesn't throw an error now (except on the preview) but I don't think it is OCRing. I'll leave it for a day or so and see if it gets around to it...

Speaking of that, what does OpenKM do at midnight everynight?
 #12862  by jllort
 
At migdnight we build latest svn code. 5.1.x version will be next 5.1.8 and covers all bug modifications we've found in 5.1.7, really it's 5.1.7 with solved bugs, on that version we don't introduce new features. Meanwhile we waiting for release 5.1.8 you can starting using nightly build that solutions problems from actual release 5.1.7 ( that's the idea ).
 #12872  by Alexires
 
Sorry, I didn't make that clear. On my server, at midnight every night, it spins up the HDD's and does something where it looks like it reads words or perhaps it is doing some kind of cron task?

Any idea when 5.1.8 is coming out? I'll keep upgrading to the latest svn code once a week till it comes out.
 #12878  by jllort
 
Version 5.1.8 is really closed, but we've not found time to release it ( that's the actual problem ), hope on next week will doing it.

About nightly OpenKM operation, we've got some internal procedure to calculate statistics, probably that's the feature you're detecting, if it's a problem tell us and we'll study some solution.
 #12897  by Alexires
 
Nah, not a problem at all. Just noticed and wondered what it did. Thanks for the answer :D
 #13541  by Alexires
 
Has anyone actually managed to get OCR working in their instance of OpenKM? I got sick of fighting with it and took a break for a while, but I'm back and I thought the new version might have helped. Still fighting with errors though, and I'm at a loss.

If someone has managed to get an instance of OpenKM working, could you please give me the details of the install so I can attempt to replicate it?
 #13567  by jllort
 
You should talk about your scenario, OS, OpenKM version, your actual configuration parameters and which OCR are you trying to configure.
 #13619  by Alexires
 
Alright. This is my current setup:

I have a VPS with OpenKM 5.1.8_2 freshly installed on it. I have installed/configured openoffice, imagemagick, swftools, clamAV and an OpenOffice dictionary.

I have installed Tesseract 3.0.0 and have edited the textFilterClasses parameter of SearchIndex in repository.xml to include com.openkm.extractor.Tesseract3TextExtractor and have also inserted the same into the database configuration (found in the admin tools)

I am getting this error at the moment:
Code: Select all
08:41:54,429 WARN  [PdfTextExtractor] PDF does not contains text layer
08:41:54,429 WARN  [PdfTextExtractor] Failed to extract PDF text content
java.lang.IllegalArgumentException: Prefix string too short
        at java.io.File.createTempFile(File.java:1782)
        at java.io.File.createTempFile(File.java:1828)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:89)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
        at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
        at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
        at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
        at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
        at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
        at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:419)
        at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:378)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
        at java.lang.Thread.run(Thread.java:662)
08:41:54,431 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/interstellar.pdf'
 #13631  by pavila
 
Can you attach the PDF document which generates this error?
 #13727  by jllort
 
By default you can upload any document. Have you made some chages on default OpenKM configuration ... which kind of error you get when try uploading pdf file ?
 #13736  by pavila
 
Try to upload the zipped PDF. ZIP extensions are allowed. Anyway I will take a look at the forum configuration to enable PDF attachments.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.