Open Source Document Management System | OpenKM - OCR/Indexing Problem

Reply

Re: OCR/Indexing Problem

#12670 by jllort
Sat Oct 22, 2011 4:35 pm

Curiously on error appearing com.openkm.extractor.CuneiformTextExtractor but you've got tesseract configured.

Which OpenKM version are you using ?

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12818 by Alexires
Fri Nov 04, 2011 6:01 am

Sorry, I didn't see your post; it was on the next page.

I'm using Version: 5.1.8-SNAPSHOT (build: 7221)

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12827 by jllort
Fri Nov 04, 2011 4:48 pm

upgrade to latest integration.openkm.com

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12853 by Alexires
Thu Nov 10, 2011 7:06 am

Upgraded and it doesn't throw an error now (except on the preview) but I don't think it is OCRing. I'll leave it for a day or so and see if it gets around to it...

Speaking of that, what does OpenKM do at midnight everynight?

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12862 by jllort
Thu Nov 10, 2011 7:52 am

At migdnight we build latest svn code. 5.1.x version will be next 5.1.8 and covers all bug modifications we've found in 5.1.7, really it's 5.1.7 with solved bugs, on that version we don't introduce new features. Meanwhile we waiting for release 5.1.8 you can starting using nightly build that solutions problems from actual release 5.1.7 ( that's the idea ).

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12872 by Alexires
Sat Nov 12, 2011 4:58 am

Sorry, I didn't make that clear. On my server, at midnight every night, it spins up the HDD's and does something where it looks like it reads words or perhaps it is doing some kind of cron task?

Any idea when 5.1.8 is coming out? I'll keep upgrading to the latest svn code once a week till it comes out.

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12878 by jllort
Sat Nov 12, 2011 9:37 am

Version 5.1.8 is really closed, but we've not found time to release it ( that's the actual problem ), hope on next week will doing it.

About nightly OpenKM operation, we've got some internal procedure to calculate statistics, probably that's the feature you're detecting, if it's a problem tell us and we'll study some solution.

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12897 by Alexires
Tue Nov 15, 2011 9:17 am

Nah, not a problem at all. Just noticed and wondered what it did. Thanks for the answer

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#13541 by Alexires
Tue Jan 17, 2012 7:09 am

Has anyone actually managed to get OCR working in their instance of OpenKM? I got sick of fighting with it and took a break for a while, but I'm back and I thought the new version might have helped. Still fighting with errors though, and I'm at a loss.

If someone has managed to get an instance of OpenKM working, could you please give me the details of the install so I can attempt to replicate it?

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#13567 by jllort
Thu Jan 19, 2012 11:05 am

You should talk about your scenario, OS, OpenKM version, your actual configuration parameters and which OCR are you trying to configure.

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#13619 by Alexires
Tue Jan 24, 2012 5:49 am

Alright. This is my current setup:

I have a VPS with OpenKM 5.1.8_2 freshly installed on it. I have installed/configured openoffice, imagemagick, swftools, clamAV and an OpenOffice dictionary.

I have installed Tesseract 3.0.0 and have edited the textFilterClasses parameter of SearchIndex in repository.xml to include com.openkm.extractor.Tesseract3TextExtractor and have also inserted the same into the database configuration (found in the admin tools)

I am getting this error at the moment:

Code: Select all

08:41:54,429 WARN  [PdfTextExtractor] PDF does not contains text layer
08:41:54,429 WARN  [PdfTextExtractor] Failed to extract PDF text content
java.lang.IllegalArgumentException: Prefix string too short
        at java.io.File.createTempFile(File.java:1782)
        at java.io.File.createTempFile(File.java:1828)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:89)
        at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
        at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
        at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
        at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
        at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
        at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
        at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
        at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
        at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:419)
        at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:378)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
        at java.lang.Thread.run(Thread.java:662)
08:41:54,431 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/interstellar.pdf'

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#13631 by pavila
Tue Jan 24, 2012 12:20 pm

Can you attach the PDF document which generates this error?

Username

pavila

Rank

Moderator

Posts

3142

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR/Indexing Problem

#13709 by Alexires
Tue Jan 31, 2012 9:14 am

Apparently I can't upload the document (document PDF is not allowed).

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#13727 by jllort
Wed Feb 01, 2012 9:54 am

By default you can upload any document. Have you made some chages on default OpenKM configuration ... which kind of error you get when try uploading pdf file ?

Username

jllort

Rank

Moderator

Posts

12145

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#13736 by pavila
Wed Feb 01, 2012 2:56 pm

Try to upload the zipped PDF. ZIP extensions are allowed. Anyway I will take a look at the forum configuration to enable PDF attachments.

Username

pavila

Rank

Moderator

Posts

3142

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Reply

Page 2 of 4
51 posts

1
2
3
4