Open Source Document Management System | OpenKM

Reply

OCR/Indexing Problem

#12077 by Alexires
Wed Sep 07, 2011 12:13 pm

Hi guys,

I'm trying to implement OpenKM 5.1.6 on ubuntu 11.04 and I've gotten everything working great.... except the OCR. I installed tesseract and eventually go it "working" in the sense that it didn't throw any errors. However, it just hung at "Indexing Document" after it has uploaded the document. The same is happening with Cuneiform, however that is giving errors. One of the CPU's is sitting at 100% whilst the other is only at about 20%. Ram is almost maxed out and there is some data in the swap.

I'm using a dual core 1.8GHz with 1gig ram and the repository is on an external HDD.

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12097 by pavila
Fri Sep 09, 2011 2:10 pm

Please, try the OpenKM 5.1.8-SNAPSHOT nigthbuild and tell me which OCR engine have you configured. Typically, you should run this OCR program from the command line to see if works.

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR/Indexing Problem

#12110 by jllort
Sat Sep 10, 2011 7:04 am

Low RAM for OpenKM and tesseract running at same time I suggest increment to 2GB

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12142 by Alexires
Wed Sep 14, 2011 2:46 am

Thanks for the reply guys. Unfortunately, I've turned that box into a paperweight by screwing up the BIOS. I'll need to get another motherboard before I try your suggestions (I'd just upgraded to 4 gig ram too), but from memory it wasn't much better.

Still, I'll give you something more definite when I fix it.

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12186 by pavila
Mon Sep 19, 2011 4:08 pm

May the Force be with you...

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR/Indexing Problem

#12238 by Alexires
Wed Sep 21, 2011 12:10 pm

Alright. New motherboard is in, and a new stick of 2 gig DDR3/1333; same problem. One of the CPU's is running at 100% and htop (this is in ubuntu) is reporting tesseract is using 100% CPU on and off (seems to be opening and closing with a different tmp file each time). The repository is located on an external HDD which is mounted into "repository" in the jboss folder.

The document being ocr'd is a quantum mechanics textbook that is 11meg big and 522 pages long.

I'll give 5.1.8-nightly a run and see how it goes.

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12239 by Alexires
Wed Sep 21, 2011 1:23 pm

Alright. I've upgraded to 5.1.8-nightly. I've tried another book that looks cleaner to give the OCR a run. Now using a 267 page book that is 1.5meg in total so as not to stress the system.

The file uploads fine, gets to "Indexing Document" and then gets past that and returns me to the taxonomy screen. The document is visible in the taxonomy screen (successful addition to repository), but the file doesn't preview (throws an error) and doesn't appear to be OCR'd.

Suggestions?

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12280 by jllort
Fri Sep 23, 2011 4:22 pm

Preview and OCR and different configuration modules. If you've got a preview problem I suggest create other post. Put there your server-log error,your preview configuration, and indicate on which OS do you got OpenKM installed )

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR/Indexing Problem

#12300 by Alexires
Sat Sep 24, 2011 5:37 am

Yeah, I was just thinking that. Still, no OCR as far as I can see. All I've been getting in the start.sh terminal window (since I uploaded the file) is

00:00:15,102 INFO [LRUNodeIdCache] num=13/10240 hits=235 miss=39765
00:00:15,158 INFO [BundleCache] num=1095 mem=8190k max=8192k avg=7659 hits=31938 miss=8062

EDIT: I've posted in a preview problem thread.

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12354 by pavila
Tue Sep 27, 2011 5:15 pm

Perhaps you want to try an experimental text extraction mechanism which is more verbose. Take a look at Experimental features.

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR/Indexing Problem

#12362 by Alexires
Wed Sep 28, 2011 3:37 am

Alright, here is an update. The experimental text extraction has been on the whole time, and I turned on Force.OCR which has generated some errors. The most notable is the first error it throws after the "No text to extract" which has to do with a zip problem (see server.log output below)

Code: Select all

2011-09-27 17:39:36,472 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)

Now, this is strange, as it indicates that perhaps tesseract wasn't built correctly with all the appropriate libraries. So I double checked that I had all the libraries (including zlib) and I compiled it again. In the tesseract readme, it says to doublecheck the config_auto.h file for a line that says something like #HAVE_ZLIB which isn't there. There is a line that says #HAVE_LIBZ which I think might be an error in the coding. So after much searching, I narrowed it down to a problem in leptonica. During compile, it doesn't recognise that zlib is installed, so it isn't including it in the build, which in turn means that tesseract can't use it, hence the error above.

Unfortunately, this problem won't be fixed until the next build of leptonica, although there is a patch. I've applied the patch and rebuilt leptonica, then tesseract and still get an error. It is copied out of the server.log as follows:

Code: Select all

2011-09-28 13:56:48,796 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] STDERR: Usage:/usr/local/bin/tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

2011-09-28 13:56:48,811 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(ZipFile.java:127)
	at java.util.zip.ZipFile.<init>(ZipFile.java:88)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)
2011-09-28 13:56:48,813 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] java.lang.IllegalArgumentException: Number of bits must be >= 0
java.lang.IllegalArgumentException: Number of bits must be >= 0
	at java.awt.image.ColorModel.<init>(ColorModel.java:353)
	at java.awt.image.ComponentColorModel.<init>(ComponentColorModel.java:256)
	at org.apache.pdfbox.pdmodel.graphics.color.PDDeviceGray.createColorModel(PDDeviceGray.java:91)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:238)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:91)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12426 by pavila
Mon Oct 03, 2011 10:44 am

The ZIP error is related to the SpellChecker:

Code: Select all

com.openkm.util.DocumentUtils.spellChecker

Please, post the value of the system.openoffice.dictionary configuration property.

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR/Indexing Problem

#12505 by Alexires
Sat Oct 08, 2011 11:42 am

System.OpenOffice.Dictionary: /home/alexires/jboss-4.2.3/ChemDictOOo.oxt

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12571 by Alexires
Fri Oct 14, 2011 10:28 am

Is this setting ok?

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Re: OCR/Indexing Problem

#12631 by Alexires
Wed Oct 19, 2011 3:55 am

I really hate to post hog, but I'm waiting on the OCR functionality before I put my OpenKM online. This is the only thing I am waiting on, so any help would be greatly appreciated...

Username

Alexires

Rank

Expert Boarder

Posts

130

Joined

Thu Jul 14, 2011 9:24 am

Reply

Page 1 of 4
51 posts

1
2
3
4