Page 1 of 4

OCR/Indexing Problem

PostPosted:Wed Sep 07, 2011 12:13 pm
by Alexires
Hi guys,

I'm trying to implement OpenKM 5.1.6 on ubuntu 11.04 and I've gotten everything working great.... except the OCR. I installed tesseract and eventually go it "working" in the sense that it didn't throw any errors. However, it just hung at "Indexing Document" after it has uploaded the document. The same is happening with Cuneiform, however that is giving errors. One of the CPU's is sitting at 100% whilst the other is only at about 20%. Ram is almost maxed out and there is some data in the swap.

I'm using a dual core 1.8GHz with 1gig ram and the repository is on an external HDD.

Re: OCR/Indexing Problem

PostPosted:Fri Sep 09, 2011 2:10 pm
by pavila
Please, try the OpenKM 5.1.8-SNAPSHOT nigthbuild and tell me which OCR engine have you configured. Typically, you should run this OCR program from the command line to see if works.

Re: OCR/Indexing Problem

PostPosted:Sat Sep 10, 2011 7:04 am
by jllort
Low RAM for OpenKM and tesseract running at same time I suggest increment to 2GB

Re: OCR/Indexing Problem

PostPosted:Wed Sep 14, 2011 2:46 am
by Alexires
Thanks for the reply guys. Unfortunately, I've turned that box into a paperweight by screwing up the BIOS. I'll need to get another motherboard before I try your suggestions (I'd just upgraded to 4 gig ram too), but from memory it wasn't much better.

Still, I'll give you something more definite when I fix it.

Re: OCR/Indexing Problem

PostPosted:Mon Sep 19, 2011 4:08 pm
by pavila
May the Force be with you... :|

Re: OCR/Indexing Problem

PostPosted:Wed Sep 21, 2011 12:10 pm
by Alexires
Alright. New motherboard is in, and a new stick of 2 gig DDR3/1333; same problem. One of the CPU's is running at 100% and htop (this is in ubuntu) is reporting tesseract is using 100% CPU on and off (seems to be opening and closing with a different tmp file each time). The repository is located on an external HDD which is mounted into "repository" in the jboss folder.

The document being ocr'd is a quantum mechanics textbook that is 11meg big and 522 pages long.

I'll give 5.1.8-nightly a run and see how it goes.

Re: OCR/Indexing Problem

PostPosted:Wed Sep 21, 2011 1:23 pm
by Alexires
Alright. I've upgraded to 5.1.8-nightly. I've tried another book that looks cleaner to give the OCR a run. Now using a 267 page book that is 1.5meg in total so as not to stress the system.

The file uploads fine, gets to "Indexing Document" and then gets past that and returns me to the taxonomy screen. The document is visible in the taxonomy screen (successful addition to repository), but the file doesn't preview (throws an error) and doesn't appear to be OCR'd.

Suggestions?

Re: OCR/Indexing Problem

PostPosted:Fri Sep 23, 2011 4:22 pm
by jllort
Preview and OCR and different configuration modules. If you've got a preview problem I suggest create other post. Put there your server-log error,your preview configuration, and indicate on which OS do you got OpenKM installed )

Re: OCR/Indexing Problem

PostPosted:Sat Sep 24, 2011 5:37 am
by Alexires
Yeah, I was just thinking that. Still, no OCR as far as I can see. All I've been getting in the start.sh terminal window (since I uploaded the file) is
00:00:15,102 INFO [LRUNodeIdCache] num=13/10240 hits=235 miss=39765
00:00:15,158 INFO [BundleCache] num=1095 mem=8190k max=8192k avg=7659 hits=31938 miss=8062
EDIT: I've posted in a preview problem thread.

Re: OCR/Indexing Problem

PostPosted:Tue Sep 27, 2011 5:15 pm
by pavila
Perhaps you want to try an experimental text extraction mechanism which is more verbose. Take a look at Experimental features.

Re: OCR/Indexing Problem

PostPosted:Wed Sep 28, 2011 3:37 am
by Alexires
Alright, here is an update. The experimental text extraction has been on the whole time, and I turned on Force.OCR which has generated some errors. The most notable is the first error it throws after the "No text to extract" which has to do with a zip problem (see server.log output below)
Code: Select all
2011-09-27 17:39:36,472 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)
Now, this is strange, as it indicates that perhaps tesseract wasn't built correctly with all the appropriate libraries. So I double checked that I had all the libraries (including zlib) and I compiled it again. In the tesseract readme, it says to doublecheck the config_auto.h file for a line that says something like #HAVE_ZLIB which isn't there. There is a line that says #HAVE_LIBZ which I think might be an error in the coding. So after much searching, I narrowed it down to a problem in leptonica. During compile, it doesn't recognise that zlib is installed, so it isn't including it in the build, which in turn means that tesseract can't use it, hence the error above.

Unfortunately, this problem won't be fixed until the next build of leptonica, although there is a patch. I've applied the patch and rebuilt leptonica, then tesseract and still get an error. It is copied out of the server.log as follows:
Code: Select all
2011-09-28 13:56:48,796 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] STDERR: Usage:/usr/local/bin/tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

2011-09-28 13:56:48,811 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(ZipFile.java:127)
	at java.util.zip.ZipFile.<init>(ZipFile.java:88)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)
2011-09-28 13:56:48,813 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] java.lang.IllegalArgumentException: Number of bits must be >= 0
java.lang.IllegalArgumentException: Number of bits must be >= 0
	at java.awt.image.ColorModel.<init>(ColorModel.java:353)
	at java.awt.image.ComponentColorModel.<init>(ComponentColorModel.java:256)
	at org.apache.pdfbox.pdmodel.graphics.color.PDDeviceGray.createColorModel(PDDeviceGray.java:91)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:238)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:91)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)

Re: OCR/Indexing Problem

PostPosted:Mon Oct 03, 2011 10:44 am
by pavila
The ZIP error is related to the SpellChecker:
Code: Select all
com.openkm.util.DocumentUtils.spellChecker
Please, post the value of the system.openoffice.dictionary configuration property.

Re: OCR/Indexing Problem

PostPosted:Sat Oct 08, 2011 11:42 am
by Alexires
System.OpenOffice.Dictionary: /home/alexires/jboss-4.2.3/ChemDictOOo.oxt

Re: OCR/Indexing Problem

PostPosted:Fri Oct 14, 2011 10:28 am
by Alexires
Is this setting ok?

Re: OCR/Indexing Problem

PostPosted:Wed Oct 19, 2011 3:55 am
by Alexires
I really hate to post hog, but I'm waiting on the OCR functionality before I put my OpenKM online. This is the only thing I am waiting on, so any help would be greatly appreciated...