• OCR/Indexing Problem

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #12077  by Alexires
 
Hi guys,

I'm trying to implement OpenKM 5.1.6 on ubuntu 11.04 and I've gotten everything working great.... except the OCR. I installed tesseract and eventually go it "working" in the sense that it didn't throw any errors. However, it just hung at "Indexing Document" after it has uploaded the document. The same is happening with Cuneiform, however that is giving errors. One of the CPU's is sitting at 100% whilst the other is only at about 20%. Ram is almost maxed out and there is some data in the swap.

I'm using a dual core 1.8GHz with 1gig ram and the repository is on an external HDD.
 #12097  by pavila
 
Please, try the OpenKM 5.1.8-SNAPSHOT nigthbuild and tell me which OCR engine have you configured. Typically, you should run this OCR program from the command line to see if works.
 #12110  by jllort
 
Low RAM for OpenKM and tesseract running at same time I suggest increment to 2GB
 #12142  by Alexires
 
Thanks for the reply guys. Unfortunately, I've turned that box into a paperweight by screwing up the BIOS. I'll need to get another motherboard before I try your suggestions (I'd just upgraded to 4 gig ram too), but from memory it wasn't much better.

Still, I'll give you something more definite when I fix it.
 #12238  by Alexires
 
Alright. New motherboard is in, and a new stick of 2 gig DDR3/1333; same problem. One of the CPU's is running at 100% and htop (this is in ubuntu) is reporting tesseract is using 100% CPU on and off (seems to be opening and closing with a different tmp file each time). The repository is located on an external HDD which is mounted into "repository" in the jboss folder.

The document being ocr'd is a quantum mechanics textbook that is 11meg big and 522 pages long.

I'll give 5.1.8-nightly a run and see how it goes.
 #12239  by Alexires
 
Alright. I've upgraded to 5.1.8-nightly. I've tried another book that looks cleaner to give the OCR a run. Now using a 267 page book that is 1.5meg in total so as not to stress the system.

The file uploads fine, gets to "Indexing Document" and then gets past that and returns me to the taxonomy screen. The document is visible in the taxonomy screen (successful addition to repository), but the file doesn't preview (throws an error) and doesn't appear to be OCR'd.

Suggestions?
 #12280  by jllort
 
Preview and OCR and different configuration modules. If you've got a preview problem I suggest create other post. Put there your server-log error,your preview configuration, and indicate on which OS do you got OpenKM installed )
 #12300  by Alexires
 
Yeah, I was just thinking that. Still, no OCR as far as I can see. All I've been getting in the start.sh terminal window (since I uploaded the file) is
00:00:15,102 INFO [LRUNodeIdCache] num=13/10240 hits=235 miss=39765
00:00:15,158 INFO [BundleCache] num=1095 mem=8190k max=8192k avg=7659 hits=31938 miss=8062
EDIT: I've posted in a preview problem thread.
 #12362  by Alexires
 
Alright, here is an update. The experimental text extraction has been on the whole time, and I turned on Force.OCR which has generated some errors. The most notable is the first error it throws after the "No text to extract" which has to do with a zip problem (see server.log output below)
Code: Select all
2011-09-27 17:39:36,472 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)
Now, this is strange, as it indicates that perhaps tesseract wasn't built correctly with all the appropriate libraries. So I double checked that I had all the libraries (including zlib) and I compiled it again. In the tesseract readme, it says to doublecheck the config_auto.h file for a line that says something like #HAVE_ZLIB which isn't there. There is a line that says #HAVE_LIBZ which I think might be an error in the coding. So after much searching, I narrowed it down to a problem in leptonica. During compile, it doesn't recognise that zlib is installed, so it isn't including it in the build, which in turn means that tesseract can't use it, hence the error above.

Unfortunately, this problem won't be fixed until the next build of leptonica, although there is a patch. I've applied the patch and rebuilt leptonica, then tesseract and still get an error. It is copied out of the server.log as follows:
Code: Select all
2011-09-28 13:56:48,796 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-09-28 13:56:48,811 WARN  [com.openkm.util.ExecutionUtils] STDERR: Usage:/usr/local/bin/tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

2011-09-28 13:56:48,811 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(ZipFile.java:127)
	at java.util.zip.ZipFile.<init>(ZipFile.java:88)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)
2011-09-28 13:56:48,813 ERROR [org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap] java.lang.IllegalArgumentException: Number of bits must be >= 0
java.lang.IllegalArgumentException: Number of bits must be >= 0
	at java.awt.image.ColorModel.<init>(ColorModel.java:353)
	at java.awt.image.ComponentColorModel.<init>(ComponentColorModel.java:256)
	at org.apache.pdfbox.pdmodel.graphics.color.PDDeviceGray.createColorModel(PDDeviceGray.java:91)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:238)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:91)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:75)
	at com.openkm.extractor.RegisteredExtractors.index(RegisteredExtractors.java:117)
	at com.openkm.module.base.BaseDocumentModule.create(BaseDocumentModule.java:161)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:199)
	at com.openkm.module.direct.DirectDocumentModule.create(DirectDocumentModule.java:98)
	at com.openkm.api.OKMDocument.create(OKMDocument.java:71)
	at com.openkm.servlet.frontend.FileUploadServlet.doPost(FileUploadServlet.java:176)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:524)
	at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
	at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:856)
	at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:566)
	at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
	at java.lang.Thread.run(Thread.java:662)
 #12426  by pavila
 
The ZIP error is related to the SpellChecker:
Code: Select all
com.openkm.util.DocumentUtils.spellChecker
Please, post the value of the system.openoffice.dictionary configuration property.
 #12505  by Alexires
 
System.OpenOffice.Dictionary: /home/alexires/jboss-4.2.3/ChemDictOOo.oxt
 #12631  by Alexires
 
I really hate to post hog, but I'm waiting on the OCR functionality before I put my OpenKM online. This is the only thing I am waiting on, so any help would be greatly appreciated...

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.