• OCR with OpenKM under Centos 5.7 x64 ?

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #14454  by techexpress
 
Hi to all !
I run Open Km 5.1.9 under Centos 5.7 x64 , preview for jpeg, pdf or Office document are OK
Now I'm trying to configure OCR Tesseract3 here the value via the admin setting :
Code: Select all
system.ocr	String 	/usr/local/bin/tesseract ${fileIn} ${fileOut}  
Do we need to put /usr/local/bin/tesseract ${fileIn} ${fileOut} -l fra instead for french langage ?
also When I scan a very well printed I have these error
Code: Select all
2012-03-12 17:24:57,984 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract /tmp/image02251514838640367430.tiff /tmp/okm4578699148093339593.txt -l fra
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:137)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:98)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
2012-03-12 17:24:58,007 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract /tmp/image03630279688831400324.tiff /tmp/okm6425780083587314556.txt -l fra
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:137)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:98)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Last edited by techexpress on Wed Mar 14, 2012 1:42 pm, edited 1 time in total.
 #14504  by techexpress
 
thank's for the reply.
I remove all setting and put them to Administration -> Configuration
Code: Select all
#
# Since OpenKM 5.1 this file is only used for Hibernate configuration.
# To change configuration parameters, use Administration -> Configuration
#
hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
hibernate.hbm2ddl=none
setting from Administration -> Configuration
Code: Select all
system.ocr	String 	/usr/local/bin/tesseract ${fileIn} ${fileOut} -l fra 
system.openoffice.dictionary	String 	/opt/openoffice.org/basis3.2/share/wordbook/fr/ooo-dictionnaire-fr-moderne-v4.2.oxt 
system.openoffice.path	String 	/opt/openoffice.org3 	
system.openoffice.port	Integer 	2002 	
system.openoffice.server	String 	http://localhost:8080/converter/convert 
system.openoffice.tasks	Integer 	200 	
system.pdf.force.ocr	Boolean 	Inactive 	
system.previewer	String 	zviewer 	
system.readonly	Boolean 	Inactive 	
system.swftools.pdf2swf	String 	/usr/local/bin/pdf2swf -T 9 -f ${fileIn} -o ${fileOut} 

now I have these errors :
Code: Select all
2012-03-14 09:51:06,466 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-14 09:51:06,982 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
I put the link for the PDF source (in french)
http://tech-express.ca/docups3.pdf
 #14549  by techexpress
 
Thank's for the reply
I change the setting but i havee same error message with " PDF does not contains text layer" 2 time
here the log
Code: Select all
2012-03-16 08:17:46,636 DEBUG [org.jboss.deployment.scanner.URLDeploymentScanner] Watch URL for: file:/opt/jboss-as/server/default/deploy/jmx-console.war/ -> file:/opt/jboss-as/server/default/deploy/jmx-console.war/WEB-INF/web.xml
2012-03-16 08:17:46,656 DEBUG [org.jboss.deployment.scanner.AbstractDeploymentScanner$ScannerThread] Notified that enabled: true
2012-03-16 08:17:46,656 DEBUG [org.jboss.deployment.scanner.URLDeploymentScanner] Started jboss.deployment:type=DeploymentScanner,flavor=URL
2012-03-16 08:17:46,656 DEBUG [org.jboss.system.ServiceController] Starting dependent components for: jboss.deployment:type=DeploymentScanner,flavor=URL dependent components: []
2012-03-16 08:17:46,870 DEBUG [org.jboss.deployment.MainDeployer] End deployment start on package: jboss-service.xml
2012-03-16 08:17:46,870 DEBUG [org.jboss.deployment.MainDeployer] Deployed package: file:/opt/jboss-as/server/default/conf/jboss-service.xml
2012-03-16 08:17:46,871 DEBUG [org.jboss.web.tomcat.service.JBossWeb] Saw org.jboss.system.server.started notification, starting connectors
2012-03-16 08:17:46,878 INFO  [org.apache.coyote.http11.Http11Protocol] Démarrage de Coyote HTTP/1.1 sur http-0.0.0.0-8080
2012-03-16 08:17:46,907 INFO  [org.apache.coyote.ajp.AjpProtocol] Starting Coyote AJP/1.3 on ajp-0.0.0.0-8009
2012-03-16 08:17:47,062 INFO  [org.jboss.system.server.Server] JBoss (MX MicroKernel) [4.2.3.GA (build: SVNTag=JBoss_4_2_3_GA date=200807181439)] Started in 47s:699ms
2012-03-16 08:31:22,996 DEBUG [org.jboss.security.plugins.JaasSecurityManager.OpenKM] CallbackHandler: org.jboss.security.auth.callback.SecurityAssociationHandler@3c859513
2012-03-16 08:31:22,997 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] Created securityMgr=org.jboss.security.plugins.JaasSecurityManager@135fa2b9
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManager.OpenKM] CachePolicy set to: org.jboss.util.TimedCachePolicy@6bb4299e
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] setCachePolicy, c=org.jboss.util.TimedCachePolicy@6bb4299e
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] Added OpenKM, org.jboss.security.plugins.SecurityDomainContext@1187b50 to map
2012-03-16 08:32:03,897 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-16 08:32:03,901 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-16 08:32:09,383 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
BUT .... i check in the TMP directory and I can find 2 files okm518291356940814048.txt.txt and okm7108162072939302808.txt.txt , yes two files for the same scan and with double extension
here the soruce of this files , we can see than OCR works
Code: Select all
À propos de ce manuel
Ce manuel contient des instructions détaillées pour Pinstallation et a été conçu
pour servir de référence lors de l”exploitation, des interventions de dépannage,
et des futures mises à niveau.
Conventions typographiques
Ce document utilise les conventions suivantes pour distinguer les différents
éléments de texte :
Touches
ENTREE
UTILISATEUR
NOMS DE FICHIERS
Options Menu, Noms de
commandes, Noms de
boîtes de dialogue
COMMANDES,
REPERTOIRES, et
UNITES
Les touches apparaissent en gras. Un signe plus (+)
entre deux touches signifie que ces demièrcs
doivent être enfoncées simultanément.
Les entrees de l'utilisateur apparaissent dans un
style différent et en majuscule.
Les noms de fichiers apparaissent en italique et en
majuscule.
La première lettre de ces éléments est en majuscule
Ces éléments apparaissent en majuscule.
So I guess problem isn'T with OCR but after OCRing
 #14572  by jllort
 
could you try the same file in our online demo at demo.openkm.com if ocr runs correctly there ?

Do you have configured some dictionary ? seems there's some problem on it at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
 #14701  by techexpress
 
Hi again

OK I just scan the same document from demo with user0 , first test is in standard quality : docups.pdf 17.5 k second is in photo quality: docupshq.pdf (135 k).
After I select the search tab , I'm looking for these two words : 'manuel' and 'typographiques' the demo web site didn't find anything
I try in English or French configuration

For dictionary I use this setting system.openoffice.dictionary String /opt/openoffice.org/basis3.2/share/wordbook/fr/ooo-dictionnaire-fr-moderne-v4.2.oxt
 #14884  by pavila
 
Tesseract and Cuneiform need a 300 dpi to obtain good results. Commercial OCR like Abby works with lower resolution.
 #14892  by techexpress
 
OK ,

Sorry for my poor english but I'm french.
I understand your answer but if I scan the document at photo quality it's more than 300 dpi (Hardware resolution is 600 dpi) .
Also I think the document is correctly decoded, since I find .txt file in the Temp directory .
Do you think it's because double extension than it can't be indexed ?
 #14970  by techexpress
 
Sorry for delay was too tired to do a new try.
I try to upload th epdf file but
The extension pdf is not allowed.
you can download here : http://tech-express.ca/docups3.pdf

Do you know where I can find a very good tuto for Centos ?
btw I will try with ubuntu tomorrow
thanks you in advance and have a nice day
 #15359  by techexpress
 
Hi
Sorry for the looooong delay , I was on another project for a client.
Today Itrying OpenKM with Ubuntu 10.04 and now the OCR works even if openoffice isn't installed !!
some words are recognized while others are not "manuel, conventions" but not typographique .
Do you think I need to install any dictionnary ? Do you have a good tuto to install Open Office on Ubuntu 10 ?
Thank's and have a nice day

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.