Open Source Document Management System | OpenKM - OCR with OpenKM under Centos 5.7 x64 ?

Reply

OCR with OpenKM under Centos 5.7 x64 ?

#14454 by techexpress
Mon Mar 12, 2012 10:00 pm

Hi to all !
I run Open Km 5.1.9 under Centos 5.7 x64 , preview for jpeg, pdf or Office document are OK
Now I'm trying to configure OCR Tesseract3 here the value via the admin setting :

Code: Select all

system.ocr	String 	/usr/local/bin/tesseract ${fileIn} ${fileOut}

Do we need to put /usr/local/bin/tesseract ${fileIn} ${fileOut} -l fra instead for french langage ?
also When I scan a very well printed I have these error

Code: Select all

2012-03-12 17:24:57,984 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract /tmp/image02251514838640367430.tiff /tmp/okm4578699148093339593.txt -l fra
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:137)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:98)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
2012-03-12 17:24:58,007 WARN  [com.openkm.extractor.CuneiformTextExtractor] IO exception executing command: /usr/local/bin/tesseract /tmp/image03630279688831400324.tiff /tmp/okm6425780083587314556.txt -l fra
java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at java.util.zip.ZipFile.<init>(Unknown Source)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)
	at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:130)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:137)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:98)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

Last edited by techexpress on Wed Mar 14, 2012 1:42 pm, edited 1 time in total.

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14484 by pavila
Tue Mar 13, 2012 6:50 pm

This error is related to the spell checker configuration.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14504 by techexpress
Wed Mar 14, 2012 2:05 pm

thank's for the reply.
I remove all setting and put them to Administration -> Configuration

Code: Select all

#
# Since OpenKM 5.1 this file is only used for Hibernate configuration.
# To change configuration parameters, use Administration -> Configuration
#
hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
hibernate.hbm2ddl=none

setting from Administration -> Configuration

Code: Select all

system.ocr	String 	/usr/local/bin/tesseract ${fileIn} ${fileOut} -l fra 
system.openoffice.dictionary	String 	/opt/openoffice.org/basis3.2/share/wordbook/fr/ooo-dictionnaire-fr-moderne-v4.2.oxt 
system.openoffice.path	String 	/opt/openoffice.org3 	
system.openoffice.port	Integer 	2002 	
system.openoffice.server	String 	http://localhost:8080/converter/convert 
system.openoffice.tasks	Integer 	200 	
system.pdf.force.ocr	Boolean 	Inactive 	
system.previewer	String 	zviewer 	
system.readonly	Boolean 	Inactive 	
system.swftools.pdf2swf	String 	/usr/local/bin/pdf2swf -T 9 -f ${fileIn} -o ${fileOut}

now I have these errors :

Code: Select all

2012-03-14 09:51:06,466 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-14 09:51:06,982 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer

I put the link for the PDF source (in french)
http://tech-express.ca/docups3.pdf

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14534 by jllort
Fri Mar 16, 2012 8:22 am

You could try changing

system.pdf.force.ocr Boolean active

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14549 by techexpress
Fri Mar 16, 2012 12:40 pm

Thank's for the reply
I change the setting but i havee same error message with " PDF does not contains text layer" 2 time
here the log

Code: Select all

2012-03-16 08:17:46,636 DEBUG [org.jboss.deployment.scanner.URLDeploymentScanner] Watch URL for: file:/opt/jboss-as/server/default/deploy/jmx-console.war/ -> file:/opt/jboss-as/server/default/deploy/jmx-console.war/WEB-INF/web.xml
2012-03-16 08:17:46,656 DEBUG [org.jboss.deployment.scanner.AbstractDeploymentScanner$ScannerThread] Notified that enabled: true
2012-03-16 08:17:46,656 DEBUG [org.jboss.deployment.scanner.URLDeploymentScanner] Started jboss.deployment:type=DeploymentScanner,flavor=URL
2012-03-16 08:17:46,656 DEBUG [org.jboss.system.ServiceController] Starting dependent components for: jboss.deployment:type=DeploymentScanner,flavor=URL dependent components: []
2012-03-16 08:17:46,870 DEBUG [org.jboss.deployment.MainDeployer] End deployment start on package: jboss-service.xml
2012-03-16 08:17:46,870 DEBUG [org.jboss.deployment.MainDeployer] Deployed package: file:/opt/jboss-as/server/default/conf/jboss-service.xml
2012-03-16 08:17:46,871 DEBUG [org.jboss.web.tomcat.service.JBossWeb] Saw org.jboss.system.server.started notification, starting connectors
2012-03-16 08:17:46,878 INFO  [org.apache.coyote.http11.Http11Protocol] Démarrage de Coyote HTTP/1.1 sur http-0.0.0.0-8080
2012-03-16 08:17:46,907 INFO  [org.apache.coyote.ajp.AjpProtocol] Starting Coyote AJP/1.3 on ajp-0.0.0.0-8009
2012-03-16 08:17:47,062 INFO  [org.jboss.system.server.Server] JBoss (MX MicroKernel) [4.2.3.GA (build: SVNTag=JBoss_4_2_3_GA date=200807181439)] Started in 47s:699ms
2012-03-16 08:31:22,996 DEBUG [org.jboss.security.plugins.JaasSecurityManager.OpenKM] CallbackHandler: org.jboss.security.auth.callback.SecurityAssociationHandler@3c859513
2012-03-16 08:31:22,997 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] Created securityMgr=org.jboss.security.plugins.JaasSecurityManager@135fa2b9
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManager.OpenKM] CachePolicy set to: org.jboss.util.TimedCachePolicy@6bb4299e
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] setCachePolicy, c=org.jboss.util.TimedCachePolicy@6bb4299e
2012-03-16 08:31:22,998 DEBUG [org.jboss.security.plugins.JaasSecurityManagerService] Added OpenKM, org.jboss.security.plugins.SecurityDomainContext@1187b50 to map
2012-03-16 08:32:03,897 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-16 08:32:03,901 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-16 08:32:09,383 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.

BUT .... i check in the TMP directory and I can find 2 files okm518291356940814048.txt.txt and okm7108162072939302808.txt.txt , yes two files for the same scan and with double extension
here the soruce of this files , we can see than OCR works

Code: Select all

À propos de ce manuel
Ce manuel contient des instructions détaillées pour Pinstallation et a été conçu
pour servir de référence lors de l”exploitation, des interventions de dépannage,
et des futures mises à niveau.
Conventions typographiques
Ce document utilise les conventions suivantes pour distinguer les différents
éléments de texte :
Touches
ENTREE
UTILISATEUR
NOMS DE FICHIERS
Options Menu, Noms de
commandes, Noms de
boîtes de dialogue
COMMANDES,
REPERTOIRES, et
UNITES
Les touches apparaissent en gras. Un signe plus (+)
entre deux touches signiﬁe que ces demièrcs
doivent être enfoncées simultanément.
Les entrees de l'utilisateur apparaissent dans un
style différent et en majuscule.
Les noms de ﬁchiers apparaissent en italique et en
majuscule.
La première lettre de ces éléments est en majuscule
Ces éléments apparaissent en majuscule.

So I guess problem isn'T with OCR but after OCRing

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14572 by jllort
Mon Mar 19, 2012 12:42 pm

could you try the same file in our online demo at demo.openkm.com if ocr runs correctly there ?

Do you have configured some dictionary ? seems there's some problem on it at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:177)

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14701 by techexpress
Tue Mar 20, 2012 12:56 am

Hi again

OK I just scan the same document from demo with user0 , first test is in standard quality : docups.pdf 17.5 k second is in photo quality: docupshq.pdf (135 k).
After I select the search tab , I'm looking for these two words : 'manuel' and 'typographiques' the demo web site didn't find anything
I try in English or French configuration

For dictionary I use this setting system.openoffice.dictionary String /opt/openoffice.org/basis3.2/share/wordbook/fr/ooo-dictionnaire-fr-moderne-v4.2.oxt

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14884 by pavila
Wed Mar 21, 2012 9:31 am

Tesseract and Cuneiform need a 300 dpi to obtain good results. Commercial OCR like Abby works with lower resolution.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14892 by techexpress
Wed Mar 21, 2012 2:33 pm

OK ,

Sorry for my poor english but I'm french.
I understand your answer but if I scan the document at photo quality it's more than 300 dpi (Hardware resolution is 600 dpi) .
Also I think the document is correctly decoded, since I find .txt file in the Temp directory .
Do you think it's because double extension than it can't be indexed ?

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14955 by pavila
Sun Mar 25, 2012 4:07 pm

Two suggestions:

- Attach a PDF in this forum thread, so I can test it my local installation.

- Try to reproduce the problem in a nigth build from http://integration.openkm.com/5.1.x/.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#14970 by techexpress
Tue Mar 27, 2012 1:18 am

Sorry for delay was too tired to do a new try.
I try to upload th epdf file but

The extension pdf is not allowed.

you can download here : http://tech-express.ca/docups3.pdf

Do you know where I can find a very good tuto for Centos ?
btw I will try with ubuntu tomorrow
thanks you in advance and have a nice day

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#15033 by pavila
Sat Mar 31, 2012 7:47 am

According to http://wiki.openkm.com/index.php/Third- ... ation:_OCR, the minimal configuration for Cuneiform is:

/usr/bin/cuneiform ${fileIn} -o ${fileOut}

Be careful with the -o parameter.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with OpenKM under Centos 5.7 x64 ?

#15359 by techexpress
Wed Apr 25, 2012 12:33 am

Hi
Sorry for the looooong delay , I was on another project for a client.
Today Itrying OpenKM with Ubuntu 10.04 and now the OCR works even if openoffice isn't installed !!
some words are recognized while others are not "manuel, conventions" but not typographique .
Do you think I need to install any dictionnary ? Do you have a good tuto to install Open Office on Ubuntu 10 ?
Thank's and have a nice day

Username

techexpress

Rank

Junior Boarder

Posts

37

Joined

Tue Mar 06, 2012 3:54 pm

Location

Québec , Canada

Contact

Reply

Page 1 of 1
13 posts