Page 1 of 2
OCR with Tesseract doesn't work
PostPosted:Wed Dec 21, 2011 5:13 pm
by andydld
Hi,
i just installed a Debian Squeeze with OpenKM 5.1.8.
I'm not sure about the configuration for the tesseract ocr.
Just a question before i start:
Is OCR only for PDF or for images like jpgeg, gif etc., too?
I installed tesseract this way:
apt-get install tesseract-ocr tesseract-ocr-deu
And configured it within OpenKM (Admin-Page, not the openkm.cfg-file) this way:
"system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}" <- Also tested with "/usr/bin/tesseract -l deu" and "/usr/bin/tesseract".
system.pdf.force.ocr=on <- I think, this is for activate ocr for pdf, is this right?
OpenOffice, OO-Dictionary, ImageMagick, SWFTools are installed, too.
I uploaded several pictures and pdfs. Searching works with no pictures and only for some pdfs.
Maybe someone can get me the right idea, what's going on.
Regards,
Andy
Re: OCR with Tesseract, some questions and maybe a problem
PostPosted:Wed Dec 21, 2011 9:43 pm
by andydld
Hi again,
i searched and tried alot the last hours.
Know i have tesseract 3.01 on the system.
With my test.tiff it works on the bash (tested with commandline "tesseract test.tiff test" and ""tesseract test.tiff test -l deu").
I got an text-file with readable text.
But it seems not to be working within OpenKM.
I tried with the origin installed tesseract 2.04 (from the debian squeeze repo).
Tried with tesseract 3.01.
Tried with and w/o com.openkm.extractor.Tesseract2TextExtractor and com.openkm.extractor.Tesseract3TextExtractor.
Tried with system.ocr=/usr/bin/tesseract, system.ocr=/usr/bin/tesseract -l deu, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut}, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu, system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}.
Nothing seems to work.
No ocr-depending error on the server.log.
Any ideas?
Regards and good night,
Andy
Re: OCR with Tesseract doesn't work
PostPosted:Thu Dec 22, 2011 5:29 pm
by andydld
Hi to all,
today i tested OpenKM 5.1.8 on Windows 7 Pro x64 with the same result.
The ocr with tesseract seems not to be working.
There's no error on the log. On console everything is fine.
I added "com.openkm.extractor.Tesseract3TextExtractor" to "registered.text.extractors".
Set the "system.ocr" to "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".
I also tried a trick, to see, if tesseract is called from OpenKM.
I replaced the "tesseract.exe" within "system.ocr" with a batch-script-name.
This script excepts blindparameters and send them to tesseract.exe, this script writes the current time to a report file, too.
It works on console, a report-entry is written. But if i try it with OpenKM, it's seems, tesseract won't be called.
The question is: What is wrong?
Best regards,
Andy
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 23, 2011 3:09 pm
by jllort
OK,
You Should register in repository.xml and workspace.xml under ( repository folders subdirectories ) and in administration tab you got some extractors properties that must be updated there too.
I suggest test tesseract from terminal console to ensure there's no problem on it.
Tell us if it solves the problem.
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 23, 2011 6:16 pm
by andydld
Thanks for the answer.
What exactly i have to do within theses two xml-files?
Adding "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses"?
My current "registered.text.extractors" on the admin-tab are:
Code: Select allorg.apache.jackrabbit.extractor.PlainTextExtractor org.apache.jackrabbit.extractor.MsWordTextExtractor org.apache.jackrabbit.extractor.MsExcelTextExtractor org.apache.jackrabbit.extractor.MsPowerPointTextExtractor org.apache.jackrabbit.extractor.OpenOfficeTextExtractor org.apache.jackrabbit.extractor.RTFTextExtractor org.apache.jackrabbit.extractor.HTMLTextExtractor org.apache.jackrabbit.extractor.XMLTextExtractor org.apache.jackrabbit.extractor.PngTextExtractor org.apache.jackrabbit.extractor.MsOutlookTextExtractor com.openkm.extractor.PdfTextExtractor com.openkm.extractor.AudioTextExtractor com.openkm.extractor.ExifTextExtractor com.openkm.extractor.CuneiformTextExtractor com.openkm.extractor.SourceCodeTextExtractor com.openkm.extractor.MsOffice2007TextExtractor com.openkm.extractor.Tesseract3TextExtractor
How should the "system.ocr"-value should be?
Atm mine looks like
Code: Select allC:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
on my Windows.
On the console i have no problem with tesseract on both systems (debian, windows).
For a short test i added "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses" at "workspace.xml". Than i did a restart of OpenKM. Logged in and uploaded one of the two test-images include with tesseract on windows. I found these within the "server.log":
Code: Select all2011-12-23 19:05:06,496 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,575 WARN [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*
2011-12-23 19:05:06,576 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398 C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8839435122050927398.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at java.io.FileInputStream.<init>(FileInputStream.java:79)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:06,780 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,781 WARN [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*
2011-12-23 19:05:06,781 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742 C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5031118750275653742.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at java.io.FileInputStream.<init>(FileInputStream.java:79)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:07,281 INFO [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 23, 2011 6:48 pm
by andydld
I tested with all "registered.text.extractors" within the two xml-files on both systems (debian, windows).
On debian i've got the same "file not found/abnormal program termination"-errors:
Code: Select all2011-12-23 19:45:19,928 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:19,930 WARN [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*
2011-12-23 19:45:19,930 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670 -l deu /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670
java.io.FileNotFoundException: /tmp/okm2543197829169296670.txt (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:137)
at java.io.FileInputStream.<init>(FileInputStream.java:96)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:636)
2011-12-23 19:45:20,042 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:20,051 WARN [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*
2011-12-23 19:45:20,051 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388 -l deu /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388
java.io.FileNotFoundException: /tmp/okm8996897110758953388.txt (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:137)
at java.io.FileInputStream.<init>(FileInputStream.java:96)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:636)
Re: OCR with Tesseract doesn't work
PostPosted:Sat Dec 24, 2011 5:54 pm
by jllort
try first without -l deu parameter system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} error continues existing ?
if last change solve the error then try with system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu
Re: OCR with Tesseract doesn't work
PostPosted:Sun Dec 25, 2011 12:52 pm
by andydld
I tried both variants, with and without "l- deu". Still the same error on windows and debian.
Re: OCR with Tesseract doesn't work
PostPosted:Mon Dec 26, 2011 11:07 am
by andydld
I just made an test with Tesseract 2.04 on windows with the same error-result:
Code: Select all2011-12-26 12:04:16,839 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:16,846 WARN [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'
2011-12-26 12:04:16,847 WARN [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091 C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5118280720318210091.txt (Das System kann die angegebene Datei nicht finden)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at java.io.FileInputStream.<init>(FileInputStream.java:79)
at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:16,916 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:17,060 WARN [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'
2011-12-26 12:04:17,062 WARN [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934 C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt (Das System kann die angegebene Datei nicht finden)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at java.io.FileInputStream.<init>(FileInputStream.java:79)
at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:18,014 INFO [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
Re: OCR with Tesseract doesn't work
PostPosted:Tue Dec 27, 2011 8:05 am
by jllort
For some reason seems can not generate temporal file C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt ? I don't know which could be the reason, really it's really strange you've got the same problem in both OS, seems something wrong is in both configuration. Take a look at repository.xml workspace.xml and configuration parameters in administration. Make a screenshot of administration parameters where's setting system.ocr
Re: OCR with Tesseract doesn't work
PostPosted:Tue Dec 27, 2011 11:29 am
by andydld
This error reminds me on another error we had:
http://forum.openkm.com/viewtopic.php?f ... ick#p12559
At that time, an ImageMagick-Bug was the problem.
Now i see "the same". I mean, the program (tesseract or OpenKM's call) crash ("Abnormal program termination") and in consequence there are no temp-files.
Screenshot of admin-tab with "system.ocr" of my windows-machine is attached.
Here's the "repository.xml":
Code: Select all<?xml version="1.0"?>
<!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 1.6//EN"
"http://jackrabbit.apache.org/dtd/repository-1.6.dtd">
<Repository>
<!-- virtual file system where the repository stores global state
(e.g. registered namespaces, custom node types, etc.) -->
<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
<param name="path" value="${rep.home}/repository"/>
</FileSystem>
<!-- Security configuration -->
<Security appName="OpenKM">
<!-- Security manager: FQN of class implementing the JackrabbitSecurityManager interface -->
<!--<SecurityManager class="org.apache.jackrabbit.core.DefaultSecurityManager" workspaceName="security">-->
<!-- workspace access: FQN of class implementing the WorkspaceAccessManager interface -->
<!-- <WorkspaceAccessManager class="..."/> -->
<!-- <param name="config" value="${rep.home}/security.xml"/> -->
<!--</SecurityManager>-->
<!-- Access manager: FQN of class implementing the AccessManager interface -->
<AccessManager class="com.openkm.core.OKMAccessManager"/>
<!-- <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager"/> -->
<!-- <AccessManager class="org.apache.jackrabbit.core.security.DefaultAccessManager"> -->
<!-- <param name="config" value="${rep.home}/access.xml"/> -->
<!-- </AccessManager> -->
<!-- <LoginModule class="org.apache.jackrabbit.core.security.simple.SimpleLoginModule"> -->
<!-- <LoginModule class="org.apache.jackrabbit.core.security.authentication.DefaultLoginModule"> -->
<!-- Anonymous user name ('anonymous' is the default value) -->
<!-- <param name="anonymousId" value="anonymous"/> -->
<!-- Administrator user id (default value if param is missing is 'admin') -->
<!-- <param name="adminId" value="admin"/> -->
<!-- <param name="principalProvider" value="..."/> -->
<!--</LoginModule>-->
</Security>
<!-- Location of workspaces root directory and name of default workspace -->
<Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>
<!-- Workspace configuration template:
used to create the initial workspace if there's no workspace yet -->
<Workspace name="${wsp.name}">
<!-- Virtual file system of the workspace:
class: FQN of class implementing the FileSystem interface -->
<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
<param name="path" value="${wsp.home}"/>
</FileSystem>
<!-- Persistence manager of the workspace:
class: FQN of class implementing the PersistenceManager interface -->
<PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
<param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
<param name="schemaObjectPrefix" value="${wsp.name}_"/>
</PersistenceManager>
<!-- Search index and the file system it uses.
class: FQN of class implementing the QueryHandler interface -->
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="textFilterClasses" value="
org.apache.jackrabbit.extractor.PlainTextExtractor,
org.apache.jackrabbit.extractor.MsWordTextExtractor,
org.apache.jackrabbit.extractor.MsExcelTextExtractor,
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
org.apache.jackrabbit.extractor.RTFTextExtractor,
org.apache.jackrabbit.extractor.HTMLTextExtractor,
org.apache.jackrabbit.extractor.XMLTextExtractor,
org.apache.jackrabbit.extractor.PngTextExtractor,
org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
com.openkm.extractor.PdfTextExtractor,
com.openkm.extractor.AudioTextExtractor,
com.openkm.extractor.ExifTextExtractor,
com.openkm.extractor.CuneiformTextExtractor,
com.openkm.extractor.SourceCodeTextExtractor,
com.openkm.extractor.MsOffice2007TextExtractor,
com.openkm.extractor.Tesseract2TextExtractor"/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="false"/>
<param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
</SearchIndex>
</Workspace>
<!-- Configures the versioning -->
<Versioning rootPath="${rep.home}/version">
<!-- Configures the filesystem to use for versioning for the respective
persistence manager -->
<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
<param name="path" value="${rep.home}/version" />
</FileSystem>
<!-- Configures the persistence manager to be used for persisting version state.
Please note that the current versioning implementation is based on
a 'normal' persistence manager, but this could change in future
implementations. -->
<PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
<param name="url" value="jdbc:derby:${rep.home}/version/db;create=true"/>
<param name="schemaObjectPrefix" value="version_"/>
</PersistenceManager>
</Versioning>
<!-- Search index for content that is shared repository wide
(/jcr:system tree, contains mainly versions) -->
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${rep.home}/repository/index"/>
<param name="textFilterClasses" value=""/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="false"/>
</SearchIndex>
<!-- DataStore improve file handling performance -->
<DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
<param name="path" value="${rep.home}/repository/datastore"/>
<param name="minRecordLength" value="100"/>
</DataStore>
</Repository>
Here's the "workspace.xml":
Code: Select all<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
<!-- Virtual file system of the workspace:
class: FQN of class implementing the FileSystem interface -->
<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
<param name="path" value="${wsp.home}"/>
</FileSystem>
<!-- Persistence manager of the workspace:
class: FQN of class implementing the PersistenceManager interface -->
<PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
<param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
<param name="schemaObjectPrefix" value="${wsp.name}_"/>
</PersistenceManager>
<!-- Search index and the file system it uses.
class: FQN of class implementing the QueryHandler interface -->
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="textFilterClasses" value="
org.apache.jackrabbit.extractor.PlainTextExtractor,
org.apache.jackrabbit.extractor.MsWordTextExtractor,
org.apache.jackrabbit.extractor.MsExcelTextExtractor,
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
org.apache.jackrabbit.extractor.RTFTextExtractor,
org.apache.jackrabbit.extractor.HTMLTextExtractor,
org.apache.jackrabbit.extractor.XMLTextExtractor,
org.apache.jackrabbit.extractor.PngTextExtractor,
org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
com.openkm.extractor.PdfTextExtractor,
com.openkm.extractor.AudioTextExtractor,
com.openkm.extractor.ExifTextExtractor,
com.openkm.extractor.CuneiformTextExtractor,
com.openkm.extractor.SourceCodeTextExtractor,
com.openkm.extractor.MsOffice2007TextExtractor,
com.openkm.extractor.Tesseract2TextExtractor"/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="false"/>
<param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
</SearchIndex>
</Workspace>
Atm all is configured for tesseract 2.
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 30, 2011 10:12 am
by pavila
In the configuration page, the paths should not contain spaces. Perhaps this is the problem.
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 30, 2011 12:32 pm
by andydld
Thanks for the answer.
I changed system.ocr on my windows from
Code: Select allC:\Tesseract-OCR\2.04\tesseract.exe ${fileIn} ${fileOut}
to
Code: Select allC:\Tesseract-OCR\2.04\tesseract.exe${fileIn}${fileOut}
On debian i've done the same. I changed system.ocr from
Code: Select all/usr/bin/tesseract ${fileIn} ${fileOut}
to
Code: Select all/usr/bin/tesseract${fileIn}${fileOut}
Both changes within the admin-tab.
I tested again with the eurotext.tif- and phototest.tif-files from the tesseract-windows-package with the same error-result. Only the "Abnormal program termination" is gone.
server.log from Windows with tesseract 2.04 atm:
Code: Select all2011-12-30 13:13:07,402 WARN [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840 C:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tif C:\Users\andy\AppData\Local\Temp\okm4341221566327818840
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
at java.lang.ProcessImpl.start(ProcessImpl.java:30)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 10 more
2011-12-30 13:13:07,588 WARN [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687 C:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tif C:\Users\andy\AppData\Local\Temp\okm5038797940072836687
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
at java.lang.ProcessImpl.start(ProcessImpl.java:30)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 10 more
2011-12-30 13:13:08,281 INFO [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
server.log from debian with tesseract 3.01 atm:
Code: Select all2011-12-30 13:24:25,598 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872 /tmp/okm3939815078924811309.tif /tmp/okm2344878681557850872
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872": java.io.IOException: error=20, Not a directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 10 more
2011-12-30 13:24:25,601 WARN [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933 /tmp/okm6229961688656192035.tif /tmp/okm5378590165912760933
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933": java.io.IOException: error=20, Not a directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 10 more
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 30, 2011 3:12 pm
by pberden
I have the same problem with Ubuntu server 10.11 64 bit, tesseract 3.01 and OpenKM 5.1.8.
Re: OCR with Tesseract doesn't work
PostPosted:Fri Dec 30, 2011 4:34 pm
by andydld
Good to know, that i'm not alone.
Just a notice:
Because of an license-change, java isnt't available thru the repos anymore.
If you execute the preferred command (found within the wiki) you get OpenJDK on Debian Squeeze and get an error on Ubuntu (tested on 10.04 Server LTS AMD64).
I use the original/oracle java jdk 6 u30 on windows and have OpenJDK on Debian.
I think we can eleminate a java-version-problem this way.