• OCR with Tesseract doesn't work

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #13262  by andydld
 
Hi,

i just installed a Debian Squeeze with OpenKM 5.1.8.
I'm not sure about the configuration for the tesseract ocr.

Just a question before i start:

Is OCR only for PDF or for images like jpgeg, gif etc., too?

I installed tesseract this way:

apt-get install tesseract-ocr tesseract-ocr-deu

And configured it within OpenKM (Admin-Page, not the openkm.cfg-file) this way:

"system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}" <- Also tested with "/usr/bin/tesseract -l deu" and "/usr/bin/tesseract".

system.pdf.force.ocr=on <- I think, this is for activate ocr for pdf, is this right?

OpenOffice, OO-Dictionary, ImageMagick, SWFTools are installed, too.

I uploaded several pictures and pdfs. Searching works with no pictures and only for some pdfs.

Maybe someone can get me the right idea, what's going on.

Regards,

Andy
Last edited by andydld on Thu Dec 22, 2011 5:20 pm, edited 1 time in total.
 #13264  by andydld
 
Hi again,

i searched and tried alot the last hours.
Know i have tesseract 3.01 on the system.

With my test.tiff it works on the bash (tested with commandline "tesseract test.tiff test" and ""tesseract test.tiff test -l deu").
I got an text-file with readable text.

But it seems not to be working within OpenKM.

I tried with the origin installed tesseract 2.04 (from the debian squeeze repo).
Tried with tesseract 3.01.
Tried with and w/o com.openkm.extractor.Tesseract2TextExtractor and com.openkm.extractor.Tesseract3TextExtractor.
Tried with system.ocr=/usr/bin/tesseract, system.ocr=/usr/bin/tesseract -l deu, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut}, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu, system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}.

Nothing seems to work.

No ocr-depending error on the server.log.

Any ideas?

Regards and good night,

Andy
 #13269  by andydld
 
Hi to all,

today i tested OpenKM 5.1.8 on Windows 7 Pro x64 with the same result.
The ocr with tesseract seems not to be working.

There's no error on the log. On console everything is fine.

I added "com.openkm.extractor.Tesseract3TextExtractor" to "registered.text.extractors".
Set the "system.ocr" to "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".

I also tried a trick, to see, if tesseract is called from OpenKM.
I replaced the "tesseract.exe" within "system.ocr" with a batch-script-name.
This script excepts blindparameters and send them to tesseract.exe, this script writes the current time to a report file, too.
It works on console, a report-entry is written. But if i try it with OpenKM, it's seems, tesseract won't be called.

The question is: What is wrong?

Best regards,

Andy
 #13289  by jllort
 
OK,

You Should register in repository.xml and workspace.xml under ( repository folders subdirectories ) and in administration tab you got some extractors properties that must be updated there too.
I suggest test tesseract from terminal console to ensure there's no problem on it.

Tell us if it solves the problem.
 #13290  by andydld
 
Thanks for the answer.

What exactly i have to do within theses two xml-files?

Adding "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses"?

My current "registered.text.extractors" on the admin-tab are:
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor org.apache.jackrabbit.extractor.MsWordTextExtractor org.apache.jackrabbit.extractor.MsExcelTextExtractor org.apache.jackrabbit.extractor.MsPowerPointTextExtractor org.apache.jackrabbit.extractor.OpenOfficeTextExtractor org.apache.jackrabbit.extractor.RTFTextExtractor org.apache.jackrabbit.extractor.HTMLTextExtractor org.apache.jackrabbit.extractor.XMLTextExtractor org.apache.jackrabbit.extractor.PngTextExtractor org.apache.jackrabbit.extractor.MsOutlookTextExtractor com.openkm.extractor.PdfTextExtractor com.openkm.extractor.AudioTextExtractor com.openkm.extractor.ExifTextExtractor com.openkm.extractor.CuneiformTextExtractor com.openkm.extractor.SourceCodeTextExtractor com.openkm.extractor.MsOffice2007TextExtractor com.openkm.extractor.Tesseract3TextExtractor 
How should the "system.ocr"-value should be?
Atm mine looks like
Code: Select all
C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut} 
on my Windows.

On the console i have no problem with tesseract on both systems (debian, windows).

For a short test i added "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses" at "workspace.xml". Than i did a restart of OpenKM. Logged in and uploaded one of the two test-images include with tesseract on windows. I found these within the "server.log":
Code: Select all
2011-12-23 19:05:06,496 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,575 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:05:06,576 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398 C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8839435122050927398.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:06,780 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,781 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:05:06,781 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742 C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5031118750275653742.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:07,281 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
 #13291  by andydld
 
I tested with all "registered.text.extractors" within the two xml-files on both systems (debian, windows).
On debian i've got the same "file not found/abnormal program termination"-errors:
Code: Select all
2011-12-23 19:45:19,928 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:19,930 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:45:19,930 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670 -l deu /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670
java.io.FileNotFoundException: /tmp/okm2543197829169296670.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:137)
	at java.io.FileInputStream.<init>(FileInputStream.java:96)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
2011-12-23 19:45:20,042 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:20,051 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:45:20,051 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388 -l deu /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388
java.io.FileNotFoundException: /tmp/okm8996897110758953388.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:137)
	at java.io.FileInputStream.<init>(FileInputStream.java:96)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
 #13299  by jllort
 
try first without -l deu parameter system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} error continues existing ?

if last change solve the error then try with system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu
 #13302  by andydld
 
I just made an test with Tesseract 2.04 on windows with the same error-result:
Code: Select all
2011-12-26 12:04:16,839 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:16,846 WARN  [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'

2011-12-26 12:04:16,847 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091 C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5118280720318210091.txt (Das System kann die angegebene Datei nicht finden)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:16,916 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:17,060 WARN  [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'

2011-12-26 12:04:17,062 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934 C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt (Das System kann die angegebene Datei nicht finden)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:18,014 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
 #13310  by jllort
 
For some reason seems can not generate temporal file C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt ? I don't know which could be the reason, really it's really strange you've got the same problem in both OS, seems something wrong is in both configuration. Take a look at repository.xml workspace.xml and configuration parameters in administration. Make a screenshot of administration parameters where's setting system.ocr
 #13312  by andydld
 
This error reminds me on another error we had:

http://forum.openkm.com/viewtopic.php?f ... ick#p12559

At that time, an ImageMagick-Bug was the problem.

Now i see "the same". I mean, the program (tesseract or OpenKM's call) crash ("Abnormal program termination") and in consequence there are no temp-files.

Screenshot of admin-tab with "system.ocr" of my windows-machine is attached.

Here's the "repository.xml":
Code: Select all
<?xml version="1.0"?>
<!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 1.6//EN"
                            "http://jackrabbit.apache.org/dtd/repository-1.6.dtd">
<Repository>
    <!-- virtual file system where the repository stores global state
        (e.g. registered namespaces, custom node types, etc.) -->
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
        <param name="path" value="${rep.home}/repository"/>
    </FileSystem>

    <!-- Security configuration -->
    <Security appName="OpenKM">
        <!-- Security manager: FQN of class implementing the JackrabbitSecurityManager interface -->
        <!--<SecurityManager class="org.apache.jackrabbit.core.DefaultSecurityManager" workspaceName="security">-->
            <!-- workspace access: FQN of class implementing the WorkspaceAccessManager interface -->
            <!-- <WorkspaceAccessManager class="..."/> -->
            <!-- <param name="config" value="${rep.home}/security.xml"/> -->
        <!--</SecurityManager>-->

        <!-- Access manager: FQN of class implementing the AccessManager interface -->
        <AccessManager class="com.openkm.core.OKMAccessManager"/>
        <!-- <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager"/> -->
        <!-- <AccessManager class="org.apache.jackrabbit.core.security.DefaultAccessManager"> -->
            <!-- <param name="config" value="${rep.home}/access.xml"/> -->
        <!-- </AccessManager> -->

        <!-- <LoginModule class="org.apache.jackrabbit.core.security.simple.SimpleLoginModule"> -->
        <!-- <LoginModule class="org.apache.jackrabbit.core.security.authentication.DefaultLoginModule"> -->
           <!-- Anonymous user name ('anonymous' is the default value) -->
           <!-- <param name="anonymousId" value="anonymous"/> -->
           <!-- Administrator user id (default value if param is missing is 'admin') -->
           <!-- <param name="adminId" value="admin"/> -->
           <!-- <param name="principalProvider" value="..."/> -->
        <!--</LoginModule>-->
    </Security>

    <!-- Location of workspaces root directory and name of default workspace -->
    <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>

    <!-- Workspace configuration template:
         used to create the initial workspace if there's no workspace yet -->
    <Workspace name="${wsp.name}">
        <!-- Virtual file system of the workspace:
             class: FQN of class implementing the FileSystem interface -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${wsp.home}"/>
        </FileSystem>

        <!-- Persistence manager of the workspace:
             class: FQN of class implementing the PersistenceManager interface -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
          <param name="schemaObjectPrefix" value="${wsp.name}_"/>
        </PersistenceManager>

        <!-- Search index and the file system it uses.
             class: FQN of class implementing the QueryHandler interface -->
        <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="textFilterClasses" value="
			org.apache.jackrabbit.extractor.PlainTextExtractor,
			org.apache.jackrabbit.extractor.MsWordTextExtractor,
			org.apache.jackrabbit.extractor.MsExcelTextExtractor,
			org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
			org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
			org.apache.jackrabbit.extractor.RTFTextExtractor,
			org.apache.jackrabbit.extractor.HTMLTextExtractor,
			org.apache.jackrabbit.extractor.XMLTextExtractor,
			org.apache.jackrabbit.extractor.PngTextExtractor,
			org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
			com.openkm.extractor.PdfTextExtractor,
			com.openkm.extractor.AudioTextExtractor,
			com.openkm.extractor.ExifTextExtractor,
			com.openkm.extractor.CuneiformTextExtractor,
			com.openkm.extractor.SourceCodeTextExtractor,
			com.openkm.extractor.MsOffice2007TextExtractor,
			com.openkm.extractor.Tesseract2TextExtractor"/>
            <param name="extractorPoolSize" value="2"/>
            <param name="supportHighlighting" value="false"/>
            <param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
        </SearchIndex>
    </Workspace>

    <!-- Configures the versioning -->
    <Versioning rootPath="${rep.home}/version">
        <!-- Configures the filesystem to use for versioning for the respective
             persistence manager -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${rep.home}/version" />
        </FileSystem>

        <!-- Configures the persistence manager to be used for persisting version state.
             Please note that the current versioning implementation is based on
             a 'normal' persistence manager, but this could change in future
             implementations. -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${rep.home}/version/db;create=true"/>
          <param name="schemaObjectPrefix" value="version_"/>
        </PersistenceManager>
    </Versioning>

    <!-- Search index for content that is shared repository wide
         (/jcr:system tree, contains mainly versions) -->
    <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
        <param name="path" value="${rep.home}/repository/index"/>
        <param name="textFilterClasses" value=""/>
        <param name="extractorPoolSize" value="2"/>
        <param name="supportHighlighting" value="false"/>
    </SearchIndex>

    <!-- DataStore improve file handling performance -->
    <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
        <param name="path" value="${rep.home}/repository/datastore"/>
        <param name="minRecordLength" value="100"/>
    </DataStore>
</Repository>
Here's the "workspace.xml":
Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
        <!-- Virtual file system of the workspace:
             class: FQN of class implementing the FileSystem interface -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${wsp.home}"/>
        </FileSystem>

        <!-- Persistence manager of the workspace:
             class: FQN of class implementing the PersistenceManager interface -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
          <param name="schemaObjectPrefix" value="${wsp.name}_"/>
        </PersistenceManager>

        <!-- Search index and the file system it uses.
             class: FQN of class implementing the QueryHandler interface -->
        <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="textFilterClasses" value="
			org.apache.jackrabbit.extractor.PlainTextExtractor,
			org.apache.jackrabbit.extractor.MsWordTextExtractor,
			org.apache.jackrabbit.extractor.MsExcelTextExtractor,
			org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
			org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
			org.apache.jackrabbit.extractor.RTFTextExtractor,
			org.apache.jackrabbit.extractor.HTMLTextExtractor,
			org.apache.jackrabbit.extractor.XMLTextExtractor,
			org.apache.jackrabbit.extractor.PngTextExtractor,
			org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
			com.openkm.extractor.PdfTextExtractor,
			com.openkm.extractor.AudioTextExtractor,
			com.openkm.extractor.ExifTextExtractor,
			com.openkm.extractor.CuneiformTextExtractor,
			com.openkm.extractor.SourceCodeTextExtractor,
			com.openkm.extractor.MsOffice2007TextExtractor,
			com.openkm.extractor.Tesseract2TextExtractor"/>
            <param name="extractorPoolSize" value="2"/>
            <param name="supportHighlighting" value="false"/>
            <param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
        </SearchIndex>
    </Workspace>
Atm all is configured for tesseract 2.
Attachments
OpenKM-Admin-Tab with system.ocr
OpenKM-Admin-Tab with system.ocr
opkm-admintab.PNG (47.32 KiB) Viewed 22953 times
 #13344  by pavila
 
In the configuration page, the paths should not contain spaces. Perhaps this is the problem.
 #13345  by andydld
 
Thanks for the answer.

I changed system.ocr on my windows from
Code: Select all
C:\Tesseract-OCR\2.04\tesseract.exe ${fileIn} ${fileOut}
to
Code: Select all
C:\Tesseract-OCR\2.04\tesseract.exe${fileIn}${fileOut}
On debian i've done the same. I changed system.ocr from
Code: Select all
/usr/bin/tesseract ${fileIn} ${fileOut}
to
Code: Select all
/usr/bin/tesseract${fileIn}${fileOut}
Both changes within the admin-tab.

I tested again with the eurotext.tif- and phototest.tif-files from the tesseract-windows-package with the same error-result. Only the "Abnormal program termination" is gone.

server.log from Windows with tesseract 2.04 atm:
Code: Select all
2011-12-30 13:13:07,402 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840 C:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tif C:\Users\andy\AppData\Local\Temp\okm4341221566327818840
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessImpl.create(Native Method)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
	at java.lang.ProcessImpl.start(ProcessImpl.java:30)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	... 10 more
2011-12-30 13:13:07,588 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687 C:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tif C:\Users\andy\AppData\Local\Temp\okm5038797940072836687
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessImpl.create(Native Method)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
	at java.lang.ProcessImpl.start(ProcessImpl.java:30)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	... 10 more
2011-12-30 13:13:08,281 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
server.log from debian with tesseract 3.01 atm:
Code: Select all
2011-12-30 13:24:25,598 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872 /tmp/okm3939815078924811309.tif /tmp/okm2344878681557850872
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872": java.io.IOException: error=20, Not a directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
	at java.lang.ProcessImpl.start(ProcessImpl.java:81)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
	... 10 more
2011-12-30 13:24:25,601 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933 /tmp/okm6229961688656192035.tif /tmp/okm5378590165912760933
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933": java.io.IOException: error=20, Not a directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
	at java.lang.ProcessImpl.start(ProcessImpl.java:81)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
	... 10 more
 #13350  by andydld
 
Good to know, that i'm not alone.

Just a notice:

Because of an license-change, java isnt't available thru the repos anymore.

If you execute the preferred command (found within the wiki) you get OpenJDK on Debian Squeeze and get an error on Ubuntu (tested on 10.04 Server LTS AMD64).

I use the original/oracle java jdk 6 u30 on windows and have OpenJDK on Debian.

I think we can eleminate a java-version-problem this way.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.