Open Source Document Management System | OpenKM - OCR with Tesseract doesn't work

Reply

OCR with Tesseract doesn't work

#13262 by andydld
Wed Dec 21, 2011 5:13 pm

Hi,

i just installed a Debian Squeeze with OpenKM 5.1.8.
I'm not sure about the configuration for the tesseract ocr.

Just a question before i start:

Is OCR only for PDF or for images like jpgeg, gif etc., too?

I installed tesseract this way:

apt-get install tesseract-ocr tesseract-ocr-deu

And configured it within OpenKM (Admin-Page, not the openkm.cfg-file) this way:

"system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}" <- Also tested with "/usr/bin/tesseract -l deu" and "/usr/bin/tesseract".

system.pdf.force.ocr=on <- I think, this is for activate ocr for pdf, is this right?

OpenOffice, OO-Dictionary, ImageMagick, SWFTools are installed, too.

I uploaded several pictures and pdfs. Searching works with no pictures and only for some pdfs.

Maybe someone can get me the right idea, what's going on.

Regards,

Andy

Last edited by andydld on Thu Dec 22, 2011 5:20 pm, edited 1 time in total.

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract, some questions and maybe a problem

#13264 by andydld
Wed Dec 21, 2011 9:43 pm

Hi again,

i searched and tried alot the last hours.
Know i have tesseract 3.01 on the system.

With my test.tiff it works on the bash (tested with commandline "tesseract test.tiff test" and ""tesseract test.tiff test -l deu").
I got an text-file with readable text.

But it seems not to be working within OpenKM.

I tried with the origin installed tesseract 2.04 (from the debian squeeze repo).
Tried with tesseract 3.01.
Tried with and w/o com.openkm.extractor.Tesseract2TextExtractor and com.openkm.extractor.Tesseract3TextExtractor.
Tried with system.ocr=/usr/bin/tesseract, system.ocr=/usr/bin/tesseract -l deu, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut}, system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu, system.ocr=/usr/bin/tesseract -l deu ${fileIn} ${fileOut}.

Nothing seems to work.

No ocr-depending error on the server.log.

Any ideas?

Regards and good night,

Andy

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13269 by andydld
Thu Dec 22, 2011 5:29 pm

Hi to all,

today i tested OpenKM 5.1.8 on Windows 7 Pro x64 with the same result.
The ocr with tesseract seems not to be working.

There's no error on the log. On console everything is fine.

I added "com.openkm.extractor.Tesseract3TextExtractor" to "registered.text.extractors".
Set the "system.ocr" to "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".

I also tried a trick, to see, if tesseract is called from OpenKM.
I replaced the "tesseract.exe" within "system.ocr" with a batch-script-name.
This script excepts blindparameters and send them to tesseract.exe, this script writes the current time to a report file, too.
It works on console, a report-entry is written. But if i try it with OpenKM, it's seems, tesseract won't be called.

The question is: What is wrong?

Best regards,

Andy

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13289 by jllort
Fri Dec 23, 2011 3:09 pm

OK,

You Should register in repository.xml and workspace.xml under ( repository folders subdirectories ) and in administration tab you got some extractors properties that must be updated there too.
I suggest test tesseract from terminal console to ensure there's no problem on it.

Tell us if it solves the problem.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with Tesseract doesn't work

#13290 by andydld
Fri Dec 23, 2011 6:16 pm

Thanks for the answer.

What exactly i have to do within theses two xml-files?

Adding "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses"?

My current "registered.text.extractors" on the admin-tab are:

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor org.apache.jackrabbit.extractor.MsWordTextExtractor org.apache.jackrabbit.extractor.MsExcelTextExtractor org.apache.jackrabbit.extractor.MsPowerPointTextExtractor org.apache.jackrabbit.extractor.OpenOfficeTextExtractor org.apache.jackrabbit.extractor.RTFTextExtractor org.apache.jackrabbit.extractor.HTMLTextExtractor org.apache.jackrabbit.extractor.XMLTextExtractor org.apache.jackrabbit.extractor.PngTextExtractor org.apache.jackrabbit.extractor.MsOutlookTextExtractor com.openkm.extractor.PdfTextExtractor com.openkm.extractor.AudioTextExtractor com.openkm.extractor.ExifTextExtractor com.openkm.extractor.CuneiformTextExtractor com.openkm.extractor.SourceCodeTextExtractor com.openkm.extractor.MsOffice2007TextExtractor com.openkm.extractor.Tesseract3TextExtractor

How should the "system.ocr"-value should be?
Atm mine looks like

Code: Select all

C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

on my Windows.

On the console i have no problem with tesseract on both systems (debian, windows).

For a short test i added "com.openkm.extractor.Tesseract3TextExtractor" to "TextFilterClasses" at "workspace.xml". Than i did a restart of OpenKM. Logged in and uploaded one of the two test-images include with tesseract on windows. I found these within the "server.log":

Code: Select all

2011-12-23 19:05:06,496 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,575 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:05:06,576 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398 C:\Users\andy\AppData\Local\Temp\okm2293481143648870807.tif C:\Users\andy\AppData\Local\Temp\okm8839435122050927398
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8839435122050927398.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:06,780 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:05:06,781 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:05:06,781 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: C:\Program Files (x86)\Tesseract-OCR\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742 C:\Users\andy\AppData\Local\Temp\okm2141153793137157575.tif C:\Users\andy\AppData\Local\Temp\okm5031118750275653742
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5031118750275653742.txt (Das System kann die angegebene Datei nicht finden) <- File not found.
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-23 19:05:07,281 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13291 by andydld
Fri Dec 23, 2011 6:48 pm

I tested with all "registered.text.extractors" within the two xml-files on both systems (debian, windows).
On debian i've got the same "file not found/abnormal program termination"-errors:

Code: Select all

2011-12-23 19:45:19,928 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:19,930 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:45:19,930 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670 -l deu /tmp/okm7108422049875095523.tif /tmp/okm2543197829169296670
java.io.FileNotFoundException: /tmp/okm2543197829169296670.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:137)
	at java.io.FileInputStream.<init>(FileInputStream.java:96)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
2011-12-23 19:45:20,042 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-23 19:45:20,051 WARN  [com.openkm.util.ExecutionUtils] STDERR: read_params_file: parameter not found: II*

2011-12-23 19:45:20,051 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388 -l deu /tmp/okm110838518289848022.tif /tmp/okm8996897110758953388
java.io.FileNotFoundException: /tmp/okm8996897110758953388.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:137)
	at java.io.FileInputStream.<init>(FileInputStream.java:96)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:92)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13299 by jllort
Sat Dec 24, 2011 5:54 pm

try first without -l deu parameter system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} error continues existing ?

if last change solve the error then try with system.ocr=/usr/bin/tesseract ${fileIn} ${fileOut} -l deu

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with Tesseract doesn't work

#13300 by andydld
Sun Dec 25, 2011 12:52 pm

I tried both variants, with and without "l- deu". Still the same error on windows and debian.

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13302 by andydld
Mon Dec 26, 2011 11:07 am

I just made an test with Tesseract 2.04 on windows with the same error-result:

Code: Select all

2011-12-26 12:04:16,839 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:16,846 WARN  [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'

2011-12-26 12:04:16,847 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091 C:\Users\andy\AppData\Local\Temp\okm1919319298833890175.tif C:\Users\andy\AppData\Local\Temp\okm5118280720318210091
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm5118280720318210091.txt (Das System kann die angegebene Datei nicht finden)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:16,916 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2011-12-26 12:04:17,060 WARN  [com.openkm.util.ExecutionUtils] STDERR: error: Could not find variable 'II*'

2011-12-26 12:04:17,062 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exe C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934 C:\Users\andy\AppData\Local\Temp\okm5283665428412893901.tif C:\Users\andy\AppData\Local\Temp\okm8045755092797813934
java.io.FileNotFoundException: C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt (Das System kann die angegebene Datei nicht finden)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:120)
	at java.io.FileInputStream.<init>(FileInputStream.java:79)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:100)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
2011-12-26 12:04:18,014 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13310 by jllort
Tue Dec 27, 2011 8:05 am

For some reason seems can not generate temporal file C:\Users\andy\AppData\Local\Temp\okm8045755092797813934.txt ? I don't know which could be the reason, really it's really strange you've got the same problem in both OS, seems something wrong is in both configuration. Take a look at repository.xml workspace.xml and configuration parameters in administration. Make a screenshot of administration parameters where's setting system.ocr

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with Tesseract doesn't work

#13312 by andydld
Tue Dec 27, 2011 11:29 am

This error reminds me on another error we had:

http://forum.openkm.com/viewtopic.php?f ... ick#p12559

At that time, an ImageMagick-Bug was the problem.

Now i see "the same". I mean, the program (tesseract or OpenKM's call) crash ("Abnormal program termination") and in consequence there are no temp-files.

Screenshot of admin-tab with "system.ocr" of my windows-machine is attached.

Here's the "repository.xml":

Code: Select all

<?xml version="1.0"?>
<!DOCTYPE Repository PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 1.6//EN"
                            "http://jackrabbit.apache.org/dtd/repository-1.6.dtd">
<Repository>
    <!-- virtual file system where the repository stores global state
        (e.g. registered namespaces, custom node types, etc.) -->
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
        <param name="path" value="${rep.home}/repository"/>
    </FileSystem>

    <!-- Security configuration -->
    <Security appName="OpenKM">
        <!-- Security manager: FQN of class implementing the JackrabbitSecurityManager interface -->
        <!--<SecurityManager class="org.apache.jackrabbit.core.DefaultSecurityManager" workspaceName="security">-->
            <!-- workspace access: FQN of class implementing the WorkspaceAccessManager interface -->
            <!-- <WorkspaceAccessManager class="..."/> -->
            <!-- <param name="config" value="${rep.home}/security.xml"/> -->
        <!--</SecurityManager>-->

        <!-- Access manager: FQN of class implementing the AccessManager interface -->
        <AccessManager class="com.openkm.core.OKMAccessManager"/>
        <!-- <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager"/> -->
        <!-- <AccessManager class="org.apache.jackrabbit.core.security.DefaultAccessManager"> -->
            <!-- <param name="config" value="${rep.home}/access.xml"/> -->
        <!-- </AccessManager> -->

        <!-- <LoginModule class="org.apache.jackrabbit.core.security.simple.SimpleLoginModule"> -->
        <!-- <LoginModule class="org.apache.jackrabbit.core.security.authentication.DefaultLoginModule"> -->
           <!-- Anonymous user name ('anonymous' is the default value) -->
           <!-- <param name="anonymousId" value="anonymous"/> -->
           <!-- Administrator user id (default value if param is missing is 'admin') -->
           <!-- <param name="adminId" value="admin"/> -->
           <!-- <param name="principalProvider" value="..."/> -->
        <!--</LoginModule>-->
    </Security>

    <!-- Location of workspaces root directory and name of default workspace -->
    <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>

    <!-- Workspace configuration template:
         used to create the initial workspace if there's no workspace yet -->
    <Workspace name="${wsp.name}">
        <!-- Virtual file system of the workspace:
             class: FQN of class implementing the FileSystem interface -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${wsp.home}"/>
        </FileSystem>

        <!-- Persistence manager of the workspace:
             class: FQN of class implementing the PersistenceManager interface -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
          <param name="schemaObjectPrefix" value="${wsp.name}_"/>
        </PersistenceManager>

        <!-- Search index and the file system it uses.
             class: FQN of class implementing the QueryHandler interface -->
        <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="textFilterClasses" value="
			org.apache.jackrabbit.extractor.PlainTextExtractor,
			org.apache.jackrabbit.extractor.MsWordTextExtractor,
			org.apache.jackrabbit.extractor.MsExcelTextExtractor,
			org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
			org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
			org.apache.jackrabbit.extractor.RTFTextExtractor,
			org.apache.jackrabbit.extractor.HTMLTextExtractor,
			org.apache.jackrabbit.extractor.XMLTextExtractor,
			org.apache.jackrabbit.extractor.PngTextExtractor,
			org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
			com.openkm.extractor.PdfTextExtractor,
			com.openkm.extractor.AudioTextExtractor,
			com.openkm.extractor.ExifTextExtractor,
			com.openkm.extractor.CuneiformTextExtractor,
			com.openkm.extractor.SourceCodeTextExtractor,
			com.openkm.extractor.MsOffice2007TextExtractor,
			com.openkm.extractor.Tesseract2TextExtractor"/>
            <param name="extractorPoolSize" value="2"/>
            <param name="supportHighlighting" value="false"/>
            <param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
        </SearchIndex>
    </Workspace>

    <!-- Configures the versioning -->
    <Versioning rootPath="${rep.home}/version">
        <!-- Configures the filesystem to use for versioning for the respective
             persistence manager -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${rep.home}/version" />
        </FileSystem>

        <!-- Configures the persistence manager to be used for persisting version state.
             Please note that the current versioning implementation is based on
             a 'normal' persistence manager, but this could change in future
             implementations. -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${rep.home}/version/db;create=true"/>
          <param name="schemaObjectPrefix" value="version_"/>
        </PersistenceManager>
    </Versioning>

    <!-- Search index for content that is shared repository wide
         (/jcr:system tree, contains mainly versions) -->
    <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
        <param name="path" value="${rep.home}/repository/index"/>
        <param name="textFilterClasses" value=""/>
        <param name="extractorPoolSize" value="2"/>
        <param name="supportHighlighting" value="false"/>
    </SearchIndex>

    <!-- DataStore improve file handling performance -->
    <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
        <param name="path" value="${rep.home}/repository/datastore"/>
        <param name="minRecordLength" value="100"/>
    </DataStore>
</Repository>

Here's the "workspace.xml":

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<Workspace name="default">
        <!-- Virtual file system of the workspace:
             class: FQN of class implementing the FileSystem interface -->
        <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
            <param name="path" value="${wsp.home}"/>
        </FileSystem>

        <!-- Persistence manager of the workspace:
             class: FQN of class implementing the PersistenceManager interface -->
        <PersistenceManager class="org.apache.jackrabbit.core.persistence.bundle.DerbyPersistenceManager">
          <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
          <param name="schemaObjectPrefix" value="${wsp.name}_"/>
        </PersistenceManager>

        <!-- Search index and the file system it uses.
             class: FQN of class implementing the QueryHandler interface -->
        <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="textFilterClasses" value="
			org.apache.jackrabbit.extractor.PlainTextExtractor,
			org.apache.jackrabbit.extractor.MsWordTextExtractor,
			org.apache.jackrabbit.extractor.MsExcelTextExtractor,
			org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
			org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
			org.apache.jackrabbit.extractor.RTFTextExtractor,
			org.apache.jackrabbit.extractor.HTMLTextExtractor,
			org.apache.jackrabbit.extractor.XMLTextExtractor,
			org.apache.jackrabbit.extractor.PngTextExtractor,
			org.apache.jackrabbit.extractor.MsOutlookTextExtractor,
			com.openkm.extractor.PdfTextExtractor,
			com.openkm.extractor.AudioTextExtractor,
			com.openkm.extractor.ExifTextExtractor,
			com.openkm.extractor.CuneiformTextExtractor,
			com.openkm.extractor.SourceCodeTextExtractor,
			com.openkm.extractor.MsOffice2007TextExtractor,
			com.openkm.extractor.Tesseract2TextExtractor"/>
            <param name="extractorPoolSize" value="2"/>
            <param name="supportHighlighting" value="false"/>
            <param name="indexingConfiguration" value="${wsp.home}/../../../indexing_configuration.xml"/>
        </SearchIndex>
    </Workspace>

Atm all is configured for tesseract 2.

Attachments

OpenKM-Admin-Tab with system.ocr
opkm-admintab.PNG (47.32 KiB) Viewed 23637 times

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13344 by pavila
Fri Dec 30, 2011 10:12 am

In the configuration page, the paths should not contain spaces. Perhaps this is the problem.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with Tesseract doesn't work

#13345 by andydld
Fri Dec 30, 2011 12:32 pm

Thanks for the answer.

I changed system.ocr on my windows from

Code: Select all

C:\Tesseract-OCR\2.04\tesseract.exe ${fileIn} ${fileOut}

to

Code: Select all

C:\Tesseract-OCR\2.04\tesseract.exe${fileIn}${fileOut}

On debian i've done the same. I changed system.ocr from

Code: Select all

/usr/bin/tesseract ${fileIn} ${fileOut}

to

Code: Select all

/usr/bin/tesseract${fileIn}${fileOut}

Both changes within the admin-tab.

I tested again with the eurotext.tif- and phototest.tif-files from the tesseract-windows-package with the same error-result. Only the "Abnormal program termination" is gone.

server.log from Windows with tesseract 2.04 atm:

Code: Select all

2011-12-30 13:13:07,402 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840 C:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tif C:\Users\andy\AppData\Local\Temp\okm4341221566327818840
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm4924962631318534852.tifC:\Users\andy\AppData\Local\Temp\okm4341221566327818840": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessImpl.create(Native Method)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
	at java.lang.ProcessImpl.start(ProcessImpl.java:30)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	... 10 more
2011-12-30 13:13:07,588 WARN  [com.openkm.extractor.Tesseract2TextExtractor] IO exception executing command: C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687 C:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tif C:\Users\andy\AppData\Local\Temp\okm5038797940072836687
java.io.IOException: Cannot run program "C:\Tesseract-OCR\2.04\tesseract.exeC:\Users\andy\AppData\Local\Temp\okm1872047461444103958.tifC:\Users\andy\AppData\Local\Temp\okm5038797940072836687": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract2TextExtractor.extractText(Tesseract2TextExtractor.java:97)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
	at java.lang.ProcessImpl.create(Native Method)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
	at java.lang.ProcessImpl.start(ProcessImpl.java:30)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
	... 10 more
2011-12-30 13:13:08,281 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.

server.log from debian with tesseract 3.01 atm:

Code: Select all

2011-12-30 13:24:25,598 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872 /tmp/okm3939815078924811309.tif /tmp/okm2344878681557850872
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm3939815078924811309.tif/tmp/okm2344878681557850872": java.io.IOException: error=20, Not a directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
	at java.lang.ProcessImpl.start(ProcessImpl.java:81)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
	... 10 more
2011-12-30 13:24:25,601 WARN  [com.openkm.extractor.Tesseract3TextExtractor] IO exception executing command: /usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933 /tmp/okm6229961688656192035.tif /tmp/okm5378590165912760933
java.io.IOException: Cannot run program "/usr/bin/tesseract/tmp/okm6229961688656192035.tif/tmp/okm5378590165912760933": java.io.IOException: error=20, Not a directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
	at com.openkm.util.ExecutionUtils.runCmdImpl(ExecutionUtils.java:246)
	at com.openkm.util.ExecutionUtils.runCmd(ExecutionUtils.java:225)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:89)
	at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
	at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
	at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
	at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
	at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: java.io.IOException: error=20, Not a directory
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
	at java.lang.ProcessImpl.start(ProcessImpl.java:81)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
	... 10 more

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Re: OCR with Tesseract doesn't work

#13348 by pberden
Fri Dec 30, 2011 3:12 pm

I have the same problem with Ubuntu server 10.11 64 bit, tesseract 3.01 and OpenKM 5.1.8.

Username

pberden

Rank

Fresh Boarder

Posts

1

Joined

Thu Dec 29, 2011 5:01 pm

Re: OCR with Tesseract doesn't work

#13350 by andydld
Fri Dec 30, 2011 4:34 pm

Good to know, that i'm not alone.

Just a notice:

Because of an license-change, java isnt't available thru the repos anymore.

If you execute the preferred command (found within the wiki) you get OpenJDK on Debian Squeeze and get an error on Ubuntu (tested on 10.04 Server LTS AMD64).

I use the original/oracle java jdk 6 u30 on windows and have OpenJDK on Debian.

I think we can eleminate a java-version-problem this way.

Username

andydld

Rank

Fresh Boarder

Posts

17

Joined

Thu Oct 13, 2011 7:29 am

Reply

Page 1 of 2
16 posts

1
2