Open Source Document Management System | OpenKM - OCR with tesseract

OCR with tesseract - Problem & dirty workaround

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

6 posts

6 posts

OCR with tesseract - Problem & dirty workaround

#21355 by mauorrizze
Wed Feb 13, 2013 3:28 am

Using google for "Too few text extracted" I found many problems getting tesseract to work, but no solutions.
Comparing Tesseract 3 and Cuneiform via command line, I noticed two things:
1. Tesseract improved greatly and in my examples it worked better than cuneiform
2. Passing the paramter for the output file with the desired extension resulted in a file with twice the extension ("tesseract in.jpg out.txt" -> "out.txt.txt")

The second led me to some experiments and the problem that the call "/path/to/tesseract ${fileIn} ${fileOut}" naturally also delivers the output file including the extension, so that I suspected that OpenKM simply can't find the generated output file, thus "Too few text extracted". There might be more exact error messages out there...

I wrote a tiny workaround, mainly just for testing, I can't write Bash, but the following worked for me in Linux. Don't use it 1:1, it will propably lead to even more errors

Code: Select all

#!/bin/bash

ifname=$1
ofname=$2
ofext="${ofname##*.}"
ofname="${ofname%.*}"

/usr/bin/tesseract $ifname $ofname -l deu

(deu for german, eng should be default)

I saved this file as /usr/bin/teswrap and changed the command in OpenKM to /usr/bin/teswrap ${fileIn} ${fileOut}

The cleaner way would be if we could use ${fileOutBase} in the command string in OpenKM.

Username

mauorrizze

Rank

Fresh Boarder

Posts

Joined

Wed Feb 13, 2013 3:04 am

Re: OCR with tesseract - Problem & dirty workaround

#21373 by jllort
Thu Feb 14, 2013 10:13 pm

I think the problem could be in where "tmp" file is stored have you made any configuration on the server that could cause some kind of problem where "tmp" files are stored ?

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR with tesseract - Problem & dirty workaround

#21391 by mauorrizze
Fri Feb 15, 2013 12:06 pm

I used arch linux, now debian, in both the temp directories are quiet default-like (/tmp). But I installed the combined tomcat & openkm community 6.22 package and it uses the temp dir within tomcat. ($CATALINA_HOME/temp or something)
I have no experiences in advanced OpenKM debugging/logging so sorry that I use some tricks. But If I deliberately enter the wrong tesseract path I can see the command OpenKM tries to execute (in catalina.out log):

Code: Select all

/path/to/tesseract /home/*user*/tomcat-7.0.27/temp/okm236275203897839349.jpg /home/*user*/tomcat-7.0.27/temp/okm2717505807773049149.txt

Which, with the correct tesseract path, leads to the well known "Too few text extracted" error message.

If I execute tesseract on the console e.g. with the command

Code: Select all

/usr/bin/tesseract input.jpg output.txt

It produces the file

Code: Select all

output.txt.txt

My tesseract version is 3.02, included in both debian wheezy and arch package repository.

My assumption was that OpenKM can't find the okm2717505807773049149.txt.txt, so my little script cuts off the file extension from the ${fileOut} parameter.
If I add log output to my script I can see the resulting call:

Code: Select all

/usr/bin/tesseract /home/*user*/tomcat-7.0.27/temp/okm2098970522202044220.jpg /home/*user*/tomcat-7.0.27/temp/okm8124108125134778178 -l deu

With the result that I can't find "Too few text extracted" in the log, but text elements via full text search in OpenKM

Username

mauorrizze

Rank

Fresh Boarder

Posts

Joined

Wed Feb 13, 2013 3:04 am

Re: OCR with tesseract - Problem & dirty workaround

#21434 by pavila
Sun Feb 17, 2013 11:35 am

Please, paste your "registered.text.extractors" configuration property value.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: OCR with tesseract - Problem & dirty workaround

#21435 by mauorrizze
Sun Feb 17, 2013 12:39 pm

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.CuneiformTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Does that mean tesseract is missing in this config and OpenKM is usually aware of the tempfile.txt.txt file?

note: my temp directory has still the following content:

Code: Select all

~/tomcat-7.0.27/temp$ ls
okm1043105818973058510.txt.txt	okm3778300903172236830.txt.txt	okm7313556725340927314.txt.txt	okm7853102012711839460.txt.txt
okm1238734102778642262.txt.txt	okm3944205123158091582.txt.txt	okm7444060295834607103.txt.txt	okm8579924113660039877.txt.txt
okm1868275720582807002.txt.txt	okm412239872954792346.txt.txt	 okm776575320287834025.txt.txt	 okm9214329425559532.txt.txt
okm2389180805481113821.txt.txt	okm5470182959380227958.txt.txt	okm7834870244283743619.txt.txt	safeToDelete.tmp

image and .txt files are properly taken care of, these .txt.txt files generated by tesseract (3.0.2) aren't handled by OpenKM.

Username

mauorrizze

Rank

Fresh Boarder

Posts

Joined

Wed Feb 13, 2013 3:04 am

Re: OCR with tesseract - Problem & dirty workaround

#21484 by pavila
Thu Feb 21, 2013 5:34 pm

According to http://wiki.openkm.com/index.php/Third- ... ation:_OCR you need to replace "com.openkm.extractor.CuneiformTextExtractor" by "com.openkm.extractor.Tesseract3TextExtractor".

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Page 1 of 1
6 posts

Return to “Configuration”

Display:

Sort by:

Jump to: