• OCR with tesseract - Problem & dirty workaround

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #21355  by mauorrizze
 
Using google for "Too few text extracted" I found many problems getting tesseract to work, but no solutions.
Comparing Tesseract 3 and Cuneiform via command line, I noticed two things:
1. Tesseract improved greatly and in my examples it worked better than cuneiform
2. Passing the paramter for the output file with the desired extension resulted in a file with twice the extension ("tesseract in.jpg out.txt" -> "out.txt.txt")

The second led me to some experiments and the problem that the call "/path/to/tesseract ${fileIn} ${fileOut}" naturally also delivers the output file including the extension, so that I suspected that OpenKM simply can't find the generated output file, thus "Too few text extracted". There might be more exact error messages out there... :lol:

I wrote a tiny workaround, mainly just for testing, I can't write Bash, but the following worked for me in Linux. Don't use it 1:1, it will propably lead to even more errors :!:
Code: Select all
#!/bin/bash

ifname=$1
ofname=$2
ofext="${ofname##*.}"
ofname="${ofname%.*}"

/usr/bin/tesseract $ifname $ofname -l deu
(deu for german, eng should be default)

I saved this file as /usr/bin/teswrap and changed the command in OpenKM to /usr/bin/teswrap ${fileIn} ${fileOut}

The cleaner way would be if we could use ${fileOutBase} in the command string in OpenKM.
 #21373  by jllort
 
I think the problem could be in where "tmp" file is stored have you made any configuration on the server that could cause some kind of problem where "tmp" files are stored ?
 #21391  by mauorrizze
 
I used arch linux, now debian, in both the temp directories are quiet default-like (/tmp). But I installed the combined tomcat & openkm community 6.22 package and it uses the temp dir within tomcat. ($CATALINA_HOME/temp or something)
I have no experiences in advanced OpenKM debugging/logging so sorry that I use some tricks. But If I deliberately enter the wrong tesseract path I can see the command OpenKM tries to execute (in catalina.out log):
Code: Select all
/path/to/tesseract /home/*user*/tomcat-7.0.27/temp/okm236275203897839349.jpg /home/*user*/tomcat-7.0.27/temp/okm2717505807773049149.txt
Which, with the correct tesseract path, leads to the well known "Too few text extracted" error message.

If I execute tesseract on the console e.g. with the command
Code: Select all
/usr/bin/tesseract input.jpg output.txt
It produces the file
Code: Select all
output.txt.txt
My tesseract version is 3.02, included in both debian wheezy and arch package repository.

My assumption was that OpenKM can't find the okm2717505807773049149.txt.txt, so my little script cuts off the file extension from the ${fileOut} parameter.
If I add log output to my script I can see the resulting call:
Code: Select all
/usr/bin/tesseract /home/*user*/tomcat-7.0.27/temp/okm2098970522202044220.jpg /home/*user*/tomcat-7.0.27/temp/okm8124108125134778178 -l deu
With the result that I can't find "Too few text extracted" in the log, but text elements via full text search in OpenKM :)
 #21435  by mauorrizze
 
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.CuneiformTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Does that mean tesseract is missing in this config and OpenKM is usually aware of the tempfile.txt.txt file?

note: my temp directory has still the following content:
Code: Select all
~/tomcat-7.0.27/temp$ ls
okm1043105818973058510.txt.txt	okm3778300903172236830.txt.txt	okm7313556725340927314.txt.txt	okm7853102012711839460.txt.txt
okm1238734102778642262.txt.txt	okm3944205123158091582.txt.txt	okm7444060295834607103.txt.txt	okm8579924113660039877.txt.txt
okm1868275720582807002.txt.txt	okm412239872954792346.txt.txt	 okm776575320287834025.txt.txt	 okm9214329425559532.txt.txt
okm2389180805481113821.txt.txt	okm5470182959380227958.txt.txt	okm7834870244283743619.txt.txt	safeToDelete.tmp
image and .txt files are properly taken care of, these .txt.txt files generated by tesseract (3.0.2) aren't handled by OpenKM.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.