OCR with tesseract - Problem & dirty workaround
PostPosted:Wed Feb 13, 2013 3:28 am
Using google for "Too few text extracted" I found many problems getting tesseract to work, but no solutions.
Comparing Tesseract 3 and Cuneiform via command line, I noticed two things:
1. Tesseract improved greatly and in my examples it worked better than cuneiform
2. Passing the paramter for the output file with the desired extension resulted in a file with twice the extension ("tesseract in.jpg out.txt" -> "out.txt.txt")
The second led me to some experiments and the problem that the call "/path/to/tesseract ${fileIn} ${fileOut}" naturally also delivers the output file including the extension, so that I suspected that OpenKM simply can't find the generated output file, thus "Too few text extracted". There might be more exact error messages out there...
I wrote a tiny workaround, mainly just for testing, I can't write Bash, but the following worked for me in Linux. Don't use it 1:1, it will propably lead to even more errors
I saved this file as /usr/bin/teswrap and changed the command in OpenKM to /usr/bin/teswrap ${fileIn} ${fileOut}
The cleaner way would be if we could use ${fileOutBase} in the command string in OpenKM.
Comparing Tesseract 3 and Cuneiform via command line, I noticed two things:
1. Tesseract improved greatly and in my examples it worked better than cuneiform
2. Passing the paramter for the output file with the desired extension resulted in a file with twice the extension ("tesseract in.jpg out.txt" -> "out.txt.txt")
The second led me to some experiments and the problem that the call "/path/to/tesseract ${fileIn} ${fileOut}" naturally also delivers the output file including the extension, so that I suspected that OpenKM simply can't find the generated output file, thus "Too few text extracted". There might be more exact error messages out there...
I wrote a tiny workaround, mainly just for testing, I can't write Bash, but the following worked for me in Linux. Don't use it 1:1, it will propably lead to even more errors
Code: Select all
(deu for german, eng should be default)#!/bin/bash
ifname=$1
ofname=$2
ofext="${ofname##*.}"
ofname="${ofname%.*}"
/usr/bin/tesseract $ifname $ofname -l deu
I saved this file as /usr/bin/teswrap and changed the command in OpenKM to /usr/bin/teswrap ${fileIn} ${fileOut}
The cleaner way would be if we could use ${fileOutBase} in the command string in OpenKM.