Page 1 of 1

Failed to extract OCR text

PostPosted:Wed Sep 30, 2009 4:54 pm
by djdifulvio
Getting the following error when adding TIFF files:

12:45:02,545 WARN [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /tmp/okm1330748909376977501.txt (No such file or directory)

Current setup/build is:
Ubuntu 9.04 Server x64
Sun Java 1.5.0_19-b02
ImageMagick 6.5.6-6
Tesseract 2.04 (with English Tessdata)
Tifflib4 and Tifflib4-dev

Manual \"tesseract\" works fine with a handful of \"unknown field tag\" errors. Tiffinfo shows the file being a multi-page TIFF at 2550x3300 by 300dpi, 1 Bit, and CCITT Group 4 Compression. Did a \"convert\" on one to re-sample to 72dpi however still got the same error.

Any ideas?

Re:Failed to extract OCR text

PostPosted:Wed Sep 30, 2009 9:43 pm
by djdifulvio
In reviewing the server.log I found that this command is sent:

convert -depth 8 -monochrome /$rome /tmp/filename.tiff /tmp/filenamenew.tiff

However in trying this command manually I am getting this error:

convert: no decode delegate for this image format `filename.tif\' @ constitute.c/ReadImage/503.
convert: missing an image filename `filenamenew.tif\' @ convert.c/ConvertImageCommand/2822.

Ideas?

Re:Failed to extract OCR text

PostPosted:Wed Sep 30, 2009 10:07 pm
by oliver
I\'m having exactly the same problem. I can see openKM creating tmp tif files in /var/tmp:
Code: Select all
-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55572.tmp
-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55573.tmp
These tmp files are actually tif files so if I rename one to end in .tif I can run it through tesseract without any problems, it\'s just the odd filename it\'s looking for that doesn\'t exist:
Code: Select all
22:53:36,622 WARN  [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /var/tmp/okm55581.txt (No such file or directory)
Could it be that tesseract is failing and so not creating the .txt file, hence the IOException saying it can\'t find the file?

I\'m running on Solaris 10
-bash-3.00$ java -version
java version \"1.5.0_15\"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) Client VM (build 1.5.0_15-b04, mixed mode, sharing)

With the latest openKM version.

Re:Failed to extract OCR text

PostPosted:Wed Sep 30, 2009 10:42 pm
by djdifulvio
:Update:

Found the cause and solution.

In my problem the cause was that the convert command was not working, thus the file was not created for tesseract to process. The reason was that when I built ImageMagick I had only installed Libtiff, and I needed to install more. After downloaded and building tiff-3.9.1.tar.gz from http://ftp.fifi.org/ImageMagick/delegates/ my errors on the convert command had been resolve.

Which resolved my main problem, now I just have to find a way to make it do this convert and OCR faster, it is P3 slow.

Re:Failed to extract OCR text

PostPosted:Mon Nov 30, 2009 5:50 pm
by MartinR
Hello,

I\'ve the same problem. I tried uploading g4 compressed tiffs

when I start convert manual then I get:
convert: Bits/sample must be 1 for Group 3/4 encoding/decoding. `t1.tif\' @ tiff.c/TIFFErrors/493.

I\'ve compiled tesseract with libtiff, so I replaced the convert with a shell script with copies the file.

The next problem was, german umlaute are not recognized.

So I changed in tesseractmain.cpp
const char* lang = \"eng\";
to
const char* lang = \"deu\";

when I test on a console, evverything is fine, but in the server.log
apears H??r L??f (should be Hör Löf)

Best regards Martin