• Failed to extract OCR text

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #3171  by djdifulvio
 
Getting the following error when adding TIFF files:

12:45:02,545 WARN [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /tmp/okm1330748909376977501.txt (No such file or directory)

Current setup/build is:
Ubuntu 9.04 Server x64
Sun Java 1.5.0_19-b02
ImageMagick 6.5.6-6
Tesseract 2.04 (with English Tessdata)
Tifflib4 and Tifflib4-dev

Manual \"tesseract\" works fine with a handful of \"unknown field tag\" errors. Tiffinfo shows the file being a multi-page TIFF at 2550x3300 by 300dpi, 1 Bit, and CCITT Group 4 Compression. Did a \"convert\" on one to re-sample to 72dpi however still got the same error.

Any ideas?
 #3172  by djdifulvio
 
In reviewing the server.log I found that this command is sent:

convert -depth 8 -monochrome /$rome /tmp/filename.tiff /tmp/filenamenew.tiff

However in trying this command manually I am getting this error:

convert: no decode delegate for this image format `filename.tif\' @ constitute.c/ReadImage/503.
convert: missing an image filename `filenamenew.tif\' @ convert.c/ConvertImageCommand/2822.

Ideas?
 #3173  by oliver
 
I\'m having exactly the same problem. I can see openKM creating tmp tif files in /var/tmp:
Code: Select all
-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55572.tmp
-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55573.tmp
These tmp files are actually tif files so if I rename one to end in .tif I can run it through tesseract without any problems, it\'s just the odd filename it\'s looking for that doesn\'t exist:
Code: Select all
22:53:36,622 WARN  [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /var/tmp/okm55581.txt (No such file or directory)
Could it be that tesseract is failing and so not creating the .txt file, hence the IOException saying it can\'t find the file?

I\'m running on Solaris 10
-bash-3.00$ java -version
java version \"1.5.0_15\"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) Client VM (build 1.5.0_15-b04, mixed mode, sharing)

With the latest openKM version.
 #3174  by djdifulvio
 
:Update:

Found the cause and solution.

In my problem the cause was that the convert command was not working, thus the file was not created for tesseract to process. The reason was that when I built ImageMagick I had only installed Libtiff, and I needed to install more. After downloaded and building tiff-3.9.1.tar.gz from http://ftp.fifi.org/ImageMagick/delegates/ my errors on the convert command had been resolve.

Which resolved my main problem, now I just have to find a way to make it do this convert and OCR faster, it is P3 slow.
 #3391  by MartinR
 
Hello,

I\'ve the same problem. I tried uploading g4 compressed tiffs

when I start convert manual then I get:
convert: Bits/sample must be 1 for Group 3/4 encoding/decoding. `t1.tif\' @ tiff.c/TIFFErrors/493.

I\'ve compiled tesseract with libtiff, so I replaced the convert with a shell script with copies the file.

The next problem was, german umlaute are not recognized.

So I changed in tesseractmain.cpp
const char* lang = \"eng\";
to
const char* lang = \"deu\";

when I test on a console, evverything is fine, but in the server.log
apears H??r L??f (should be Hör Löf)

Best regards Martin

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.