Open Source Document Management System | OpenKM

Failed to extract OCR text

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

5 posts

5 posts

Failed to extract OCR text

#3171 by djdifulvio
Wed Sep 30, 2009 4:54 pm

Getting the following error when adding TIFF files:

12:45:02,545 WARN [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /tmp/okm1330748909376977501.txt (No such file or directory)

Current setup/build is:
Ubuntu 9.04 Server x64
Sun Java 1.5.0_19-b02
ImageMagick 6.5.6-6
Tesseract 2.04 (with English Tessdata)
Tifflib4 and Tifflib4-dev

Manual \"tesseract\" works fine with a handful of \"unknown field tag\" errors. Tiffinfo shows the file being a multi-page TIFF at 2550x3300 by 300dpi, 1 Bit, and CCITT Group 4 Compression. Did a \"convert\" on one to re-sample to 72dpi however still got the same error.

Any ideas?

Username

djdifulvio

Rank

Fresh Boarder

Posts

Joined

Thu Sep 03, 2009 10:08 pm

Re:Failed to extract OCR text

#3172 by djdifulvio
Wed Sep 30, 2009 9:43 pm

In reviewing the server.log I found that this command is sent:

convert -depth 8 -monochrome /$rome /tmp/filename.tiff /tmp/filenamenew.tiff

However in trying this command manually I am getting this error:

convert: no decode delegate for this image format `filename.tif\' @ constitute.c/ReadImage/503.
convert: missing an image filename `filenamenew.tif\' @ convert.c/ConvertImageCommand/2822.

Ideas?

Username

djdifulvio

Rank

Fresh Boarder

Posts

Joined

Thu Sep 03, 2009 10:08 pm

Re:Failed to extract OCR text

#3173 by oliver
Wed Sep 30, 2009 10:07 pm

I\'m having exactly the same problem. I can see openKM creating tmp tif files in /var/tmp:

Code: Select all

-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55572.tmp
-rw-r--r--   1 root     root      111838 Sep 30 22:53 bin55573.tmp

These tmp files are actually tif files so if I rename one to end in .tif I can run it through tesseract without any problems, it\'s just the odd filename it\'s looking for that doesn\'t exist:

Code: Select all

22:53:36,622 WARN  [TiffTextExtractor] Failed to extract OCR text
java.io.FileNotFoundException: /var/tmp/okm55581.txt (No such file or directory)

Could it be that tesseract is failing and so not creating the .txt file, hence the IOException saying it can\'t find the file?

I\'m running on Solaris 10
-bash-3.00$ java -version
java version \"1.5.0_15\"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) Client VM (build 1.5.0_15-b04, mixed mode, sharing)

With the latest openKM version.

Username

oliver

Rank

Fresh Boarder

Posts

Joined

Wed Sep 30, 2009 10:02 pm

Re:Failed to extract OCR text

#3174 by djdifulvio
Wed Sep 30, 2009 10:42 pm

:Update:

Found the cause and solution.

In my problem the cause was that the convert command was not working, thus the file was not created for tesseract to process. The reason was that when I built ImageMagick I had only installed Libtiff, and I needed to install more. After downloaded and building tiff-3.9.1.tar.gz from http://ftp.fifi.org/ImageMagick/delegates/ my errors on the convert command had been resolve.

Which resolved my main problem, now I just have to find a way to make it do this convert and OCR faster, it is P3 slow.

Username

djdifulvio

Rank

Fresh Boarder

Posts

Joined

Thu Sep 03, 2009 10:08 pm

Re:Failed to extract OCR text

#3391 by MartinR
Mon Nov 30, 2009 5:50 pm

Hello,

I\'ve the same problem. I tried uploading g4 compressed tiffs

when I start convert manual then I get:
convert: Bits/sample must be 1 for Group 3/4 encoding/decoding. `t1.tif\' @ tiff.c/TIFFErrors/493.

I\'ve compiled tesseract with libtiff, so I replaced the convert with a shell script with copies the file.

The next problem was, german umlaute are not recognized.

So I changed in tesseractmain.cpp
const char* lang = \"eng\";
to
const char* lang = \"deu\";

when I test on a console, evverything is fine, but in the server.log
apears H??r L??f (should be Hör Löf)

Best regards Martin

Username

MartinR

Rank

Fresh Boarder

Posts

Joined

Sun Nov 29, 2009 6:23 pm

Page 1 of 1
5 posts

Return to “Configuration”

Display:

Sort by:

Jump to: