Page 1 of 1

OCR issues - convert pre-process not working

PostPosted:Wed Apr 13, 2011 2:47 pm
by sess
OpenKM 5.0.3 (build: 5159) on Ubuntu server
tesseract 3
imagemagick 6.6.2.6-1ubuntu1.1

Running tesseract from the terminal I'm able to OCR a Tiff file without any problems, but when I upload the same document on OKM the document does not get processed. This what I get on my log file:
Code: Select all
2011-04-13 15:06:05,607 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: convert -depth 8 -monochrome /tmp/okm2222865511775852701.tif /tmp/okm8301523301139778998.tif
2011-04-13 15:06:05,797 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: convert -depth 8 -monochrome /tmp/okm2134408032814981524.tif /tmp/okm7437339512715843922.tif
2011-04-13 15:06:08,968 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: /usr/bin/tesseract /tmp/okm7437339512715843922.tif /tmp/okm998230814336069662
2011-04-13 15:06:08,969 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: /usr/bin/tesseract /tmp/okm8301523301139778998.tif /tmp/okm1392151765958597643
2011-04-13 15:06:09,216 DEBUG [com.openkm.extractor.TiffTextExtractor] TEXT: 
2011-04-13 15:06:09,338 DEBUG [com.openkm.extractor.TiffTextExtractor] TEXT: 
2011-04-13 15:06:09,362 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
The command convert -depth 8 -monochrome gets called first then tesseract gets issued for OCR, now testing the convert command from the terminal (also with the same tiff file) I get this error:
Code: Select all
> /usr/bin/convert -depth 8 -monochrome /root/file.tif /tmp/filenew.tif
convert: Bits/sample must be 1 for Group 3/4 encoding/decoding. `/tmp/filenew.tif' @ warning/tiff.c/TIFFErrors/494.
any ideas as to what's causing this?? I've checked for any missing libraries but for what I can see everything looks in order.

Re: OCR issues - convert pre-process not working

PostPosted:Wed Apr 13, 2011 3:31 pm
by sess
convert only seems to work with the depth set to 1

/usr/bin/convert -depth 1 -monochrome /root/file.tif /tmp/filenew.tif
file.tif-depth1.png
file.tif-depth1.png (33.3 KiB) Viewed 3674 times

Re: OCR issues - convert pre-process not working

PostPosted:Wed Apr 20, 2011 7:15 am
by pavila
In our testing 8 bits depth is working fine. Perhaps is anything related to a concrete TIFF file or the problems is with many others? Can you attach here a sample TIFF to test?

Re: OCR issues - convert pre-process not working

PostPosted:Wed Apr 20, 2011 11:59 am
by sess
thanks pavila,
our scanner produces tiff format documents with group 3/4 compression, i've attached a sample for you to look at.

Re: OCR issues - convert pre-process not working

PostPosted:Fri Apr 22, 2011 7:09 am
by pavila
The TIFF has some problems:
Code: Select all
TIFFReadDirectory: Warning, 20110420_185355.tif: invalid TIFF directory; tags are not sorted in ascending order.
TIFFReadDirectory: Warning, 20110420_185355.tif: unknown field with tag 32931 (0x80a3) encountered.
TIFFReadDirectory: Warning, 20110420_185355.tif: unknown field with tag 32934 (0x80a6) encountered.
But the OCR text extraction works fine. You have several options:
  • Fix the TIFF problems
  • Download the source code and comment the call to ImageMagick convert, and compile it.
  • Create a dummy "convert" which only copy the input file to the output without any modification
  • Wait until the next week and test the new OpenKM major release (actually there is no document upgrade)