• OCR issues - convert pre-process not working

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #10427  by sess
 
OpenKM 5.0.3 (build: 5159) on Ubuntu server
tesseract 3
imagemagick 6.6.2.6-1ubuntu1.1

Running tesseract from the terminal I'm able to OCR a Tiff file without any problems, but when I upload the same document on OKM the document does not get processed. This what I get on my log file:
Code: Select all
2011-04-13 15:06:05,607 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: convert -depth 8 -monochrome /tmp/okm2222865511775852701.tif /tmp/okm8301523301139778998.tif
2011-04-13 15:06:05,797 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: convert -depth 8 -monochrome /tmp/okm2134408032814981524.tif /tmp/okm7437339512715843922.tif
2011-04-13 15:06:08,968 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: /usr/bin/tesseract /tmp/okm7437339512715843922.tif /tmp/okm998230814336069662
2011-04-13 15:06:08,969 DEBUG [com.openkm.extractor.TiffTextExtractor] CMD: /usr/bin/tesseract /tmp/okm8301523301139778998.tif /tmp/okm1392151765958597643
2011-04-13 15:06:09,216 DEBUG [com.openkm.extractor.TiffTextExtractor] TEXT: 
2011-04-13 15:06:09,338 DEBUG [com.openkm.extractor.TiffTextExtractor] TEXT: 
2011-04-13 15:06:09,362 INFO  [org.apache.jackrabbit.core.query.lucene.MultiIndex] updating index with 1 nodes from indexing queue.
The command convert -depth 8 -monochrome gets called first then tesseract gets issued for OCR, now testing the convert command from the terminal (also with the same tiff file) I get this error:
Code: Select all
> /usr/bin/convert -depth 8 -monochrome /root/file.tif /tmp/filenew.tif
convert: Bits/sample must be 1 for Group 3/4 encoding/decoding. `/tmp/filenew.tif' @ warning/tiff.c/TIFFErrors/494.
any ideas as to what's causing this?? I've checked for any missing libraries but for what I can see everything looks in order.
 #10428  by sess
 
convert only seems to work with the depth set to 1

/usr/bin/convert -depth 1 -monochrome /root/file.tif /tmp/filenew.tif
file.tif-depth1.png
file.tif-depth1.png (33.3 KiB) Viewed 3672 times
 #10601  by pavila
 
In our testing 8 bits depth is working fine. Perhaps is anything related to a concrete TIFF file or the problems is with many others? Can you attach here a sample TIFF to test?
 #10616  by sess
 
thanks pavila,
our scanner produces tiff format documents with group 3/4 compression, i've attached a sample for you to look at.
Attachments
20110420_185355.tif
20110420_185355.tif (35 KiB) Viewed 3653 times
 #10718  by pavila
 
The TIFF has some problems:
Code: Select all
TIFFReadDirectory: Warning, 20110420_185355.tif: invalid TIFF directory; tags are not sorted in ascending order.
TIFFReadDirectory: Warning, 20110420_185355.tif: unknown field with tag 32931 (0x80a3) encountered.
TIFFReadDirectory: Warning, 20110420_185355.tif: unknown field with tag 32934 (0x80a6) encountered.
But the OCR text extraction works fine. You have several options:
  • Fix the TIFF problems
  • Download the source code and comment the call to ImageMagick convert, and compile it.
  • Create a dummy "convert" which only copy the input file to the output without any modification
  • Wait until the next week and test the new OpenKM major release (actually there is no document upgrade)

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.