• OCR/Indexing Problem

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #13839  by Alexires
 
I've upgraded to OpenKM 5.1.9 and it seems to be working (an instance of tesseract was observed when running top in ubuntu). No errors were thrown when scanning the document I linked to earlier, however none of the words were searchable. This could be due to no recognition by tesseract, or it may still not be working. Does anyone have a test document which has had the OCR removed from it, or has been tested with in the past?
 #13843  by okmuser
 
I did a quick check on my server with one of the PDF file without text layer, and found this error, 5.1.9 Build:7446.

The PDF is not protected.
Code: Select all
2012-02-08 19:53:31,196 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-02-08 19:53:31,269 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,271 WARN  [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im238448562289920155604.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im238448562289920155604.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im238448562289920155604.tiff

2012-02-08 19:53:31,345 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,345 WARN  [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im275595064840849541596.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im275595064840849541596.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im275595064840849541596.tiff

2012-02-08 19:53:31,346 WARN  [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Temporary Folder/img-Y10085108-0001.pdf'

Cheers,
 #13889  by pavila
 
I can recommend Cuneiform, which have support for many types of images and has better recognition engine. If you have problems with Tesseract, try to execute from the command line to test its functionality.
 #13925  by Alexires
 
Pavila - So you think Cuneiform is better? I was trying to work out which one would give me better results, and the only test I could find online had tesseract giving slightly better recognition levels.
 #13959  by pavila
 
According to my experience, Cuneiform 1.0 and 1.0.1 give me better word recognition. At least for English and Spanish texts.
 #14436  by Alexires
 
Was that in comparison to Tesseract 2 or Tesseract 3?

Also, how do I test if OCR is working correctly in my instance of OpenKM?
 #14477  by pavila
 
I think Cuneiform has better recognition engine.
 #14959  by Alexires
 
Ok, so I've tried Cuneiform (via aptitude install) and I get this error in server.log:
Code: Select all
2012-03-26 10:39:19,892 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:22,345 WARN  [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Sciences and Mathematics/Physics/LA-UR--97-1534- Free Electron Laser.pdf'
This is on 5.1.9 running on Ubuntu 10.10 x64. I think it is running Cuneiform 1.1.1
 #15206  by Alexires
 
Sorry it took so long, there has been problems with my VPS provider.

To assist you, the upload was started at 2012-04-14 12:34:51,834 on the log with a file called "Acty Instructions Corps Day 2011.pdf".
Attachments
(154.26 KiB) Downloaded 185 times
 #15243  by pavila
 
I have seen this:
Code: Select all
java.lang.IllegalStateException: java.util.concurrent.ExecutionException: java.io.UnsupportedEncodingException: num
So disable spell checker (OpenOffice dictionary).
 #15267  by Alexires
 
Seems to be working now...

Any idea why the dictionary function wouldn't work?
 #15291  by pavila
 
I need to take a look at this. Perhaps the recent OpenOffice dictionaries does not work with OpenKM.
 #15679  by michaeled
 
Hi !
I've the same issue:
Code: Select all
2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] CommandLine: [/usr/bin/cuneiform, /tmp/XIPLAYER_CM26088752938913727698.png, -o, /tmp/okm7428846735421532831.txt]
2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
But the spell checker is not activated (i think, how to be sure ?)

version: 0.9.0 (i try with 1.1.0, same issue)

Any idea ?

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.