Open Source Document Management System | OpenKM

PostPosted:**Tue Feb 07, 2012 4:41 am**

Now the file is too big; maximum allowed file size is 256KiB. Here is a link to the file: https://www.princeton.edu/~pkrugman/interstellar.pdf

The quality is bad, but that shouldn't matter for the OCR; it should still scan it.

PostPosted:**Wed Feb 08, 2012 7:11 am**

I've upgraded to OpenKM 5.1.9 and it seems to be working (an instance of tesseract was observed when running top in ubuntu). No errors were thrown when scanning the document I linked to earlier, however none of the words were searchable. This could be due to no recognition by tesseract, or it may still not be working. Does anyone have a test document which has had the OCR removed from it, or has been tested with in the past?

PostPosted:**Wed Feb 08, 2012 9:02 am**

I did a quick check on my server with one of the PDF file without text layer, and found this error, 5.1.9 Build:7446.

The PDF is not protected.

Code: Select all

2012-02-08 19:53:31,196 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-02-08 19:53:31,269 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,271 WARN  [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im238448562289920155604.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im238448562289920155604.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im238448562289920155604.tiff

2012-02-08 19:53:31,345 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,345 WARN  [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im275595064840849541596.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im275595064840849541596.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im275595064840849541596.tiff

2012-02-08 19:53:31,346 WARN  [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Temporary Folder/img-Y10085108-0001.pdf'

Cheers,

PostPosted:**Thu Feb 09, 2012 6:17 pm**

I can recommend Cuneiform, which have support for many types of images and has better recognition engine. If you have problems with Tesseract, try to execute from the command line to test its functionality.

PostPosted:**Sat Feb 11, 2012 2:38 am**

Pavila - So you think Cuneiform is better? I was trying to work out which one would give me better results, and the only test I could find online had tesseract giving slightly better recognition levels.

PostPosted:**Mon Feb 13, 2012 1:06 pm**

According to my experience, Cuneiform 1.0 and 1.0.1 give me better word recognition. At least for English and Spanish texts.

PostPosted:**Mon Mar 12, 2012 7:43 am**

Was that in comparison to Tesseract 2 or Tesseract 3?

Also, how do I test if OCR is working correctly in my instance of OpenKM?

PostPosted:**Tue Mar 13, 2012 6:35 pm**

I think Cuneiform has better recognition engine.

PostPosted:**Mon Mar 26, 2012 6:41 am**

Ok, so I've tried Cuneiform (via aptitude install) and I get this error in server.log:

Code: Select all

2012-03-26 10:39:19,892 WARN  [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:20,380 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

2012-03-26 10:39:22,345 WARN  [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Sciences and Mathematics/Physics/LA-UR--97-1534- Free Electron Laser.pdf'

This is on 5.1.9 running on Ubuntu 10.10 x64. I think it is running Cuneiform 1.1.1

PostPosted:**Tue Mar 27, 2012 6:54 pm**

Please, read http://wiki.openkm.com/index.php/Debug_log_info and attach the zipped log.

PostPosted:**Sat Apr 14, 2012 2:56 am**

Sorry it took so long, there has been problems with my VPS provider.

To assist you, the upload was started at 2012-04-14 12:34:51,834 on the log with a file called "Acty Instructions Corps Day 2011.pdf".

PostPosted:**Tue Apr 17, 2012 11:05 am**

I have seen this:

Code: Select all

java.lang.IllegalStateException: java.util.concurrent.ExecutionException: java.io.UnsupportedEncodingException: num

So disable spell checker (OpenOffice dictionary).

PostPosted:**Wed Apr 18, 2012 9:01 am**

Seems to be working now...

Any idea why the dictionary function wouldn't work?

PostPosted:**Thu Apr 19, 2012 7:05 am**

I need to take a look at this. Perhaps the recent OpenOffice dictionaries does not work with OpenKM.

PostPosted:**Wed May 23, 2012 10:44 am**

Hi !
I've the same issue:

Code: Select all

2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] CommandLine: [/usr/bin/cuneiform, /tmp/XIPLAYER_CM26088752938913727698.png, -o, /tmp/okm7428846735421532831.txt]
2012-05-23 10:29:45,308 WARN  [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.

But the spell checker is not activated (i think, how to be sure ?)

version: 0.9.0 (i try with 1.1.0, same issue)

Any idea ?

Open Source Document Management System | OpenKM

OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem

Re: OCR/Indexing Problem