Page 3 of 4
Re: OCR/Indexing Problem
PostPosted:Tue Feb 07, 2012 4:41 am
by Alexires
Now the file is too big; maximum allowed file size is 256KiB. Here is a link to the file:
https://www.princeton.edu/~pkrugman/interstellar.pdf
The quality is bad, but that shouldn't matter for the OCR; it should still scan it.
Re: OCR/Indexing Problem
PostPosted:Wed Feb 08, 2012 7:11 am
by Alexires
I've upgraded to OpenKM 5.1.9 and it seems to be working (an instance of tesseract was observed when running top in ubuntu). No errors were thrown when scanning the document I linked to earlier, however none of the words were searchable. This could be due to no recognition by tesseract, or it may still not be working. Does anyone have a test document which has had the OCR removed from it, or has been tested with in the past?
Re: OCR/Indexing Problem
PostPosted:Wed Feb 08, 2012 9:02 am
by okmuser
I did a quick check on my server with one of the PDF file without text layer, and found this error, 5.1.9 Build:7446.
The PDF is not protected.
Code: Select all2012-02-08 19:53:31,196 WARN [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-02-08 19:53:31,269 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,271 WARN [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im238448562289920155604.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im238448562289920155604.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im238448562289920155604.tiff
2012-02-08 19:53:31,345 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 139
2012-02-08 19:53:31,345 WARN [com.openkm.util.ExecutionUtils] STDERR: Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:/tmp/Im275595064840849541596.tiff
IMAGE::read_header:Error:Can't read this image type:/tmp/Im275595064840849541596.tiff
/usr/bin/tesseract:Error:Read of file failed:/tmp/Im275595064840849541596.tiff
2012-02-08 19:53:31,346 WARN [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Temporary Folder/img-Y10085108-0001.pdf'
Cheers,
Re: OCR/Indexing Problem
PostPosted:Thu Feb 09, 2012 6:17 pm
by pavila
I can recommend Cuneiform, which have support for many types of images and has better recognition engine. If you have problems with Tesseract, try to execute from the command line to test its functionality.
Re: OCR/Indexing Problem
PostPosted:Sat Feb 11, 2012 2:38 am
by Alexires
Pavila - So you think Cuneiform is better? I was trying to work out which one would give me better results, and the only test I could find online had tesseract giving slightly better recognition levels.
Re: OCR/Indexing Problem
PostPosted:Mon Feb 13, 2012 1:06 pm
by pavila
According to my experience, Cuneiform 1.0 and 1.0.1 give me better word recognition. At least for English and Spanish texts.
Re: OCR/Indexing Problem
PostPosted:Mon Mar 12, 2012 7:43 am
by Alexires
Was that in comparison to Tesseract 2 or Tesseract 3?
Also, how do I test if OCR is working correctly in my instance of OpenKM?
Re: OCR/Indexing Problem
PostPosted:Tue Mar 13, 2012 6:35 pm
by pavila
I think Cuneiform has better recognition engine.
Re: OCR/Indexing Problem
PostPosted:Mon Mar 26, 2012 6:41 am
by Alexires
Ok, so I've tried Cuneiform (via aptitude install) and I get this error in server.log:
Code: Select all2012-03-26 10:39:19,892 WARN [com.openkm.extractor.PdfTextExtractor] PDF does not contains text layer
2012-03-26 10:39:20,380 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
2012-03-26 10:39:20,380 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
2012-03-26 10:39:20,380 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
2012-03-26 10:39:20,380 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
2012-03-26 10:39:20,380 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-03-26 10:39:20,381 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
2012-03-26 10:39:22,345 WARN [com.openkm.extractor.RegisteredExtractors] There was a problem extracting text from '/okm:root/Sciences and Mathematics/Physics/LA-UR--97-1534- Free Electron Laser.pdf'
This is on 5.1.9 running on Ubuntu 10.10 x64. I think it is running Cuneiform 1.1.1
Re: OCR/Indexing Problem
PostPosted:Tue Mar 27, 2012 6:54 pm
by pavila
Re: OCR/Indexing Problem
PostPosted:Sat Apr 14, 2012 2:56 am
by Alexires
Sorry it took so long, there has been problems with my VPS provider.
To assist you, the upload was started at 2012-04-14 12:34:51,834 on the log with a file called "Acty Instructions Corps Day 2011.pdf".
Re: OCR/Indexing Problem
PostPosted:Tue Apr 17, 2012 11:05 am
by pavila
I have seen this:
Code: Select alljava.lang.IllegalStateException: java.util.concurrent.ExecutionException: java.io.UnsupportedEncodingException: num
So disable spell checker (OpenOffice dictionary).
Re: OCR/Indexing Problem
PostPosted:Wed Apr 18, 2012 9:01 am
by Alexires
Seems to be working now...
Any idea why the dictionary function wouldn't work?
Re: OCR/Indexing Problem
PostPosted:Thu Apr 19, 2012 7:05 am
by pavila
I need to take a look at this. Perhaps the recent OpenOffice dictionaries does not work with OpenKM.
Re: OCR/Indexing Problem
PostPosted:Wed May 23, 2012 10:44 am
by michaeled
Hi !
I've the same issue:
Code: Select all2012-05-23 10:29:45,308 WARN [com.openkm.util.ExecutionUtils] Abnormal program termination: 1
2012-05-23 10:29:45,308 WARN [com.openkm.util.ExecutionUtils] CommandLine: [/usr/bin/cuneiform, /tmp/XIPLAYER_CM26088752938913727698.png, -o, /tmp/okm7428846735421532831.txt]
2012-05-23 10:29:45,308 WARN [com.openkm.util.ExecutionUtils] STDERR: PUMA_XFinalrecognition failed.
But the spell checker is not activated (i think, how to be sure ?)
version: 0.9.0 (i try with 1.1.0, same issue)
Any idea ?