Page 1 of 1

OCR Problem

PostPosted:Wed Aug 27, 2014 10:25 pm
by skorpion78
Hello,
my config:
OpenKM Community 6.3.0
Debian 7.6
Tesseract 3

Please help OCR PDF documents. We scan documents on the device RICOH MPC 3000 and save multipage PDF format. If the scan is done in B&W OCR correctly recognize text, while scanning in COLOR mode does not recognize text OCR only garbage colection like this: "vümąäümaüwêa Saša? 320% S =a..S‹N.mNS: .Sonia .aaasxanssm . aaänaaêœ."

config i OpenKM:
Przechwytywanie.PNG
Przechwytywanie.PNG (84.58 KiB) Viewed 2481 times

Re: OCR Problem

PostPosted:Thu Aug 28, 2014 2:34 pm
by jllort
Very very strange. The process to extract text from pdf ( if pdf does not have text layer and only contains images ) is extract the image and then execute the ocr. For some reason the image extracted is horizontal not vertical. You should investigate why it happens, because is quite strange. I've attached the image what extracts the library. This is a general purpose library and I do not think be a bug on it.

Here you got code of the class is doing it http://doxygen.openkm.com/openkm/d6/d84 ... actor.html if you want to try do some test in your side.

Re: OCR Problem

PostPosted:Thu Aug 28, 2014 8:55 pm
by skorpion78
Thank you for your response.
Well this really weird, I think check the various settings on the device.

greetings
Sebastian

Re: OCR Problem

PostPosted:Sat Aug 30, 2014 10:31 am
by jllort
Is very very strange. Because open the document in Acrobat Reader, or any visor is shown correctly, but when extract the image get +80 ( horizontal ). If you got some Java skills, the best will be do minimal sample to test the code in combination with this PDF and other what are going right. Could be a library problem or a problem in PDF, this kind of things are difficult to know. The best option probably could be go to library website and explain them the problem https://pdfbox.apache.org/. These guys are who can give us some clue what's happening in your case.