Open Source Document Management System | OpenKM

PostPosted:**Wed Aug 27, 2014 10:25 pm**

Hello,
my config:
OpenKM Community 6.3.0
Debian 7.6
Tesseract 3

Please help OCR PDF documents. We scan documents on the device RICOH MPC 3000 and save multipage PDF format. If the scan is done in B&W OCR correctly recognize text, while scanning in COLOR mode does not recognize text OCR only garbage colection like this: "vümąäümaüwêa Saša? 320% S =a..S‹N.mNS: .Sonia .aaasxanssm . aaänaaêœ."

config i OpenKM:

Przechwytywanie.PNG (84.58 KiB) Viewed 2822 times

PostPosted:**Thu Aug 28, 2014 2:34 pm**

Very very strange. The process to extract text from pdf ( if pdf does not have text layer and only contains images ) is extract the image and then execute the ocr. For some reason the image extracted is horizontal not vertical. You should investigate why it happens, because is quite strange. I've attached the image what extracts the library. This is a general purpose library and I do not think be a bug on it.

Here you got code of the class is doing it http://doxygen.openkm.com/openkm/d6/d84 ... actor.html if you want to try do some test in your side.

PostPosted:**Thu Aug 28, 2014 8:55 pm**

Thank you for your response.
Well this really weird, I think check the various settings on the device.

greetings
Sebastian

PostPosted:**Sat Aug 30, 2014 10:31 am**

Is very very strange. Because open the document in Acrobat Reader, or any visor is shown correctly, but when extract the image get +80 ( horizontal ). If you got some Java skills, the best will be do minimal sample to test the code in combination with this PDF and other what are going right. Could be a library problem or a problem in PDF, this kind of things are difficult to know. The best option probably could be go to library website and explain them the problem https://pdfbox.apache.org/. These guys are who can give us some clue what's happening in your case.

Open Source Document Management System | OpenKM

OCR Problem

OCR Problem

Re: OCR Problem

Re: OCR Problem

Re: OCR Problem