Open Source Document Management System

OCR Problem

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

4 posts

4 posts

OCR Problem

#29697 by skorpion78
Wed Aug 27, 2014 10:25 pm

Hello,
my config:
OpenKM Community 6.3.0
Debian 7.6
Tesseract 3

Please help OCR PDF documents. We scan documents on the device RICOH MPC 3000 and save multipage PDF format. If the scan is done in B&W OCR correctly recognize text, while scanning in COLOR mode does not recognize text OCR only garbage colection like this: "vümąäümaüwêa Saša? 320% S =a..S‹N.mNS: .Sonia .aaasxanssm . aaänaaêœ."

config i OpenKM:

Przechwytywanie.PNG (84.58 KiB) Viewed 2829 times

Attachments

20140715133503011_0.pdf

PDF sample B&W
(46.63 KiB) Downloaded 257 times

20140715103101968.pdf

PDF sample Color
(204.96 KiB) Downloaded 250 times

Username

skorpion78

Rank

Senior Boarder

Posts

Joined

Wed Aug 27, 2014 9:48 pm

Re: OCR Problem

#29719 by jllort
Thu Aug 28, 2014 2:34 pm

Very very strange. The process to extract text from pdf ( if pdf does not have text layer and only contains images ) is extract the image and then execute the ocr. For some reason the image extracted is horizontal not vertical. You should investigate why it happens, because is quite strange. I've attached the image what extracts the library. This is a general purpose library and I do not think be a bug on it.

Here you got code of the class is doing it http://doxygen.openkm.com/openkm/d6/d84 ... actor.html if you want to try do some test in your side.

Attachments

Im13023204248111430395.jpg (677.79 KiB) Viewed 2825 times

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR Problem

#29722 by skorpion78
Thu Aug 28, 2014 8:55 pm

Thank you for your response.
Well this really weird, I think check the various settings on the device.

greetings
Sebastian

Username

skorpion78

Rank

Senior Boarder

Posts

Joined

Wed Aug 27, 2014 9:48 pm

Re: OCR Problem

#29743 by jllort
Sat Aug 30, 2014 10:31 am

Is very very strange. Because open the document in Acrobat Reader, or any visor is shown correctly, but when extract the image get +80 ( horizontal ). If you got some Java skills, the best will be do minimal sample to test the code in combination with this PDF and other what are going right. Could be a library problem or a problem in PDF, this kind of things are difficult to know. The best option probably could be go to library website and explain them the problem https://pdfbox.apache.org/. These guys are who can give us some clue what's happening in your case.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
4 posts

Return to “Configuration”

Display:

Sort by:

Jump to: