• OCR Problem

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #29697  by skorpion78
 
Hello,
my config:
OpenKM Community 6.3.0
Debian 7.6
Tesseract 3

Please help OCR PDF documents. We scan documents on the device RICOH MPC 3000 and save multipage PDF format. If the scan is done in B&W OCR correctly recognize text, while scanning in COLOR mode does not recognize text OCR only garbage colection like this: "vümąäümaüwêa Saša? 320% S =a..S‹N.mNS: .Sonia .aaasxanssm . aaänaaêœ."

config i OpenKM:
Przechwytywanie.PNG
Przechwytywanie.PNG (84.58 KiB) Viewed 2480 times
Attachments
PDF sample B&W
(46.63 KiB) Downloaded 232 times
PDF sample Color
(204.96 KiB) Downloaded 229 times
 #29719  by jllort
 
Very very strange. The process to extract text from pdf ( if pdf does not have text layer and only contains images ) is extract the image and then execute the ocr. For some reason the image extracted is horizontal not vertical. You should investigate why it happens, because is quite strange. I've attached the image what extracts the library. This is a general purpose library and I do not think be a bug on it.

Here you got code of the class is doing it http://doxygen.openkm.com/openkm/d6/d84 ... actor.html if you want to try do some test in your side.
Attachments
Im13023204248111430395.jpg
Im13023204248111430395.jpg (677.79 KiB) Viewed 2476 times
 #29722  by skorpion78
 
Thank you for your response.
Well this really weird, I think check the various settings on the device.

greetings
Sebastian
 #29743  by jllort
 
Is very very strange. Because open the document in Acrobat Reader, or any visor is shown correctly, but when extract the image get +80 ( horizontal ). If you got some Java skills, the best will be do minimal sample to test the code in combination with this PDF and other what are going right. Could be a library problem or a problem in PDF, this kind of things are difficult to know. The best option probably could be go to library website and explain them the problem https://pdfbox.apache.org/. These guys are who can give us some clue what's happening in your case.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.