• Scanner and OCR

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #15010  by lineac
 
Hi, me again : )

I would like to use this so handy feature which is multi-page scanning (at least page scanning) and OCR. But: I have no idea where to start. I've read the official documentation about tesseract integration and done what it was said, but then I'm lost.
By doing some extensive research on the Internet, I gathered some infos:
-scanner must be TWAIN compliant;
-we must provide a scanner.jar or something to make it work;
-there's some code in this thread: http://forum.openkm.com/viewtopic.php?f=31&t=4795 that would allow multi-page scanning...

Besides all this, I have no idea what to do. First I tried to upload a .tif but then how to use OCR on this in order for the document to be indexed? and then how can I link a scanner to OpenKM and make the scanner applet work?
Can anyone give me hints/links/how-to procedures?
 #15022  by jllort
 
Actual OpenKM UI scanner supports multipage scanner.
If you have configured tesseract ( we suggest cuneiform seems gets betteres results with it ) then you should be able to indexing image text.
 #15031  by pavila
 
The minimun recommended scanning resolution is 300 dpi to be used with a OCR engine like Tesseract or Cuneiform.
 #15045  by lineac
 
For the OCR: I've tried to upload a .tif image (drawn on MS paint, I've just put basic text like "Test Sample 1 2"), but it seems it is not indexed: when I search for a word I've put in it, it doesn't find anything... Is there a step I'm missing?

For the Scanner: well, when I try to click on the scanner menu entry, a pop-up appear and when I try to click "scan and upload" it says some error trying to find source data etc. Of course, I think I must indicate where to find the scanner somewhere (its IP?), but I have no idea where I should configure this.
 #15054  by pavila
 
Please, attach the TIFF image to test it or try in the online demo.
 #15078  by lineac
 
That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?
 #15080  by vkasgpta
 
lineac wrote:That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?
OCR is not 100% fool proof, its possible it recognized some characters as something else
 #15082  by lineac
 
Is there a way to see what word OCR recognized (and what OpenKM indexed)? I only test it by searching for content in the file, maybe there's a better solution.
Also, it would be odd that OCR don't recognize characters that are directly written from keyboard (then perfectly-made), right?
 #15106  by vkasgpta
 
well if the document was a computer text and not scanned, openkm recognizes it 100%, but if it was printed then scanned... the OCR API / software takes over. Its not OpenKM's fault here. Yea, you look at the scan and it looks like perfect computer font on the scan, even then OCR can have errors! thats just how it is. Think of it like voice dialing before apple's SIRI. Yea SIRI is still bugged but before it came out voice dialing was really terrible and unreliable...

I had the privilege of talking to a senior technician for lexmark afew days ago when i was trying to get the OCR to work.
according to him (not me) OCR was popular when it first came out, but people really got frustrated by it and threw it aside.
Because you have your scanner working, it would be advisable to disable the OCR OR get someone to vet through every document after the OCR works.
If you want to test what OpenKM got in the OCR, initiate the OCR software without OpenKM and see what it saw, they usually have a previewer and will show you what it recognized wrong.
OpenKM does not have its own OCR so whichever OCR you are using, check out its guide if you are not sure how to use it separately :)

Maybe wait for a SIRI or Google Voice of OCR? :)
 #15112  by lineac
 
well if the document was a computer text and not scanned, openkm recognizes it 100%
That's the case: it's created, saved as tif and uploaded directly! Something must be wrong with OpenKM on this, it can't be the OCR...
 #15113  by vkasgpta
 
haha touche! You seem to have cornered me in my statement and you would technically win against me BUT you don't win against the system :P
when you saved it as a image it lost all the text which OpenKM could recognize. When you uploaded it to OpenKM, OpenKM simply initiated the OCR software that you set to read it. Remember, even computer clear text when read by the OCR can get you wrong results. I would still say OpenKM is not at fault here, cos OpenKM is not the one translating an image to a computer searchable document.

Try the experiment, use the same OCR you set in OpenKM and get it to read the same document without OpenKM, see what results it got.

If that don't work, I remember reading somewhere some user talked in the forum of a text file created when OCR works, try searching the forums for it and see if you can find a file at the same link. If you get no results just search for all .txt files in the OpenKM directory, it should be there... open the text file and it'll show you what the OCR read.
 #15222  by pavila
 
Open Source OCR engine like Cuneiform are good, but if you need more accurate character recognition you should go with commercial software like Abby.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.