Page 1 of 1

Scanner and OCR

PostPosted:Thu Mar 29, 2012 3:31 pm
by lineac
Hi, me again : )

I would like to use this so handy feature which is multi-page scanning (at least page scanning) and OCR. But: I have no idea where to start. I've read the official documentation about tesseract integration and done what it was said, but then I'm lost.
By doing some extensive research on the Internet, I gathered some infos:
-scanner must be TWAIN compliant;
-we must provide a scanner.jar or something to make it work;
-there's some code in this thread: http://forum.openkm.com/viewtopic.php?f=31&t=4795 that would allow multi-page scanning...

Besides all this, I have no idea what to do. First I tried to upload a .tif but then how to use OCR on this in order for the document to be indexed? and then how can I link a scanner to OpenKM and make the scanner applet work?
Can anyone give me hints/links/how-to procedures?

Re: Scanner and OCR

PostPosted:Fri Mar 30, 2012 11:28 am
by jllort
Actual OpenKM UI scanner supports multipage scanner.
If you have configured tesseract ( we suggest cuneiform seems gets betteres results with it ) then you should be able to indexing image text.

Re: Scanner and OCR

PostPosted:Sat Mar 31, 2012 7:10 am
by pavila
The minimun recommended scanning resolution is 300 dpi to be used with a OCR engine like Tesseract or Cuneiform.

Re: Scanner and OCR

PostPosted:Sat Mar 31, 2012 6:41 pm
by lineac
For the OCR: I've tried to upload a .tif image (drawn on MS paint, I've just put basic text like "Test Sample 1 2"), but it seems it is not indexed: when I search for a word I've put in it, it doesn't find anything... Is there a step I'm missing?

For the Scanner: well, when I try to click on the scanner menu entry, a pop-up appear and when I try to click "scan and upload" it says some error trying to find source data etc. Of course, I think I must indicate where to find the scanner somewhere (its IP?), but I have no idea where I should configure this.

Re: Scanner and OCR

PostPosted:Mon Apr 02, 2012 12:08 pm
by pavila
Please, attach the TIFF image to test it or try in the online demo.

Re: Scanner and OCR

PostPosted:Wed Apr 04, 2012 10:24 am
by lineac
That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?

Re: Scanner and OCR

PostPosted:Thu Apr 05, 2012 4:29 am
by vkasgpta
lineac wrote:That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?
OCR is not 100% fool proof, its possible it recognized some characters as something else

Re: Scanner and OCR

PostPosted:Thu Apr 05, 2012 7:53 am
by lineac
Is there a way to see what word OCR recognized (and what OpenKM indexed)? I only test it by searching for content in the file, maybe there's a better solution.
Also, it would be odd that OCR don't recognize characters that are directly written from keyboard (then perfectly-made), right?

Re: Scanner and OCR

PostPosted:Fri Apr 06, 2012 5:42 am
by vkasgpta
well if the document was a computer text and not scanned, openkm recognizes it 100%, but if it was printed then scanned... the OCR API / software takes over. Its not OpenKM's fault here. Yea, you look at the scan and it looks like perfect computer font on the scan, even then OCR can have errors! thats just how it is. Think of it like voice dialing before apple's SIRI. Yea SIRI is still bugged but before it came out voice dialing was really terrible and unreliable...

I had the privilege of talking to a senior technician for lexmark afew days ago when i was trying to get the OCR to work.
according to him (not me) OCR was popular when it first came out, but people really got frustrated by it and threw it aside.
Because you have your scanner working, it would be advisable to disable the OCR OR get someone to vet through every document after the OCR works.
If you want to test what OpenKM got in the OCR, initiate the OCR software without OpenKM and see what it saw, they usually have a previewer and will show you what it recognized wrong.
OpenKM does not have its own OCR so whichever OCR you are using, check out its guide if you are not sure how to use it separately :)

Maybe wait for a SIRI or Google Voice of OCR? :)

Re: Scanner and OCR

PostPosted:Fri Apr 06, 2012 10:01 am
by lineac
well if the document was a computer text and not scanned, openkm recognizes it 100%
That's the case: it's created, saved as tif and uploaded directly! Something must be wrong with OpenKM on this, it can't be the OCR...

Re: Scanner and OCR

PostPosted:Fri Apr 06, 2012 10:27 am
by vkasgpta
haha touche! You seem to have cornered me in my statement and you would technically win against me BUT you don't win against the system :P
when you saved it as a image it lost all the text which OpenKM could recognize. When you uploaded it to OpenKM, OpenKM simply initiated the OCR software that you set to read it. Remember, even computer clear text when read by the OCR can get you wrong results. I would still say OpenKM is not at fault here, cos OpenKM is not the one translating an image to a computer searchable document.

Try the experiment, use the same OCR you set in OpenKM and get it to read the same document without OpenKM, see what results it got.

If that don't work, I remember reading somewhere some user talked in the forum of a text file created when OCR works, try searching the forums for it and see if you can find a file at the same link. If you get no results just search for all .txt files in the OpenKM directory, it should be there... open the text file and it'll show you what the OCR read.

Re: Scanner and OCR

PostPosted:Sun Apr 15, 2012 12:27 pm
by pavila
Open Source OCR engine like Cuneiform are good, but if you need more accurate character recognition you should go with commercial software like Abby.