Open Source Document Management System | OpenKM

Scanner and OCR

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

12 posts

12 posts

Scanner and OCR

#15010 by lineac
Thu Mar 29, 2012 3:31 pm

Hi, me again : )

I would like to use this so handy feature which is multi-page scanning (at least page scanning) and OCR. But: I have no idea where to start. I've read the official documentation about tesseract integration and done what it was said, but then I'm lost.
By doing some extensive research on the Internet, I gathered some infos:
-scanner must be TWAIN compliant;
-we must provide a scanner.jar or something to make it work;
-there's some code in this thread: http://forum.openkm.com/viewtopic.php?f=31&t=4795 that would allow multi-page scanning...

Besides all this, I have no idea what to do. First I tried to upload a .tif but then how to use OCR on this in order for the document to be indexed? and then how can I link a scanner to OpenKM and make the scanner applet work?
Can anyone give me hints/links/how-to procedures?

Username

lineac

Rank

Fresh Boarder

Posts

Joined

Fri Mar 16, 2012 4:09 pm

Re: Scanner and OCR

#15022 by jllort
Fri Mar 30, 2012 11:28 am

Actual OpenKM UI scanner supports multipage scanner.
If you have configured tesseract ( we suggest cuneiform seems gets betteres results with it ) then you should be able to indexing image text.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Scanner and OCR

#15031 by pavila
Sat Mar 31, 2012 7:10 am

The minimun recommended scanning resolution is 300 dpi to be used with a OCR engine like Tesseract or Cuneiform.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Scanner and OCR

#15045 by lineac
Sat Mar 31, 2012 6:41 pm

For the OCR: I've tried to upload a .tif image (drawn on MS paint, I've just put basic text like "Test Sample 1 2"), but it seems it is not indexed: when I search for a word I've put in it, it doesn't find anything... Is there a step I'm missing?

For the Scanner: well, when I try to click on the scanner menu entry, a pop-up appear and when I try to click "scan and upload" it says some error trying to find source data etc. Of course, I think I must indicate where to find the scanner somewhere (its IP?), but I have no idea where I should configure this.

Username

lineac

Rank

Fresh Boarder

Posts

Joined

Fri Mar 16, 2012 4:09 pm

Re: Scanner and OCR

#15054 by pavila
Mon Apr 02, 2012 12:08 pm

Please, attach the TIFF image to test it or try in the online demo.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Scanner and OCR

#15078 by lineac
Wed Apr 04, 2012 10:24 am

That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?

Username

lineac

Rank

Fresh Boarder

Posts

Joined

Fri Mar 16, 2012 4:09 pm

Re: Scanner and OCR

#15080 by vkasgpta
Thu Apr 05, 2012 4:29 am

lineac wrote:That's weird, I tried it in the online demo and it doesn't work. Here's how I did it:
>Paint
>Write sample words (with keyboard, not drawing them)
>Save as TIF
>Upload it to OpenKM. It shows : uploading...100%, then Indexing...100%.
>Search: Content: 1 of the sample words in the file I just uploaded. Results: no file found.

Still, I'm missing something?

OCR is not 100% fool proof, its possible it recognized some characters as something else

Username

vkasgpta

Rank

Senior Boarder

Posts

Joined

Sun Feb 26, 2012 12:37 pm

Re: Scanner and OCR

#15082 by lineac
Thu Apr 05, 2012 7:53 am

Is there a way to see what word OCR recognized (and what OpenKM indexed)? I only test it by searching for content in the file, maybe there's a better solution.
Also, it would be odd that OCR don't recognize characters that are directly written from keyboard (then perfectly-made), right?

Username

lineac

Rank

Fresh Boarder

Posts

Joined

Fri Mar 16, 2012 4:09 pm

Re: Scanner and OCR

#15106 by vkasgpta
Fri Apr 06, 2012 5:42 am

well if the document was a computer text and not scanned, openkm recognizes it 100%, but if it was printed then scanned... the OCR API / software takes over. Its not OpenKM's fault here. Yea, you look at the scan and it looks like perfect computer font on the scan, even then OCR can have errors! thats just how it is. Think of it like voice dialing before apple's SIRI. Yea SIRI is still bugged but before it came out voice dialing was really terrible and unreliable...

I had the privilege of talking to a senior technician for lexmark afew days ago when i was trying to get the OCR to work.
according to him (not me) OCR was popular when it first came out, but people really got frustrated by it and threw it aside.
Because you have your scanner working, it would be advisable to disable the OCR OR get someone to vet through every document after the OCR works.
If you want to test what OpenKM got in the OCR, initiate the OCR software without OpenKM and see what it saw, they usually have a previewer and will show you what it recognized wrong.
OpenKM does not have its own OCR so whichever OCR you are using, check out its guide if you are not sure how to use it separately

Maybe wait for a SIRI or Google Voice of OCR?

Username

vkasgpta

Rank

Senior Boarder

Posts

Joined

Sun Feb 26, 2012 12:37 pm

Re: Scanner and OCR

#15112 by lineac
Fri Apr 06, 2012 10:01 am

well if the document was a computer text and not scanned, openkm recognizes it 100%

That's the case: it's created, saved as tif and uploaded directly! Something must be wrong with OpenKM on this, it can't be the OCR...

Username

lineac

Rank

Fresh Boarder

Posts

Joined

Fri Mar 16, 2012 4:09 pm

Re: Scanner and OCR

#15113 by vkasgpta
Fri Apr 06, 2012 10:27 am

haha touche! You seem to have cornered me in my statement and you would technically win against me BUT you don't win against the system

when you saved it as a image it lost all the text which OpenKM could recognize. When you uploaded it to OpenKM, OpenKM simply initiated the OCR software that you set to read it. Remember, even computer clear text when read by the OCR can get you wrong results. I would still say OpenKM is not at fault here, cos OpenKM is not the one translating an image to a computer searchable document.

Try the experiment, use the same OCR you set in OpenKM and get it to read the same document without OpenKM, see what results it got.

If that don't work, I remember reading somewhere some user talked in the forum of a text file created when OCR works, try searching the forums for it and see if you can find a file at the same link. If you get no results just search for all .txt files in the OpenKM directory, it should be there... open the text file and it'll show you what the OCR read.

Username

vkasgpta

Rank

Senior Boarder

Posts

Joined

Sun Feb 26, 2012 12:37 pm

Re: Scanner and OCR

#15222 by pavila
Sun Apr 15, 2012 12:27 pm

Open Source OCR engine like Cuneiform are good, but if you need more accurate character recognition you should go with commercial software like Abby.

Username

pavila

Rank

Moderator

Posts

3145

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Page 1 of 1
12 posts

Return to “Configuration”

Display:

Sort by:

Jump to: