Open Source Document Management System | OpenKM - Mutiple OCR For Search Index Sources

Mutiple OCR For Search Index Sources

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

3 posts

3 posts

Mutiple OCR For Search Index Sources

#18572 by Netvoid
Sat Sep 29, 2012 10:01 pm

Any thought about supporting (or maybe it already does) ... Multiple OCR engines on import would export the text for indexing.

Upload Document / Image

Phase I Tesseract 3 OCR Processes
Phase 2 Tesseract 2 OCR Processes
Phase 3 Cuneiform OCR Processes
Phase 4 Abby OCR Process
Phase 5 Generic OCR handler Process

Just giving an extreme example but many of the OCR processors have strength in some areas/formats and weakness in others, would be nice if OpenKM supported multiple stages of OCR processing, maybe definable by file type. So JPG would only go through Cuneiform but TIFF would go through 3 stages .. etc, etc...

Username

Netvoid

Rank

Fresh Boarder

Posts

Joined

Sat Sep 29, 2012 4:19 am

Re: Mutiple OCR For Search Index Sources

#18577 by jllort
Sun Sep 30, 2012 10:25 am

Open source ocr works fine to solve some problems but some specific compliance level needs payment OCR. Your proposal will make configuration more complex than is actually and is not clear that you will get food results. Really for some environment requirements should be need OCR workflow to validate results etc...

If you need more accurate OCR results I suggest payment OCR like abby for linux that we have tested in OpenKM with excelent results. Major professional ocr solutions can be executed by command line ( in this case are compatible with OpenKM ). You could take a look at http://cognitiveforms.com/ too, they have interesting solutions too.

Username

jllort

Rank

Moderator

Posts

12182

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Mutiple OCR For Search Index Sources

#18663 by pavila
Thu Oct 11, 2012 6:54 am

I understand what you need but a how to determine the best OCR output? If OCR #1 give bad text conversion and OCR #2 give a better one, how can OpenKM determine which is the best? Or should add the result #2 to result #1?

Also keep on mind that OCR can be an expensive task depending on the number of pages to preocess. This means that an OCR of a 10 pages could take 3 - 5 minutes. If you have to pass throug another OCR engine, add 5 minutes for every one. 20 or 30 minutes for processing a file is not desirable.

Username

pavila

Rank

Moderator

Posts

3144

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Page 1 of 1
3 posts

Return to “Usage”

Display:

Sort by:

Jump to: