Page 1 of 1

Mutiple OCR For Search Index Sources

PostPosted:Sat Sep 29, 2012 10:01 pm
by Netvoid
Any thought about supporting (or maybe it already does) ... Multiple OCR engines on import would export the text for indexing.

Upload Document / Image

Phase I Tesseract 3 OCR Processes
Phase 2 Tesseract 2 OCR Processes
Phase 3 Cuneiform OCR Processes
Phase 4 Abby OCR Process
Phase 5 Generic OCR handler Process

Just giving an extreme example but many of the OCR processors have strength in some areas/formats and weakness in others, would be nice if OpenKM supported multiple stages of OCR processing, maybe definable by file type. So JPG would only go through Cuneiform but TIFF would go through 3 stages .. etc, etc...

Re: Mutiple OCR For Search Index Sources

PostPosted:Sun Sep 30, 2012 10:25 am
by jllort
Open source ocr works fine to solve some problems but some specific compliance level needs payment OCR. Your proposal will make configuration more complex than is actually and is not clear that you will get food results. Really for some environment requirements should be need OCR workflow to validate results etc...

If you need more accurate OCR results I suggest payment OCR like abby for linux that we have tested in OpenKM with excelent results. Major professional ocr solutions can be executed by command line ( in this case are compatible with OpenKM ). You could take a look at http://cognitiveforms.com/ too, they have interesting solutions too.

Re: Mutiple OCR For Search Index Sources

PostPosted:Thu Oct 11, 2012 6:54 am
by pavila
I understand what you need but a how to determine the best OCR output? If OCR #1 give bad text conversion and OCR #2 give a better one, how can OpenKM determine which is the best? Or should add the result #2 to result #1?

Also keep on mind that OCR can be an expensive task depending on the number of pages to preocess. This means that an OCR of a 10 pages could take 3 - 5 minutes. If you have to pass throug another OCR engine, add 5 minutes for every one. 20 or 30 minutes for processing a file is not desirable.