• Mutiple OCR For Search Index Sources

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #18572  by Netvoid
 
Any thought about supporting (or maybe it already does) ... Multiple OCR engines on import would export the text for indexing.

Upload Document / Image

Phase I Tesseract 3 OCR Processes
Phase 2 Tesseract 2 OCR Processes
Phase 3 Cuneiform OCR Processes
Phase 4 Abby OCR Process
Phase 5 Generic OCR handler Process

Just giving an extreme example but many of the OCR processors have strength in some areas/formats and weakness in others, would be nice if OpenKM supported multiple stages of OCR processing, maybe definable by file type. So JPG would only go through Cuneiform but TIFF would go through 3 stages .. etc, etc...
 #18577  by jllort
 
Open source ocr works fine to solve some problems but some specific compliance level needs payment OCR. Your proposal will make configuration more complex than is actually and is not clear that you will get food results. Really for some environment requirements should be need OCR workflow to validate results etc...

If you need more accurate OCR results I suggest payment OCR like abby for linux that we have tested in OpenKM with excelent results. Major professional ocr solutions can be executed by command line ( in this case are compatible with OpenKM ). You could take a look at http://cognitiveforms.com/ too, they have interesting solutions too.
 #18663  by pavila
 
I understand what you need but a how to determine the best OCR output? If OCR #1 give bad text conversion and OCR #2 give a better one, how can OpenKM determine which is the best? Or should add the result #2 to result #1?

Also keep on mind that OCR can be an expensive task depending on the number of pages to preocess. This means that an OCR of a 10 pages could take 3 - 5 minutes. If you have to pass throug another OCR engine, add 5 minutes for every one. 20 or 30 minutes for processing a file is not desirable.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.