• Overview of OCR-Capabilities

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #41537  by Bummibaer
 
Hello,

I want an opensource CMS with OCR, and stumbled over OpenKM.
I 'm an absolute beginner, sorry for silly questions...

In the User Guide http://wiki.openkm.com/index.php/User_Guide
is a Menu button OCR/OMR, but not in my GUI.

I've searched the Forum and the Wiki, but it is not clear to me, which configuration is the best for OCR in common.
First I saw in http://wiki.openkm.com/index.php/OCR that I have to set
system.ocr=...
in?
Where is this one:
You need to modify the registered.text.extractors configuration property to match the OCR engine you have configured using system.ocr. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.
Than I read
You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml
Where is the repository xml?

How will contents extracted from non-scanned file i.e. PDF, Framemaker and so on (Apache Tika?). Where I can
find it.
In the plugin search I found some "pdf to text". Is there a recommended Solution?

regards for every hint or pointer to a user guide for beginners
Steffen
 #41540  by jllort
 
Wiki will no longer keep alive, there's a big messi and we are migrating to new format at docs.openkm.com unfortunately we have still not finished to migrate actual community documentation and you should survive with existing one.

The system.ocr helps you to configure OCR engine ( we suggest tesseract ). Read:
http://wiki.openkm.com/index.php/Third- ... _Tesseract
http://wiki.openkm.com/index.php/Third- ... ation:_OCR

About how its done the text extraction, for it we have specific classes ( TextExtractors ) what implements it. These are configured with the parameter registered.text.extractors ( each class converts document to text based on its mime-type, you can extend it ).

In PDF case, the class com.openkm.extractor.Tesseract3TextExtractor gets the images into de document and process them across the OCR engine ( it's transparent from your side ).

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.