Page 1 of 1

Overview of OCR-Capabilities

PostPosted:Fri Apr 01, 2016 8:36 am
by Bummibaer
Hello,

I want an opensource CMS with OCR, and stumbled over OpenKM.
I 'm an absolute beginner, sorry for silly questions...

In the User Guide http://wiki.openkm.com/index.php/User_Guide
is a Menu button OCR/OMR, but not in my GUI.

I've searched the Forum and the Wiki, but it is not clear to me, which configuration is the best for OCR in common.
First I saw in http://wiki.openkm.com/index.php/OCR that I have to set
system.ocr=...
in?
Where is this one:
You need to modify the registered.text.extractors configuration property to match the OCR engine you have configured using system.ocr. By default only Cuneiform text extractor is enabled. If you want to configure Tesseract remove the Cuneiform extractor and add the Tesseract extractor.
Than I read
You can enable any of these text extractors adding it in the textFilterClasses param of the SearchIndex section in your repository.xml
Where is the repository xml?

How will contents extracted from non-scanned file i.e. PDF, Framemaker and so on (Apache Tika?). Where I can
find it.
In the plugin search I found some "pdf to text". Is there a recommended Solution?

regards for every hint or pointer to a user guide for beginners
Steffen

Re: Overview of OCR-Capabilities

PostPosted:Fri Apr 01, 2016 10:34 pm
by jllort
Wiki will no longer keep alive, there's a big messi and we are migrating to new format at docs.openkm.com unfortunately we have still not finished to migrate actual community documentation and you should survive with existing one.

The system.ocr helps you to configure OCR engine ( we suggest tesseract ). Read:
http://wiki.openkm.com/index.php/Third- ... _Tesseract
http://wiki.openkm.com/index.php/Third- ... ation:_OCR

About how its done the text extraction, for it we have specific classes ( TextExtractors ) what implements it. These are configured with the parameter registered.text.extractors ( each class converts document to text based on its mime-type, you can extend it ).

In PDF case, the class com.openkm.extractor.Tesseract3TextExtractor gets the images into de document and process them across the OCR engine ( it's transparent from your side ).