• Skipping OCR and Indexing for certain file

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #43701  by alexkcy
 
Dear all,

I have been importing bulks of documents into OpenKM (6.3.2 CE).

From time to time, some PDF files could not be OCRed (and therefore no indexing of its contents) (I know that is the problem of tesseract or imagick but not OpenKM).

Therefore, such files are stuck at the head of the queue in text extraction and preventing other files to be properly processed.

Is there anyway to prevent such problematic documents from text extraction but keeping them in the system ?

Further, anyone has any tips to convert such documents to be "text-extractable" ?

Thanks in advance.

Regards
Alex
 #43717  by jllort
 
Several options:
- Connect a payment OCR engine ( the bad news are what usually are limited to a range of pages by year ). For example you can integrate with ocr4linux ( abby ) or another windows solution ( if you are on this OS ). Community 6.3.3 should going with ocr4linux ( If my memory not fails it has the code changes yet done to get ocr4linux working into ).
- Minimal changes in source code to connect other OCR engine when default fails ( really is not much complex doing it ). If you want I can give to you the clues for doing it.
 #43721  by alexkcy
 
Thanks for your advise.

At the moment I will not use paid external OCR as I use openkm personally.

Perhaps I back to office and use adobe for OCR and add a text layer to the document I really wanna keep.

After finalize some home projects, I will contribute to the project by adding a field in the table to 'mute' a document, and change the sql statement during text extraction to filter the muted document, but at least few months later.

Of course, I will be happy to see Open Document Management System S.L. could change the source code for a such minor amendment ^^.

Thanks again.

Alex
 #43740  by alexkcy
 
jllort,

I have come across the following script which add a text extraction layer to pdf.

https://gist.github.com/wcaleb/7337097

In fact, I could do OCR using the above script but not under openkm.

Therefore, I suspect the reason is related to the execution timeout option. But for importing tens of thousands of files, increasing the execution timeout is not practical.. Anyway....

Alex
 #43747  by jllort
 
Really you can execute the script from OpenKM, there's no problem on it, in the same way is executing convert tool etc.. You must create a text extractor for it or modify the existing one. If you want the classes what you should take a look for it, tell to me and I will try to provide you a description on what you might be interested.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.