Open Source Document Management System | OpenKM - Skipping OCR and Indexing for certain file

Skipping OCR and Indexing for certain file

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

5 posts

5 posts

Skipping OCR and Indexing for certain file

#43701 by alexkcy
Fri Apr 28, 2017 11:37 am

Dear all,

I have been importing bulks of documents into OpenKM (6.3.2 CE).

From time to time, some PDF files could not be OCRed (and therefore no indexing of its contents) (I know that is the problem of tesseract or imagick but not OpenKM).

Therefore, such files are stuck at the head of the queue in text extraction and preventing other files to be properly processed.

Is there anyway to prevent such problematic documents from text extraction but keeping them in the system ?

Further, anyone has any tips to convert such documents to be "text-extractable" ?

Thanks in advance.

Regards
Alex

Username

alexkcy

Rank

Junior Boarder

Posts

Joined

Tue Feb 14, 2017 6:17 am

Re: Skipping OCR and Indexing for certain file

#43717 by jllort
Sat Apr 29, 2017 11:25 am

Several options:
- Connect a payment OCR engine ( the bad news are what usually are limited to a range of pages by year ). For example you can integrate with ocr4linux ( abby ) or another windows solution ( if you are on this OS ). Community 6.3.3 should going with ocr4linux ( If my memory not fails it has the code changes yet done to get ocr4linux working into ).
- Minimal changes in source code to connect other OCR engine when default fails ( really is not much complex doing it ). If you want I can give to you the clues for doing it.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Skipping OCR and Indexing for certain file

#43721 by alexkcy
Sat Apr 29, 2017 1:48 pm

Thanks for your advise.

At the moment I will not use paid external OCR as I use openkm personally.

Perhaps I back to office and use adobe for OCR and add a text layer to the document I really wanna keep.

After finalize some home projects, I will contribute to the project by adding a field in the table to 'mute' a document, and change the sql statement during text extraction to filter the muted document, but at least few months later.

Of course, I will be happy to see Open Document Management System S.L. could change the source code for a such minor amendment ^^.

Thanks again.

Alex

Username

alexkcy

Rank

Junior Boarder

Posts

Joined

Tue Feb 14, 2017 6:17 am

Re: Skipping OCR and Indexing for certain file

#43740 by alexkcy
Wed May 03, 2017 12:21 pm

jllort,

I have come across the following script which add a text extraction layer to pdf.

https://gist.github.com/wcaleb/7337097

In fact, I could do OCR using the above script but not under openkm.

Therefore, I suspect the reason is related to the execution timeout option. But for importing tens of thousands of files, increasing the execution timeout is not practical.. Anyway....

Alex

Username

alexkcy

Rank

Junior Boarder

Posts

Joined

Tue Feb 14, 2017 6:17 am

Re: Skipping OCR and Indexing for certain file

#43747 by jllort
Wed May 03, 2017 8:38 pm

Really you can execute the script from OpenKM, there's no problem on it, in the same way is executing convert tool etc.. You must create a text extractor for it or modify the existing one. If you want the classes what you should take a look for it, tell to me and I will try to provide you a description on what you might be interested.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
5 posts

Return to “Usage”

Display:

Sort by:

Jump to: