Open Source Document Management System | OpenKM

Text Extractor Worker Cron

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

9 posts

9 posts

Text Extractor Worker Cron

#43050 by openkm_user
Fri Jan 20, 2017 4:02 pm

Hi,

Tesseract (we see in logs) is always running and taking 20% or more of CPU all the time, can anyone explain more about this? Like I mentioned in other threads we are using OpenKM as a container and accessing (read, upload, download and delete) all files and folders through REST API searching for folder/document names.

Is it necessary to continue to run this Cron or can it be stopped?

Thanks!

Username

openkm_user

Rank

Expert Boarder

Posts

142

Joined

Thu Dec 17, 2015 7:38 am

Re: Text Extractor Worker Cron

#43059 by jllort
Sat Jan 21, 2017 11:04 am

All it depends if you want to make queries of the contents of the documents or not. If you do not want to make queries for document contents then you can disable the crontab task "text extractor worker". Optionally if you want to search by content, but do not want doing OCR on them, simply disable the system.ocr ( set emtpy value )

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Text Extractor Worker Cron

#43063 by openkm_user
Mon Jan 23, 2017 8:01 am

Thanks for the reply, system.ocr is already empty by default, or am I missing something?

Please suggest if we can disable other services if they don't need to run in our case (accessing through REST API to read, write, download and delete only).

Username

openkm_user

Rank

Expert Boarder

Posts

142

Joined

Thu Dec 17, 2015 7:38 am

Re: Text Extractor Worker Cron

#43077 by jllort
Wed Jan 25, 2017 7:59 am

If system.ocr is empty, then sure OpenKM is not executing any OCR engine. I do not understanding how is possible you get any tesseract ocr running. Anyware you can disable specific text extractor from Administration > Configuration parameters. Take a look at registered.text.extractors parameter and remove the named like XXTesseractXX and XXCuneiformXX and XXOCRXX

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Text Extractor Worker Cron

#43079 by openkm_user
Wed Jan 25, 2017 8:10 am

The property registered.text.extractors has the following values,

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Should I remove only com.openkm.extractor.Tesseract3TextExtractor from this list? I am not sure if we need any of this extractors at all

, for our case I mean.

Username

openkm_user

Rank

Expert Boarder

Posts

142

Joined

Thu Dec 17, 2015 7:38 am

Re: Text Extractor Worker Cron

#43098 by jllort
Thu Jan 26, 2017 5:21 pm

If you disable this extractor worker "com.openkm.extractor.Tesseract3TextExtractor" the OCR engine will not be executed during text extraction process ( You do not need to disable anything else for disabling this feature, the other extractors can continue being working ).

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Text Extractor Worker Cron

#43109 by openkm_user
Fri Jan 27, 2017 3:12 pm

Thank you!

Username

openkm_user

Rank

Expert Boarder

Posts

142

Joined

Thu Dec 17, 2015 7:38 am

Re: Text Extractor Worker Cron

#43191 by openkm_user
Thu Feb 09, 2017 5:24 pm

Hi,

I have disabled all properties you mentioned but it still shows these activity in Administration -> Statistics -> Text extraction queue.

text.JPG (248.28 KiB) Viewed 3713 times

Does this slow down the document import while doing it in Administration -> Import?

Please advice!

Username

openkm_user

Rank

Expert Boarder

Posts

142

Joined

Thu Dec 17, 2015 7:38 am

Re: Text Extractor Worker Cron

#43210 by jllort
Sat Feb 11, 2017 9:44 am

The queue will not disappear until you process it or mark in database all documents are processed.
The next query is for mark all document as processed and clean the entire queue.

Code: Select all

UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='T';

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
9 posts

Return to “Usage”

Display:

Sort by:

Jump to: