Page 1 of 1
Text Extractor Worker Cron
PostPosted:Fri Jan 20, 2017 4:02 pm
by openkm_user
Hi,
Tesseract (we see in logs) is always running and taking 20% or more of CPU all the time, can anyone explain more about this? Like I mentioned in other threads we are using OpenKM as a container and accessing (read, upload, download and delete) all files and folders through REST API searching for folder/document names.
Is it necessary to continue to run this Cron or can it be stopped?
Thanks!
Re: Text Extractor Worker Cron
PostPosted:Sat Jan 21, 2017 11:04 am
by jllort
All it depends if you want to make queries of the contents of the documents or not. If you do not want to make queries for document contents then you can disable the crontab task "text extractor worker". Optionally if you want to search by content, but do not want doing OCR on them, simply disable the system.ocr ( set emtpy value )
Re: Text Extractor Worker Cron
PostPosted:Mon Jan 23, 2017 8:01 am
by openkm_user
Thanks for the reply, system.ocr is already empty by default, or am I missing something?
Please suggest if we can disable other services if they don't need to run in our case (accessing through REST API to read, write, download and delete only).
Re: Text Extractor Worker Cron
PostPosted:Wed Jan 25, 2017 7:59 am
by jllort
If system.ocr is empty, then sure OpenKM is not executing any OCR engine. I do not understanding how is possible you get any tesseract ocr running. Anyware you can disable specific text extractor from Administration > Configuration parameters. Take a look at registered.text.extractors parameter and remove the named like XXTesseractXX and XXCuneiformXX and XXOCRXX
Re: Text Extractor Worker Cron
PostPosted:Wed Jan 25, 2017 8:10 am
by openkm_user
The property
registered.text.extractors has the following values,
Code: Select allorg.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Should I remove only
com.openkm.extractor.Tesseract3TextExtractor from this list? I am not sure if we need any of this extractors at all

, for our case I mean.
Re: Text Extractor Worker Cron
PostPosted:Thu Jan 26, 2017 5:21 pm
by jllort
If you disable this extractor worker "com.openkm.extractor.Tesseract3TextExtractor" the OCR engine will not be executed during text extraction process ( You do not need to disable anything else for disabling this feature, the other extractors can continue being working ).
Re: Text Extractor Worker Cron
PostPosted:Fri Jan 27, 2017 3:12 pm
by openkm_user
Thank you!
Re: Text Extractor Worker Cron
PostPosted:Thu Feb 09, 2017 5:24 pm
by openkm_user
Hi,
I have disabled all properties you mentioned but it still shows these activity in
Administration ->
Statistics ->
Text extraction queue.
text.JPG (248.28 KiB) Viewed 4700 times
Does this slow down the document import while doing it in
Administration ->
Import?
Please advice!
Re: Text Extractor Worker Cron
PostPosted:Sat Feb 11, 2017 9:44 am
by jllort
The queue will not disappear until you process it or mark in database all documents are processed.
The next query is for mark all document as processed and clean the entire queue.
Code: Select allUPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='T';