Open Source Document Management System | OpenKM

PostPosted:**Fri Jan 20, 2017 4:02 pm**

Hi,

Tesseract (we see in logs) is always running and taking 20% or more of CPU all the time, can anyone explain more about this? Like I mentioned in other threads we are using OpenKM as a container and accessing (read, upload, download and delete) all files and folders through REST API searching for folder/document names.

Is it necessary to continue to run this Cron or can it be stopped?

Thanks!

PostPosted:**Sat Jan 21, 2017 11:04 am**

All it depends if you want to make queries of the contents of the documents or not. If you do not want to make queries for document contents then you can disable the crontab task "text extractor worker". Optionally if you want to search by content, but do not want doing OCR on them, simply disable the system.ocr ( set emtpy value )

PostPosted:**Mon Jan 23, 2017 8:01 am**

Thanks for the reply, system.ocr is already empty by default, or am I missing something?

Please suggest if we can disable other services if they don't need to run in our case (accessing through REST API to read, write, download and delete only).

PostPosted:**Wed Jan 25, 2017 7:59 am**

If system.ocr is empty, then sure OpenKM is not executing any OCR engine. I do not understanding how is possible you get any tesseract ocr running. Anyware you can disable specific text extractor from Administration > Configuration parameters. Take a look at registered.text.extractors parameter and remove the named like XXTesseractXX and XXCuneiformXX and XXOCRXX

PostPosted:**Wed Jan 25, 2017 8:10 am**

The property registered.text.extractors has the following values,

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Should I remove only com.openkm.extractor.Tesseract3TextExtractor from this list? I am not sure if we need any of this extractors at all

, for our case I mean.

PostPosted:**Thu Jan 26, 2017 5:21 pm**

If you disable this extractor worker "com.openkm.extractor.Tesseract3TextExtractor" the OCR engine will not be executed during text extraction process ( You do not need to disable anything else for disabling this feature, the other extractors can continue being working ).

PostPosted:**Fri Jan 27, 2017 3:12 pm**

Thank you!

PostPosted:**Thu Feb 09, 2017 5:24 pm**

Hi,

I have disabled all properties you mentioned but it still shows these activity in Administration -> Statistics -> Text extraction queue.

text.JPG (248.28 KiB) Viewed 5224 times

Does this slow down the document import while doing it in Administration -> Import?

Please advice!

PostPosted:**Sat Feb 11, 2017 9:44 am**

The queue will not disappear until you process it or mark in database all documents are processed.
The next query is for mark all document as processed and clean the entire queue.

Code: Select all

UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='T';

Open Source Document Management System | OpenKM

Text Extractor Worker Cron

Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron

Re: Text Extractor Worker Cron