• Text Extractor Worker Cron

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #43050  by openkm_user
 
Hi,

Tesseract (we see in logs) is always running and taking 20% or more of CPU all the time, can anyone explain more about this? Like I mentioned in other threads we are using OpenKM as a container and accessing (read, upload, download and delete) all files and folders through REST API searching for folder/document names.

Is it necessary to continue to run this Cron or can it be stopped?

Thanks!
 #43059  by jllort
 
All it depends if you want to make queries of the contents of the documents or not. If you do not want to make queries for document contents then you can disable the crontab task "text extractor worker". Optionally if you want to search by content, but do not want doing OCR on them, simply disable the system.ocr ( set emtpy value )
 #43063  by openkm_user
 
Thanks for the reply, system.ocr is already empty by default, or am I missing something?

Please suggest if we can disable other services if they don't need to run in our case (accessing through REST API to read, write, download and delete only).
 #43077  by jllort
 
If system.ocr is empty, then sure OpenKM is not executing any OCR engine. I do not understanding how is possible you get any tesseract ocr running. Anyware you can disable specific text extractor from Administration > Configuration parameters. Take a look at registered.text.extractors parameter and remove the named like XXTesseractXX and XXCuneiformXX and XXOCRXX
 #43079  by openkm_user
 
The property registered.text.extractors has the following values,
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor


Should I remove only com.openkm.extractor.Tesseract3TextExtractor from this list? I am not sure if we need any of this extractors at all :?:, for our case I mean.
 #43098  by jllort
 
If you disable this extractor worker "com.openkm.extractor.Tesseract3TextExtractor" the OCR engine will not be executed during text extraction process ( You do not need to disable anything else for disabling this feature, the other extractors can continue being working ).
 #43109  by openkm_user
 
Thank you!
 #43191  by openkm_user
 
Hi,

I have disabled all properties you mentioned but it still shows these activity in Administration -> Statistics -> Text extraction queue.
text.JPG
text.JPG (248.28 KiB) Viewed 3713 times
Does this slow down the document import while doing it in Administration -> Import?

Please advice!
 #43210  by jllort
 
The queue will not disappear until you process it or mark in database all documents are processed.
The next query is for mark all document as processed and clean the entire queue.
Code: Select all
UPDATE OKM_NODE_DOCUMENT SET NDC_TEXT_EXTRACTED='T';

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.