• OCR not working

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #53854  by puce
 
Hello,
After a mozilla firefox update, I loss the preview functionality.
I was running OpenKM 6.3.0.
I have now installed 6.3.11 and testing it before going live.

Text recognition is not working

System :
- Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-48-generic x86_64)
- OpenKM 6.3.12 (build: a3587ce) With Community Extension
- Tesseract 4.1.1 leptonica-1.82.0

An uploaded document goes to the queue and then to extraction in progress.

When the doc is a pdf I have this error :

In catalina.out :
Code: Select all
2022-09-22 19:10:57,139 [Thread-6295] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Licences/2012_01_16 VMware Fusion 4.pdf': Too few text extracted
2022-09-22 19:10:57,143 [Thread-6295] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a31e356d-c9a5-4eab-9f42-afcaa25bec32, docPath=/okm:root/Licences/2014_06_16 AssistiveWare activation.pdf, docVerUuid=ea6a1cb2-8ac3-4b6f-beaa-6ced3b6a6b31, date=Thu Sep 22 15:04:44 CEST 2022}
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat/temp/okm5678788871933443952.pdf, /home/openkm/tomcat/temp/okm7604132903759119539.txt]
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - STDERR: Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.
Trying to convert from the terminal also fail

openkm@okm-vm:~/Polet/Manuels$ tesseract '2018 Manual FM voiture.pdf' test -l fra
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.


When the doc is a jpg it also fail, catalina.out :
Code: Select all
2022-09-23 09:55:00,055 [Thread-6964] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=7977b55b-a787-474d-a314-d40415d72776, docPath=/okm:root/Test/test Bpost TVA import.jpg, docVerUuid=481b4a0b-2efa-4615-b6ec-835e453e2601, date=Fri Sep 23 09:50:54 CEST 2022}
2022-09-23 09:55:15,447 [Thread-6964] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Test/test Bpost TVA import.jpg': Too few text extracted
but from the terminal I do have a successfull text recognition output

openkm@okm-vm:~$ tesseract bpostTest.jpg bpost.txt -l fra
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 625

System settings :

system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut}
system.ocr.rotate String
system.pdf.force.ocr Boolean Inactive


Plug-in settings :

com.openkm.extractor.TextExtractor :
Code: Select all
AbbyTextExtractor 	com.openkm.extractor.AbbyTextExtractor  Active
AudioTextExtractor 	com.openkm.extractor.AudioTextExtractor  Active
BarcodeTextExtractor 	com.openkm.extractor.BarcodeTextExtractor  Active
CuneiformTextExtractor 	com.openkm.extractor.CuneiformTextExtractor  Active
ExifTextExtractor 	com.openkm.extractor.ExifTextExtractor  Active
HTMLTextExtractor 	com.openkm.extractor.HTMLTextExtractor  Active
MsExcelTextExtractor 	com.openkm.extractor.MsExcelTextExtractor  Active
MsOffice2007TextExtractor 	com.openkm.extractor.MsOffice2007TextExtractor  Active
MsOutlookTextExtractor 	com.openkm.extractor.MsOutlookTextExtractor  Active
MsPowerPointTextExtractor 	com.openkm.extractor.MsPowerPointTextExtractor  Active
MsWordTextExtractor 	com.openkm.extractor.MsWordTextExtractor  Active
NativeMsExcelTextExtractor 	com.openkm.extractor.NativeMsExcelTextExtractor  Active
OOTextExtractor 	com.openkm.extractor.OOTextExtractor  Active
OpenOfficeTextExtractor 	com.openkm.extractor.OpenOfficeTextExtractor  Active
PdfTextExtractor 	com.openkm.extractor.PdfTextExtractor  Active 
PlainTextExtractor 	com.openkm.extractor.PlainTextExtractor  Active
RTFTextExtractor 	com.openkm.extractor.RTFTextExtractor  Active
SourceCodeTextExtractor 	com.openkm.extractor.SourceCodeTextExtractor  Active
Tesseract2TextExtractor 	com.openkm.extractor.Tesseract2TextExtractor  Active
Tesseract3TextExtractor 	com.openkm.extractor.Tesseract3TextExtractor Active
XMLTextExtractor 	com.openkm.extractor.XMLTextExtractor Active
Any advice please ?

Thank you,
Harold
 #53869  by jllort
 
You must disable
Code: Select all
AbbyTextExtractor 	com.openkm.extractor.AbbyTextExtractor  Active
CuneiformTextExtractor 	com.openkm.extractor.CuneiformTextExtractor  Active
From the terminal you have executed the tesseract with french dictionary "-l fra" I suggest adding at the end of the OpenKM configuration
Code: Select all
system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut} -l fra
You can have several dictionaries enabled at the same time with "-l fra+spa+eng " read tesseract documentation about this matter

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.