Open Source Document Management System | OpenKM

Reply

OCR not working

#53854 by puce
Fri Sep 23, 2022 8:44 am

Hello,
After a mozilla firefox update, I loss the preview functionality.
I was running OpenKM 6.3.0.
I have now installed 6.3.11 and testing it before going live.

Text recognition is not working

System :
- Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-48-generic x86_64)
- OpenKM 6.3.12 (build: a3587ce) With Community Extension
- Tesseract 4.1.1 leptonica-1.82.0

An uploaded document goes to the queue and then to extraction in progress.

When the doc is a pdf I have this error :

In catalina.out :

Code: Select all

2022-09-22 19:10:57,139 [Thread-6295] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Licences/2012_01_16 VMware Fusion 4.pdf': Too few text extracted
2022-09-22 19:10:57,143 [Thread-6295] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a31e356d-c9a5-4eab-9f42-afcaa25bec32, docPath=/okm:root/Licences/2014_06_16 AssistiveWare activation.pdf, docVerUuid=ea6a1cb2-8ac3-4b6f-beaa-6ced3b6a6b31, date=Thu Sep 22 15:04:44 CEST 2022}
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat/temp/okm5678788871933443952.pdf, /home/openkm/tomcat/temp/okm7604132903759119539.txt]
2022-09-22 19:10:57,278 [Thread-6295] WARN  com.openkm.util.ExecutionUtils - STDERR: Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

Trying to convert from the terminal also fail

openkm@okm-vm:~/Polet/Manuels$ tesseract '2018 Manual FM voiture.pdf' test -l fra
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

When the doc is a jpg it also fail, catalina.out :

Code: Select all

2022-09-23 09:55:00,055 [Thread-6964] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=7977b55b-a787-474d-a314-d40415d72776, docPath=/okm:root/Test/test Bpost TVA import.jpg, docVerUuid=481b4a0b-2efa-4615-b6ec-835e453e2601, date=Fri Sep 23 09:50:54 CEST 2022}
2022-09-23 09:55:15,447 [Thread-6964] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Test/test Bpost TVA import.jpg': Too few text extracted

but from the terminal I do have a successfull text recognition output

openkm@okm-vm:~$ tesseract bpostTest.jpg bpost.txt -l fra
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 625

System settings :

system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut}
system.ocr.rotate String
system.pdf.force.ocr Boolean Inactive

Plug-in settings :

com.openkm.extractor.TextExtractor :

Code: Select all

AbbyTextExtractor 	com.openkm.extractor.AbbyTextExtractor  Active
AudioTextExtractor 	com.openkm.extractor.AudioTextExtractor  Active
BarcodeTextExtractor 	com.openkm.extractor.BarcodeTextExtractor  Active
CuneiformTextExtractor 	com.openkm.extractor.CuneiformTextExtractor  Active
ExifTextExtractor 	com.openkm.extractor.ExifTextExtractor  Active
HTMLTextExtractor 	com.openkm.extractor.HTMLTextExtractor  Active
MsExcelTextExtractor 	com.openkm.extractor.MsExcelTextExtractor  Active
MsOffice2007TextExtractor 	com.openkm.extractor.MsOffice2007TextExtractor  Active
MsOutlookTextExtractor 	com.openkm.extractor.MsOutlookTextExtractor  Active
MsPowerPointTextExtractor 	com.openkm.extractor.MsPowerPointTextExtractor  Active
MsWordTextExtractor 	com.openkm.extractor.MsWordTextExtractor  Active
NativeMsExcelTextExtractor 	com.openkm.extractor.NativeMsExcelTextExtractor  Active
OOTextExtractor 	com.openkm.extractor.OOTextExtractor  Active
OpenOfficeTextExtractor 	com.openkm.extractor.OpenOfficeTextExtractor  Active
PdfTextExtractor 	com.openkm.extractor.PdfTextExtractor  Active 
PlainTextExtractor 	com.openkm.extractor.PlainTextExtractor  Active
RTFTextExtractor 	com.openkm.extractor.RTFTextExtractor  Active
SourceCodeTextExtractor 	com.openkm.extractor.SourceCodeTextExtractor  Active
Tesseract2TextExtractor 	com.openkm.extractor.Tesseract2TextExtractor  Active
Tesseract3TextExtractor 	com.openkm.extractor.Tesseract3TextExtractor Active
XMLTextExtractor 	com.openkm.extractor.XMLTextExtractor Active

Any advice please ?

Thank you,
Harold

Username

puce

Rank

Fresh Boarder

Posts

6

Joined

Sun Jun 30, 2013 9:18 am

Re: OCR not working

#53869 by jllort
Mon Oct 03, 2022 7:36 am

You must disable

Code: Select all

AbbyTextExtractor 	com.openkm.extractor.AbbyTextExtractor  Active
CuneiformTextExtractor 	com.openkm.extractor.CuneiformTextExtractor  Active

From the terminal you have executed the tesseract with french dictionary "-l fra" I suggest adding at the end of the OpenKM configuration

Code: Select all

system.ocr String /usr/bin/tesseract ${fileIn} ${fileOut} -l fra

You can have several dictionaries enabled at the same time with "-l fra+spa+eng " read tesseract documentation about this matter

Username

jllort

Rank

Moderator

Posts

12053

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR not working

#53935 by puce
Sun Nov 20, 2022 3:41 pm

Solved !
Thank you

Username

puce

Rank

Fresh Boarder

Posts

6

Joined

Sun Jun 30, 2013 9:18 am

Reply

Page 1 of 1
3 posts