Open Source Document Management System | OpenKM - Search from OCR pdf documents

Reply

Search from OCR pdf documents

#28628 by matt81
Wed May 21, 2014 1:47 am

Hi,
Is it possible to do full text search for OCR scanned pdf documents. I have treid doing and it doesn't work. Is there any way that this can be enabled?

Thanks

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28659 by pavila
Thu May 22, 2014 11:29 am

You need to install a OCR engine and integrate it with OpenKM. See http://wiki.openkm.com/index.php/Third- ... ation:_OCR

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Search from OCR PDF documents

#28677 by matt81
Fri May 23, 2014 6:04 am

Thanks for your reply.
I did set up THE OCR software, however when I upload a PDF document I get the following error in log file.

Code: Select all

2014-05-23 04:35:00,022 [Thread-38] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=9da253b7-6344-4c6d-a317-a3c633222bda, docPath=/okm:root/scanned-pdf.pdf, docVerUuid=8ce6c2cb-dba8-461e-bf22-d0477aef4270, date=Fri May 23 04:34:50 UTC 2014}
2014-05-23 04:35:00,035 [Thread-38] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-05-23 04:35:02,198 [Thread-38] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/cuneiform, /home/tomcat-7/temp/Im77607184968132204728.jpg, -o, /home/tomcat-7/temp/okm5636227013585324756.txt]
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - STDERR: PUMA_XFinalrecognition failed.

2014-05-23 04:35:02,201 [Thread-38] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/scanned-pdf.pdf': Too few text extracted

Do you know why this would occur?

I setup the following setting in cofig:
system.ocr = /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate = 90;180;270
and com.openkm.extractor.CuneiformTextExtractor part of the text extractors.

Thanks

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28708 by matt81
Mon May 26, 2014 5:51 am

I tried to execute it from command prompt and i got the same error. Do you know what problem could be?

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28727 by pavila
Mon May 26, 2014 3:46 pm

If the problem is "cuneiform" I recommend contact to the program developer.

You can also try Tesseract, for example.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Search from OCR PDF documents

#28765 by matt81
Fri May 30, 2014 12:17 am

Thanks for the reply.

I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best. When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed better results for .jpg formats. So all good. The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into Portuguese. See image attached. I haven't set Portuguese language anywhere.

Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have something else set as default?

Attachments

language.png (80.17 KiB) Viewed 8963 times

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28792 by jllort
Sat May 31, 2014 5:11 pm

Document language identification is done by other library, not by tesseract and is not 100% accurate. We can confirm at the present in most cases tesseract provides better results than cuneiform.

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Search from OCR PDF documents

#28806 by matt81
Mon Jun 02, 2014 4:37 am

Ok so you are saying that for PDF files, cuneiform is used as default for text extraction? Why is that as I have specified tessarect in the config file?
It works fine for .jpg and .tiff formats, so I don't understand why it doesn't work for PDF as it contains an image.

Is there any way that it can be changed to tesseract?
Do you ahe any suggestions?

Thanks

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28840 by pavila
Tue Jun 03, 2014 7:33 am

If you want to use Tesseract you have to modify the system.ocr configuration property and set the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors (and remove the Cuneiform one).

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Search from OCR PDF documents

#28856 by matt81
Tue Jun 03, 2014 11:25 pm

Thanks for your reply.
Yes those settings are already set to tesseract, however as I mentioned when I upload a scanned PDF document, it recognizes it as a different language (see image attached in previous post). When I scan the same document as .jpg or .tif, it works prefectly fine.
Do you know why it behaves like this? It is the same document in different format and it doesn't work!

Thanks

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28858 by matt81
Wed Jun 04, 2014 4:03 am

Just to let you know I have enabled DEBUG for PdfTextExtractor in log4j.properties.
See log file when I upload a scanned PDF file:

Code: Select all

2014-06-04 02:35:00,145 [Thread-14] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=0fea632e-5dd1-4245-b08f-d0db3c0a3815, docPath=/okm:root/new-pdf/scanned-pdf.pdf, docVerUuid=78804a6f-e0d4-4e41-b28e-1fa1e06e072d, date=Wed Jun 04 02:31:13 UTC 2014}
2014-06-04 02:35:00,315 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-06-04 02:35:00,315 [Thread-14] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-06-04 02:35:00,316 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /dir/lar/tomcat-7.0.27/temp/Im74825222203928531329.jpg
2014-06-04 02:35:11,914 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted: m.3_=$ Ouﬁunzcau
3 9.32 oﬁ §naoa3_am<
4w? W55»: 2 7.moaon_d—QN% QUO max Sam _<_mEo,:3m SO 83 >53...»
dpmazo 3>4mz_mza
._.=_m OE Emu bun. 1.303
_mw.._wn we Ema: mo: >mm.u:mm Em: _rm~»3<wx_

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28862 by pavila
Wed Jun 04, 2014 7:11 am

It seems protected. Please, attach this PDF to test it.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Search from OCR PDF documents

#28871 by matt81
Wed Jun 04, 2014 11:28 pm

There is no PDF attached?
I did enable full permissions, however I still get the same results.
Not sure how to over pass this.

Thanks

Username

matt81

Rank

Expert Boarder

Posts

95

Joined

Fri Feb 28, 2014 5:09 am

Re: Search from OCR PDF documents

#28874 by baolinhtv
Thu Jun 05, 2014 2:21 am

my tesseract 3.0 is working good but only with TIF file, when i check text extraction , jpg and scanned pdf file is not working .

updated
after i uncheck force ocr pdf and move com.openkm.extractor.Tesseract3TextExtractor after com.openkm.extractor.PdfTextExtractor in text.extractors key
now i can find fulltext from file docx save to pdf , and scanned pdf image file .

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

updated
in registered.text.extractors key i remove exifTextextractor , and i let empty system.openoffice.dictionary key link to oxt file , now when i check text extraction , pdf scanned file (image) working good about 80% extracly VietNamese . jpg file too , but i try save 1 docx file to pdf ( word 2010 ) and upload to openkm , it have no text extraction ???

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Last edited by baolinhtv on Thu Jun 05, 2014 4:05 am, edited 3 times in total.

Username

baolinhtv

Rank

Fresh Boarder

Posts

14

Joined

Mon May 26, 2014 8:26 am

Re: Search from OCR PDF documents

#28875 by baolinhtv
Thu Jun 05, 2014 2:55 am

matt81 wrote:Thanks for the reply.
I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best.
When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed beter results for .jpg formats. So all good.
The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into portuguese. See image attached. I haven't set portuguese langugage anywhere.
Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have soomething else set as default?

because you config system.ocr = ${fileIn} ${fileOut} -l esp , edit language by edit after -l , im vietnamese so i edited : ${fileIn} ${fileOut} -l vie , may this help you out

Username

baolinhtv

Rank

Fresh Boarder

Posts

14

Joined

Mon May 26, 2014 8:26 am

Reply

Page 1 of 2
30 posts

1
2