Page 1 of 2

Search from OCR pdf documents

PostPosted:Wed May 21, 2014 1:47 am
by matt81
Hi,
Is it possible to do full text search for OCR scanned pdf documents. I have treid doing and it doesn't work. Is there any way that this can be enabled?

Thanks

Re: Search from OCR PDF documents

PostPosted:Thu May 22, 2014 11:29 am
by pavila
You need to install a OCR engine and integrate it with OpenKM. See http://wiki.openkm.com/index.php/Third- ... ation:_OCR

Re: Search from OCR PDF documents

PostPosted:Fri May 23, 2014 6:04 am
by matt81
Thanks for your reply.
I did set up THE OCR software, however when I upload a PDF document I get the following error in log file.
Code: Select all
2014-05-23 04:35:00,022 [Thread-38] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=9da253b7-6344-4c6d-a317-a3c633222bda, docPath=/okm:root/scanned-pdf.pdf, docVerUuid=8ce6c2cb-dba8-461e-bf22-d0477aef4270, date=Fri May 23 04:34:50 UTC 2014}
2014-05-23 04:35:00,035 [Thread-38] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-05-23 04:35:02,198 [Thread-38] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/cuneiform, /home/tomcat-7/temp/Im77607184968132204728.jpg, -o, /home/tomcat-7/temp/okm5636227013585324756.txt]
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - STDERR: PUMA_XFinalrecognition failed.

2014-05-23 04:35:02,201 [Thread-38] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/scanned-pdf.pdf': Too few text extracted
Do you know why this would occur?

I setup the following setting in cofig:
system.ocr = /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate = 90;180;270
and com.openkm.extractor.CuneiformTextExtractor part of the text extractors.

Thanks

Re: Search from OCR PDF documents

PostPosted:Mon May 26, 2014 5:51 am
by matt81
I tried to execute it from command prompt and i got the same error. Do you know what problem could be?

Re: Search from OCR PDF documents

PostPosted:Mon May 26, 2014 3:46 pm
by pavila
If the problem is "cuneiform" I recommend contact to the program developer.

You can also try Tesseract, for example.

Re: Search from OCR PDF documents

PostPosted:Fri May 30, 2014 12:17 am
by matt81
Thanks for the reply.

I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best. When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed better results for .jpg formats. So all good. The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into Portuguese. See image attached. I haven't set Portuguese language anywhere.

Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have something else set as default?

Re: Search from OCR PDF documents

PostPosted:Sat May 31, 2014 5:11 pm
by jllort
Document language identification is done by other library, not by tesseract and is not 100% accurate. We can confirm at the present in most cases tesseract provides better results than cuneiform.

Re: Search from OCR PDF documents

PostPosted:Mon Jun 02, 2014 4:37 am
by matt81
Ok so you are saying that for PDF files, cuneiform is used as default for text extraction? Why is that as I have specified tessarect in the config file?
It works fine for .jpg and .tiff formats, so I don't understand why it doesn't work for PDF as it contains an image.

Is there any way that it can be changed to tesseract?
Do you ahe any suggestions?

Thanks

Re: Search from OCR PDF documents

PostPosted:Tue Jun 03, 2014 7:33 am
by pavila
If you want to use Tesseract you have to modify the system.ocr configuration property and set the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors (and remove the Cuneiform one).

Re: Search from OCR PDF documents

PostPosted:Tue Jun 03, 2014 11:25 pm
by matt81
Thanks for your reply.
Yes those settings are already set to tesseract, however as I mentioned when I upload a scanned PDF document, it recognizes it as a different language (see image attached in previous post). When I scan the same document as .jpg or .tif, it works prefectly fine.
Do you know why it behaves like this? It is the same document in different format and it doesn't work!

Thanks

Re: Search from OCR PDF documents

PostPosted:Wed Jun 04, 2014 4:03 am
by matt81
Just to let you know I have enabled DEBUG for PdfTextExtractor in log4j.properties.
See log file when I upload a scanned PDF file:
Code: Select all
2014-06-04 02:35:00,145 [Thread-14] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=0fea632e-5dd1-4245-b08f-d0db3c0a3815, docPath=/okm:root/new-pdf/scanned-pdf.pdf, docVerUuid=78804a6f-e0d4-4e41-b28e-1fa1e06e072d, date=Wed Jun 04 02:31:13 UTC 2014}
2014-06-04 02:35:00,315 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-06-04 02:35:00,315 [Thread-14] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-06-04 02:35:00,316 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /dir/lar/tomcat-7.0.27/temp/Im74825222203928531329.jpg
2014-06-04 02:35:11,914 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted: m.3_=$ Oufiunzcau
3 9.32 ofi §naoa3_am<
4w? W55»: 2 7.moaon_d—QN% QUO max Sam _<_mEo,:3m SO 83 >53...»
dpmazo 3>4mz_mza
._.=_m OE Emu bun. 1.303
_mw.._wn we Ema: mo: >mm.u:mm Em: _rm~»3<wx_

Re: Search from OCR PDF documents

PostPosted:Wed Jun 04, 2014 7:11 am
by pavila
It seems protected. Please, attach this PDF to test it.

Re: Search from OCR PDF documents

PostPosted:Wed Jun 04, 2014 11:28 pm
by matt81
There is no PDF attached?
I did enable full permissions, however I still get the same results.
Not sure how to over pass this.

Thanks

Re: Search from OCR PDF documents

PostPosted:Thu Jun 05, 2014 2:21 am
by baolinhtv
my tesseract 3.0 is working good but only with TIF file, when i check text extraction , jpg and scanned pdf file is not working .

updated
after i uncheck force ocr pdf and move com.openkm.extractor.Tesseract3TextExtractor after com.openkm.extractor.PdfTextExtractor in text.extractors key
now i can find fulltext from file docx save to pdf , and scanned pdf image file .
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
updated
in registered.text.extractors key i remove exifTextextractor , and i let empty system.openoffice.dictionary key link to oxt file , now when i check text extraction , pdf scanned file (image) working good about 80% extracly VietNamese . jpg file too , but i try save 1 docx file to pdf ( word 2010 ) and upload to openkm , it have no text extraction ???
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

Re: Search from OCR PDF documents

PostPosted:Thu Jun 05, 2014 2:55 am
by baolinhtv
matt81 wrote:Thanks for the reply.
I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best.
When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed beter results for .jpg formats. So all good.
The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into portuguese. See image attached. I haven't set portuguese langugage anywhere.
Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have soomething else set as default?
because you config system.ocr = ${fileIn} ${fileOut} -l esp , edit language by edit after -l , im vietnamese so i edited : ${fileIn} ${fileOut} -l vie , may this help you out