Open Source Document Management System | OpenKM

PostPosted:**Wed May 21, 2014 1:47 am**

Hi,
Is it possible to do full text search for OCR scanned pdf documents. I have treid doing and it doesn't work. Is there any way that this can be enabled?

Thanks

PostPosted:**Thu May 22, 2014 11:29 am**

You need to install a OCR engine and integrate it with OpenKM. See http://wiki.openkm.com/index.php/Third- ... ation:_OCR

PostPosted:**Fri May 23, 2014 6:04 am**

Thanks for your reply.
I did set up THE OCR software, however when I upload a PDF document I get the following error in log file.

Code: Select all

2014-05-23 04:35:00,022 [Thread-38] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=9da253b7-6344-4c6d-a317-a3c633222bda, docPath=/okm:root/scanned-pdf.pdf, docVerUuid=8ce6c2cb-dba8-461e-bf22-d0477aef4270, date=Fri May 23 04:34:50 UTC 2014}
2014-05-23 04:35:00,035 [Thread-38] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-05-23 04:35:02,198 [Thread-38] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/cuneiform, /home/tomcat-7/temp/Im77607184968132204728.jpg, -o, /home/tomcat-7/temp/okm5636227013585324756.txt]
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - STDERR: PUMA_XFinalrecognition failed.

2014-05-23 04:35:02,201 [Thread-38] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/scanned-pdf.pdf': Too few text extracted

Do you know why this would occur?

I setup the following setting in cofig:
system.ocr = /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate = 90;180;270
and com.openkm.extractor.CuneiformTextExtractor part of the text extractors.

Thanks

PostPosted:**Mon May 26, 2014 5:51 am**

I tried to execute it from command prompt and i got the same error. Do you know what problem could be?

PostPosted:**Mon May 26, 2014 3:46 pm**

If the problem is "cuneiform" I recommend contact to the program developer.

You can also try Tesseract, for example.

PostPosted:**Fri May 30, 2014 12:17 am**

Thanks for the reply.

I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best. When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed better results for .jpg formats. So all good. The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into Portuguese. See image attached. I haven't set Portuguese language anywhere.

Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have something else set as default?

PostPosted:**Sat May 31, 2014 5:11 pm**

Document language identification is done by other library, not by tesseract and is not 100% accurate. We can confirm at the present in most cases tesseract provides better results than cuneiform.

PostPosted:**Mon Jun 02, 2014 4:37 am**

Ok so you are saying that for PDF files, cuneiform is used as default for text extraction? Why is that as I have specified tessarect in the config file?
It works fine for .jpg and .tiff formats, so I don't understand why it doesn't work for PDF as it contains an image.

Is there any way that it can be changed to tesseract?
Do you ahe any suggestions?

Thanks

PostPosted:**Tue Jun 03, 2014 7:33 am**

If you want to use Tesseract you have to modify the system.ocr configuration property and set the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors (and remove the Cuneiform one).

PostPosted:**Tue Jun 03, 2014 11:25 pm**

Thanks for your reply.
Yes those settings are already set to tesseract, however as I mentioned when I upload a scanned PDF document, it recognizes it as a different language (see image attached in previous post). When I scan the same document as .jpg or .tif, it works prefectly fine.
Do you know why it behaves like this? It is the same document in different format and it doesn't work!

Thanks

PostPosted:**Wed Jun 04, 2014 4:03 am**

Just to let you know I have enabled DEBUG for PdfTextExtractor in log4j.properties.
See log file when I upload a scanned PDF file:

Code: Select all

2014-06-04 02:35:00,145 [Thread-14] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=0fea632e-5dd1-4245-b08f-d0db3c0a3815, docPath=/okm:root/new-pdf/scanned-pdf.pdf, docVerUuid=78804a6f-e0d4-4e41-b28e-1fa1e06e072d, date=Wed Jun 04 02:31:13 UTC 2014}
2014-06-04 02:35:00,315 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-06-04 02:35:00,315 [Thread-14] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-06-04 02:35:00,316 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /dir/lar/tomcat-7.0.27/temp/Im74825222203928531329.jpg
2014-06-04 02:35:11,914 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted: m.3_=$ Ouﬁunzcau
3 9.32 oﬁ §naoa3_am<
4w? W55»: 2 7.moaon_d—QN% QUO max Sam _<_mEo,:3m SO 83 >53...»
dpmazo 3>4mz_mza
._.=_m OE Emu bun. 1.303
_mw.._wn we Ema: mo: >mm.u:mm Em: _rm~»3<wx_

PostPosted:**Wed Jun 04, 2014 7:11 am**

It seems protected. Please, attach this PDF to test it.

PostPosted:**Wed Jun 04, 2014 11:28 pm**

There is no PDF attached?
I did enable full permissions, however I still get the same results.
Not sure how to over pass this.

Thanks

PostPosted:**Thu Jun 05, 2014 2:21 am**

my tesseract 3.0 is working good but only with TIF file, when i check text extraction , jpg and scanned pdf file is not working .

updated
after i uncheck force ocr pdf and move com.openkm.extractor.Tesseract3TextExtractor after com.openkm.extractor.PdfTextExtractor in text.extractors key
now i can find fulltext from file docx save to pdf , and scanned pdf image file .

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

updated
in registered.text.extractors key i remove exifTextextractor , and i let empty system.openoffice.dictionary key link to oxt file , now when i check text extraction , pdf scanned file (image) working good about 80% extracly VietNamese . jpg file too , but i try save 1 docx file to pdf ( word 2010 ) and upload to openkm , it have no text extraction ???

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

PostPosted:**Thu Jun 05, 2014 2:55 am**

matt81 wrote:Thanks for the reply.
I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best.
When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed beter results for .jpg formats. So all good.
The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into portuguese. See image attached. I haven't set portuguese langugage anywhere.
Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have soomething else set as default?

because you config system.ocr = ${fileIn} ${fileOut} -l esp , edit language by edit after -l , im vietnamese so i edited : ${fileIn} ${fileOut} -l vie , may this help you out

Open Source Document Management System | OpenKM

Search from OCR pdf documents

Search from OCR pdf documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents