• Search from OCR pdf documents

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #28628  by matt81
 
Hi,
Is it possible to do full text search for OCR scanned pdf documents. I have treid doing and it doesn't work. Is there any way that this can be enabled?

Thanks
 #28677  by matt81
 
Thanks for your reply.
I did set up THE OCR software, however when I upload a PDF document I get the following error in log file.
Code: Select all
2014-05-23 04:35:00,022 [Thread-38] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=9da253b7-6344-4c6d-a317-a3c633222bda, docPath=/okm:root/scanned-pdf.pdf, docVerUuid=8ce6c2cb-dba8-461e-bf22-d0477aef4270, date=Fri May 23 04:34:50 UTC 2014}
2014-05-23 04:35:00,035 [Thread-38] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-05-23 04:35:02,198 [Thread-38] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/cuneiform, /home/tomcat-7/temp/Im77607184968132204728.jpg, -o, /home/tomcat-7/temp/okm5636227013585324756.txt]
2014-05-23 04:35:02,199 [Thread-38] WARN  com.openkm.util.ExecutionUtils - STDERR: PUMA_XFinalrecognition failed.

2014-05-23 04:35:02,201 [Thread-38] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/scanned-pdf.pdf': Too few text extracted
Do you know why this would occur?

I setup the following setting in cofig:
system.ocr = /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate = 90;180;270
and com.openkm.extractor.CuneiformTextExtractor part of the text extractors.

Thanks
 #28708  by matt81
 
I tried to execute it from command prompt and i got the same error. Do you know what problem could be?
 #28727  by pavila
 
If the problem is "cuneiform" I recommend contact to the program developer.

You can also try Tesseract, for example.
 #28765  by matt81
 
Thanks for the reply.

I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best. When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed better results for .jpg formats. So all good. The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into Portuguese. See image attached. I haven't set Portuguese language anywhere.

Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have something else set as default?
Attachments
language.png
language.png (80.17 KiB) Viewed 8230 times
 #28792  by jllort
 
Document language identification is done by other library, not by tesseract and is not 100% accurate. We can confirm at the present in most cases tesseract provides better results than cuneiform.
 #28806  by matt81
 
Ok so you are saying that for PDF files, cuneiform is used as default for text extraction? Why is that as I have specified tessarect in the config file?
It works fine for .jpg and .tiff formats, so I don't understand why it doesn't work for PDF as it contains an image.

Is there any way that it can be changed to tesseract?
Do you ahe any suggestions?

Thanks
 #28840  by pavila
 
If you want to use Tesseract you have to modify the system.ocr configuration property and set the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors (and remove the Cuneiform one).
 #28856  by matt81
 
Thanks for your reply.
Yes those settings are already set to tesseract, however as I mentioned when I upload a scanned PDF document, it recognizes it as a different language (see image attached in previous post). When I scan the same document as .jpg or .tif, it works prefectly fine.
Do you know why it behaves like this? It is the same document in different format and it doesn't work!

Thanks
 #28858  by matt81
 
Just to let you know I have enabled DEBUG for PdfTextExtractor in log4j.properties.
See log file when I upload a scanned PDF file:
Code: Select all
2014-06-04 02:35:00,145 [Thread-14] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=0fea632e-5dd1-4245-b08f-d0db3c0a3815, docPath=/okm:root/new-pdf/scanned-pdf.pdf, docVerUuid=78804a6f-e0d4-4e41-b28e-1fa1e06e072d, date=Wed Jun 04 02:31:13 UTC 2014}
2014-06-04 02:35:00,315 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-06-04 02:35:00,315 [Thread-14] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-06-04 02:35:00,316 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /dir/lar/tomcat-7.0.27/temp/Im74825222203928531329.jpg
2014-06-04 02:35:11,914 [Thread-14] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted: m.3_=$ Oufiunzcau
3 9.32 ofi §naoa3_am<
4w? W55»: 2 7.moaon_d—QN% QUO max Sam _<_mEo,:3m SO 83 >53...»
dpmazo 3>4mz_mza
._.=_m OE Emu bun. 1.303
_mw.._wn we Ema: mo: >mm.u:mm Em: _rm~»3<wx_
 #28871  by matt81
 
There is no PDF attached?
I did enable full permissions, however I still get the same results.
Not sure how to over pass this.

Thanks
 #28874  by baolinhtv
 
my tesseract 3.0 is working good but only with TIF file, when i check text extraction , jpg and scanned pdf file is not working .

updated
after i uncheck force ocr pdf and move com.openkm.extractor.Tesseract3TextExtractor after com.openkm.extractor.PdfTextExtractor in text.extractors key
now i can find fulltext from file docx save to pdf , and scanned pdf image file .
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
updated
in registered.text.extractors key i remove exifTextextractor , and i let empty system.openoffice.dictionary key link to oxt file , now when i check text extraction , pdf scanned file (image) working good about 80% extracly VietNamese . jpg file too , but i try save 1 docx file to pdf ( word 2010 ) and upload to openkm , it have no text extraction ???
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
Last edited by baolinhtv on Thu Jun 05, 2014 4:05 am, edited 3 times in total.
 #28875  by baolinhtv
 
matt81 wrote:Thanks for the reply.
I have tried with tesseract and in fact this is the best one out there. So I would recommend to everyone to use tesseract as it is the best.
When I upload a .jpg and .tif images, the OCR extraction works pretty good, although I noticed beter results for .jpg formats. So all good.
The interesting part is that when I upload a PDF file, it doesn't work and in fact it translates into portuguese. See image attached. I haven't set portuguese langugage anywhere.
Do you know why that would happen. What does OpenKM use when it finds that a PDF file is an image, does it use tesseract or do you guys have soomething else set as default?
because you config system.ocr = ${fileIn} ${fileOut} -l esp , edit language by edit after -l , im vietnamese so i edited : ${fileIn} ${fileOut} -l vie , may this help you out

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.