• Search from OCR pdf documents

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #28885  by matt81
 
Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the pdf you used so i can test?

I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.

/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic

Thanks
 #28887  by baolinhtv
 
matt81 wrote:Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the PDF you used so i can test?

I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.

/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic

Thanks
i have 3.02 too , you should modify registered.text.extractors like me :
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
and
system.ocr (that is mine you must change your path and language)
Code: Select all
c:\\esseract\\tesseract.exe ${fileIn} ${fileOut} -l vie 
after that you can use Check text extraction to check is it work
 #28904  by matt81
 
Thanks for you feedback. I have tried everything, with the same settings as you and still doesn't work. I have set the system.ocr to /usr/bin/tesseract ${fileIn} ${fileOut} -l eng
system.pdf.force.ocr is unchecked, I have also set system.ocr.rotate to 90;180;270 .

My scanned PDF text extraction is garbage text, see below:
"wmmmwm mam menmmwmwmm ammzsm on» Qonczsmim 8 25% 2 .fi:m< mximfl .85 _o3_om1<‘ _ nE4m::< cmm Hmmmmanfi we, mo Em: msammmwoafirmwmmcz. ._.:m:_G "

Please find attached PDF I have used, can you test on your end and tell me if it works.
I have scanned the PDF document at 400 DPI.

Thanks
Attachments
(12 KiB) Downloaded 256 times
 #28919  by jllort
 
The problem is that this pdf is +90 rotated. I've extracted pdf contents and you got two images test-000.pbm what is +90 rotation and test-000-rotated result of -90 rotation from original doc. If you execute tesseract here will going right
Attachments
(29.4 KiB) Downloaded 253 times
 #28923  by matt81
 
Thanks for your reply.
Ok so what does that mean, it finds 2 images from this PDF file. I need to upload a PDF file, and that PDF file to be text extracted. I don't understand your comment?
How can i resolve this issue?

Thanks
 #28935  by jllort
 
1- first of all the page must not be +90 degrees rotated, should be normally scanned. When I extract the page from the pdf it is already +90 degrees rotated ( it's not vertical A4, it's A4+90 degrees ).
2- Remove rotation from Administration -> configuration .

First image is what you really got into the pdf file. Althought you apply +90, +180 etc... really you will need -90. But this is not the solution the solution is scan an A4 vertically.
 #28957  by matt81
 
Thanks a lot you clarified it for me, that was the issue, I was stuck for a while there.
As long the PDF document is vertically scanned, there is no need for rotation, just leave it empty.
Rotation comes in handy if you are scanning .jpg or .tiff images, without rotation it will only work with horizontal scan. If the rotation is set it will work for all types of scans vertically or hotizontally.

Thanks again
 #29872  by matt81
 
Hi, Just coming back to this, I wanted to re-clarify by Vertical or Horizontal scanning. If I scan a page vertically the OCR doesn't work, no text is extracted. Only if it's horizontal. As long as the text is vertically on the page and not the layout. For example see images attached. The Vertical-scan.png doesn't work, no text is extracted, whereas the Horizontal-scan.png works fine.
I wanted to get your confirmatin when you said scan it vertically, is this what you meant or I am missing something. Rotation is set to null.

Thanks
Attachments
Vertical Scan
Vertical Scan
Vertical-scan.png (19.99 KiB) Viewed 6229 times
Horizontal Scan
Horizontal Scan
Horizontal-scan.png (30.41 KiB) Viewed 6229 times
 #29896  by jllort
 
If the text if left to right then ocr is going right. In you case text is down to up ( hope this explanation will be better ). Refering your screenshots ( vertical will going right, but not horizontal =. The problem with your document is when we extract image into pdf file, that is shown as Horizontal screenshot.
 #29913  by matt81
 
Thanks for your reply.
When I scan a document, I scan it to a PDF file, however that PDF file is really an image. So when you say "The problem with your document is when we extract image into PDF file, that is shown as Horizontal screenshot.", that is prefectly correct. That's what I am saying, when we scan documents as PDF files it should be horizontally with text up and down, otherwise it won't work. If I scan a document as an image it works fine for me if it's vertical. However it won't work horizontally with text up and down.

So not sure if my settigns are all corrrect but that;s the behaviour I am getting. Can you confirm that that's how it works.

Sorry for taking your time but I want to get things clear, and make it clear for the others.
 #29929  by jllort
 
We do not have OCR image text direction recognition at the present. You must upload image or pdf with correct orientation. althought there's a parameter ocr.rotate to rotate I do not recommend use in your case, this parameter is used for all repository OCR scanning actions and will be applied every document. The problem you got is with pdf format, for some reason preview with acrobat reader works correctly ( seems all fine ) but when we extract image goes with incorrect rotation. It's first time we see it and it's quite strange be a library problem ( possible, but strange ). The pdf you generated is a very newer version ?
 #29947  by matt81
 
Thanks for your reply.
So for now we will have to scan PDF's as I mentioned previously, (horizontally with text up and down), until this is resolved right?
Do you know if you can find a solution soon?
The PDF I generated I believe is a newer version, but that shouldn't be problem I guess, as most people will use newer verions.

Thanks
 #29970  by jllort
 
I've got no idea about why in your case is extracting image incorrectly. Could you try to scan a document and generate a pdf with other computer or application. If we're not able to reproduce the problem is quite difficult we find the clue to solve it.
 #29978  by matt81
 
Thanks for your reply.
I started again from scartch and scanned new documents and now it's working!! Not sure if the older version of OpenKM I had previously was causing the problem, or whether there was an issue with the scanned documents. Not sure, but now it's working. All good.

Thank you.
 #30002  by jllort
 
I vote for the second, because is quite strange this problem only happen to you, other people should report it, and is not the case. However we're pleased to see problem has solved.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.