Page 2 of 2
Re: Search from OCR PDF documents
PostPosted:Fri Jun 06, 2014 3:44 am
by matt81
Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the pdf you used so i can test?
I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.
/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic
Thanks
Re: Search from OCR PDF documents
PostPosted:Fri Jun 06, 2014 8:42 am
by baolinhtv
matt81 wrote:Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the PDF you used so i can test?
I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.
/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic
Thanks
i have 3.02 too , you should modify registered.text.extractors like me :
Code: Select allorg.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor
and
system.ocr (that is mine you must change your path and language)
Code: Select allc:\\esseract\\tesseract.exe ${fileIn} ${fileOut} -l vie
after that you can use Check text extraction to check is it work
Re: Search from OCR PDF documents
PostPosted:Tue Jun 10, 2014 6:30 am
by matt81
Thanks for you feedback. I have tried everything, with the same settings as you and still doesn't work. I have set the
system.ocr to /usr/bin/tesseract ${fileIn} ${fileOut} -l eng
system.pdf.force.ocr is unchecked, I have also set
system.ocr.rotate to 90;180;270 .
My scanned PDF text extraction is garbage text, see below:
"wmmmwm mam menmmwmwmm ammzsm on» Qonczsmim 8 25% 2 .fi:m< mximfl .85 _o3_om1<‘ _ nE4m::< cmm Hmmmmanfi we, mo Em: msammmwoafirmwmmcz. ._.

_G "
Please find attached PDF I have used, can you test on your end and tell me if it works.
I have scanned the PDF document at 400 DPI.
Thanks
Re: Search from OCR PDF documents
PostPosted:Wed Jun 11, 2014 7:42 am
by jllort
The problem is that this pdf is +90 rotated. I've extracted pdf contents and you got two images test-000.pbm what is +90 rotation and test-000-rotated result of -90 rotation from original doc. If you execute tesseract here will going right
Re: Search from OCR PDF documents
PostPosted:Thu Jun 12, 2014 6:13 am
by matt81
Thanks for your reply.
Ok so what does that mean, it finds 2 images from this PDF file. I need to upload a PDF file, and that PDF file to be text extracted. I don't understand your comment?
How can i resolve this issue?
Thanks
Re: Search from OCR PDF documents
PostPosted:Fri Jun 13, 2014 4:31 pm
by jllort
1- first of all the page must not be +90 degrees rotated, should be normally scanned. When I extract the page from the pdf it is already +90 degrees rotated ( it's not vertical A4, it's A4+90 degrees ).
2- Remove rotation from Administration -> configuration .
First image is what you really got into the pdf file. Althought you apply +90, +180 etc... really you will need -90. But this is not the solution the solution is scan an A4 vertically.
Re: Search from OCR PDF documents
PostPosted:Tue Jun 17, 2014 4:59 am
by matt81
Thanks a lot you clarified it for me, that was the issue, I was stuck for a while there.
As long the PDF document is vertically scanned, there is no need for rotation, just leave it empty.
Rotation comes in handy if you are scanning .jpg or .tiff images, without rotation it will only work with horizontal scan. If the rotation is set it will work for all types of scans vertically or hotizontally.
Thanks again
Re: Search from OCR PDF documents
PostPosted:Wed Sep 10, 2014 1:02 am
by matt81
Hi, Just coming back to this, I wanted to re-clarify by Vertical or Horizontal scanning. If I scan a page vertically the OCR doesn't work, no text is extracted. Only if it's horizontal. As long as the text is vertically on the page and not the layout. For example see images attached. The Vertical-scan.png doesn't work, no text is extracted, whereas the Horizontal-scan.png works fine.
I wanted to get your confirmatin when you said scan it vertically, is this what you meant or I am missing something. Rotation is set to null.
Thanks
Re: Search from OCR PDF documents
PostPosted:Sat Sep 13, 2014 9:29 am
by jllort
If the text if left to right then ocr is going right. In you case text is down to up ( hope this explanation will be better ). Refering your screenshots ( vertical will going right, but not horizontal =. The problem with your document is when we extract image into pdf file, that is shown as Horizontal screenshot.
Re: Search from OCR PDF documents
PostPosted:Sun Sep 14, 2014 11:47 pm
by matt81
Thanks for your reply.
When I scan a document, I scan it to a PDF file, however that PDF file is really an image. So when you say "The problem with your document is when we extract image into PDF file, that is shown as Horizontal screenshot.", that is prefectly correct. That's what I am saying, when we scan documents as PDF files it should be horizontally with text up and down, otherwise it won't work. If I scan a document as an image it works fine for me if it's vertical. However it won't work horizontally with text up and down.
So not sure if my settigns are all corrrect but that;s the behaviour I am getting. Can you confirm that that's how it works.
Sorry for taking your time but I want to get things clear, and make it clear for the others.
Re: Search from OCR PDF documents
PostPosted:Tue Sep 16, 2014 7:16 am
by jllort
We do not have OCR image text direction recognition at the present. You must upload image or pdf with correct orientation. althought there's a parameter ocr.rotate to rotate I do not recommend use in your case, this parameter is used for all repository OCR scanning actions and will be applied every document. The problem you got is with pdf format, for some reason preview with acrobat reader works correctly ( seems all fine ) but when we extract image goes with incorrect rotation. It's first time we see it and it's quite strange be a library problem ( possible, but strange ). The pdf you generated is a very newer version ?
Re: Search from OCR PDF documents
PostPosted:Tue Sep 16, 2014 11:28 pm
by matt81
Thanks for your reply.
So for now we will have to scan PDF's as I mentioned previously, (horizontally with text up and down), until this is resolved right?
Do you know if you can find a solution soon?
The PDF I generated I believe is a newer version, but that shouldn't be problem I guess, as most people will use newer verions.
Thanks
Re: Search from OCR PDF documents
PostPosted:Thu Sep 18, 2014 4:58 pm
by jllort
I've got no idea about why in your case is extracting image incorrectly. Could you try to scan a document and generate a pdf with other computer or application. If we're not able to reproduce the problem is quite difficult we find the clue to solve it.
Re: Search from OCR PDF documents
PostPosted:Thu Sep 18, 2014 11:17 pm
by matt81
Thanks for your reply.
I started again from scartch and scanned new documents and now it's working!! Not sure if the older version of OpenKM I had previously was causing the problem, or whether there was an issue with the scanned documents. Not sure, but now it's working. All good.
Thank you.
Re: Search from OCR PDF documents
PostPosted:Sun Sep 21, 2014 6:19 am
by jllort
I vote for the second, because is quite strange this problem only happen to you, other people should report it, and is not the case. However we're pleased to see problem has solved.