Open Source Document Management System | OpenKM

PostPosted:**Fri Jun 06, 2014 3:44 am**

Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the pdf you used so i can test?

I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.

/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic

Thanks

PostPosted:**Fri Jun 06, 2014 8:42 am**

matt81 wrote:Thanks for your reply.
I did try the options mentioned, however it didn't work on my end. I do have Tesseract version 3.02, so I am not sure if that is the difference, as you have 3.00 right?
Can you upload the PDF you used so i can test?

I left the system.openoffice.dictionary empty, I couldn;t find any oxi file. i have read that the follwoing value should be there as it does improve the quality. However when I tried it then, I had no luck, it was the same issue.

/var/libreoffice4.1/share/extensions/dict-en/en_AU.dic

Thanks

i have 3.02 too , you should modify registered.text.extractors like me :

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

and
system.ocr (that is mine you must change your path and language)

Code: Select all

c:\\esseract\\tesseract.exe ${fileIn} ${fileOut} -l vie

after that you can use Check text extraction to check is it work

PostPosted:**Tue Jun 10, 2014 6:30 am**

Thanks for you feedback. I have tried everything, with the same settings as you and still doesn't work. I have set the system.ocr to /usr/bin/tesseract ${fileIn} ${fileOut} -l eng
system.pdf.force.ocr is unchecked, I have also set system.ocr.rotate to 90;180;270 .

My scanned PDF text extraction is garbage text, see below:
"wmmmwm mam menmmwmwmm ammzsm on» Qonczsmim 8 25% 2 .ﬁ:m< mximﬂ .85 _o3_om1<‘ _ nE4m::< cmm Hmmmmanﬁ we, mo Em: msammmwoaﬁrmwmmcz. ._.

_G "

Please find attached PDF I have used, can you test on your end and tell me if it works.
I have scanned the PDF document at 400 DPI.

Thanks

PostPosted:**Wed Jun 11, 2014 7:42 am**

The problem is that this pdf is +90 rotated. I've extracted pdf contents and you got two images test-000.pbm what is +90 rotation and test-000-rotated result of -90 rotation from original doc. If you execute tesseract here will going right

PostPosted:**Thu Jun 12, 2014 6:13 am**

Thanks for your reply.
Ok so what does that mean, it finds 2 images from this PDF file. I need to upload a PDF file, and that PDF file to be text extracted. I don't understand your comment?
How can i resolve this issue?

Thanks

PostPosted:**Fri Jun 13, 2014 4:31 pm**

1- first of all the page must not be +90 degrees rotated, should be normally scanned. When I extract the page from the pdf it is already +90 degrees rotated ( it's not vertical A4, it's A4+90 degrees ).
2- Remove rotation from Administration -> configuration .

First image is what you really got into the pdf file. Althought you apply +90, +180 etc... really you will need -90. But this is not the solution the solution is scan an A4 vertically.

PostPosted:**Tue Jun 17, 2014 4:59 am**

Thanks a lot you clarified it for me, that was the issue, I was stuck for a while there.
As long the PDF document is vertically scanned, there is no need for rotation, just leave it empty.
Rotation comes in handy if you are scanning .jpg or .tiff images, without rotation it will only work with horizontal scan. If the rotation is set it will work for all types of scans vertically or hotizontally.

Thanks again

PostPosted:**Wed Sep 10, 2014 1:02 am**

Hi, Just coming back to this, I wanted to re-clarify by Vertical or Horizontal scanning. If I scan a page vertically the OCR doesn't work, no text is extracted. Only if it's horizontal. As long as the text is vertically on the page and not the layout. For example see images attached. The Vertical-scan.png doesn't work, no text is extracted, whereas the Horizontal-scan.png works fine.
I wanted to get your confirmatin when you said scan it vertically, is this what you meant or I am missing something. Rotation is set to null.

Thanks

PostPosted:**Sat Sep 13, 2014 9:29 am**

If the text if left to right then ocr is going right. In you case text is down to up ( hope this explanation will be better ). Refering your screenshots ( vertical will going right, but not horizontal =. The problem with your document is when we extract image into pdf file, that is shown as Horizontal screenshot.

PostPosted:**Sun Sep 14, 2014 11:47 pm**

Thanks for your reply.
When I scan a document, I scan it to a PDF file, however that PDF file is really an image. So when you say "The problem with your document is when we extract image into PDF file, that is shown as Horizontal screenshot.", that is prefectly correct. That's what I am saying, when we scan documents as PDF files it should be horizontally with text up and down, otherwise it won't work. If I scan a document as an image it works fine for me if it's vertical. However it won't work horizontally with text up and down.

So not sure if my settigns are all corrrect but that;s the behaviour I am getting. Can you confirm that that's how it works.

Sorry for taking your time but I want to get things clear, and make it clear for the others.

PostPosted:**Tue Sep 16, 2014 7:16 am**

We do not have OCR image text direction recognition at the present. You must upload image or pdf with correct orientation. althought there's a parameter ocr.rotate to rotate I do not recommend use in your case, this parameter is used for all repository OCR scanning actions and will be applied every document. The problem you got is with pdf format, for some reason preview with acrobat reader works correctly ( seems all fine ) but when we extract image goes with incorrect rotation. It's first time we see it and it's quite strange be a library problem ( possible, but strange ). The pdf you generated is a very newer version ?

PostPosted:**Tue Sep 16, 2014 11:28 pm**

Thanks for your reply.
So for now we will have to scan PDF's as I mentioned previously, (horizontally with text up and down), until this is resolved right?
Do you know if you can find a solution soon?
The PDF I generated I believe is a newer version, but that shouldn't be problem I guess, as most people will use newer verions.

Thanks

PostPosted:**Thu Sep 18, 2014 4:58 pm**

I've got no idea about why in your case is extracting image incorrectly. Could you try to scan a document and generate a pdf with other computer or application. If we're not able to reproduce the problem is quite difficult we find the clue to solve it.

PostPosted:**Thu Sep 18, 2014 11:17 pm**

Thanks for your reply.
I started again from scartch and scanned new documents and now it's working!! Not sure if the older version of OpenKM I had previously was causing the problem, or whether there was an issue with the scanned documents. Not sure, but now it's working. All good.

Thank you.

PostPosted:**Sun Sep 21, 2014 6:19 am**

I vote for the second, because is quite strange this problem only happen to you, other people should report it, and is not the case. However we're pleased to see problem has solved.

Open Source Document Management System | OpenKM

Search from OCR pdf documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents

Re: Search from OCR PDF documents