Open Source Document Management System | OpenKM

PostPosted:**Sat Jun 21, 2014 5:02 pm**

Hi,

I can extract text from not-rotated images without any issue, in openkm.
But, I have an image that is roated by +90 degrees.
On ubuntu-command-line, when I use "-rotate -90" option in imagemagick and then use tesseract, I get the text extracted properly with this +90 rotated image.
However, when I set openkm-property "system.ocr.rotate" to "-90" and upload the same image (which is a pdf file), on checking text-extraction, all I see is gibberish words.
I tried
i) system.ocr.rotate String 90;180;270;
ii) system.ocr.rotate String -90;
ii)restarting openkm
but none of them worked.

Do you have any suggestion?
Here are my current configuration-settings in openkm ...

*********************************************
registered.text.extractors List ->
org.apache.jackrabbit.extractor.PlainTextExtractor
org.apache.jackrabbit.extractor.MsWordTextExtractor
org.apache.jackrabbit.extractor.MsExcelTextExtractor
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor
org.apache.jackrabbit.extractor.RTFTextExtractor
org.apache.jackrabbit.extractor.HTMLTextExtractor
org.apache.jackrabbit.extractor.XMLTextExtractor
org.apache.jackrabbit.extractor.PngTextExtractor
org.apache.jackrabbit.extractor.MsOutlookTextExtractor
com.openkm.extractor.PdfTextExtractor
com.openkm.extractor.AudioTextExtractor
com.openkm.extractor.ExifTextExtractor
com.openkm.extractor.Tesseract3TextExtractor
com.openkm.extractor.SourceCodeTextExtractor
com.openkm.extractor.MsOffice2007TextExtractor

system.imagemagick.convert String /usr/bin/convert -density 300 ${fileIn} -depth 8 ${fileOut}
system.ocr String /usr/local/bin/tesseract ${fileIn} ${fileOut}
system.ocr.rotate String -90;
system.pdf.force.ocr Boolean Inactive
*********************************************

PostPosted:**Sun Jun 22, 2014 10:47 am**

You can take control of this kind of documents, can always be stored in some folder ( before doing ocr ) or identified by name, user, or metadata ? Because if we can identify in some way, we can separatelly process and doing all task needed to doing OCR correctly. Are you able to identify in some way or put always in same folder ? or set some metadata by user what identigy this kind of docs ?

PostPosted:**Sun Jun 22, 2014 4:27 pm**

The +90 degrees documents are random, so I cannot distinguish them from non-rotated documents.
Actually, I have been extracting text from mix-up of such documents in windows, without any issue. So, I was expecting it to be the same in Linux.
But it looks like being a freeware, tesseract has its own limitation...

I will download abbyy CLI for Linux and see how it performs in extracting text, from such mix-up.

Thanks for looking into this !!

Open Source Document Management System | OpenKM

Image rotation not working

Image rotation not working

Re: Image rotation not working

Re: Image rotation not working