Page 1 of 1

Tesseract OCR version update support for more image types?

PostPosted:Wed Nov 24, 2010 9:14 pm
by bontscho
hi guys,

i did a clean install of openkm and i'm very happy so far.

my question is:

since i have tesseract version 1.02 running on my server and tesseract 2.04 and tesseract 3 is available, does openkm support the newer versions of tesseract so an upgrade would give me the addtional support for more image-formats like jpg/png like stated on the official tesseract page?

it also says that version 3 is not compatible with the files from 2.04, so i would really appreciate a clear answer on that topic :-)

many thanks for any useful information here.

kind regards,
bontscho

Re: Tesseract OCR version update support for more image type

PostPosted:Thu Nov 25, 2010 8:45 pm
by pavila
By default, only TIFF images are used for OCR.

Re: Tesseract OCR version update support for more image type

PostPosted:Thu Nov 25, 2010 8:50 pm
by bontscho
so an upgrade to tesseract 2.04 should enable the multipage/compressed tif support?

btw is an update to support more image types planned?

thanks a lot for your answer,

kind regards,
bontscho

Re: Tesseract OCR version update support for more image type

PostPosted:Mon Nov 29, 2010 6:38 pm
by pavila
I'm not a tesseract hacker. May be should try a dedicated forum or mailing list about tesseract.

Re: Tesseract OCR version update support for more image type

PostPosted:Mon Nov 29, 2010 6:50 pm
by bontscho
nevermind, i successfully upgraded to tesseract 2.04 and now multipage tiffs and compressed tiffs are extracted correctly.

aswell tesseract 2.04 enables localizing, that means now my german documents are recognized correctly and available through lucene

maybe in the future openkm will take advantage of tesseract 3 and it would be more flexible in ocr recognition (tesseract 3 supports more formats as mentioned)

kind regards,
bontscho

Re: Tesseract OCR version update support for more image type

PostPosted:Wed Dec 01, 2010 6:32 pm
by pavila
I was reading about tesseract 3.0 and have some interesting improvements. If the command line parameter are not changed, the new tesseract 3.0 can run ok with OpenKM. If you want to pass other paramters to teesseract, you can configure the "system.ocr" to a script which wraps the original binary call. The same trick is used with pdf2swf utility. Search the wiki for more info.