Page 1 of 1
OCR in the Community version
PostPosted:Thu Jun 04, 2015 6:51 am
by gwaitsi
To clarify understanding please.
There are no OCR options in the community version.
Is the OCR text extraction used then? Does it make any sense to install tesseract or cuneiform in the community?
if it is worth to install, how does one check the quality of the text extraction i.e. where can we find the extracted text?
Re: OCR in the Community version
PostPosted:Fri Jun 05, 2015 4:18 pm
by jllort
Community version comes with OCR integration but you must configure it,
http://wiki.openkm.com/index.php/Third- ... ation:_OCR . Here is explained how to enable
http://wiki.openkm.com/index.php/Applic ... abling_OCR .My suggestion is use tesseract OCR engine.
To check the process, copy some document UUID ( tab properties ), go to administration -> database query and execute (jdbc selected at bottom right):
Code: Select allSELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the uuid'
field NDC_TEXT_EXTRACTED='T' indicate document has been processed from text extraction queue ( take in mind documents are not processed just in time, go into queue and then are sequentially processed to extracts its text ).
field NDC_TEXT is the content extracted.
Hope with it you can start.
Re: OCR in the Community version
PostPosted:Fri Jun 05, 2015 7:03 pm
by gwaitsi
hi jllort,
that is perfect thanks. I got the java settings working so it seems stable and i have tried both tesseract and cuneiform.
I found cuneiform is processing quicker than tesseract and also, the only problems i am getting from cuneiform are documents from windows 95 and older i.e. word95 and excel95 not being recognised.
With tesseract under Debian Wheezy i am getting all sorts of other errors and conversion is going very slow.
I not two thing in the temp folder. PDF files are being converted to JPG and the most output files have no extension. i.e. no .TXT
But the info below help me to work it and once i have everything working well, i will re-do the text extraction.
Re: OCR in the Community version
PostPosted:Sat Jun 06, 2015 5:23 am
by gwaitsi
for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:
tesseract ${fileIn} ${fileOut} -l eng+ger+ces
without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.
cuneiform results were rubbish and took half the time to process because the amount of text produced was a fraction of tesseract, but not usable.
I concur with jllort, tesseract is the best one to use.
Re: OCR in the Community version
PostPosted:Sat Jun 06, 2015 6:24 pm
by jllort
About word 95 and excel 95 could be a problem with MSOfficeTextExtractor class, could you test in our demo.openkm.com and also share here some document what is not indexed correctly ( for testing in one of our development environments ).
3-4 years ago, cuneiform was better than tesseract, a lot. But seems google spent more time on this project and the cognitive forms guys -
http://cognitiveforms.com/ - had frozen the cuneiform community version. Anyway if you need specialized OCR, the cognitive forms guys can build customized cuneiform under demand with your needs. I think it's something must be published, these people really they have a very good commercial tools and cuneiform can going so far than community edition.
Re: OCR in the Community version
PostPosted:Wed Mar 28, 2018 7:02 am
by hus.alkahli
I am hoping this is in the right place.
I've installed OpenKM Community Version: 6.3.6 (build: 87d181f) under Windows 7 and install tesseract 3.05.01
Below is the configuration that I performed in the administration tab in OpenKM.
Code: Select allsystem.ocr=c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}
system.ocr.rotate= 90;180;270;
system.pdf.force.ocr=TRUE
Now I need to know how to use OCR because I couldn't find the OCR module(icon) in the administration tab.
I've watched some videos on youtube and find out that there is a module(icon) as the SS below.
but I don't have it in my openkm as the SS below. how to add it to my system???
plz help, and tnx in advance.
Re: OCR in the Community version
PostPosted:Thu Mar 29, 2018 9:33 am
by jllort
The OCR engine works in community and professional edition, but OCR Zonal is only available from the professional edition.
Re: OCR in the Community version
PostPosted:Tue Oct 08, 2019 2:31 am
by siegfredev
gwaitsi wrote: ↑Sat Jun 06, 2015 5:23 am
for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:
tesseract ${fileIn} ${fileOut} -l eng+ger+ces
without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.
I am having trouble for the OCR Japanese language. I have placed jpn.traineddata on the tessdata folder. tesseract works great on the command line with the -l eng+jpn. I have tried adding -l eng+jpn to the system.ocr configuration.
ocrconfig.png (18.75 KiB) Viewed 7264 times
It only shows "T" on the NDC_TEXT_EXTRACTED column.
extrtxt.png (11.07 KiB) Viewed 7264 times
What am I missing here? I am using CentOS 7 btw. Thank you in advance.
EDIT: SOLVED! Apparently all I needed was the 3.04 version of jpn.traineddata for tesseract 3.04.
Re: OCR in the Community version
PostPosted:Wed Oct 09, 2019 6:49 pm
by jllort
I suggest in future upgrade to tesseract version 4