Open Source Document Management System | OpenKM

OCR in the Community version

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

9 posts

9 posts

OCR in the Community version

#39738 by gwaitsi
Thu Jun 04, 2015 6:51 am

To clarify understanding please.
There are no OCR options in the community version.
Is the OCR text extraction used then? Does it make any sense to install tesseract or cuneiform in the community?
if it is worth to install, how does one check the quality of the text extraction i.e. where can we find the extracted text?

Username

gwaitsi

Rank

Senior Boarder

Posts

Joined

Wed Sep 03, 2014 1:00 pm

Re: OCR in the Community version

#39820 by jllort
Fri Jun 05, 2015 4:18 pm

Community version comes with OCR integration but you must configure it, http://wiki.openkm.com/index.php/Third- ... ation:_OCR . Here is explained how to enable http://wiki.openkm.com/index.php/Applic ... abling_OCR .My suggestion is use tesseract OCR engine.

To check the process, copy some document UUID ( tab properties ), go to administration -> database query and execute (jdbc selected at bottom right):

Code: Select all

SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the uuid'

field NDC_TEXT_EXTRACTED='T' indicate document has been processed from text extraction queue ( take in mind documents are not processed just in time, go into queue and then are sequentially processed to extracts its text ).
field NDC_TEXT is the content extracted.

Hope with it you can start.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR in the Community version

#39823 by gwaitsi
Fri Jun 05, 2015 7:03 pm

hi jllort,

that is perfect thanks. I got the java settings working so it seems stable and i have tried both tesseract and cuneiform.

I found cuneiform is processing quicker than tesseract and also, the only problems i am getting from cuneiform are documents from windows 95 and older i.e. word95 and excel95 not being recognised.

With tesseract under Debian Wheezy i am getting all sorts of other errors and conversion is going very slow.
I not two thing in the temp folder. PDF files are being converted to JPG and the most output files have no extension. i.e. no .TXT

But the info below help me to work it and once i have everything working well, i will re-do the text extraction.

Username

gwaitsi

Rank

Senior Boarder

Posts

Joined

Wed Sep 03, 2014 1:00 pm

Re: OCR in the Community version

#39824 by gwaitsi
Sat Jun 06, 2015 5:23 am

for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.

cuneiform results were rubbish and took half the time to process because the amount of text produced was a fraction of tesseract, but not usable.
I concur with jllort, tesseract is the best one to use.

Username

gwaitsi

Rank

Senior Boarder

Posts

Joined

Wed Sep 03, 2014 1:00 pm

Re: OCR in the Community version

#39834 by jllort
Sat Jun 06, 2015 6:24 pm

About word 95 and excel 95 could be a problem with MSOfficeTextExtractor class, could you test in our demo.openkm.com and also share here some document what is not indexed correctly ( for testing in one of our development environments ).

3-4 years ago, cuneiform was better than tesseract, a lot. But seems google spent more time on this project and the cognitive forms guys - http://cognitiveforms.com/ - had frozen the cuneiform community version. Anyway if you need specialized OCR, the cognitive forms guys can build customized cuneiform under demand with your needs. I think it's something must be published, these people really they have a very good commercial tools and cuneiform can going so far than community edition.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR in the Community version

#45550 by hus.alkahli
Wed Mar 28, 2018 7:02 am

I am hoping this is in the right place.

I've installed OpenKM Community Version: 6.3.6 (build: 87d181f) under Windows 7 and install tesseract 3.05.01

Below is the configuration that I performed in the administration tab in OpenKM.

Code: Select all

system.ocr=c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}
system.ocr.rotate= 90;180;270; 
system.pdf.force.ocr=TRUE

Now I need to know how to use OCR because I couldn't find the OCR module(icon) in the administration tab.

I've watched some videos on youtube and find out that there is a module(icon) as the SS below.

but I don't have it in my openkm as the SS below. how to add it to my system???

plz help, and tnx in advance.

Username

hus.alkahli

Rank

Fresh Boarder

Posts

Joined

Sun Mar 18, 2018 10:44 am

Re: OCR in the Community version

#45579 by jllort
Thu Mar 29, 2018 9:33 am

The OCR engine works in community and professional edition, but OCR Zonal is only available from the professional edition.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR in the Community version

#49038 by siegfredev
Tue Oct 08, 2019 2:31 am

gwaitsi wrote: ↑Sat Jun 06, 2015 5:23 am for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.

I am having trouble for the OCR Japanese language. I have placed jpn.traineddata on the tessdata folder. tesseract works great on the command line with the -l eng+jpn. I have tried adding -l eng+jpn to the system.ocr configuration.

ocrconfig.png (18.75 KiB) Viewed 4761 times

It only shows "T" on the NDC_TEXT_EXTRACTED column.

extrtxt.png (11.07 KiB) Viewed 4761 times

What am I missing here? I am using CentOS 7 btw. Thank you in advance.

EDIT: SOLVED! Apparently all I needed was the 3.04 version of jpn.traineddata for tesseract 3.04.

Username

siegfredev

Rank

Fresh Boarder

Posts

Joined

Tue Oct 08, 2019 1:59 am

Re: OCR in the Community version

#49063 by jllort
Wed Oct 09, 2019 6:49 pm

I suggest in future upgrade to tesseract version 4

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
9 posts

Return to “Usage”

Display:

Sort by:

Jump to: