Open Source Document Management System | OpenKM

PostPosted:**Thu Jun 04, 2015 6:51 am**

To clarify understanding please.
There are no OCR options in the community version.
Is the OCR text extraction used then? Does it make any sense to install tesseract or cuneiform in the community?
if it is worth to install, how does one check the quality of the text extraction i.e. where can we find the extracted text?

PostPosted:**Fri Jun 05, 2015 4:18 pm**

Community version comes with OCR integration but you must configure it, http://wiki.openkm.com/index.php/Third- ... ation:_OCR . Here is explained how to enable http://wiki.openkm.com/index.php/Applic ... abling_OCR .My suggestion is use tesseract OCR engine.

To check the process, copy some document UUID ( tab properties ), go to administration -> database query and execute (jdbc selected at bottom right):

Code: Select all

SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the uuid'

field NDC_TEXT_EXTRACTED='T' indicate document has been processed from text extraction queue ( take in mind documents are not processed just in time, go into queue and then are sequentially processed to extracts its text ).
field NDC_TEXT is the content extracted.

Hope with it you can start.

PostPosted:**Fri Jun 05, 2015 7:03 pm**

hi jllort,

that is perfect thanks. I got the java settings working so it seems stable and i have tried both tesseract and cuneiform.

I found cuneiform is processing quicker than tesseract and also, the only problems i am getting from cuneiform are documents from windows 95 and older i.e. word95 and excel95 not being recognised.

With tesseract under Debian Wheezy i am getting all sorts of other errors and conversion is going very slow.
I not two thing in the temp folder. PDF files are being converted to JPG and the most output files have no extension. i.e. no .TXT

But the info below help me to work it and once i have everything working well, i will re-do the text extraction.

PostPosted:**Sat Jun 06, 2015 5:23 am**

for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.

cuneiform results were rubbish and took half the time to process because the amount of text produced was a fraction of tesseract, but not usable.
I concur with jllort, tesseract is the best one to use.

PostPosted:**Sat Jun 06, 2015 6:24 pm**

About word 95 and excel 95 could be a problem with MSOfficeTextExtractor class, could you test in our demo.openkm.com and also share here some document what is not indexed correctly ( for testing in one of our development environments ).

3-4 years ago, cuneiform was better than tesseract, a lot. But seems google spent more time on this project and the cognitive forms guys - http://cognitiveforms.com/ - had frozen the cuneiform community version. Anyway if you need specialized OCR, the cognitive forms guys can build customized cuneiform under demand with your needs. I think it's something must be published, these people really they have a very good commercial tools and cuneiform can going so far than community edition.

PostPosted:**Wed Mar 28, 2018 7:02 am**

I am hoping this is in the right place.

I've installed OpenKM Community Version: 6.3.6 (build: 87d181f) under Windows 7 and install tesseract 3.05.01

Below is the configuration that I performed in the administration tab in OpenKM.

Code: Select all

system.ocr=c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}
system.ocr.rotate= 90;180;270; 
system.pdf.force.ocr=TRUE

Now I need to know how to use OCR because I couldn't find the OCR module(icon) in the administration tab.

I've watched some videos on youtube and find out that there is a module(icon) as the SS below.

but I don't have it in my openkm as the SS below. how to add it to my system???

plz help, and tnx in advance.

PostPosted:**Thu Mar 29, 2018 9:33 am**

The OCR engine works in community and professional edition, but OCR Zonal is only available from the professional edition.

PostPosted:**Tue Oct 08, 2019 2:31 am**

gwaitsi wrote: ↑Sat Jun 06, 2015 5:23 am for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.

I am having trouble for the OCR Japanese language. I have placed jpn.traineddata on the tessdata folder. tesseract works great on the command line with the -l eng+jpn. I have tried adding -l eng+jpn to the system.ocr configuration.

ocrconfig.png (18.75 KiB) Viewed 9253 times

It only shows "T" on the NDC_TEXT_EXTRACTED column.

extrtxt.png (11.07 KiB) Viewed 9253 times

What am I missing here? I am using CentOS 7 btw. Thank you in advance.

EDIT: SOLVED! Apparently all I needed was the 3.04 version of jpn.traineddata for tesseract 3.04.

PostPosted:**Wed Oct 09, 2019 6:49 pm**

I suggest in future upgrade to tesseract version 4

Open Source Document Management System | OpenKM

OCR in the Community version

OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version

Re: OCR in the Community version