• OCR in the Community version

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #39738  by gwaitsi
 
To clarify understanding please.
There are no OCR options in the community version.
Is the OCR text extraction used then? Does it make any sense to install tesseract or cuneiform in the community?
if it is worth to install, how does one check the quality of the text extraction i.e. where can we find the extracted text?
 #39820  by jllort
 
Community version comes with OCR integration but you must configure it, http://wiki.openkm.com/index.php/Third- ... ation:_OCR . Here is explained how to enable http://wiki.openkm.com/index.php/Applic ... abling_OCR .My suggestion is use tesseract OCR engine.

To check the process, copy some document UUID ( tab properties ), go to administration -> database query and execute (jdbc selected at bottom right):
Code: Select all
SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the uuid'
field NDC_TEXT_EXTRACTED='T' indicate document has been processed from text extraction queue ( take in mind documents are not processed just in time, go into queue and then are sequentially processed to extracts its text ).
field NDC_TEXT is the content extracted.

Hope with it you can start.
 #39823  by gwaitsi
 
hi jllort,

that is perfect thanks. I got the java settings working so it seems stable and i have tried both tesseract and cuneiform.

I found cuneiform is processing quicker than tesseract and also, the only problems i am getting from cuneiform are documents from windows 95 and older i.e. word95 and excel95 not being recognised.

With tesseract under Debian Wheezy i am getting all sorts of other errors and conversion is going very slow.
I not two thing in the temp folder. PDF files are being converted to JPG and the most output files have no extension. i.e. no .TXT

But the info below help me to work it and once i have everything working well, i will re-do the text extraction.
 #39824  by gwaitsi
 
for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.

cuneiform results were rubbish and took half the time to process because the amount of text produced was a fraction of tesseract, but not usable.
I concur with jllort, tesseract is the best one to use.
 #39834  by jllort
 
About word 95 and excel 95 could be a problem with MSOfficeTextExtractor class, could you test in our demo.openkm.com and also share here some document what is not indexed correctly ( for testing in one of our development environments ).

3-4 years ago, cuneiform was better than tesseract, a lot. But seems google spent more time on this project and the cognitive forms guys - http://cognitiveforms.com/ - had frozen the cuneiform community version. Anyway if you need specialized OCR, the cognitive forms guys can build customized cuneiform under demand with your needs. I think it's something must be published, these people really they have a very good commercial tools and cuneiform can going so far than community edition.
 #45550  by hus.alkahli
 
I am hoping this is in the right place.

I've installed OpenKM Community Version: 6.3.6 (build: 87d181f) under Windows 7 and install tesseract 3.05.01

Below is the configuration that I performed in the administration tab in OpenKM.
Code: Select all
system.ocr=c:\Tesseract-ocr-3.0.2\tesseract.exe ${fileIn} ${fileOut}
system.ocr.rotate= 90;180;270; 
system.pdf.force.ocr=TRUE
Now I need to know how to use OCR because I couldn't find the OCR module(icon) in the administration tab.

I've watched some videos on youtube and find out that there is a module(icon) as the SS below.


Image

but I don't have it in my openkm as the SS below. how to add it to my system???


Image


plz help, and tnx in advance.
 #45579  by jllort
 
The OCR engine works in community and professional edition, but OCR Zonal is only available from the professional edition.
 #49038  by siegfredev
 
gwaitsi wrote: Sat Jun 06, 2015 5:23 am for anyone interested, i have individually tested all scenarios with cuneiform and tesseract and the results that gave the best results for the 3 languages i use were:

tesseract ${fileIn} ${fileOut} -l eng+ger+ces

without the lang specification, tesseract did not give good results for Czech, although it did recognise German without the language addition.
I am having trouble for the OCR Japanese language. I have placed jpn.traineddata on the tessdata folder. tesseract works great on the command line with the -l eng+jpn. I have tried adding -l eng+jpn to the system.ocr configuration.
ocrconfig.png
ocrconfig.png (18.75 KiB) Viewed 4761 times

It only shows "T" on the NDC_TEXT_EXTRACTED column.
extrtxt.png
extrtxt.png (11.07 KiB) Viewed 4761 times

What am I missing here? I am using CentOS 7 btw. Thank you in advance.


EDIT: SOLVED! Apparently all I needed was the 3.04 version of jpn.traineddata for tesseract 3.04.
 #49063  by jllort
 
I suggest in future upgrade to tesseract version 4

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.