• Unable to invoke tesseract OCR 3.x

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #20639  by venu.vijayagiri09
 
Hi,
I am working on OpenKM 6.2 on windows.
I have downloaded the tesseract 3.01v for text extraction from the images. From the command line it's working fine.
Configured the system.ocr property to "D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}"
But it's not working.

When I uploaded the image file ,Found the log message as below
Code: Select all
2013-01-09 12:30:51,351 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.
Working on {docUuid=2183568a-a1ab-4265-9ee3-933cf4ce8c99, docPath=/okm:root/Invoice/test_text.png, docVerUuid=d4635339-cbfe-4697-869d-1208dcfa9018, date=Wed Jan 09 12:28:45 IST 2013}
2013-01-09 12:30:51,356 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Invoice/test_text.png': Too few text extracted
Can you tell the path of the files, where the text extracted exactly.

If we want to use tesseract text extractors on images what steps do we need to follow.
 #20676  by jllort
 
Depending the image type you need to install image magick ( it's needed by tesseract ). I suggest you try tesseract from command line with same image. Remember that openkm has a queue extraction ( you can see at administration -> stats -> extraction queue ), and if you make a database select you can see the extract text at table OKM_NODE_BASE;
 #20684  by venu.vijayagiri09
 
Thanks for your reply.

I tried with the same image from the command line and able get the output.
But when I tried to access the stats and scripts tab it is showing following exception.
Attachments
stats.png
stats.png (94.09 KiB) Viewed 17592 times
 #20700  by okmuser
 
You also have to change the value in administration confugration to use tesseract3.ocr.

Please note the picture below
Attachments
Untitled.jpg
Untitled.jpg (79.4 KiB) Viewed 17572 times
 #20704  by venu.vijayagiri09
 
Yes..I have changed the value and tried with the following system.ocr property values.

D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe

It showing the error in the path and attached is the screen shot.

When I uploaded a sample .tiff and .png images, found the log as below
Code: Select all
2013-01-11 15:27:14,393 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=77f03c6d-46a4-480f-a21d-07375e54d0df,docPath=/okm:root
/del/document_example.tiff, docVerUuid=aaab8558-ce3d-428a-b856-f03be604874d, date=Fri Jan 11 15:26:58 IST 2013} 2013-01-11 15:27:14,398 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported 2013-01-11 15:27:14,400 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/del/document_example.tiff': Full text indexing of 'image/tiff' is not supported

2013-01-11 15:44:14,398 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=6d89bbaa-eb5e-4870-b94a-884ffd47feba,docPath=/okm:root
/Purchase Order/test_text.png, docVerUuid=384e4825-e11e-4355-9afd-ad982858facf,date=Fri Jan 11 15:43:40 IST 2013} 2013-01-11 15:44:14,403 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Purchase Order/test_text.png': Too few text extracted
I tried these two images from the tesseract terminal and I am able to see the output(but not accurately).

And also I tried to use cuneiForm open ocr with
system.ocr="D:\Venu\CuneiForm"
System.ocr=D:\Venu\CuneiForm\face.exe"
system.ocr="D:\Venu\CuneiForm\Face.exe ${fileIn} -o ${fileOut}"

the value I used for this is "com.openkm.extractor.CuneiformTextExtractor "

Its validating the path correctly but I not able to find anything in the log.

Is there any way to see the extracted text (like extrcted.txt files) and my stats tab is not working as I attached in the previous post.
Attachments
tesseract_path.png
tesseract_path.png (111.92 KiB) Viewed 17568 times
 #20711  by okmuser
 
The issue is the space in the path...

try removing the space in the path (Tesseract Images to Tesseract_Images or similar..)
 #20730  by jllort
 
I'm do not sire if cuneiform can be executed from terminal ? because when you execute always appearing a windows application I'm not sure be able to execute only as a terminal command.
 #20748  by venu.vijayagiri09
 
Thanks for your reply,

Now the path is detected correctly, but I don't know how to see the functionality of the tesseract in openkm whether it is working or not(included com.openkm.extractor.Tesseract3TextExtractor).

I uploaded a sample .tiff document to see the output and I found the following lines in the tomcat log.
Code: Select all
2013-01-15 10:44:08,334 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=fae64e67-b4dc-422b-bd12-a087a7faa6f1, docPath=/okm:root/Invoice/document_example.tiff, docVerUuid=42c2494a-347f-4917-b888-672c582d907d, date=Tue Jan 15 10:44:04 IST 2013}
2013-01-15 10:44:10,715 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported
2013-01-15 10:44:10,715 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Invoice/document_example.tiff': Full text indexing of 'image/tiff' is not supported.
When I tried with .png image found the follwing lines in the log.
Code: Select all
2013-01-15 11:05:08,476 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=3efb3111-dfe8-40b3-ba2e-633313c360a0, docPath=/okm:root/Invoice/a-plaintext01.png, docVerUuid=ebf0c995-2767-421a-a91f-125602991bc3, date=Tue Jan 15 11:04:59 IST 2013}
2013-01-15 11:05:08,476 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from 
'/okm:root/Invoice/a-laintext01.png': Too few text extracted
yes jllort,

I have executed from the cuneiForm windows application(not from the terminal) and it's working fine.
 #20789  by jllort
 
Your answer is not clear to me, the question is; have you tested tesseract in command line ? first is necessary discard that in command line all is going right. Then we can focus in OpenKM.
 #20965  by jllort
 
go to database query and execute:
Code: Select all
SELECT NDC_TEXT FROM OKM_NODE_DOCUMENT where NBS_UUID='fae64e67-b4dc-422b-bd12-a087a7faa6f1';
which is the value of NDC_TEXT ?
 #20981  by venu.vijayagiri09
 
Thanks Jllort,

Now I am able to see the extracted content in the tables. I tried with Openkm 5.1.10 version on windows 7 to see the OCR functionality with tesseract, working fine and also
I am able to see the extracted text files with the content(extracted content).

But I want to see the extracted text from the images(tiff/png/pdf), checked the tables of the images and the values of the images (NDC_TEXT values) are null.
Tried from the tesseract terminal with the same images which I have uploaded and it's working fine.
The log details:

Uploaded tiff
Code: Select all
2013-01-20 11:18:25,989 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported
2013-01-20 11:18:25,990 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Images/Tiffs/CCITT_1.TIF': Full text indexing of 'image/tiff' is not supported
Uploaded png
Code: Select all
- Added field 'category' with value '5734b5e6-2a30-48f4-a864-ba175d0d1aaa'2013-01-20 11:48:25,990 [Text Extractor Worker] INFO  com.openkm.d.t - processSe
rial.Working on {docUuid=dcb7e0b9-5100-455e-a2d9-61f6f3292875, docPath=/okm:root/a-plaintext01.png, docVerUuid=823bd7f0-0032-4301-93dc-d2f8af14e40c, date=SunJan 20 11:46:00 IST 2013}2013-01-20 11:48:25,993 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/a-plaintext01.png': Too few text extracted
For your Information I have executed a workflow and deleted(process definition) may be that was the reason to get this exception(I don't know why the workflow servlet is executing automatically when uploading a document).

Uloaded pdf
Code: Select all
2013-01-20 11:52:14,374 [http-bio-0.0.0.0-8090-exec-6] ERROR com.openkm.servlet.frontend.WorkflowServlet -
java.lang.NullPointerException
        at com.openkm.h.a.d.a(Unknown Source)
        at com.openkm.h.b.r.getProcessDefinitionForms(Unknown Source)
2013-01-20 11:53:25,991 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=cfe87cb6-9a1d-41fc-9235-f1f1fb0ad7b6, docPath=/okm:root
/h6118-captiva-module-pdg.pdf, docVerUuid=c2c005da-8c99-4f62-b2c1-9175539cff64,date=Sun Jan 20 11:52:13 IST 2013}2013-01-20 11:53:25,999 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'application/pdf' is not supported2013-01-20 11:53:26,000 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/h6118-captiva-module-pdg.pdf': Full text indexing of 'application/pdf' is not supported
Can you tell me how to resolve this.
 #20995  by jllort
 
You're using 6.2.2 community or trial ? because this kind of logs with this packages are strange to me
com.openkm.d.m
In log4j.properties file add this line
log4j.logger.com.openkm.extractor=DEBUG
 #21060  by jllort
 
I do not know which could be the cause, but be sure on windows is running correctly it, because we have some customers in windows and tesseract ocr configured. Anyway is this critical test to you I suggest contact with us at http://www.openkm.com/en/contact.html and we'll try to help you on more direct way ( indicate the post url )

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.