Open Source Document Management System | OpenKM - Unable to invoke tesseract OCR 3.x

Reply

Unable to invoke tesseract OCR 3.x

#20639 by venu.vijayagiri09
Wed Jan 09, 2013 7:07 am

Hi,
I am working on OpenKM 6.2 on windows.
I have downloaded the tesseract 3.01v for text extraction from the images. From the command line it's working fine.
Configured the system.ocr property to "D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}"
But it's not working.

When I uploaded the image file ,Found the log message as below

Code: Select all

2013-01-09 12:30:51,351 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.
Working on {docUuid=2183568a-a1ab-4265-9ee3-933cf4ce8c99, docPath=/okm:root/Invoice/test_text.png, docVerUuid=d4635339-cbfe-4697-869d-1208dcfa9018, date=Wed Jan 09 12:28:45 IST 2013}
2013-01-09 12:30:51,356 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Invoice/test_text.png': Too few text extracted

Can you tell the path of the files, where the text extracted exactly.

If we want to use tesseract text extractors on images what steps do we need to follow.

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20676 by jllort
Wed Jan 09, 2013 7:29 pm

Depending the image type you need to install image magick ( it's needed by tesseract ). I suggest you try tesseract from command line with same image. Remember that openkm has a queue extraction ( you can see at administration -> stats -> extraction queue ), and if you make a database select you can see the extract text at table OKM_NODE_BASE;

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Unable to invoke tesseract OCR 3.x

#20684 by venu.vijayagiri09
Thu Jan 10, 2013 5:13 am

Thanks for your reply.

I tried with the same image from the command line and able get the output.
But when I tried to access the stats and scripts tab it is showing following exception.

Attachments

stats.png (94.09 KiB) Viewed 18628 times

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20700 by okmuser
Fri Jan 11, 2013 5:36 am

You also have to change the value in administration confugration to use tesseract3.ocr.

Please note the picture below

Attachments

Untitled.jpg (79.4 KiB) Viewed 18608 times

Username

okmuser

Rank

Expert Boarder

Posts

123

Joined

Fri Dec 16, 2011 1:25 pm

Re: Unable to invoke tesseract OCR 3.x

#20704 by venu.vijayagiri09
Fri Jan 11, 2013 10:24 am

Yes..I have changed the value and tried with the following system.ocr property values.

D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
D:\Venu\Tesseract Images\Tesseract-OCR\tesseract.exe

It showing the error in the path and attached is the screen shot.

When I uploaded a sample .tiff and .png images, found the log as below

Code: Select all

2013-01-11 15:27:14,393 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=77f03c6d-46a4-480f-a21d-07375e54d0df,docPath=/okm:root
/del/document_example.tiff, docVerUuid=aaab8558-ce3d-428a-b856-f03be604874d, date=Fri Jan 11 15:26:58 IST 2013} 2013-01-11 15:27:14,398 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported 2013-01-11 15:27:14,400 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/del/document_example.tiff': Full text indexing of 'image/tiff' is not supported

2013-01-11 15:44:14,398 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=6d89bbaa-eb5e-4870-b94a-884ffd47feba,docPath=/okm:root
/Purchase Order/test_text.png, docVerUuid=384e4825-e11e-4355-9afd-ad982858facf,date=Fri Jan 11 15:43:40 IST 2013} 2013-01-11 15:44:14,403 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Purchase Order/test_text.png': Too few text extracted

I tried these two images from the tesseract terminal and I am able to see the output(but not accurately).

And also I tried to use cuneiForm open ocr with
system.ocr="D:\Venu\CuneiForm"
System.ocr=D:\Venu\CuneiForm\face.exe"
system.ocr="D:\Venu\CuneiForm\Face.exe ${fileIn} -o ${fileOut}"

the value I used for this is "com.openkm.extractor.CuneiformTextExtractor "

Its validating the path correctly but I not able to find anything in the log.

Is there any way to see the extracted text (like extrcted.txt files) and my stats tab is not working as I attached in the previous post.

Attachments

tesseract_path.png (111.92 KiB) Viewed 18604 times

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20711 by okmuser
Sat Jan 12, 2013 4:22 am

The issue is the space in the path...

try removing the space in the path (Tesseract Images to Tesseract_Images or similar..)

Username

okmuser

Rank

Expert Boarder

Posts

123

Joined

Fri Dec 16, 2011 1:25 pm

Re: Unable to invoke tesseract OCR 3.x

#20730 by jllort
Sat Jan 12, 2013 6:06 pm

I'm do not sire if cuneiform can be executed from terminal ? because when you execute always appearing a windows application I'm not sure be able to execute only as a terminal command.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Unable to invoke tesseract OCR 3.x

#20748 by venu.vijayagiri09
Tue Jan 15, 2013 5:38 am

Thanks for your reply,

Now the path is detected correctly, but I don't know how to see the functionality of the tesseract in openkm whether it is working or not(included com.openkm.extractor.Tesseract3TextExtractor).

I uploaded a sample .tiff document to see the output and I found the following lines in the tomcat log.

Code: Select all

2013-01-15 10:44:08,334 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=fae64e67-b4dc-422b-bd12-a087a7faa6f1, docPath=/okm:root/Invoice/document_example.tiff, docVerUuid=42c2494a-347f-4917-b888-672c582d907d, date=Tue Jan 15 10:44:04 IST 2013}
2013-01-15 10:44:10,715 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported
2013-01-15 10:44:10,715 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Invoice/document_example.tiff': Full text indexing of 'image/tiff' is not supported.

When I tried with .png image found the follwing lines in the log.

Code: Select all

2013-01-15 11:05:08,476 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=3efb3111-dfe8-40b3-ba2e-633313c360a0, docPath=/okm:root/Invoice/a-plaintext01.png, docVerUuid=ebf0c995-2767-421a-a91f-125602991bc3, date=Tue Jan 15 11:04:59 IST 2013}
2013-01-15 11:05:08,476 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from 
'/okm:root/Invoice/a-laintext01.png': Too few text extracted

yes jllort,

I have executed from the cuneiForm windows application(not from the terminal) and it's working fine.

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20789 by jllort
Wed Jan 16, 2013 10:35 pm

Your answer is not clear to me, the question is; have you tested tesseract in command line ? first is necessary discard that in command line all is going right. Then we can focus in OpenKM.

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Unable to invoke tesseract OCR 3.x

#20795 by venu.vijayagiri09
Thu Jan 17, 2013 4:54 am

Hi Jllort,

Yes I have tested the images from the tesseract command line and its working fine.

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20965 by jllort
Fri Jan 18, 2013 7:12 pm

go to database query and execute:

Code: Select all

SELECT NDC_TEXT FROM OKM_NODE_DOCUMENT where NBS_UUID='fae64e67-b4dc-422b-bd12-a087a7faa6f1';

which is the value of NDC_TEXT ?

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Unable to invoke tesseract OCR 3.x

#20981 by venu.vijayagiri09
Sun Jan 20, 2013 6:29 am

Thanks Jllort,

Now I am able to see the extracted content in the tables. I tried with Openkm 5.1.10 version on windows 7 to see the OCR functionality with tesseract, working fine and also
I am able to see the extracted text files with the content(extracted content).

But I want to see the extracted text from the images(tiff/png/pdf), checked the tables of the images and the values of the images (NDC_TEXT values) are null.
Tried from the tesseract terminal with the same images which I have uploaded and it's working fine.
The log details:

Uploaded tiff

Code: Select all

2013-01-20 11:18:25,989 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'image/tiff' is not supported
2013-01-20 11:18:25,990 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Images/Tiffs/CCITT_1.TIF': Full text indexing of 'image/tiff' is not supported

Uploaded png

Code: Select all

- Added field 'category' with value '5734b5e6-2a30-48f4-a864-ba175d0d1aaa'2013-01-20 11:48:25,990 [Text Extractor Worker] INFO  com.openkm.d.t - processSe
rial.Working on {docUuid=dcb7e0b9-5100-455e-a2d9-61f6f3292875, docPath=/okm:root/a-plaintext01.png, docVerUuid=823bd7f0-0032-4301-93dc-d2f8af14e40c, date=SunJan 20 11:46:00 IST 2013}2013-01-20 11:48:25,993 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/a-plaintext01.png': Too few text extracted

For your Information I have executed a workflow and deleted(process definition) may be that was the reason to get this exception(I don't know why the workflow servlet is executing automatically when uploading a document).

Uloaded pdf

Code: Select all

2013-01-20 11:52:14,374 [http-bio-0.0.0.0-8090-exec-6] ERROR com.openkm.servlet.frontend.WorkflowServlet -
java.lang.NullPointerException
        at com.openkm.h.a.d.a(Unknown Source)
        at com.openkm.h.b.r.getProcessDefinitionForms(Unknown Source)
2013-01-20 11:53:25,991 [Text Extractor Worker] INFO  com.openkm.d.t - processSerial.Working on {docUuid=cfe87cb6-9a1d-41fc-9235-f1f1fb0ad7b6, docPath=/okm:root
/h6118-captiva-module-pdg.pdf, docVerUuid=c2c005da-8c99-4f62-b2c1-9175539cff64,date=Sun Jan 20 11:52:13 IST 2013}2013-01-20 11:53:25,999 [Text Extractor Worker] WARN  com.openkm.d.m - Text extraction failure: Full text indexing of 'application/pdf' is not supported2013-01-20 11:53:26,000 [Text Extractor Worker] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/h6118-captiva-module-pdg.pdf': Full text indexing of 'application/pdf' is not supported

Can you tell me how to resolve this.

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#20995 by jllort
Sun Jan 20, 2013 6:20 pm

You're using 6.2.2 community or trial ? because this kind of logs with this packages are strange to me

com.openkm.d.m

In log4j.properties file add this line
log4j.logger.com.openkm.extractor=DEBUG

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Unable to invoke tesseract OCR 3.x

#21013 by venu.vijayagiri09
Tue Jan 22, 2013 4:33 am

I am using OpenKM 6.2.4 trial version on windows 7 and I added the line to log4j.properties file, nothing found new in the log.

Username

venu.vijayagiri09

Rank

Fresh Boarder

Posts

14

Joined

Mon Jan 07, 2013 4:33 am

Re: Unable to invoke tesseract OCR 3.x

#21060 by jllort
Thu Jan 24, 2013 4:06 pm

I do not know which could be the cause, but be sure on windows is running correctly it, because we have some customers in windows and tesseract ocr configured. Anyway is this critical test to you I suggest contact with us at http://www.openkm.com/en/contact.html and we'll try to help you on more direct way ( indicate the post url )

Username

jllort

Rank

Moderator

Posts

12187

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Reply

Page 1 of 1
15 posts