Open Source Document Management System | OpenKM

Archiving bills

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

10 posts

10 posts

Archiving bills

#28948 by meinereiner
Sun Jun 15, 2014 6:25 pm

Hello,

I'm looking for a solution/system that can be used to archive scanned bills as much automatized as possible.
In detail I would like to have the following:

If any invoice that needs to be archived (can come from anywhere, cannot be processed with a template) is scanned and put to a network share, a system (OpenKM) is doing an OCR scan on those files and already prefils some information. When someone then logs in the system the user should be able to see and further process it. But also when there is no tagging or other things done it should be possible to find the document back again if I do a text search.

I've read that OCR is possible but how can the process described above be set up automatized? Can someone give me some hints on which functions I need to look on? How is OCR done automatically?

Thanks in advance for your replies!

meinereiner

Username

meinereiner

Rank

Fresh Boarder

Posts

Joined

Sun Jun 15, 2014 6:13 pm

Re: Archiving bills

#28962 by jllort
Tue Jun 17, 2014 6:49 pm

You want to catalog invoice evaluating full ocr text output ? is that ?

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Archiving bills

#28965 by meinereiner
Tue Jun 17, 2014 7:44 pm

In short: yes!
I want to be able to make a full text search after scanning and find for that text in a scanned document. I also do not want to process all scanned documents manually before I can do that.

Username

meinereiner

Rank

Fresh Boarder

Posts

Joined

Sun Jun 15, 2014 6:13 pm

Re: Archiving bills

#28971 by jllort
Wed Jun 18, 2014 3:36 pm

Here you got two options based on automation events.
1- When document is created or updated force OCR text extractor and then analyze contents.
2- After document is procesed in queue do the analyze

What you prefer I try to explain you ?

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Archiving bills

#28975 by meinereiner
Wed Jun 18, 2014 9:01 pm

I assume with processing in a queue its not ment to automatically import documents form a folder or am I wrong?

For the first test I would do a manual document creatiion so it would be great if you may explain to me what I have to do in order to get option 1 done.

Thanks in advance for your support!

Username

meinereiner

Rank

Fresh Boarder

Posts

Joined

Sun Jun 15, 2014 6:13 pm

Re: Archiving bills

#28983 by jllort
Thu Jun 19, 2014 2:59 pm

I will try to explain better. Any document upload ( created to openkm ) goes into "text extraction queue" ( queue of document pending to be indexed -> extract text ). Based on crontab task ( 5 minutes period ) this queue is procesed. Arrived at this point you got to options
1- After document is procesed from "text extraction queue" openkm sends some signal ( automation event ) and then you got extracted text to apply some logic, etc...
2- You can force each time document is created to inmediatly execute text extraction ( that means uploading process will take some extra time to finish ) but inmediatly you'll be able to evaluate the extracted text.

Hope you understand better the two scenarios, what you wish I explain to you in more deep ?

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Archiving bills

#28988 by meinereiner
Thu Jun 19, 2014 7:48 pm

Thanks for the explanation.
I'm looking for option 2.

After you told me that at least some things should work already with OCR I checked again the log files and found out that the automatic process after 5 minutes is already triggered.
There it is always quoted "Too few text extracted" after the processing although the files are available in a high resolution. When I do this manually I can see a lot of text being extracted perfectly.

Can you tell me how option 2 can be set up and how I can improve the text recognition?

Username

meinereiner

Rank

Fresh Boarder

Posts

Joined

Sun Jun 15, 2014 6:13 pm

Re: Archiving bills

#28994 by jllort
Fri Jun 20, 2014 10:07 am

- Take a look here how enable automation http://wiki.openkm.com/index.php/Automation
- You must create and Automation Action. For it the best is configure development IDE http://wiki.openkm.com/index.php/Developer_Guide
- Take a look at classes into com.openkm.automation.action and here to create your own classes http://wiki.openkm.com/index.php/Extend_automation_6.4

into the class you must:
1- Get object uuid ->

Code: Select all

String uuid = AutomationUtils.getUuid(env);

2- test if is a document

Code: Select all

if (OKMDocument.getInstance().isValid(null, uuid)) {
}

3- force ocr

Code: Select all

        // Get path
        String docPath = OKMRepository.getInstance().getNodePath(null, uuid);

       // Get doc version uuid
        NodeDocumentVersion currentVersion = NodeDocumentVersionDAO.getInstance().findCurrentVersion(uuid);
        String docVerUuuid = currentVersion.getUuid();
         
        // Document extractor
        TextExtractorWork tew = new TextExtractorWork();
        tew.setDocUuid(uuid);
        tew.setDocPath(docPath);
        tew.setDocVerUuid(docVerUuuid);
         
        // Execute extractor
        NodeDocumentDAO.getInstance().textExtractorHelper(tew);
         
        // Get extracted text
        NodeDocument docNode = NodeDocumentDAO.getInstance().findByPk(uuid);
        String text = docNode.getText();
        if (text==null) {
            text = "";
        } else {
            text = text.toLowerCase();
        }

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Archiving bills

#29015 by meinereiner
Sun Jun 22, 2014 6:17 pm

Thanks for your trips. I will look into that as soon as I have resolved my current issue with OpenKM.

Before doing this I'm wondering on the OCR process that should already work with OpenKM. When add a document ant wait for some time I can see in the condole the following output:

Code: Select all

WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/xxxxxx.pdf': Too few text extracted

I have scanned this document. It doesn't matter on the resolution, everytime I have the same message. When I'm running the OCR command on the console I see more than those 16 required signs getting extracted. Also in the temp folder I can see that the temp file (okm1234567891234567891.txt.txt) contains a lot of text.

I have seen that you already gave a hint on that (http://forum.openkm.com/viewtopic.php?f=4&t=8311 ) which unfortunately doesn't work for me.
Can it be a bug that by accident a .txt is attached by the internal logic too much and so when read back no content gets read because the file cannot be found?

Do you have any idea what can cause that issue and how to resolve that?

Username

meinereiner

Rank

Fresh Boarder

Posts

Joined

Sun Jun 15, 2014 6:13 pm

Re: Archiving bills

#29032 by jllort
Mon Jun 23, 2014 8:37 pm

Try the extraction from command line. I suggest use tesseract

Code: Select all

tesseract filein fileout

Username

jllort

Rank

Moderator

Posts

12160

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
10 posts

Return to “Usage”

Display:

Sort by:

Jump to: