Open Source Document Management System | OpenKM

PostPosted:**Sun Jun 15, 2014 6:25 pm**

Hello,

I'm looking for a solution/system that can be used to archive scanned bills as much automatized as possible.
In detail I would like to have the following:

If any invoice that needs to be archived (can come from anywhere, cannot be processed with a template) is scanned and put to a network share, a system (OpenKM) is doing an OCR scan on those files and already prefils some information. When someone then logs in the system the user should be able to see and further process it. But also when there is no tagging or other things done it should be possible to find the document back again if I do a text search.

I've read that OCR is possible but how can the process described above be set up automatized? Can someone give me some hints on which functions I need to look on? How is OCR done automatically?

Thanks in advance for your replies!

meinereiner

PostPosted:**Tue Jun 17, 2014 6:49 pm**

You want to catalog invoice evaluating full ocr text output ? is that ?

PostPosted:**Tue Jun 17, 2014 7:44 pm**

In short: yes!
I want to be able to make a full text search after scanning and find for that text in a scanned document. I also do not want to process all scanned documents manually before I can do that.

PostPosted:**Wed Jun 18, 2014 3:36 pm**

Here you got two options based on automation events.
1- When document is created or updated force OCR text extractor and then analyze contents.
2- After document is procesed in queue do the analyze

What you prefer I try to explain you ?

PostPosted:**Wed Jun 18, 2014 9:01 pm**

I assume with processing in a queue its not ment to automatically import documents form a folder or am I wrong?

For the first test I would do a manual document creatiion so it would be great if you may explain to me what I have to do in order to get option 1 done.

Thanks in advance for your support!

PostPosted:**Thu Jun 19, 2014 2:59 pm**

I will try to explain better. Any document upload ( created to openkm ) goes into "text extraction queue" ( queue of document pending to be indexed -> extract text ). Based on crontab task ( 5 minutes period ) this queue is procesed. Arrived at this point you got to options
1- After document is procesed from "text extraction queue" openkm sends some signal ( automation event ) and then you got extracted text to apply some logic, etc...
2- You can force each time document is created to inmediatly execute text extraction ( that means uploading process will take some extra time to finish ) but inmediatly you'll be able to evaluate the extracted text.

Hope you understand better the two scenarios, what you wish I explain to you in more deep ?

PostPosted:**Thu Jun 19, 2014 7:48 pm**

Thanks for the explanation.
I'm looking for option 2.

After you told me that at least some things should work already with OCR I checked again the log files and found out that the automatic process after 5 minutes is already triggered.
There it is always quoted "Too few text extracted" after the processing although the files are available in a high resolution. When I do this manually I can see a lot of text being extracted perfectly.

Can you tell me how option 2 can be set up and how I can improve the text recognition?

PostPosted:**Fri Jun 20, 2014 10:07 am**

- Take a look here how enable automation http://wiki.openkm.com/index.php/Automation
- You must create and Automation Action. For it the best is configure development IDE http://wiki.openkm.com/index.php/Developer_Guide
- Take a look at classes into com.openkm.automation.action and here to create your own classes http://wiki.openkm.com/index.php/Extend_automation_6.4

into the class you must:
1- Get object uuid ->

Code: Select all

String uuid = AutomationUtils.getUuid(env);

2- test if is a document

Code: Select all

if (OKMDocument.getInstance().isValid(null, uuid)) {
}

3- force ocr

Code: Select all

        // Get path
        String docPath = OKMRepository.getInstance().getNodePath(null, uuid);

       // Get doc version uuid
        NodeDocumentVersion currentVersion = NodeDocumentVersionDAO.getInstance().findCurrentVersion(uuid);
        String docVerUuuid = currentVersion.getUuid();
         
        // Document extractor
        TextExtractorWork tew = new TextExtractorWork();
        tew.setDocUuid(uuid);
        tew.setDocPath(docPath);
        tew.setDocVerUuid(docVerUuuid);
         
        // Execute extractor
        NodeDocumentDAO.getInstance().textExtractorHelper(tew);
         
        // Get extracted text
        NodeDocument docNode = NodeDocumentDAO.getInstance().findByPk(uuid);
        String text = docNode.getText();
        if (text==null) {
            text = "";
        } else {
            text = text.toLowerCase();
        }

PostPosted:**Sun Jun 22, 2014 6:17 pm**

Thanks for your trips. I will look into that as soon as I have resolved my current issue with OpenKM.

Before doing this I'm wondering on the OCR process that should already work with OpenKM. When add a document ant wait for some time I can see in the condole the following output:

Code: Select all

WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/xxxxxx.pdf': Too few text extracted

I have scanned this document. It doesn't matter on the resolution, everytime I have the same message. When I'm running the OCR command on the console I see more than those 16 required signs getting extracted. Also in the temp folder I can see that the temp file (okm1234567891234567891.txt.txt) contains a lot of text.

I have seen that you already gave a hint on that (http://forum.openkm.com/viewtopic.php?f=4&t=8311 ) which unfortunately doesn't work for me.
Can it be a bug that by accident a .txt is attached by the internal logic too much and so when read back no content gets read because the file cannot be found?

Do you have any idea what can cause that issue and how to resolve that?

PostPosted:**Mon Jun 23, 2014 8:37 pm**

Try the extraction from command line. I suggest use tesseract

Code: Select all

tesseract filein fileout

Open Source Document Management System | OpenKM

Archiving bills

Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills

Re: Archiving bills