• Archiving bills

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #28948  by meinereiner
 
Hello,

I'm looking for a solution/system that can be used to archive scanned bills as much automatized as possible.
In detail I would like to have the following:

If any invoice that needs to be archived (can come from anywhere, cannot be processed with a template) is scanned and put to a network share, a system (OpenKM) is doing an OCR scan on those files and already prefils some information. When someone then logs in the system the user should be able to see and further process it. But also when there is no tagging or other things done it should be possible to find the document back again if I do a text search.

I've read that OCR is possible but how can the process described above be set up automatized? Can someone give me some hints on which functions I need to look on? How is OCR done automatically?

Thanks in advance for your replies!

meinereiner
 #28962  by jllort
 
You want to catalog invoice evaluating full ocr text output ? is that ?
 #28965  by meinereiner
 
In short: yes!
I want to be able to make a full text search after scanning and find for that text in a scanned document. I also do not want to process all scanned documents manually before I can do that.
 #28971  by jllort
 
Here you got two options based on automation events.
1- When document is created or updated force OCR text extractor and then analyze contents.
2- After document is procesed in queue do the analyze

What you prefer I try to explain you ?
 #28975  by meinereiner
 
I assume with processing in a queue its not ment to automatically import documents form a folder or am I wrong?

For the first test I would do a manual document creatiion so it would be great if you may explain to me what I have to do in order to get option 1 done.

Thanks in advance for your support!
 #28983  by jllort
 
I will try to explain better. Any document upload ( created to openkm ) goes into "text extraction queue" ( queue of document pending to be indexed -> extract text ). Based on crontab task ( 5 minutes period ) this queue is procesed. Arrived at this point you got to options
1- After document is procesed from "text extraction queue" openkm sends some signal ( automation event ) and then you got extracted text to apply some logic, etc...
2- You can force each time document is created to inmediatly execute text extraction ( that means uploading process will take some extra time to finish ) but inmediatly you'll be able to evaluate the extracted text.

Hope you understand better the two scenarios, what you wish I explain to you in more deep ?
 #28988  by meinereiner
 
Thanks for the explanation.
I'm looking for option 2.

After you told me that at least some things should work already with OCR I checked again the log files and found out that the automatic process after 5 minutes is already triggered.
There it is always quoted "Too few text extracted" after the processing although the files are available in a high resolution. When I do this manually I can see a lot of text being extracted perfectly.

Can you tell me how option 2 can be set up and how I can improve the text recognition?
 #28994  by jllort
 
- Take a look here how enable automation http://wiki.openkm.com/index.php/Automation
- You must create and Automation Action. For it the best is configure development IDE http://wiki.openkm.com/index.php/Developer_Guide
- Take a look at classes into com.openkm.automation.action and here to create your own classes http://wiki.openkm.com/index.php/Extend_automation_6.4

into the class you must:
1- Get object uuid ->
Code: Select all
String uuid = AutomationUtils.getUuid(env);
2- test if is a document
Code: Select all
if (OKMDocument.getInstance().isValid(null, uuid)) {
}
3- force ocr
Code: Select all
        // Get path
        String docPath = OKMRepository.getInstance().getNodePath(null, uuid);

       // Get doc version uuid
        NodeDocumentVersion currentVersion = NodeDocumentVersionDAO.getInstance().findCurrentVersion(uuid);
        String docVerUuuid = currentVersion.getUuid();
         
        // Document extractor
        TextExtractorWork tew = new TextExtractorWork();
        tew.setDocUuid(uuid);
        tew.setDocPath(docPath);
        tew.setDocVerUuid(docVerUuuid);
         
        // Execute extractor
        NodeDocumentDAO.getInstance().textExtractorHelper(tew);
         
        // Get extracted text
        NodeDocument docNode = NodeDocumentDAO.getInstance().findByPk(uuid);
        String text = docNode.getText();
        if (text==null) {
            text = "";
        } else {
            text = text.toLowerCase();
        }
 #29015  by meinereiner
 
Thanks for your trips. I will look into that as soon as I have resolved my current issue with OpenKM.

Before doing this I'm wondering on the OCR process that should already work with OpenKM. When add a document ant wait for some time I can see in the condole the following output:
Code: Select all
WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/xxxxxx.pdf': Too few text extracted
I have scanned this document. It doesn't matter on the resolution, everytime I have the same message. When I'm running the OCR command on the console I see more than those 16 required signs getting extracted. Also in the temp folder I can see that the temp file (okm1234567891234567891.txt.txt) contains a lot of text.

I have seen that you already gave a hint on that (http://forum.openkm.com/viewtopic.php?f=4&t=8311 ) which unfortunately doesn't work for me.
Can it be a bug that by accident a .txt is attached by the internal logic too much and so when read back no content gets read because the file cannot be found?

Do you have any idea what can cause that issue and how to resolve that?
 #29032  by jllort
 
Try the extraction from command line. I suggest use tesseract
Code: Select all
tesseract filein fileout

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.