Page 1 of 1
Archiving bills
PostPosted:Sun Jun 15, 2014 6:25 pm
by meinereiner
Hello,
I'm looking for a solution/system that can be used to archive scanned bills as much automatized as possible.
In detail I would like to have the following:
If any invoice that needs to be archived (can come from anywhere, cannot be processed with a template) is scanned and put to a network share, a system (OpenKM) is doing an OCR scan on those files and already prefils some information. When someone then logs in the system the user should be able to see and further process it. But also when there is no tagging or other things done it should be possible to find the document back again if I do a text search.
I've read that OCR is possible but how can the process described above be set up automatized? Can someone give me some hints on which functions I need to look on? How is OCR done automatically?
Thanks in advance for your replies!
meinereiner
Re: Archiving bills
PostPosted:Tue Jun 17, 2014 6:49 pm
by jllort
You want to catalog invoice evaluating full ocr text output ? is that ?
Re: Archiving bills
PostPosted:Tue Jun 17, 2014 7:44 pm
by meinereiner
In short: yes!
I want to be able to make a full text search after scanning and find for that text in a scanned document. I also do not want to process all scanned documents manually before I can do that.
Re: Archiving bills
PostPosted:Wed Jun 18, 2014 3:36 pm
by jllort
Here you got two options based on automation events.
1- When document is created or updated force OCR text extractor and then analyze contents.
2- After document is procesed in queue do the analyze
What you prefer I try to explain you ?
Re: Archiving bills
PostPosted:Wed Jun 18, 2014 9:01 pm
by meinereiner
I assume with processing in a queue its not ment to automatically import documents form a folder or am I wrong?
For the first test I would do a manual document creatiion so it would be great if you may explain to me what I have to do in order to get option 1 done.
Thanks in advance for your support!
Re: Archiving bills
PostPosted:Thu Jun 19, 2014 2:59 pm
by jllort
I will try to explain better. Any document upload ( created to openkm ) goes into "text extraction queue" ( queue of document pending to be indexed -> extract text ). Based on crontab task ( 5 minutes period ) this queue is procesed. Arrived at this point you got to options
1- After document is procesed from "text extraction queue" openkm sends some signal ( automation event ) and then you got extracted text to apply some logic, etc...
2- You can force each time document is created to inmediatly execute text extraction ( that means uploading process will take some extra time to finish ) but inmediatly you'll be able to evaluate the extracted text.
Hope you understand better the two scenarios, what you wish I explain to you in more deep ?
Re: Archiving bills
PostPosted:Thu Jun 19, 2014 7:48 pm
by meinereiner
Thanks for the explanation.
I'm looking for option 2.
After you told me that at least some things should work already with OCR I checked again the log files and found out that the automatic process after 5 minutes is already triggered.
There it is always quoted "Too few text extracted" after the processing although the files are available in a high resolution. When I do this manually I can see a lot of text being extracted perfectly.
Can you tell me how option 2 can be set up and how I can improve the text recognition?
Re: Archiving bills
PostPosted:Fri Jun 20, 2014 10:07 am
by jllort
- Take a look here how enable automation
http://wiki.openkm.com/index.php/Automation
- You must create and Automation Action. For it the best is configure development IDE
http://wiki.openkm.com/index.php/Developer_Guide
- Take a look at classes into com.openkm.automation.action and here to create your own classes
http://wiki.openkm.com/index.php/Extend_automation_6.4
into the class you must:
1- Get object uuid ->
Code: Select allString uuid = AutomationUtils.getUuid(env);
2- test if is a document
Code: Select allif (OKMDocument.getInstance().isValid(null, uuid)) {
}
3- force ocr
Code: Select all // Get path
String docPath = OKMRepository.getInstance().getNodePath(null, uuid);
// Get doc version uuid
NodeDocumentVersion currentVersion = NodeDocumentVersionDAO.getInstance().findCurrentVersion(uuid);
String docVerUuuid = currentVersion.getUuid();
// Document extractor
TextExtractorWork tew = new TextExtractorWork();
tew.setDocUuid(uuid);
tew.setDocPath(docPath);
tew.setDocVerUuid(docVerUuuid);
// Execute extractor
NodeDocumentDAO.getInstance().textExtractorHelper(tew);
// Get extracted text
NodeDocument docNode = NodeDocumentDAO.getInstance().findByPk(uuid);
String text = docNode.getText();
if (text==null) {
text = "";
} else {
text = text.toLowerCase();
}
Re: Archiving bills
PostPosted:Sun Jun 22, 2014 6:17 pm
by meinereiner
Thanks for your trips. I will look into that as soon as I have resolved my current issue with OpenKM.
Before doing this I'm wondering on the OCR process that should already work with OpenKM. When add a document ant wait for some time I can see in the condole the following output:
Code: Select allWARN com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/xxxxxx.pdf': Too few text extracted
I have scanned this document. It doesn't matter on the resolution, everytime I have the same message. When I'm running the OCR command on the console I see more than those 16 required signs getting extracted. Also in the temp folder I can see that the temp file (okm1234567891234567891.txt.txt) contains a lot of text.
I have seen that you already gave a hint on that (
http://forum.openkm.com/viewtopic.php?f=4&t=8311 ) which unfortunately doesn't work for me.
Can it be a bug that by accident a .txt is attached by the internal logic too much and so when read back no content gets read because the file cannot be found?
Do you have any idea what can cause that issue and how to resolve that?
Re: Archiving bills
PostPosted:Mon Jun 23, 2014 8:37 pm
by jllort
Try the extraction from command line. I suggest use tesseract