• How to use OCR

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #22562  by quelo1972
 
I've installed OpenKM 6.2.3 (build: 7945) on an ubuntu server precise/pangolin 12.04.2 LTS

I've installed cuneiform and configured as
system.ocr String /usr/bin/cuneiform ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.oxt

But i cannot view any button OCR or menu entry called OCR or similar.

How can I use this feature?

Thanks in advance for the help.
 #22587  by jllort
 
Basically openkm each time find some image executes automatically in background the ocr engine. I suggest try first from command line to detech with ocr engine works better with your images and then configure it. Cuneiform for major cases is right. Take note, than docs are not indexed in real time, they come into a indexing task queue ( see on administration stats ) and there're periodically indexed by batch process.
 #22596  by quelo1972
 
Ok. But if I right click on a file I haven't any context menu entry named OCR or similar.
 #22608  by jllort
 
I think you're on confusion, ocr is not exactly the same than zonal ocr. OCR as is not needs any menu option etc... basically when image document is uploaded or pdf with images automatically ocr tries to extract text contents ( nothing else ). Zonal ocr - which comes with some menu option etc... , is more advanced feature which allows to identify documents and extract some parts to metadata ( actually this feature is only present in professional version ).
 #22663  by pavila
 
You need to make a SQL query: select * from OKM_NODE_DOCUMENT
 #22684  by quelo1972
 
From Administration -> dataabase query I execute the SQL Query but nothing will be returned.
 #22713  by jllort
 
You have to select jdbc from list at bottom right corner.
 #22886  by quelo1972
 
Ok!
I've configured the ocr as
system.ocr String /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.(oxt|zip)

I've uploaded a page from scanner in tif with some text but if I execute a query in the database I can't find any text.
How can understand if the ocr works or not?
 #22889  by anyonebutnoone
 
Testing your ocr:

a) copy the scanned image to your local filesystem
b) run cuneiform ocr on this image /usr/bin/cuneiform /path/to/scannedfile.xxx -o /path/to/output
c) check the output file and see what it contains

If you get an error or no output file is created your cuneiform on your system is not working and there for OpenKM has no way of making use of it.

Also always very helpful: tomcathome/logs/catalina.log
See if there are any errors when OpenKM calls cuneiform
 #22890  by quelo1972
 
With this command a file .txt will be generated with the text inside
/usr/bin/cuneiform prova.pdf -o prova.txt

When I upload the file prova.pdf I have this log in catalina.log
Code: Select all
2013-05-06 17:36:15,756 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova.pdf'
2013-05-06 17:36:15,757 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova.pdf' into '/okm:root/CIE (15,4 MB)'
2013-05-06 17:36:15,758 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:40:00,023 [Thread-4848] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a8e6c061-1fcd-4393-9583-ad403a130925, docPath=/okm:root/CIE/prova.pdf, docVerUuid=5b531734-689d-4e1a-8e89-252f56e25b1e, date=Mon May 06 17:36:25 CEST 2013}
2013-05-06 17:40:00,447 [Thread-4848] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-06 17:40:19,709 [Thread-4848] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /mnt/OpenKM/dict-it.(oxt|zip)
2013-05-06 17:40:19,714 [Thread-4848] WARN  com.openkm.extractor.CuneiformTextExtractor - IO exception executing command: /usr/bin/cuneiform /mnt/OpenKM/tomcat/temp/Im07239809310552548179.png -o /mnt/OpenKM/tomcat/temp/okm4982710508936566750.txt
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)
        at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:51)
        at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:150)
        at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:141)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:100)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:211)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:172)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor211.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at bsh.Reflect.invokeOnMethod(Unknown Source)
        at bsh.Reflect.invokeObjectMethod(Unknown Source)
        at bsh.BSHPrimarySuffix.doName(Unknown Source)
        at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
        at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
        at java.lang.Thread.run(Thread.java:662)
2013-05-06 17:40:19,719 [Thread-4848] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/CIE/prova.pdf': Too few text extracted
 #22898  by joako
 
I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.
 #22919  by jllort
 
seems you got configured dictionary and has worng configuration. Remove it, test again. And then try to configure dictionary correctly.
 #23020  by pavila
 
joako wrote:I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.
This is because the configured OCR executable does not match the text extractor. I think you have configured the CunneiformTextExtractor but using tesseract executable.
 #23071  by quelo1972
 
I remove configuration of dictionary and now if I scan a page with the scanner in PDF I have this message:
Code: Select all
2013-05-16 17:09:09,694 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova2.pdf'
2013-05-16 17:09:09,697 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova2.pdf' into '/okm:root (281,6 KB)'
2013-05-16 17:09:09,698 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,185 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova2.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,186 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova2.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:10:00,033 [Thread-11347] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d9e76b01-b4a4-4751-a444-f7d9a105e215, docPath=/okm:root/prova2.pdf, docVerUuid=3407f033-a3f1-4889-b762-a6f86b2abdec, date=Thu May 16 17:09:19 CEST 2013}
2013-05-16 17:10:00,048 [Thread-11347] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
but I scan a page in tif format OCR works!
This is the log:
Code: Select all
2013-05-16 17:14:51,275 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova3.tif'
2013-05-16 17:14:51,277 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova3.tif' into '/okm:root (12,1 MB)'
2013-05-16 17:14:51,278 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,024 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova3.tif, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,026 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova3.tif","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:20:00,023 [Thread-11355] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=344b9d9e-34e2-4dc9-9078-b8d9ad66f085, docPath=/okm:root/prova3.tif, docVerUuid=f2e2db1f-3aa5-45c2-9ec8-1a3120c07d98, date=Thu May 16 17:15:01 CEST 2013}
2013-05-16 17:20:03,340 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 90 degrees
2013-05-16 17:20:04,232 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 180 degrees
2013-05-16 17:20:05,054 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 270 degrees
2013-05-16 17:21:12,527 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery(Enzo AND context:okm_root, 0, 10)
2013-05-16 17:21:12,530 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery.query: +text:enzo +context:okm_root
Have you any ideas?

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.