Open Source Document Management System | OpenKM

PostPosted:**Thu Apr 18, 2013 8:29 am**

I've installed OpenKM 6.2.3 (build: 7945) on an ubuntu server precise/pangolin 12.04.2 LTS

I've installed cuneiform and configured as
system.ocr String /usr/bin/cuneiform ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.oxt

But i cannot view any button OCR or menu entry called OCR or similar.

How can I use this feature?

Thanks in advance for the help.

PostPosted:**Fri Apr 19, 2013 6:20 pm**

Basically openkm each time find some image executes automatically in background the ocr engine. I suggest try first from command line to detech with ocr engine works better with your images and then configure it. Cuneiform for major cases is right. Take note, than docs are not indexed in real time, they come into a indexing task queue ( see on administration stats ) and there're periodically indexed by batch process.

PostPosted:**Sat Apr 20, 2013 7:43 am**

Ok. But if I right click on a file I haven't any context menu entry named OCR or similar.

PostPosted:**Sun Apr 21, 2013 6:43 pm**

I think you're on confusion, ocr is not exactly the same than zonal ocr. OCR as is not needs any menu option etc... basically when image document is uploaded or pdf with images automatically ocr tries to extract text contents ( nothing else ). Zonal ocr - which comes with some menu option etc... , is more advanced feature which allows to identify documents and extract some parts to metadata ( actually this feature is only present in professional version ).

PostPosted:**Tue Apr 23, 2013 10:10 am**

Ok but how can I view the extracted text?
Thanks.

PostPosted:**Tue Apr 23, 2013 7:23 pm**

You need to make a SQL query: select * from OKM_NODE_DOCUMENT

PostPosted:**Wed Apr 24, 2013 10:55 am**

From Administration -> dataabase query I execute the SQL Query but nothing will be returned.

PostPosted:**Thu Apr 25, 2013 9:54 pm**

You have to select jdbc from list at bottom right corner.

PostPosted:**Mon May 06, 2013 11:55 am**

Ok!
I've configured the ocr as
system.ocr String /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.(oxt|zip)

I've uploaded a page from scanner in tif with some text but if I execute a query in the database I can't find any text.
How can understand if the ocr works or not?

PostPosted:**Mon May 06, 2013 1:46 pm**

Testing your ocr:

a) copy the scanned image to your local filesystem
b) run cuneiform ocr on this image /usr/bin/cuneiform /path/to/scannedfile.xxx -o /path/to/output
c) check the output file and see what it contains

If you get an error or no output file is created your cuneiform on your system is not working and there for OpenKM has no way of making use of it.

Also always very helpful: tomcathome/logs/catalina.log
See if there are any errors when OpenKM calls cuneiform

PostPosted:**Mon May 06, 2013 3:41 pm**

With this command a file .txt will be generated with the text inside
/usr/bin/cuneiform prova.pdf -o prova.txt

When I upload the file prova.pdf I have this log in catalina.log

Code: Select all

2013-05-06 17:36:15,756 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova.pdf'
2013-05-06 17:36:15,757 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova.pdf' into '/okm:root/CIE (15,4 MB)'
2013-05-06 17:36:15,758 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:40:00,023 [Thread-4848] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a8e6c061-1fcd-4393-9583-ad403a130925, docPath=/okm:root/CIE/prova.pdf, docVerUuid=5b531734-689d-4e1a-8e89-252f56e25b1e, date=Mon May 06 17:36:25 CEST 2013}
2013-05-06 17:40:00,447 [Thread-4848] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-06 17:40:19,709 [Thread-4848] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /mnt/OpenKM/dict-it.(oxt|zip)
2013-05-06 17:40:19,714 [Thread-4848] WARN  com.openkm.extractor.CuneiformTextExtractor - IO exception executing command: /usr/bin/cuneiform /mnt/OpenKM/tomcat/temp/Im07239809310552548179.png -o /mnt/OpenKM/tomcat/temp/okm4982710508936566750.txt
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)
        at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:51)
        at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:150)
        at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:141)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:100)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:211)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:172)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor211.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at bsh.Reflect.invokeOnMethod(Unknown Source)
        at bsh.Reflect.invokeObjectMethod(Unknown Source)
        at bsh.BSHPrimarySuffix.doName(Unknown Source)
        at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
        at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
        at java.lang.Thread.run(Thread.java:662)
2013-05-06 17:40:19,719 [Thread-4848] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/CIE/prova.pdf': Too few text extracted

PostPosted:**Tue May 07, 2013 4:42 am**

I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.

PostPosted:**Wed May 08, 2013 8:01 pm**

seems you got configured dictionary and has worng configuration. Remove it, test again. And then try to configure dictionary correctly.

PostPosted:**Sun May 12, 2013 10:30 pm**

joako wrote:I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.

This is because the configured OCR executable does not match the text extractor. I think you have configured the CunneiformTextExtractor but using tesseract executable.

PostPosted:**Thu May 16, 2013 3:22 pm**

I remove configuration of dictionary and now if I scan a page with the scanner in PDF I have this message:

Code: Select all

2013-05-16 17:09:09,694 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova2.pdf'
2013-05-16 17:09:09,697 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova2.pdf' into '/okm:root (281,6 KB)'
2013-05-16 17:09:09,698 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,185 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova2.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,186 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova2.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:10:00,033 [Thread-11347] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d9e76b01-b4a4-4751-a444-f7d9a105e215, docPath=/okm:root/prova2.pdf, docVerUuid=3407f033-a3f1-4889-b762-a6f86b2abdec, date=Thu May 16 17:09:19 CEST 2013}
2013-05-16 17:10:00,048 [Thread-11347] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer

but I scan a page in tif format OCR works!
This is the log:

Code: Select all

2013-05-16 17:14:51,275 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova3.tif'
2013-05-16 17:14:51,277 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova3.tif' into '/okm:root (12,1 MB)'
2013-05-16 17:14:51,278 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,024 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova3.tif, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,026 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova3.tif","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:20:00,023 [Thread-11355] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=344b9d9e-34e2-4dc9-9078-b8d9ad66f085, docPath=/okm:root/prova3.tif, docVerUuid=f2e2db1f-3aa5-45c2-9ec8-1a3120c07d98, date=Thu May 16 17:15:01 CEST 2013}
2013-05-16 17:20:03,340 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 90 degrees
2013-05-16 17:20:04,232 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 180 degrees
2013-05-16 17:20:05,054 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 270 degrees
2013-05-16 17:21:12,527 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery(Enzo AND context:okm_root, 0, 10)
2013-05-16 17:21:12,530 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery.query: +text:enzo +context:okm_root

Have you any ideas?

Open Source Document Management System | OpenKM

How to use OCR

How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR

Re: How to use OCR