Page 1 of 2
How to use OCR
PostPosted:Thu Apr 18, 2013 8:29 am
by quelo1972
I've installed OpenKM 6.2.3 (build: 7945) on an ubuntu server precise/pangolin 12.04.2 LTS
I've installed cuneiform and configured as
system.ocr String /usr/bin/cuneiform ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.oxt
But i cannot view any button OCR or menu entry called OCR or similar.
How can I use this feature?
Thanks in advance for the help.
Re: How to use OCR
PostPosted:Fri Apr 19, 2013 6:20 pm
by jllort
Basically openkm each time find some image executes automatically in background the ocr engine. I suggest try first from command line to detech with ocr engine works better with your images and then configure it. Cuneiform for major cases is right. Take note, than docs are not indexed in real time, they come into a indexing task queue ( see on administration stats ) and there're periodically indexed by batch process.
Re: How to use OCR
PostPosted:Sat Apr 20, 2013 7:43 am
by quelo1972
Ok. But if I right click on a file I haven't any context menu entry named OCR or similar.
Re: How to use OCR
PostPosted:Sun Apr 21, 2013 6:43 pm
by jllort
I think you're on confusion, ocr is not exactly the same than zonal ocr. OCR as is not needs any menu option etc... basically when image document is uploaded or pdf with images automatically ocr tries to extract text contents ( nothing else ). Zonal ocr - which comes with some menu option etc... , is more advanced feature which allows to identify documents and extract some parts to metadata ( actually this feature is only present in professional version ).
Re: How to use OCR
PostPosted:Tue Apr 23, 2013 10:10 am
by quelo1972
Ok but how can I view the extracted text?
Thanks.
Re: How to use OCR
PostPosted:Tue Apr 23, 2013 7:23 pm
by pavila
You need to make a SQL query: select * from OKM_NODE_DOCUMENT
Re: How to use OCR
PostPosted:Wed Apr 24, 2013 10:55 am
by quelo1972
From Administration -> dataabase query I execute the SQL Query but nothing will be returned.
Re: How to use OCR
PostPosted:Thu Apr 25, 2013 9:54 pm
by jllort
You have to select jdbc from list at bottom right corner.
Re: How to use OCR
PostPosted:Mon May 06, 2013 11:55 am
by quelo1972
Ok!
I've configured the ocr as
system.ocr String /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.(oxt|zip)
I've uploaded a page from scanner in tif with some text but if I execute a query in the database I can't find any text.
How can understand if the ocr works or not?
Re: How to use OCR
PostPosted:Mon May 06, 2013 1:46 pm
by anyonebutnoone
Testing your ocr:
a) copy the scanned image to your local filesystem
b) run cuneiform ocr on this image /usr/bin/cuneiform /path/to/scannedfile.xxx -o /path/to/output
c) check the output file and see what it contains
If you get an error or no output file is created your cuneiform on your system is not working and there for OpenKM has no way of making use of it.
Also always very helpful: tomcathome/logs/catalina.log
See if there are any errors when OpenKM calls cuneiform
Re: How to use OCR
PostPosted:Mon May 06, 2013 3:41 pm
by quelo1972
With this command a file .txt will be generated with the text inside
/usr/bin/cuneiform prova.pdf -o prova.txt
When I upload the file prova.pdf I have this log in catalina.log
Code: Select all2013-05-06 17:36:15,756 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova.pdf'
2013-05-06 17:36:15,757 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova.pdf' into '/okm:root/CIE (15,4 MB)'
2013-05-06 17:36:15,758 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:40:00,023 [Thread-4848] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a8e6c061-1fcd-4393-9583-ad403a130925, docPath=/okm:root/CIE/prova.pdf, docVerUuid=5b531734-689d-4e1a-8e89-252f56e25b1e, date=Mon May 06 17:36:25 CEST 2013}
2013-05-06 17:40:00,447 [Thread-4848] WARN com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-06 17:40:19,709 [Thread-4848] INFO com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /mnt/OpenKM/dict-it.(oxt|zip)
2013-05-06 17:40:19,714 [Thread-4848] WARN com.openkm.extractor.CuneiformTextExtractor - IO exception executing command: /usr/bin/cuneiform /mnt/OpenKM/tomcat/temp/Im07239809310552548179.png -o /mnt/OpenKM/tomcat/temp/okm4982710508936566750.txt
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:127)
at java.util.zip.ZipFile.<init>(ZipFile.java:88)
at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:51)
at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:150)
at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:141)
at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:100)
at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:211)
at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:172)
at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
at sun.reflect.GeneratedMethodAccessor211.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at bsh.Reflect.invokeOnMethod(Unknown Source)
at bsh.Reflect.invokeObjectMethod(Unknown Source)
at bsh.BSHPrimarySuffix.doName(Unknown Source)
at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
at java.lang.Thread.run(Thread.java:662)
2013-05-06 17:40:19,719 [Thread-4848] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/CIE/prova.pdf': Too few text extracted
Re: How to use OCR
PostPosted:Tue May 07, 2013 4:42 am
by joako
I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.
Re: How to use OCR
PostPosted:Wed May 08, 2013 8:01 pm
by jllort
seems you got configured dictionary and has worng configuration. Remove it, test again. And then try to configure dictionary correctly.
Re: How to use OCR
PostPosted:Sun May 12, 2013 10:30 pm
by pavila
joako wrote:I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.
This is because the configured OCR executable does not match the text extractor. I think you have configured the CunneiformTextExtractor but using tesseract executable.
Re: How to use OCR
PostPosted:Thu May 16, 2013 3:22 pm
by quelo1972
I remove configuration of dictionary and now if I scan a page with the scanner in PDF I have this message:
Code: Select all2013-05-16 17:09:09,694 [http-bio-0.0.0.0-8080-exec-248] INFO com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova2.pdf'
2013-05-16 17:09:09,697 [http-bio-0.0.0.0-8080-exec-248] INFO com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova2.pdf' into '/okm:root (281,6 KB)'
2013-05-16 17:09:09,698 [http-bio-0.0.0.0-8080-exec-248] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,185 [http-bio-0.0.0.0-8080-exec-248] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova2.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,186 [http-bio-0.0.0.0-8080-exec-248] INFO com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova2.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:10:00,033 [Thread-11347] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d9e76b01-b4a4-4751-a444-f7d9a105e215, docPath=/okm:root/prova2.pdf, docVerUuid=3407f033-a3f1-4889-b762-a6f86b2abdec, date=Thu May 16 17:09:19 CEST 2013}
2013-05-16 17:10:00,048 [Thread-11347] WARN com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
but I scan a page in tif format OCR works!
This is the log:
Code: Select all2013-05-16 17:14:51,275 [http-bio-0.0.0.0-8080-exec-251] INFO com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova3.tif'
2013-05-16 17:14:51,277 [http-bio-0.0.0.0-8080-exec-251] INFO com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova3.tif' into '/okm:root (12,1 MB)'
2013-05-16 17:14:51,278 [http-bio-0.0.0.0-8080-exec-251] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,024 [http-bio-0.0.0.0-8080-exec-251] INFO com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova3.tif, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,026 [http-bio-0.0.0.0-8080-exec-251] INFO com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova3.tif","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:20:00,023 [Thread-11355] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=344b9d9e-34e2-4dc9-9078-b8d9ad66f085, docPath=/okm:root/prova3.tif, docVerUuid=f2e2db1f-3aa5-45c2-9ec8-1a3120c07d98, date=Thu May 16 17:15:01 CEST 2013}
2013-05-16 17:20:03,340 [Thread-11355] INFO com.openkm.extractor.CuneiformTextExtractor - Rotate image 90 degrees
2013-05-16 17:20:04,232 [Thread-11355] INFO com.openkm.extractor.CuneiformTextExtractor - Rotate image 180 degrees
2013-05-16 17:20:05,054 [Thread-11355] INFO com.openkm.extractor.CuneiformTextExtractor - Rotate image 270 degrees
2013-05-16 17:21:12,527 [http-bio-0.0.0.0-8080-exec-239] INFO com.openkm.dao.SearchDAO - findBySimpleQuery(Enzo AND context:okm_root, 0, 10)
2013-05-16 17:21:12,530 [http-bio-0.0.0.0-8080-exec-239] INFO com.openkm.dao.SearchDAO - findBySimpleQuery.query: +text:enzo +context:okm_root
Have you any ideas?