Open Source Document Management System | OpenKM

Reply

How to use OCR

#22562 by quelo1972
Thu Apr 18, 2013 8:29 am

I've installed OpenKM 6.2.3 (build: 7945) on an ubuntu server precise/pangolin 12.04.2 LTS

I've installed cuneiform and configured as
system.ocr String /usr/bin/cuneiform ${fileIn} ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.oxt

But i cannot view any button OCR or menu entry called OCR or similar.

How can I use this feature?

Thanks in advance for the help.

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22587 by jllort
Fri Apr 19, 2013 6:20 pm

Basically openkm each time find some image executes automatically in background the ocr engine. I suggest try first from command line to detech with ocr engine works better with your images and then configure it. Cuneiform for major cases is right. Take note, than docs are not indexed in real time, they come into a indexing task queue ( see on administration stats ) and there're periodically indexed by batch process.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: How to use OCR

#22596 by quelo1972
Sat Apr 20, 2013 7:43 am

Ok. But if I right click on a file I haven't any context menu entry named OCR or similar.

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22608 by jllort
Sun Apr 21, 2013 6:43 pm

I think you're on confusion, ocr is not exactly the same than zonal ocr. OCR as is not needs any menu option etc... basically when image document is uploaded or pdf with images automatically ocr tries to extract text contents ( nothing else ). Zonal ocr - which comes with some menu option etc... , is more advanced feature which allows to identify documents and extract some parts to metadata ( actually this feature is only present in professional version ).

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: How to use OCR

#22648 by quelo1972
Tue Apr 23, 2013 10:10 am

Ok but how can I view the extracted text?
Thanks.

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22663 by pavila
Tue Apr 23, 2013 7:23 pm

You need to make a SQL query: select * from OKM_NODE_DOCUMENT

Username

pavila

Rank

Moderator

Posts

3142

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: How to use OCR

#22684 by quelo1972
Wed Apr 24, 2013 10:55 am

From Administration -> dataabase query I execute the SQL Query but nothing will be returned.

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22713 by jllort
Thu Apr 25, 2013 9:54 pm

You have to select jdbc from list at bottom right corner.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: How to use OCR

#22886 by quelo1972
Mon May 06, 2013 11:55 am

Ok!
I've configured the ocr as
system.ocr String /usr/bin/cuneiform ${fileIn} -o ${fileOut}
system.ocr.rotate String 90;180;270;
system.openoffice.dictionary String /mnt/OpenKM/dict-it.(oxt|zip)

I've uploaded a page from scanner in tif with some text but if I execute a query in the database I can't find any text.
How can understand if the ocr works or not?

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22889 by anyonebutnoone
Mon May 06, 2013 1:46 pm

Testing your ocr:

a) copy the scanned image to your local filesystem
b) run cuneiform ocr on this image /usr/bin/cuneiform /path/to/scannedfile.xxx -o /path/to/output
c) check the output file and see what it contains

If you get an error or no output file is created your cuneiform on your system is not working and there for OpenKM has no way of making use of it.

Also always very helpful: tomcathome/logs/catalina.log
See if there are any errors when OpenKM calls cuneiform

Username

anyonebutnoone

Rank

Fresh Boarder

Posts

15

Joined

Sat Apr 13, 2013 11:00 am

Re: How to use OCR

#22890 by quelo1972
Mon May 06, 2013 3:41 pm

With this command a file .txt will be generated with the text inside
/usr/bin/cuneiform prova.pdf -o prova.txt

When I upload the file prova.pdf I have this log in catalina.log

Code: Select all

2013-05-06 17:36:15,756 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova.pdf'
2013-05-06 17:36:15,757 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova.pdf' into '/okm:root/CIE (15,4 MB)'
2013-05-06 17:36:15,758 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:36:27,379 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2FCIE%2Fprova.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-06 17:36:27,382 [http-bio-0.0.0.0-8080-exec-219] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2FCIE%2Fprova.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-06 17:40:00,023 [Thread-4848] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=a8e6c061-1fcd-4393-9583-ad403a130925, docPath=/okm:root/CIE/prova.pdf, docVerUuid=5b531734-689d-4e1a-8e89-252f56e25b1e, date=Mon May 06 17:36:25 CEST 2013}
2013-05-06 17:40:00,447 [Thread-4848] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2013-05-06 17:40:19,709 [Thread-4848] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /mnt/OpenKM/dict-it.(oxt|zip)
2013-05-06 17:40:19,714 [Thread-4848] WARN  com.openkm.extractor.CuneiformTextExtractor - IO exception executing command: /usr/bin/cuneiform /mnt/OpenKM/tomcat/temp/Im07239809310552548179.png -o /mnt/OpenKM/tomcat/temp/okm4982710508936566750.txt
java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.zip.ZipFile.<init>(ZipFile.java:88)
        at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:51)
        at com.openkm.extractor.CuneiformTextExtractor.doOcr(CuneiformTextExtractor.java:150)
        at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:141)
        at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:100)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:211)
        at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:172)
        at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1306)
        at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
        at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
        at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
        at sun.reflect.GeneratedMethodAccessor211.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at bsh.Reflect.invokeOnMethod(Unknown Source)
        at bsh.Reflect.invokeObjectMethod(Unknown Source)
        at bsh.BSHPrimarySuffix.doName(Unknown Source)
        at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at bsh.Interpreter.eval(Unknown Source)
        at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
        at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
        at java.lang.Thread.run(Thread.java:662)
2013-05-06 17:40:19,719 [Thread-4848] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/CIE/prova.pdf': Too few text extracted

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Re: How to use OCR

#22898 by joako
Tue May 07, 2013 4:42 am

I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Re: How to use OCR

#22919 by jllort
Wed May 08, 2013 8:01 pm

seems you got configured dictionary and has worng configuration. Remove it, test again. And then try to configure dictionary correctly.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: How to use OCR

#23020 by pavila
Sun May 12, 2013 10:30 pm

joako wrote:I have the same problem. Under /opt/openkm/temp I have 1730 files such as okm9159544668738627006.txt.txt. I can see the tesseract process running with the linux command "top," however when I execute the SQL query there is no extracted text from PDF files. If I disable the OCR and import PDF with text, the are indexed properly.

This is because the configured OCR executable does not match the text extractor. I think you have configured the CunneiformTextExtractor but using tesseract executable.

Username

pavila

Rank

Moderator

Posts

3142

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: How to use OCR

#23071 by quelo1972
Thu May 16, 2013 3:22 pm

I remove configuration of dictionary and now if I scan a page with the scanner in PDF I have this message:

Code: Select all

2013-05-16 17:09:09,694 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova2.pdf'
2013-05-16 17:09:09,697 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova2.pdf' into '/okm:root (281,6 KB)'
2013-05-16 17:09:09,698 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,185 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova2.pdf, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:09:20,186 [http-bio-0.0.0.0-8080-exec-248] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova2.pdf","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:10:00,033 [Thread-11347] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d9e76b01-b4a4-4751-a444-f7d9a105e215, docPath=/okm:root/prova2.pdf, docVerUuid=3407f033-a3f1-4889-b762-a6f86b2abdec, date=Thu May 16 17:09:19 CEST 2013}
2013-05-16 17:10:00,048 [Thread-11347] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer

but I scan a page in tif format OCR works!
This is the log:

Code: Select all

2013-05-16 17:14:51,275 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Filename: 'prova3.tif'
2013-05-16 17:14:51,277 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Upload file 'prova3.tif' into '/okm:root (12,1 MB)'
2013-05-16 17:14:51,278 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,024 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Wizard: {path=%2Fokm%3Aroot%2Fprova3.tif, showWizardCategories=false, showWizardKeywords=false, groupsList=[], workflowList=[], hasAutomation=false, error=, digitalSignature=false}
2013-05-16 17:15:03,026 [http-bio-0.0.0.0-8080-exec-251] INFO  com.openkm.servlet.frontend.FileUploadServlet - Action: 0, JSON Response: {"hasAutomation":false,"path":"%2Fokm%3Aroot%2Fprova3.tif","groupsList":[],"workflowList":[],"showWizardCategories":false,"showWizardKeywords":false,"digitalSignature":false,"error":""}
2013-05-16 17:20:00,023 [Thread-11355] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=344b9d9e-34e2-4dc9-9078-b8d9ad66f085, docPath=/okm:root/prova3.tif, docVerUuid=f2e2db1f-3aa5-45c2-9ec8-1a3120c07d98, date=Thu May 16 17:15:01 CEST 2013}
2013-05-16 17:20:03,340 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 90 degrees
2013-05-16 17:20:04,232 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 180 degrees
2013-05-16 17:20:05,054 [Thread-11355] INFO  com.openkm.extractor.CuneiformTextExtractor - Rotate image 270 degrees
2013-05-16 17:21:12,527 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery(Enzo AND context:okm_root, 0, 10)
2013-05-16 17:21:12,530 [http-bio-0.0.0.0-8080-exec-239] INFO  com.openkm.dao.SearchDAO - findBySimpleQuery.query: +text:enzo +context:okm_root

Have you any ideas?

Username

quelo1972

Rank

Fresh Boarder

Posts

17

Joined

Mon Apr 15, 2013 2:28 pm

Reply

Page 1 of 2
20 posts

1
2