Open Source Document Management System | OpenKM

PostPosted:**Fri Aug 07, 2015 8:06 am**

Hi all,

I have an archive with a huge amount of PDF files which contains scanned reports. I tryed to use OCR on them but without success...

To make them searcheable it would be handy if you set the field NDC_TEXT or NDC_TEXT_EXTRACTED (I am new to OKM, so I dont know which one to use) to a part of the Content...

How can I add a tab (like document notes) where I can see and modify the text or is there something available?

Kind regards

Christian

PostPosted:**Sat Aug 08, 2015 7:40 am**

Which OCR engine have you configured ?

With professional version we've a feature to see the text extracted on an extra tab, actually is not present in community version. If you want to extend it I can guide on the steps for adding in actual source code.

Anyway I suggest concentrate on pdf indexing, seems your files are not correctly indexed ( first ensure a document has yet processed, documents goes into a queue to be processed and depending the number of documents at the pending queue can take more or less time to be processed. You can see the queue at Administration -> Stats -> Top right button "Pending queue" ).

When a document is yet processed you'll see, the fields NDC_TEXT and NDC_TEXT_EXTRACTED changes into the table OKM_NODE_DOCUMENT. The way to see is some select like SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the document uuid'. You can get the document uuid from the properties tab ( bottom panel ) when you have the document selected in the file browser.

If the NDC_TEXT is empty then you have a problem with the configuration of with the pdf documents. If this is the case share here some pdf file ( upload into zip ) and we will take a look.

PostPosted:**Mon Aug 10, 2015 7:55 am**

Hi Jlort,

I have configured tesseract, but the extraction doesnt work (Java-Errors), which i could not resolv:

Code: Select all

root@euve78434:~# less /opt/tomcat/logs/catalina.log
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- Abnormal program termination: 2
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- CommandLine: [/usr/bin/tesseract, , /opt/tomcat/temp/R168839771776080646766.png, /opt/tomcat/temp/okm6680092088155651071]
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- STDERR: Tesseract Open Source OCR Engine v3.03 with Leptonica
Cannot open input file:

2015-08-10 09:50:23,549 [Thread-7616] WARN  com.openkm.extractor.Tesseract3TextExtractor- IO exception executing command: /usr/bin/tesseract  /opt/tomcat/temp/R168839771776080646766.png /opt/tomcat/temp/okm6680092088155651071
java.io.FileNotFoundException: /opt/tomcat/temp/okm6680092088155651071.txt (No such file or directory)

A new error I have seen in the logfile is:

Code: Select all

2015-08-10 09:50:28,093 [Thread-7616] WARN  com.openkm.core.Cron- Error executing crontab task 'Text Extractor Worker': Sourced file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : Method Invocation run : at Line: 1 : in file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : .run ( )

Target exception: java.lang.OutOfMemoryError: Java heap space

Kind regards

Christian

PostPosted:**Wed Aug 12, 2015 7:36 am**

Do not merge several questions on same post, please add other topic for it.

PostPosted:**Wed Aug 12, 2015 1:11 pm**

Ok, than lets continue on my beginner question

I solved the OCR Problems anyway and changed the PdfExtractor & TesseractExtractor...

Open Source Document Management System | OpenKM

Adding a new tab at document level

Adding a new tab at document level

Re: Adding a new tab at document level

Re: Adding a new tab at document level

Re: Adding a new tab at document level

Re: Adding a new tab at document level