• Adding a new tab at document level

  • He we will discuss about how to make customization and improvement to the OpenKM source code.
He we will discuss about how to make customization and improvement to the OpenKM source code.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #40255  by blacknoir
 
Hi all,

I have an archive with a huge amount of PDF files which contains scanned reports. I tryed to use OCR on them but without success...

To make them searcheable it would be handy if you set the field NDC_TEXT or NDC_TEXT_EXTRACTED (I am new to OKM, so I dont know which one to use) to a part of the Content...

How can I add a tab (like document notes) where I can see and modify the text or is there something available?



Kind regards


Christian
 #40262  by jllort
 
Which OCR engine have you configured ?

With professional version we've a feature to see the text extracted on an extra tab, actually is not present in community version. If you want to extend it I can guide on the steps for adding in actual source code.

Anyway I suggest concentrate on pdf indexing, seems your files are not correctly indexed ( first ensure a document has yet processed, documents goes into a queue to be processed and depending the number of documents at the pending queue can take more or less time to be processed. You can see the queue at Administration -> Stats -> Top right button "Pending queue" ).

When a document is yet processed you'll see, the fields NDC_TEXT and NDC_TEXT_EXTRACTED changes into the table OKM_NODE_DOCUMENT. The way to see is some select like SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the document uuid'. You can get the document uuid from the properties tab ( bottom panel ) when you have the document selected in the file browser.

If the NDC_TEXT is empty then you have a problem with the configuration of with the pdf documents. If this is the case share here some pdf file ( upload into zip ) and we will take a look.
 #40275  by blacknoir
 
Hi Jlort,

I have configured tesseract, but the extraction doesnt work (Java-Errors), which i could not resolv:
Code: Select all
root@euve78434:~# less /opt/tomcat/logs/catalina.log
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- Abnormal program termination: 2
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- CommandLine: [/usr/bin/tesseract, , /opt/tomcat/temp/R168839771776080646766.png, /opt/tomcat/temp/okm6680092088155651071]
2015-08-10 09:50:23,548 [Thread-7616] WARN  com.openkm.util.ExecutionUtils- STDERR: Tesseract Open Source OCR Engine v3.03 with Leptonica
Cannot open input file:

2015-08-10 09:50:23,549 [Thread-7616] WARN  com.openkm.extractor.Tesseract3TextExtractor- IO exception executing command: /usr/bin/tesseract  /opt/tomcat/temp/R168839771776080646766.png /opt/tomcat/temp/okm6680092088155651071
java.io.FileNotFoundException: /opt/tomcat/temp/okm6680092088155651071.txt (No such file or directory)
A new error I have seen in the logfile is:
Code: Select all
2015-08-10 09:50:28,093 [Thread-7616] WARN  com.openkm.core.Cron- Error executing crontab task 'Text Extractor Worker': Sourced file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : Method Invocation run : at Line: 1 : in file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : .run ( )

Target exception: java.lang.OutOfMemoryError: Java heap space

Kind regards

Christian
 #40279  by jllort
 
Do not merge several questions on same post, please add other topic for it.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.