Page 1 of 1
Adding a new tab at document level
PostPosted:Fri Aug 07, 2015 8:06 am
by blacknoir
Hi all,
I have an archive with a huge amount of PDF files which contains scanned reports. I tryed to use OCR on them but without success...
To make them searcheable it would be handy if you set the field NDC_TEXT or NDC_TEXT_EXTRACTED (I am new to OKM, so I dont know which one to use) to a part of the Content...
How can I add a tab (like document notes) where I can see and modify the text or is there something available?
Kind regards
Christian
Re: Adding a new tab at document level
PostPosted:Sat Aug 08, 2015 7:40 am
by jllort
Which OCR engine have you configured ?
With professional version we've a feature to see the text extracted on an extra tab, actually is not present in community version. If you want to extend it I can guide on the steps for adding in actual source code.
Anyway I suggest concentrate on pdf indexing, seems your files are not correctly indexed ( first ensure a document has yet processed, documents goes into a queue to be processed and depending the number of documents at the pending queue can take more or less time to be processed. You can see the queue at Administration -> Stats -> Top right button "Pending queue" ).
When a document is yet processed you'll see, the fields NDC_TEXT and NDC_TEXT_EXTRACTED changes into the table OKM_NODE_DOCUMENT. The way to see is some select like SELECT * FROM OKM_NODE_DOCUMENT WHERE NBS_UUID='the document uuid'. You can get the document uuid from the properties tab ( bottom panel ) when you have the document selected in the file browser.
If the NDC_TEXT is empty then you have a problem with the configuration of with the pdf documents. If this is the case share here some pdf file ( upload into zip ) and we will take a look.
Re: Adding a new tab at document level
PostPosted:Mon Aug 10, 2015 7:55 am
by blacknoir
Hi Jlort,
I have configured tesseract, but the extraction doesnt work (Java-Errors), which i could not resolv:
Code: Select allroot@euve78434:~# less /opt/tomcat/logs/catalina.log
2015-08-10 09:50:23,548 [Thread-7616] WARN com.openkm.util.ExecutionUtils- Abnormal program termination: 2
2015-08-10 09:50:23,548 [Thread-7616] WARN com.openkm.util.ExecutionUtils- CommandLine: [/usr/bin/tesseract, , /opt/tomcat/temp/R168839771776080646766.png, /opt/tomcat/temp/okm6680092088155651071]
2015-08-10 09:50:23,548 [Thread-7616] WARN com.openkm.util.ExecutionUtils- STDERR: Tesseract Open Source OCR Engine v3.03 with Leptonica
Cannot open input file:
2015-08-10 09:50:23,549 [Thread-7616] WARN com.openkm.extractor.Tesseract3TextExtractor- IO exception executing command: /usr/bin/tesseract /opt/tomcat/temp/R168839771776080646766.png /opt/tomcat/temp/okm6680092088155651071
java.io.FileNotFoundException: /opt/tomcat/temp/okm6680092088155651071.txt (No such file or directory)
A new error I have seen in the logfile is:
Code: Select all2015-08-10 09:50:28,093 [Thread-7616] WARN com.openkm.core.Cron- Error executing crontab task 'Text Extractor Worker': Sourced file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : Method Invocation run : at Line: 1 : in file: inline evaluation of: ``new com.openkm.extractor.TextExtractorWorker().run();'' : .run ( )
Target exception: java.lang.OutOfMemoryError: Java heap space
Kind regards
Christian
Re: Adding a new tab at document level
PostPosted:Wed Aug 12, 2015 7:36 am
by jllort
Do not merge several questions on same post, please add other topic for it.
Re: Adding a new tab at document level
PostPosted:Wed Aug 12, 2015 1:11 pm
by blacknoir
Ok, than lets continue on my beginner question
I solved the OCR Problems anyway and changed the PdfExtractor & TesseractExtractor...