• Problem with Text Extractor

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #27659  by bsn
 
Hello all!
TextExtractor doesn't work correctly.
My OS: Ubuntu server 12.04 32bit
Java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
I used these instructions: http://bonoty.info/index.php/OpenKM
My OpenKM: openkm-6.2.5-community-tomcat-bundle.
Settings:
Code: Select all
system.ghostscript.ps2pdf: /usr/bin/ps2pdf
system.imagemagick.convert:  /usr/bin/convert
system.ocr: /usr/bin/tesseract -l rus+ukr+eng ${fileIn} ${fileOut}
system.openoffice.dictionary: /opt/ru_RU.oxt
system.openoffice.path: /usr/lib/libreoffice
I checked all the paths.
When I run it manually, it works fine.
For example:
Code: Select all
root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/convert Безпроводные\ точки\ доступа.pdf test.jpg
root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/tesseract -l rus test.jpg test
Tesseract Open Source OCR Engine v3.02 with Leptonica
root@openkm:/opt/tomcat-7.0.27/temp# ls -l
total 984
-rw-r--r-- 1 root root 551531 Янв 24 11:06 test.jpg
-rw-r--r-- 1 root root   3457 Янв 24 11:06 test.txt
-rw------- 1 serg serg 447356 Янв 23 11:25 Безпроводные точки доступа.pdf
File test.txt is correct.

My log:
Code: Select all
 2014-01-24 08:23:47,637 [http-bio-0.0.0.0-8080-exec-9] INFO  com.openkm.vernum.VersionNumerationFactory - VersionNumerationAdapter: com.openkm.vernum.MajorMinorVersionNumerationAdapter
 2014-01-24 08:25:00,012 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
 2014-01-24 08:25:00,016 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
 2014-01-24 08:25:00,048 [Thread-352] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=2287a0b2-ce24-475a-9978-e34f3fe9df46, docPath=/okm:root/Разрешения/2014/Разреш
 ение - тест.pdf, docVerUuid=891d23f8-8510-4bf0-aa06-b202afcc1111, date=Fri Jan 24 08:23:47 EET 2014}
 2014-01-24 08:25:00,051 [Thread-352] DEBUG com.openkm.extractor.RegisteredExtractors - getText(/okm:root/Разрешения/2014/Разрешение - тест.pdf, application/pdf, null, java.io.FileInputStream@228a14)
 2014-01-24 08:25:00,698 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
 2014-01-24 08:25:00,698 [Thread-352] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
 [b]2014-01-24 08:25:00,699 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg
 2014-01-24 08:25:01,094 [Thread-352] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 2[/b]
 2014-01-24 08:25:01,095 [Thread-352] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, , -l, rus, /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg, /opt/tomcat-7.0.27 /temp/okm4760634622237429462.txt]
2014-01-24 08:25:01,095 [Thread-352] WARN  com.openkm.util.ExecutionUtils - STDERR:
2014-01-24 08:25:01,115 [Thread-352] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /opt/ru_RU.oxt
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 08:25:01,146 [Thread-352] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Разрешения/2014/Разрешение - тест.pdf': Too few text extracted
Last edited by bsn on Sun Jan 26, 2014 2:36 pm, edited 1 time in total.
 #27665  by bsn
 
I imported documents.
Then I deleted database.
Deleted folder /opt/tomcat-7.0.27 except OpenKM.cfg, server.xml, log4j.properties.
And started it again from the beginning.
Now my log looks like:
Code: Select all
2014-01-24 16:00:00,027 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
2014-01-24 16:00:00,032 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
2014-01-24 16:00:00,042 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - *** End text extraction ***
2014-01-24 16:02:16,411 [http-bio-0.0.0.0-8080-exec-22] INFO  com.openkm.vernum.VersionNumerationFactory - VersionNumerationAdapter: com.openkm.vernum.MajorMinorVersionNumerationAdapter
2014-01-24 16:05:00,017 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
2014-01-24 16:05:00,017 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
2014-01-24 16:05:00,020 [Thread-133] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=024f3f54-bdab-4490-83c5-830f464e0d84, docPath=/okm:root/Test.pdf, docVerUuid=1b2b0ac
4-3a4a-4242-b26f-991526e8fa63, date=Fri Jan 24 16:02:16 EET 2014}
2014-01-24 16:05:00,021 [Thread-133] DEBUG com.openkm.extractor.RegisteredExtractors - getText(/okm:root/Test.pdf, application/pdf, null, java.io.FileInputStream@fad411)
2014-01-24 16:05:00,038 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-01-24 16:05:00,038 [Thread-133] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-01-24 16:05:00,039 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image07267208762054520734.jpg
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image18969993581577475504.jpg
2014-01-24 16:05:11,406 [Thread-133] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 16:05:11,406 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 16:05:11,407 [Thread-133] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Test.pdf': Too few text extracted
2014-01-24 16:05:11,426 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - *** End text extraction ***
Now in the temp folder there is a file with an extension txt.txt
Code: Select all
ls -l /opt/tomcat-7.0.27/temp
-rw-r--r-- 1 root root 2228 янв.  24 16:05 /opt/tomcat-7.0.27/temp/okm4412878494546261539.txt.txt
This file contains text. Text is mostly correct. But there are some mistakes.

Query "SELECT NDC_TEXT_EXTRACTED FROM OKM_NODE_DOCUMENT WHERE NBS_UUID = '024f3f54-bdab-4490-83c5-830f464e0d84 '" returns NULL

Can the issue be caused by a large amount of mistakes in the file temp/okm4412878494546261539.txt.txt?

Can I record something to the database, at least a word?

PS: When I use *.doc file instead of a picture, text extractor works fine.
 #27678  by bsn
 
I've resolved the problem.
The solution is obvious.
There is a missed line in the instruction http://bonoty.info/index.php/OpenKM
There is no information about registered.text.extractors variable.
registered.text.extractors contains com.openkm.extractor.CuneiformTextExtractor by default.
This line should be changed on com.openkm.extractor.Tesseract3TextExtractor in case you have OCR tesseract.
After that everything works fine for me.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.