Page 1 of 1

Problem with Text Extractor

PostPosted:Fri Jan 24, 2014 9:19 am
by bsn
Hello all!
TextExtractor doesn't work correctly.
My OS: Ubuntu server 12.04 32bit
Java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
I used these instructions: http://bonoty.info/index.php/OpenKM
My OpenKM: openkm-6.2.5-community-tomcat-bundle.
Settings:
Code: Select all
system.ghostscript.ps2pdf: /usr/bin/ps2pdf
system.imagemagick.convert:  /usr/bin/convert
system.ocr: /usr/bin/tesseract -l rus+ukr+eng ${fileIn} ${fileOut}
system.openoffice.dictionary: /opt/ru_RU.oxt
system.openoffice.path: /usr/lib/libreoffice
I checked all the paths.
When I run it manually, it works fine.
For example:
Code: Select all
root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/convert Безпроводные\ точки\ доступа.pdf test.jpg
root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/tesseract -l rus test.jpg test
Tesseract Open Source OCR Engine v3.02 with Leptonica
root@openkm:/opt/tomcat-7.0.27/temp# ls -l
total 984
-rw-r--r-- 1 root root 551531 Янв 24 11:06 test.jpg
-rw-r--r-- 1 root root   3457 Янв 24 11:06 test.txt
-rw------- 1 serg serg 447356 Янв 23 11:25 Безпроводные точки доступа.pdf
File test.txt is correct.

My log:
Code: Select all
 2014-01-24 08:23:47,637 [http-bio-0.0.0.0-8080-exec-9] INFO  com.openkm.vernum.VersionNumerationFactory - VersionNumerationAdapter: com.openkm.vernum.MajorMinorVersionNumerationAdapter
 2014-01-24 08:25:00,012 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
 2014-01-24 08:25:00,016 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
 2014-01-24 08:25:00,048 [Thread-352] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=2287a0b2-ce24-475a-9978-e34f3fe9df46, docPath=/okm:root/Разрешения/2014/Разреш
 ение - тест.pdf, docVerUuid=891d23f8-8510-4bf0-aa06-b202afcc1111, date=Fri Jan 24 08:23:47 EET 2014}
 2014-01-24 08:25:00,051 [Thread-352] DEBUG com.openkm.extractor.RegisteredExtractors - getText(/okm:root/Разрешения/2014/Разрешение - тест.pdf, application/pdf, null, java.io.FileInputStream@228a14)
 2014-01-24 08:25:00,698 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
 2014-01-24 08:25:00,698 [Thread-352] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
 [b]2014-01-24 08:25:00,699 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg
 2014-01-24 08:25:01,094 [Thread-352] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 2[/b]
 2014-01-24 08:25:01,095 [Thread-352] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, , -l, rus, /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg, /opt/tomcat-7.0.27 /temp/okm4760634622237429462.txt]
2014-01-24 08:25:01,095 [Thread-352] WARN  com.openkm.util.ExecutionUtils - STDERR:
2014-01-24 08:25:01,115 [Thread-352] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /opt/ru_RU.oxt
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 08:25:01,146 [Thread-352] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Разрешения/2014/Разрешение - тест.pdf': Too few text extracted

Re: Problem with Text Extractor

PostPosted:Sat Jan 25, 2014 6:31 am
by bsn
I imported documents.
Then I deleted database.
Deleted folder /opt/tomcat-7.0.27 except OpenKM.cfg, server.xml, log4j.properties.
And started it again from the beginning.
Now my log looks like:
Code: Select all
2014-01-24 16:00:00,027 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
2014-01-24 16:00:00,032 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
2014-01-24 16:00:00,042 [Thread-130] DEBUG com.openkm.extractor.TextExtractorWorker - *** End text extraction ***
2014-01-24 16:02:16,411 [http-bio-0.0.0.0-8080-exec-22] INFO  com.openkm.vernum.VersionNumerationFactory - VersionNumerationAdapter: com.openkm.vernum.MajorMinorVersionNumerationAdapter
2014-01-24 16:05:00,017 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
2014-01-24 16:05:00,017 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
2014-01-24 16:05:00,020 [Thread-133] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=024f3f54-bdab-4490-83c5-830f464e0d84, docPath=/okm:root/Test.pdf, docVerUuid=1b2b0ac
4-3a4a-4242-b26f-991526e8fa63, date=Fri Jan 24 16:02:16 EET 2014}
2014-01-24 16:05:00,021 [Thread-133] DEBUG com.openkm.extractor.RegisteredExtractors - getText(/okm:root/Test.pdf, application/pdf, null, java.io.FileInputStream@fad411)
2014-01-24 16:05:00,038 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-01-24 16:05:00,038 [Thread-133] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
2014-01-24 16:05:00,039 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image07267208762054520734.jpg
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 16:05:05,990 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image18969993581577475504.jpg
2014-01-24 16:05:11,406 [Thread-133] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 16:05:11,406 [Thread-133] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 16:05:11,407 [Thread-133] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Test.pdf': Too few text extracted
2014-01-24 16:05:11,426 [Thread-133] DEBUG com.openkm.extractor.TextExtractorWorker - *** End text extraction ***
Now in the temp folder there is a file with an extension txt.txt
Code: Select all
ls -l /opt/tomcat-7.0.27/temp
-rw-r--r-- 1 root root 2228 янв.  24 16:05 /opt/tomcat-7.0.27/temp/okm4412878494546261539.txt.txt
This file contains text. Text is mostly correct. But there are some mistakes.

Query "SELECT NDC_TEXT_EXTRACTED FROM OKM_NODE_DOCUMENT WHERE NBS_UUID = '024f3f54-bdab-4490-83c5-830f464e0d84 '" returns NULL

Can the issue be caused by a large amount of mistakes in the file temp/okm4412878494546261539.txt.txt?

Can I record something to the database, at least a word?

PS: When I use *.doc file instead of a picture, text extractor works fine.

Re: Problem with Text Extractor (Solved)

PostPosted:Mon Jan 27, 2014 5:30 pm
by bsn
I've resolved the problem.
The solution is obvious.
There is a missed line in the instruction http://bonoty.info/index.php/OpenKM
There is no information about registered.text.extractors variable.
registered.text.extractors contains com.openkm.extractor.CuneiformTextExtractor by default.
This line should be changed on com.openkm.extractor.Tesseract3TextExtractor in case you have OCR tesseract.
After that everything works fine for me.

Re: Problem with Text Extractor

PostPosted:Wed Jan 29, 2014 6:22 pm
by jllort
From next version will be a independant clase and will not be necessary change it.

About actually explanation about it, you can see in our documentation at http://wiki.openkm.com/index.php/OCR