Problem with Text Extractor
PostPosted:Fri Jan 24, 2014 9:19 am
Hello all!
TextExtractor doesn't work correctly.
My OS: Ubuntu server 12.04 32bit
Java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
I used these instructions: http://bonoty.info/index.php/OpenKM
My OpenKM: openkm-6.2.5-community-tomcat-bundle.
Settings:
When I run it manually, it works fine.
For example:
My log:
TextExtractor doesn't work correctly.
My OS: Ubuntu server 12.04 32bit
Java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
I used these instructions: http://bonoty.info/index.php/OpenKM
My OpenKM: openkm-6.2.5-community-tomcat-bundle.
Settings:
Code: Select all
I checked all the paths.system.ghostscript.ps2pdf: /usr/bin/ps2pdf
system.imagemagick.convert: /usr/bin/convert
system.ocr: /usr/bin/tesseract -l rus+ukr+eng ${fileIn} ${fileOut}
system.openoffice.dictionary: /opt/ru_RU.oxt
system.openoffice.path: /usr/lib/libreofficeWhen I run it manually, it works fine.
For example:
Code: Select all
File test.txt is correct.root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/convert Безпроводные\ точки\ доступа.pdf test.jpg
root@openkm:/opt/tomcat-7.0.27/temp# /usr/bin/tesseract -l rus test.jpg test
Tesseract Open Source OCR Engine v3.02 with Leptonica
root@openkm:/opt/tomcat-7.0.27/temp# ls -l
total 984
-rw-r--r-- 1 root root 551531 Янв 24 11:06 test.jpg
-rw-r--r-- 1 root root 3457 Янв 24 11:06 test.txt
-rw------- 1 serg serg 447356 Янв 23 11:25 Безпроводные точки доступа.pdfMy log:
Code: Select all
2014-01-24 08:23:47,637 [http-bio-0.0.0.0-8080-exec-9] INFO com.openkm.vernum.VersionNumerationFactory - VersionNumerationAdapter: com.openkm.vernum.MajorMinorVersionNumerationAdapter
2014-01-24 08:25:00,012 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - *** Begin text extraction ***
2014-01-24 08:25:00,016 [Thread-352] DEBUG com.openkm.extractor.TextExtractorWorker - processSerial(null, 10)
2014-01-24 08:25:00,048 [Thread-352] INFO com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=2287a0b2-ce24-475a-9978-e34f3fe9df46, docPath=/okm:root/Разрешения/2014/Разреш
ение - тест.pdf, docVerUuid=891d23f8-8510-4bf0-aa06-b202afcc1111, date=Fri Jan 24 08:23:47 EET 2014}
2014-01-24 08:25:00,051 [Thread-352] DEBUG com.openkm.extractor.RegisteredExtractors - getText(/okm:root/Разрешения/2014/Разрешение - тест.pdf, application/pdf, null, java.io.FileInputStream@228a14)
2014-01-24 08:25:00,698 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - TextStripped: ''
2014-01-24 08:25:00,698 [Thread-352] WARN com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
[b]2014-01-24 08:25:00,699 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - Writing image: /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg
2014-01-24 08:25:01,094 [Thread-352] WARN com.openkm.util.ExecutionUtils - Abnormal program termination: 2[/b]
2014-01-24 08:25:01,095 [Thread-352] WARN com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, , -l, rus, /opt/tomcat-7.0.27/temp/image04987538163922428421.jpg, /opt/tomcat-7.0.27 /temp/okm4760634622237429462.txt]
2014-01-24 08:25:01,095 [Thread-352] WARN com.openkm.util.ExecutionUtils - STDERR:
2014-01-24 08:25:01,115 [Thread-352] INFO com.openkm.util.DocumentUtils - Using OpenOffice dictionary: /opt/ru_RU.oxt
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.CuneiformTextExtractor - TEXT:
2014-01-24 08:25:01,145 [Thread-352] DEBUG com.openkm.extractor.PdfTextExtractor - OCR Extracted:
2014-01-24 08:25:01,146 [Thread-352] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Разрешения/2014/Разрешение - тест.pdf': Too few text extracted