Page 1 of 1

PDF Text Extractor

PostPosted:Wed Jan 18, 2012 8:02 pm
by michael.schefczyk
Dear All,

In my 5.1.8 installation, the only major thing not working is searching for text in pdf files. Previewing works with all major file types. Tesseract also works well from the command line interface on tif-files. However, when uploading a pdf-file, the terminal lists the following error:
Code: Select all
    18:19:59,453 WARN  [PdfTextExtractor] PDF does not contains text layer
    18:19:59,455 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/testpdf.pdf'
The file in question does yield full OCR/search results on the demo machine.

Can someone please point me to what to look for?

Thanks a lot

Michael

Re: PDF Text Extractor

PostPosted:Thu Jan 19, 2012 11:17 am
by jllort
Which is your tesseract parameter configuration ? because I think there're was some bug on 5.1.8 solved in 5.1.9
And which tesseract version 2.x or 3.x ?

Re: PDF Text Extractor

PostPosted:Thu Jan 19, 2012 10:20 pm
by michael.schefczyk
My configuration for system.ocr is /usr/local/bin/tesseract ${fileIn} ${fileOut} -l deu
Omitting the -l deu und the ${fileIn} ${fileOut} does not make things better.

The version of tesseract in use is 3.01.

Re: PDF Text Extractor

PostPosted:Mon Jan 23, 2012 9:12 pm
by pavila
In PDF extractor, if it does not find text will perform OCR but using Cuneiform text extractor. Actually it does not works with Tesseract.

I have created the issue http://issues.openkm.com/view.php?id=2020 to handle the improvement.