Open Source Document Management System | OpenKM

PostPosted:**Wed Jan 18, 2012 8:02 pm**

Dear All,

In my 5.1.8 installation, the only major thing not working is searching for text in pdf files. Previewing works with all major file types. Tesseract also works well from the command line interface on tif-files. However, when uploading a pdf-file, the terminal lists the following error:

Code: Select all

    18:19:59,453 WARN  [PdfTextExtractor] PDF does not contains text layer
    18:19:59,455 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/testpdf.pdf'

The file in question does yield full OCR/search results on the demo machine.

Can someone please point me to what to look for?

Thanks a lot

Michael

PostPosted:**Thu Jan 19, 2012 11:17 am**

Which is your tesseract parameter configuration ? because I think there're was some bug on 5.1.8 solved in 5.1.9
And which tesseract version 2.x or 3.x ?

PostPosted:**Thu Jan 19, 2012 10:20 pm**

My configuration for system.ocr is /usr/local/bin/tesseract ${fileIn} ${fileOut} -l deu
Omitting the -l deu und the ${fileIn} ${fileOut} does not make things better.

The version of tesseract in use is 3.01.

PostPosted:**Mon Jan 23, 2012 9:12 pm**

In PDF extractor, if it does not find text will perform OCR but using Cuneiform text extractor. Actually it does not works with Tesseract.

I have created the issue http://issues.openkm.com/view.php?id=2020 to handle the improvement.

Open Source Document Management System | OpenKM

PDF Text Extractor

PDF Text Extractor

Re: PDF Text Extractor

Re: PDF Text Extractor

Re: PDF Text Extractor