• PDF Text Extractor

  • Problems with installing OpenKM? No problemo, the solution is closer than you think.
Problems with installing OpenKM? No problemo, the solution is closer than you think.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #13559  by michael.schefczyk
 
Dear All,

In my 5.1.8 installation, the only major thing not working is searching for text in pdf files. Previewing works with all major file types. Tesseract also works well from the command line interface on tif-files. However, when uploading a pdf-file, the terminal lists the following error:
Code: Select all
    18:19:59,453 WARN  [PdfTextExtractor] PDF does not contains text layer
    18:19:59,455 WARN  [RegisteredExtractors] There was a problem extracting text from '/okm:root/testpdf.pdf'
The file in question does yield full OCR/search results on the demo machine.

Can someone please point me to what to look for?

Thanks a lot

Michael
 #13572  by jllort
 
Which is your tesseract parameter configuration ? because I think there're was some bug on 5.1.8 solved in 5.1.9
And which tesseract version 2.x or 3.x ?
 #13582  by michael.schefczyk
 
My configuration for system.ocr is /usr/local/bin/tesseract ${fileIn} ${fileOut} -l deu
Omitting the -l deu und the ${fileIn} ${fileOut} does not make things better.

The version of tesseract in use is 3.01.
 #13609  by pavila
 
In PDF extractor, if it does not find text will perform OCR but using Cuneiform text extractor. Actually it does not works with Tesseract.

I have created the issue http://issues.openkm.com/view.php?id=2020 to handle the improvement.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.