• Disable PDF to Image convert before ocr

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #44469  by SaschaK89
 
Hi there,
I've written an application to convert PDF through ghostscript to mulitple page Tiff files and do ocr with tesseract 4 alpha.
If I run the application it have a very good accuracy, but when I'm uploading the same file to OpenKM CE 6.3 it converts the PDFs to PNG or JPG in a bad resolution and the accuracy is like 0%.

I want to disable the convert before ocr and want to use my application with the pdf.
Want I have to do to deactivate it?
My application has the same parameters like tesseract.
(C:\Programme\Tesseract-OCR\PDFOcr.exe ${fileIn} ${fileOut})

My OS is Windows XP SP3 on an embedded machine (no Windows XP Embedded!)

Please help me.
 #44478  by SaschaK89
 
I think I have to write my own textextractor and must deactivate the pdftextextractor, right?

Where I can post my code for it?
Sorry for this question, but I'm a .Net Engineer and don't use git or something like it.
Are you all interessted in my PDFOcr application?

best regards
 #44490  by jllort
 
1- Ensure you are working with last code.
2- Fork the project https://github.com/openkm/document-management-system
3- Apply changes from your side in your 6.3-DEV branch
4- Submit a merge request from your project to ours into branch 6.3-DEV

That's all.

I checked source code and I think we are extracting the images what comes into the PDF with pdfbox library to png. For what I think you are converting original images to TIFF for better perfomance with ocr engine ?
Did you have configured the configuration property system.pdfimages, the tool will help you extracting raw with same format the parameter value should be something like:
Code: Select all
/home/openkm/tomcat-7.0.61/bin/pdfimages -j -f ${firstPage} -l ${lastPage} ${fileIn} ${imageRoot} 

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.