Page 1 of 1

Disable PDF to Image convert before ocr

PostPosted:Wed Aug 02, 2017 1:52 pm
by SaschaK89
Hi there,
I've written an application to convert PDF through ghostscript to mulitple page Tiff files and do ocr with tesseract 4 alpha.
If I run the application it have a very good accuracy, but when I'm uploading the same file to OpenKM CE 6.3 it converts the PDFs to PNG or JPG in a bad resolution and the accuracy is like 0%.

I want to disable the convert before ocr and want to use my application with the pdf.
Want I have to do to deactivate it?
My application has the same parameters like tesseract.
(C:\Programme\Tesseract-OCR\PDFOcr.exe ${fileIn} ${fileOut})

My OS is Windows XP SP3 on an embedded machine (no Windows XP Embedded!)

Please help me.

Re: Disable PDF to Image convert before ocr

PostPosted:Thu Aug 03, 2017 6:13 pm
by SaschaK89
I think I have to write my own textextractor and must deactivate the pdftextextractor, right?

Where I can post my code for it?
Sorry for this question, but I'm a .Net Engineer and don't use git or something like it.
Are you all interessted in my PDFOcr application?

best regards

Re: Disable PDF to Image convert before ocr

PostPosted:Sat Aug 05, 2017 5:06 pm
by jllort
1- Ensure you are working with last code.
2- Fork the project https://github.com/openkm/document-management-system
3- Apply changes from your side in your 6.3-DEV branch
4- Submit a merge request from your project to ours into branch 6.3-DEV

That's all.

I checked source code and I think we are extracting the images what comes into the PDF with pdfbox library to png. For what I think you are converting original images to TIFF for better perfomance with ocr engine ?
Did you have configured the configuration property system.pdfimages, the tool will help you extracting raw with same format the parameter value should be something like:
Code: Select all
/home/openkm/tomcat-7.0.61/bin/pdfimages -j -f ${firstPage} -l ${lastPage} ${fileIn} ${imageRoot}