Open Source Document Management System | OpenKM - Disable PDF to Image convert before ocr

Disable PDF to Image convert before ocr

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

3 posts

3 posts

Disable PDF to Image convert before ocr

#44469 by SaschaK89
Wed Aug 02, 2017 1:52 pm

Hi there,
I've written an application to convert PDF through ghostscript to mulitple page Tiff files and do ocr with tesseract 4 alpha.
If I run the application it have a very good accuracy, but when I'm uploading the same file to OpenKM CE 6.3 it converts the PDFs to PNG or JPG in a bad resolution and the accuracy is like 0%.

I want to disable the convert before ocr and want to use my application with the pdf.
Want I have to do to deactivate it?
My application has the same parameters like tesseract.
(C:\Programme\Tesseract-OCR\PDFOcr.exe ${fileIn} ${fileOut})

My OS is Windows XP SP3 on an embedded machine (no Windows XP Embedded!)

Please help me.

Username

SaschaK89

Rank

Fresh Boarder

Posts

Joined

Wed Aug 02, 2017 1:43 pm

Re: Disable PDF to Image convert before ocr

#44478 by SaschaK89
Thu Aug 03, 2017 6:13 pm

I think I have to write my own textextractor and must deactivate the pdftextextractor, right?

Where I can post my code for it?
Sorry for this question, but I'm a .Net Engineer and don't use git or something like it.
Are you all interessted in my PDFOcr application?

best regards

Username

SaschaK89

Rank

Fresh Boarder

Posts

Joined

Wed Aug 02, 2017 1:43 pm

Re: Disable PDF to Image convert before ocr

#44490 by jllort
Sat Aug 05, 2017 5:06 pm

1- Ensure you are working with last code.
2- Fork the project https://github.com/openkm/document-management-system
3- Apply changes from your side in your 6.3-DEV branch
4- Submit a merge request from your project to ours into branch 6.3-DEV

That's all.

I checked source code and I think we are extracting the images what comes into the PDF with pdfbox library to png. For what I think you are converting original images to TIFF for better perfomance with ocr engine ?
Did you have configured the configuration property system.pdfimages, the tool will help you extracting raw with same format the parameter value should be something like:

Code: Select all

/home/openkm/tomcat-7.0.61/bin/pdfimages -j -f ${firstPage} -l ${lastPage} ${fileIn} ${imageRoot}

Username

jllort

Rank

Moderator

Posts

12128

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Page 1 of 1
3 posts

Return to “Configuration”

Display:

Sort by:

Jump to: