Open Source Document Management System | OpenKM - OCR function, PNG works except for PDF files

Reply

OCR function, PNG works except for PDF files

#31357 by fsouren
Fri Feb 20, 2015 12:28 pm

Hi,

I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.

Settings are:

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.Tesseract3TextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor

Code: Select all

system.imagemagick.convert	String	/usr/bin/convert -density 200 -quality 90
system.ocr	String	/usr/bin/tesseract ${fileIn} ${fileOut} -l nld
system.swftools.pdf2swf	String	/opt/openkm-6.3.0-community/tomcat/bin/pdf2swf -f -T 9 -t -s storeallcharacters ${fileIn} -o ${fileOut}
system.openoffice.dictionary	String		
system.openoffice.path	String	/usr/lib/libreoffice
system.pdf.force.ocr	Boolean	Inactive

Anyone who has the golden answer for me?

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31375 by jllort
Sat Feb 21, 2015 10:18 am

Do you come from a migration or new installation ( because I suspect org.apache.jackrabbit should be com.openkm, anyway this is not the reason why is not going right ). Upgrate to nighly build because there are some corrections you can find it at integration.openkm.com upgrade guide is here http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1

Can you upload here the two files into zip ( for testing purpose ).

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR function, PNG works except for PDF files

#31382 by fsouren
Sun Feb 22, 2015 7:21 am

Thanks for the reply! Its a fresh new install.

Stopping Tomcat and editing the OpenKM.cfg i can do, but how can i execute the query when OpenKM/Tomcat is not running?

I can do a fresh install of the new .WAR file is included in the ISO file, if that is necessary.

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31387 by jllort
Sun Feb 22, 2015 6:07 pm

If you replace OpenKM.war for nighly build ( integration.openkm.com ) and set hibernate.hbml=create, then the repository and all database will be clean and created.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR function, PNG works except for PDF files

#31396 by fsouren
Mon Feb 23, 2015 10:07 am

Hi,

I replaced the .WAR on a fresh install just to be sure.
I'll upload the whole PDF which i tried, and 1 PNG file which i extracted manually with "convert -density 200 -quality 90".
PDF is not OCR correctly, but the PNG is if i upload it.

http://www.famsouren.nl/vastgoed-2.png
http://www.famsouren.nl/vastgoed_juli_2014.pdf

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31398 by fsouren
Mon Feb 23, 2015 4:00 pm

If i execute the exact same commands at the commandline, OCR works well.
These are the command as configured in OpenKM.

Code: Select all

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.pdf vastgoed.png
/usr/bin/cuneiform -l dut -o ocr.txt vastgoed-2.png

Does OpenKM treat these commands different?

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31408 by fsouren
Tue Feb 24, 2015 11:36 am

Is it possible to let OpenKM execute this command:

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.PDF vastgoed.png

for example to place it like this (i've tried it but i don't think OpenKM does)

/usr/bin/convert -density 200 -quality 90 ${fileIn} ${fileOut}

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31417 by fsouren
Wed Feb 25, 2015 9:40 am

I'm thinking could it be the Dutch language?
I can upload almost every English document and its text gets recognized fine.

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31451 by jllort
Fri Feb 27, 2015 5:24 pm

First of all, I suggest using tesseract. If you're using cuneiform this class com.openkm.extractor.Tesseract3TextExtractor is wrong.

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR function, PNG works except for PDF files

#31460 by fsouren
Fri Feb 27, 2015 8:13 pm

Yeah, i'm using Tesseract.
I tried cuneiform from the commandline too, to check if it works.

But i've tried to use both at system.ocr and at extractors, but it makes no difference.

If i upload any document with English text all works fine, except for dutch scanned files.
Text PDF and images in dutch language are working too.

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31468 by jllort
Sun Mar 01, 2015 4:23 pm

And from command line, tesseract + dutch are working right ?

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: OCR function, PNG works except for PDF files

#31478 by fsouren
Mon Mar 02, 2015 7:35 am

Yes, if i convert the PDF to PNG first with the command:

Code: Select all

convert -density 300 -quality 100

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31480 by fsouren
Mon Mar 02, 2015 9:40 am

When does OpenKM convert a PDF to a image file?
Because when i upload a PDF file, the tesseract process is started right away, i can't see the he's running convert first.

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31489 by fsouren
Wed Mar 04, 2015 7:57 am

Anyone with a suggestion where to look?

Username

fsouren

Rank

Junior Boarder

Posts

20

Joined

Fri Feb 20, 2015 12:22 pm

Re: OCR function, PNG works except for PDF files

#31575 by jllort
Thu Mar 12, 2015 7:52 am

Internally us we're not doing this optimization "convert -density 300 -quality 100" we're extracting the image into with his own resolution. In your case probably you'll like to take more control of the process based on some automation task and do the indexing process. Is not much complicated doing it, but is quite complex explain all the steps you should consider.

Download the dev environment here http://sourceforge.net/projects/openkmportabledev/

First you should upload this files on a separate folder to be processed, and after finished, move to other ( this should be an easy solution ).
/okm:root/special/controled extraction to /okm:root/special/finished

Create and automation task and into move from controled to finished ( this the easies ).
Now the more complex control the text extraction -> convert the pdf document to png, take this code as example how doing it

Code: Select all

File tmpJpeg = FileUtils.createTempFileFromMime(doc.getMimeType());
InputStream is = OKMDocument.getInstance().getContent(null, doc.getUuid(), true);
FileOutputStream fos = new FileOutputStream(tmpJpeg);
IOUtils.copy(is, fos);

// Fichero pbm
File tmpPbm = File.createTempFile("okm", ".pbm");
ImageUtils.imageMagickConvert(tmpJpeg.getPath(), tmpPbm.getPath(), "${fileIn} ${fileOut}");
is.close();

Then you need to extract text ( tesseract ) and store into openkm, etc... if you're interested in this option, when you arrive here, tell me and I will continue explained how doing it.

Other way for doing it is create your own textextractor for pdf, take a look at PdfTextExtractor.java class and modify as your needs ( debug it, and problably you would like to do some changes on extracted image with the code I propesed before ).

Username

jllort

Rank

Moderator

Posts

12185

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Reply

Page 1 of 3
35 posts

1
2
3