Open Source Document Management System | OpenKM

PostPosted:**Fri Feb 20, 2015 12:28 pm**

Hi,

I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.

Settings are:

Code: Select all

org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.Tesseract3TextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor

Code: Select all

system.imagemagick.convert	String	/usr/bin/convert -density 200 -quality 90
system.ocr	String	/usr/bin/tesseract ${fileIn} ${fileOut} -l nld
system.swftools.pdf2swf	String	/opt/openkm-6.3.0-community/tomcat/bin/pdf2swf -f -T 9 -t -s storeallcharacters ${fileIn} -o ${fileOut}
system.openoffice.dictionary	String		
system.openoffice.path	String	/usr/lib/libreoffice
system.pdf.force.ocr	Boolean	Inactive

Anyone who has the golden answer for me?

PostPosted:**Sat Feb 21, 2015 10:18 am**

Do you come from a migration or new installation ( because I suspect org.apache.jackrabbit should be com.openkm, anyway this is not the reason why is not going right ). Upgrate to nighly build because there are some corrections you can find it at integration.openkm.com upgrade guide is here http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1

Can you upload here the two files into zip ( for testing purpose ).

PostPosted:**Sun Feb 22, 2015 7:21 am**

Thanks for the reply! Its a fresh new install.

Stopping Tomcat and editing the OpenKM.cfg i can do, but how can i execute the query when OpenKM/Tomcat is not running?

I can do a fresh install of the new .WAR file is included in the ISO file, if that is necessary.

PostPosted:**Sun Feb 22, 2015 6:07 pm**

If you replace OpenKM.war for nighly build ( integration.openkm.com ) and set hibernate.hbml=create, then the repository and all database will be clean and created.

PostPosted:**Mon Feb 23, 2015 10:07 am**

Hi,

I replaced the .WAR on a fresh install just to be sure.
I'll upload the whole PDF which i tried, and 1 PNG file which i extracted manually with "convert -density 200 -quality 90".
PDF is not OCR correctly, but the PNG is if i upload it.

http://www.famsouren.nl/vastgoed-2.png
http://www.famsouren.nl/vastgoed_juli_2014.pdf

PostPosted:**Mon Feb 23, 2015 4:00 pm**

If i execute the exact same commands at the commandline, OCR works well.
These are the command as configured in OpenKM.

Code: Select all

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.pdf vastgoed.png
/usr/bin/cuneiform -l dut -o ocr.txt vastgoed-2.png

Does OpenKM treat these commands different?

PostPosted:**Tue Feb 24, 2015 11:36 am**

Is it possible to let OpenKM execute this command:

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.PDF vastgoed.png

for example to place it like this (i've tried it but i don't think OpenKM does)

/usr/bin/convert -density 200 -quality 90 ${fileIn} ${fileOut}

PostPosted:**Wed Feb 25, 2015 9:40 am**

I'm thinking could it be the Dutch language?
I can upload almost every English document and its text gets recognized fine.

PostPosted:**Fri Feb 27, 2015 5:24 pm**

First of all, I suggest using tesseract. If you're using cuneiform this class com.openkm.extractor.Tesseract3TextExtractor is wrong.

PostPosted:**Fri Feb 27, 2015 8:13 pm**

Yeah, i'm using Tesseract.
I tried cuneiform from the commandline too, to check if it works.

But i've tried to use both at system.ocr and at extractors, but it makes no difference.

If i upload any document with English text all works fine, except for dutch scanned files.
Text PDF and images in dutch language are working too.

PostPosted:**Sun Mar 01, 2015 4:23 pm**

And from command line, tesseract + dutch are working right ?

PostPosted:**Mon Mar 02, 2015 7:35 am**

Yes, if i convert the PDF to PNG first with the command:

Code: Select all

convert -density 300 -quality 100

PostPosted:**Mon Mar 02, 2015 9:40 am**

When does OpenKM convert a PDF to a image file?
Because when i upload a PDF file, the tesseract process is started right away, i can't see the he's running convert first.

PostPosted:**Wed Mar 04, 2015 7:57 am**

Anyone with a suggestion where to look?

PostPosted:**Thu Mar 12, 2015 7:52 am**

Internally us we're not doing this optimization "convert -density 300 -quality 100" we're extracting the image into with his own resolution. In your case probably you'll like to take more control of the process based on some automation task and do the indexing process. Is not much complicated doing it, but is quite complex explain all the steps you should consider.

Download the dev environment here http://sourceforge.net/projects/openkmportabledev/

First you should upload this files on a separate folder to be processed, and after finished, move to other ( this should be an easy solution ).
/okm:root/special/controled extraction to /okm:root/special/finished

Create and automation task and into move from controled to finished ( this the easies ).
Now the more complex control the text extraction -> convert the pdf document to png, take this code as example how doing it

Code: Select all

File tmpJpeg = FileUtils.createTempFileFromMime(doc.getMimeType());
InputStream is = OKMDocument.getInstance().getContent(null, doc.getUuid(), true);
FileOutputStream fos = new FileOutputStream(tmpJpeg);
IOUtils.copy(is, fos);

// Fichero pbm
File tmpPbm = File.createTempFile("okm", ".pbm");
ImageUtils.imageMagickConvert(tmpJpeg.getPath(), tmpPbm.getPath(), "${fileIn} ${fileOut}");
is.close();

Then you need to extract text ( tesseract ) and store into openkm, etc... if you're interested in this option, when you arrive here, tell me and I will continue explained how doing it.

Other way for doing it is create your own textextractor for pdf, take a look at PdfTextExtractor.java class and modify as your needs ( debug it, and problably you would like to do some changes on extracted image with the code I propesed before ).

Open Source Document Management System | OpenKM

OCR function, PNG works except for PDF files

OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files

Re: OCR function, PNG works except for PDF files