Page 1 of 1

org.apache.pdfbox is slowing openkm completely

PostPosted:Fri Sep 05, 2014 1:26 pm
by Catscratch
Hi,

I've got a problem. After I OpenKM startup, I got a lot of exceptions in the logfile. And with "a lot" I mean a loooooooot.
They look like this:
Code: Select all
2014-09-05 15:15:05,952 [Thread-16] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:630)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:248)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:99)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:173)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1343)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:164)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:149)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at bsh.Reflect.invokeMethod(Reflect.java:134)
	at bsh.Reflect.invokeObjectMethod(Reflect.java:80)
	at bsh.BSHPrimarySuffix.doName(BSHPrimarySuffix.java:176)
	at bsh.BSHPrimarySuffix.doSuffix(BSHPrimarySuffix.java:120)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:80)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:47)
	at bsh.Interpreter.eval(Interpreter.java:645)
	at bsh.Interpreter.eval(Interpreter.java:739)
	at bsh.Interpreter.eval(Interpreter.java:728)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:745)
I also got a second type of ERROR:
Code: Select all
2014-09-05 15:15:40,601 [Thread-16] ERROR org.apache.pdfbox.filter.FlateFilter- Stop reading corrupt stream
So it seems PDFBox got a problem with pdf files. But how can I fix this? I didn't even now what the problem is about and to which files it is related.

But these exceptions slow OpenKM extemly. I have to wait about 3-5 minutes so simply load a document node in the tree view.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Fri Sep 05, 2014 1:48 pm
by Catscratch
Ok, sorry for replying self so quickly, but I think I found a solution (and maybe someone got the same problem and is looking for a solution). So for completeness: I updated PDFBox to version 1.8.6 and now it seems to run without problems.

OpenKM was using PDFBox 1.6.0.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Sat Sep 06, 2014 9:55 am
by pavila
I can't understand why text extraction slow down the entire system because run on its own thread. How many cores has your server?

Please, attach on of these PDFs which have problems.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 6:26 am
by Catscratch
It's running in several threads, not only one. But after some time and a specific amount of threads, system is indeed getting slow.

And sorry, but I can't attach you any PDF, because I don't know which of all PDFs causing the problem, because the log only tells me, that there is a problem and about the type of problem, but it didn't tell me which file causes it.

But anyway. With PDFBox 1.8.6 I only get:
Code: Select all
2014-09-07 01:05:00,215 [Thread-12296] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- Something went wrong ... the pixelmap doesn't contain any data.
2014-09-07 01:10:00,699 [Thread-12301] ERROR org.apache.pdfbox.filter.FlateFilter- FlateFilter: stop reading corrupt stream due to a DataFormatException
So, there is maybe still a problem, but PDFBox handles it which results in a way better system performance. Nothing seems to be blocked furthermore.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 7:45 am
by pavila
By default OpenKM create a thread for every CPU in the system, so it can degrade whole system performance if the text extractor tasks are heavy. This is true.

About PDF text extraction, not all PDF are built correctly so it's common to have problems parsing them. But it works pretty fine usually, because I've seen no PDF which text extraction does not work.

If OpenKM detects an error in text extraction, it is logged as "MISC_TEXT_EXTRACTION_FAILURE" in OKM_ACTIVITY table, which can be listed from Administration > Activity Log.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 8:16 am
by Catscratch
Ah ok. I found the log. There are a looooot of failures. The most are "Too few text extracted" and "full test indexing of <mime> is not supported". For mime we got different file types like dwg, zip, png and so on.

Mainly our PDFs are sketches with only a few words or no words to extract. Maybe this was the problem for PDFBox 1.6.0. I don't know. But I'm sorry, I can't give you one of the PDFs because they all contain confidential information.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 9:21 am
by pavila
"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 9:21 am
by pavila
"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 10:20 am
by Catscratch
But another question. I think the activity log is stored in the database. Did you cleanup the log? Or is there a way to clean it? Maybe a cronjob? Or is it simply getting bigger and bigger?

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 08, 2014 11:39 am
by pavila
By default there is no cleanup task, because the log retention policy is not the same for every installation. You should purge it depending on you own requirements.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Tue Sep 09, 2014 7:05 am
by Catscratch
Hm, my prefered way would be to register a cron in okm that executes a database query to run daily and remove all entries from okm_activity tables which are older than e.g. 30 days. Is there a way to access the database via openkm scripting?

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Tue Sep 09, 2014 9:06 am
by pavila
You can run:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL(String query)
or
Code: Select all
com.openkm.dao.LegacyDAO.executeHQL(String query)
But wait for the night build because I have added these methods.

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Mon Sep 15, 2014 11:47 am
by Catscratch
Thanks! It's working.

For completeness. Here is the task for common SQL and Postgres.

Common:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < curdate() - 30;")
Postgres:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < CURRENT_DATE - 30;")
MySQL:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from OKM_ACTIVITY WHERE act_date < DATE_SUB(curdate(), INTERVAL 30 DAY);");

Re: org.apache.pdfbox is slowing openkm completely

PostPosted:Wed Sep 24, 2014 2:38 pm
by pavila
Ok, thanks!