Open Source Document Management System | OpenKM - org.apache.pdfbox is slowing openkm completely

Reply

org.apache.pdfbox is slowing openkm completely

#29829 by Catscratch
Fri Sep 05, 2014 1:26 pm

Hi,

I've got a problem. After I OpenKM startup, I got a lot of exceptions in the logfile. And with "a lot" I mean a loooooooot.
They look like this:

Code: Select all

2014-09-05 15:15:05,952 [Thread-16] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:630)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:248)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:99)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:173)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1343)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:164)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:149)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at bsh.Reflect.invokeMethod(Reflect.java:134)
	at bsh.Reflect.invokeObjectMethod(Reflect.java:80)
	at bsh.BSHPrimarySuffix.doName(BSHPrimarySuffix.java:176)
	at bsh.BSHPrimarySuffix.doSuffix(BSHPrimarySuffix.java:120)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:80)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:47)
	at bsh.Interpreter.eval(Interpreter.java:645)
	at bsh.Interpreter.eval(Interpreter.java:739)
	at bsh.Interpreter.eval(Interpreter.java:728)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:745)

I also got a second type of ERROR:

Code: Select all

2014-09-05 15:15:40,601 [Thread-16] ERROR org.apache.pdfbox.filter.FlateFilter- Stop reading corrupt stream

So it seems PDFBox got a problem with pdf files. But how can I fix this? I didn't even now what the problem is about and to which files it is related.

But these exceptions slow OpenKM extemly. I have to wait about 3-5 minutes so simply load a document node in the tree view.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29830 by Catscratch
Fri Sep 05, 2014 1:48 pm

Ok, sorry for replying self so quickly, but I think I found a solution (and maybe someone got the same problem and is looking for a solution). So for completeness: I updated PDFBox to version 1.8.6 and now it seems to run without problems.

OpenKM was using PDFBox 1.6.0.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29841 by pavila
Sat Sep 06, 2014 9:55 am

I can't understand why text extraction slow down the entire system because run on its own thread. How many cores has your server?

Please, attach on of these PDFs which have problems.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29848 by Catscratch
Mon Sep 08, 2014 6:26 am

It's running in several threads, not only one. But after some time and a specific amount of threads, system is indeed getting slow.

And sorry, but I can't attach you any PDF, because I don't know which of all PDFs causing the problem, because the log only tells me, that there is a problem and about the type of problem, but it didn't tell me which file causes it.

But anyway. With PDFBox 1.8.6 I only get:

Code: Select all

2014-09-07 01:05:00,215 [Thread-12296] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- Something went wrong ... the pixelmap doesn't contain any data.
2014-09-07 01:10:00,699 [Thread-12301] ERROR org.apache.pdfbox.filter.FlateFilter- FlateFilter: stop reading corrupt stream due to a DataFormatException

So, there is maybe still a problem, but PDFBox handles it which results in a way better system performance. Nothing seems to be blocked furthermore.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29850 by pavila
Mon Sep 08, 2014 7:45 am

By default OpenKM create a thread for every CPU in the system, so it can degrade whole system performance if the text extractor tasks are heavy. This is true.

About PDF text extraction, not all PDF are built correctly so it's common to have problems parsing them. But it works pretty fine usually, because I've seen no PDF which text extraction does not work.

If OpenKM detects an error in text extraction, it is logged as "MISC_TEXT_EXTRACTION_FAILURE" in OKM_ACTIVITY table, which can be listed from Administration > Activity Log.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29852 by Catscratch
Mon Sep 08, 2014 8:16 am

Ah ok. I found the log. There are a looooot of failures. The most are "Too few text extracted" and "full test indexing of <mime> is not supported". For mime we got different file types like dwg, zip, png and so on.

Mainly our PDFs are sketches with only a few words or no words to extract. Maybe this was the problem for PDFBox 1.6.0. I don't know. But I'm sorry, I can't give you one of the PDFs because they all contain confidential information.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29853 by pavila
Mon Sep 08, 2014 9:21 am

"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29854 by pavila
Mon Sep 08, 2014 9:21 am

"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29856 by Catscratch
Mon Sep 08, 2014 10:20 am

But another question. I think the activity log is stored in the database. Did you cleanup the log? Or is there a way to clean it? Maybe a cronjob? Or is it simply getting bigger and bigger?

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29857 by pavila
Mon Sep 08, 2014 11:39 am

By default there is no cleanup task, because the log retention policy is not the same for every installation. You should purge it depending on you own requirements.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29866 by Catscratch
Tue Sep 09, 2014 7:05 am

Hm, my prefered way would be to register a cron in okm that executes a database query to run daily and remove all entries from okm_activity tables which are older than e.g. 30 days. Is there a way to access the database via openkm scripting?

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#29867 by pavila
Tue Sep 09, 2014 9:06 am

You can run:

Code: Select all

com.openkm.dao.LegacyDAO.executeSQL(String query)

or

Code: Select all

com.openkm.dao.LegacyDAO.executeHQL(String query)

But wait for the night build because I have added these methods.

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: org.apache.pdfbox is slowing openkm completely

#29919 by Catscratch
Mon Sep 15, 2014 11:47 am

Thanks! It's working.

For completeness. Here is the task for common SQL and Postgres.

Common:

Code: Select all

com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < curdate() - 30;")

Postgres:

Code: Select all

com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < CURRENT_DATE - 30;")

MySQL:

Code: Select all

com.openkm.dao.LegacyDAO.executeSQL("DELETE from OKM_ACTIVITY WHERE act_date < DATE_SUB(curdate(), INTERVAL 30 DAY);");

Last edited by Catscratch on Thu Apr 09, 2015 8:40 am, edited 1 time in total.

Username

Catscratch

Rank

Platinum Boarder

Posts

336

Joined

Wed Feb 16, 2011 10:35 am

Re: org.apache.pdfbox is slowing openkm completely

#30060 by pavila
Wed Sep 24, 2014 2:38 pm

Ok, thanks!

Username

pavila

Rank

Moderator

Posts

3143

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Reply

Page 1 of 1
14 posts