• org.apache.pdfbox is slowing openkm completely

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #29829  by Catscratch
 
Hi,

I've got a problem. After I OpenKM startup, I got a lot of exceptions in the logfile. And with "a lot" I mean a loooooooot.
They look like this:
Code: Select all
2014-09-05 15:15:05,952 [Thread-16] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
java.lang.IllegalArgumentException: Raster BytePackedRaster: width = 414 height = 13 #channels 1 xOff = 0 yOff = 0 is incompatible with ColorModel IndexColorModel: #pixelBits = 1 numComponents = 3 color space = java.awt.color.ICC_ColorSpace@4443d96 transparency = 1 transIndex   = -1 has alpha = false isAlphaPre = false
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:630)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:248)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:285)
	at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(PDXObjectImage.java:165)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:99)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:173)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1343)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:164)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:149)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at bsh.Reflect.invokeMethod(Reflect.java:134)
	at bsh.Reflect.invokeObjectMethod(Reflect.java:80)
	at bsh.BSHPrimarySuffix.doName(BSHPrimarySuffix.java:176)
	at bsh.BSHPrimarySuffix.doSuffix(BSHPrimarySuffix.java:120)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:80)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:47)
	at bsh.Interpreter.eval(Interpreter.java:645)
	at bsh.Interpreter.eval(Interpreter.java:739)
	at bsh.Interpreter.eval(Interpreter.java:728)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:745)
I also got a second type of ERROR:
Code: Select all
2014-09-05 15:15:40,601 [Thread-16] ERROR org.apache.pdfbox.filter.FlateFilter- Stop reading corrupt stream
So it seems PDFBox got a problem with pdf files. But how can I fix this? I didn't even now what the problem is about and to which files it is related.

But these exceptions slow OpenKM extemly. I have to wait about 3-5 minutes so simply load a document node in the tree view.
 #29830  by Catscratch
 
Ok, sorry for replying self so quickly, but I think I found a solution (and maybe someone got the same problem and is looking for a solution). So for completeness: I updated PDFBox to version 1.8.6 and now it seems to run without problems.

OpenKM was using PDFBox 1.6.0.
 #29841  by pavila
 
I can't understand why text extraction slow down the entire system because run on its own thread. How many cores has your server?

Please, attach on of these PDFs which have problems.
 #29848  by Catscratch
 
It's running in several threads, not only one. But after some time and a specific amount of threads, system is indeed getting slow.

And sorry, but I can't attach you any PDF, because I don't know which of all PDFs causing the problem, because the log only tells me, that there is a problem and about the type of problem, but it didn't tell me which file causes it.

But anyway. With PDFBox 1.8.6 I only get:
Code: Select all
2014-09-07 01:05:00,215 [Thread-12296] ERROR org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap- Something went wrong ... the pixelmap doesn't contain any data.
2014-09-07 01:10:00,699 [Thread-12301] ERROR org.apache.pdfbox.filter.FlateFilter- FlateFilter: stop reading corrupt stream due to a DataFormatException
So, there is maybe still a problem, but PDFBox handles it which results in a way better system performance. Nothing seems to be blocked furthermore.
 #29850  by pavila
 
By default OpenKM create a thread for every CPU in the system, so it can degrade whole system performance if the text extractor tasks are heavy. This is true.

About PDF text extraction, not all PDF are built correctly so it's common to have problems parsing them. But it works pretty fine usually, because I've seen no PDF which text extraction does not work.

If OpenKM detects an error in text extraction, it is logged as "MISC_TEXT_EXTRACTION_FAILURE" in OKM_ACTIVITY table, which can be listed from Administration > Activity Log.
 #29852  by Catscratch
 
Ah ok. I found the log. There are a looooot of failures. The most are "Too few text extracted" and "full test indexing of <mime> is not supported". For mime we got different file types like dwg, zip, png and so on.

Mainly our PDFs are sketches with only a few words or no words to extract. Maybe this was the problem for PDFBox 1.6.0. I don't know. But I'm sorry, I can't give you one of the PDFs because they all contain confidential information.
 #29853  by pavila
 
"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.
 #29854  by pavila
 
"Too few text extracted" errors are given because can't extract at least 16 characters. But I wonder if this "error" should be ignored because not all documents have "extractable" text. For example, those PDF with sketches.
 #29856  by Catscratch
 
But another question. I think the activity log is stored in the database. Did you cleanup the log? Or is there a way to clean it? Maybe a cronjob? Or is it simply getting bigger and bigger?
 #29857  by pavila
 
By default there is no cleanup task, because the log retention policy is not the same for every installation. You should purge it depending on you own requirements.
 #29866  by Catscratch
 
Hm, my prefered way would be to register a cron in okm that executes a database query to run daily and remove all entries from okm_activity tables which are older than e.g. 30 days. Is there a way to access the database via openkm scripting?
 #29919  by Catscratch
 
Thanks! It's working.

For completeness. Here is the task for common SQL and Postgres.

Common:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < curdate() - 30;")
Postgres:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from okm_activity WHERE act_date < CURRENT_DATE - 30;")
MySQL:
Code: Select all
com.openkm.dao.LegacyDAO.executeSQL("DELETE from OKM_ACTIVITY WHERE act_date < DATE_SUB(curdate(), INTERVAL 30 DAY);");
Last edited by Catscratch on Thu Apr 09, 2015 8:40 am, edited 1 time in total.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.