• TextExtractorWorker Error

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #54728  by MarcoOliveira
 
Hi got this error when I upload a file (.pdf)
Code: Select all
2024-05-28 10:10:00,192 [Thread-37] [] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8cd95a85-56f9-4471-a40e-ac881636c13c, docPath=/okm:root/EN4.pdf, docVerUuid=38b19a1e-1bee-4f46-a7eb-b9699bf075d5, date=Tue May 28 10:09:34 WEST 2024}

2024-05-28 10:10:02,580 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1

2024-05-28 10:10:02,581 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat-8.5.69/temp/okm97678713134532836.pdf, /home/openkm/tomcat-8.5.69/temp/okm2327626557474858401.txt]

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - STDERR: Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/EN4.pdf': Too few text extracted
With this error, I cann't find document using search -> by content.
 #54736  by jllort
 
Here are different problems:
1- few text extracted is a warning ( should not be analyzed as an error by default , it is only an advisor to check )
2- about PDF reading is not supported Error in pixRead: pix not read -> it is an error because processing an image into to PDF -> should require checking a PDF sample with this error ( maybe something that can be solved installing some missing package or library ).

We need:
1- your current openkm version and OS
2- a PDF sample to check ourselves
 #54739  by MarcoOliveira
 
My version: 6.3.12 (Community Extension) Ubuntu: v24
My pdf: Is a simple PDF. PDF text -> Hello.

Other important info:

system.ocr -> /usr/bin/tesseract ${fileIn} ${fileOut}
-> version(5.3.4)

This parameters are clear " ". Its ok?
system.ocr.crotate -> ""
system.pdf.force.ocr -> ""
system.pdfimages -> ""
system.swftools.pdf2swf -> ""
system.openoffice.dictionary -> ""
Do I need to configure any more parameters?


I upload 3 documents:

Test.pdf -> Content -> Test123
Test.txt -> Content -> Test1234
Test.docx -> Content -> Test12345

When I try search by content I only find .txt file.
.pdf and .docx .. I cannot find.

I check my table "OKM_NODE_DOCUMENT" when i upload a .txt file in column "NDC_TEXT" i have content.
When I upload .pdf or .docx in my "NDC_TEXT" column is null.
 #54753  by susserj
 
Hi

I appear to have the same problem.
When I add docx, or pdf files to Openkm CE they for some reason don't get indexed if I use the mysql database.
However, if I use the default H2 database the indexing seems to work fine.

My environment is as such:
I am testing OpenKM-CE using the docker image from here. https://hub.docker.com/r/openkm/openkm-ce
I was successful in loading and starting two containers using OpenKM CE 6.13.12 (Build a3587ce).
One container is configured to use the H2 database which is the default.
One container is configured to use the mysql database which was configured using the sample docker-compose specified in your documentation.
I have them running in parallel with different url ports so I can compare .

- When using the H2 database the indexing seems to work fine for docx and pdf files.
- However, if I use my mysql docker container, configured to use a mysql database, it doesn't index the docx and pdf files. However, it does index txt and odt documents.


PS I am a complete newbie.

Cheers
joel
 #54793  by jllort
 
You should use files with more content because when the application detects few text extracted it supposes is an error or at least raises a warning in the log. In the case of small text extracted ( less than 16 characters ) we consider there's some error an the data is not saved.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.