Open Source Document Management System | OpenKM

TextExtractorWorker Error

Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.

6 posts

6 posts

TextExtractorWorker Error

#54728 by MarcoOliveira
Tue May 28, 2024 9:24 am

Hi got this error when I upload a file (.pdf)

Code: Select all

2024-05-28 10:10:00,192 [Thread-37] [] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8cd95a85-56f9-4471-a40e-ac881636c13c, docPath=/okm:root/EN4.pdf, docVerUuid=38b19a1e-1bee-4f46-a7eb-b9699bf075d5, date=Tue May 28 10:09:34 WEST 2024}

2024-05-28 10:10:02,580 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1

2024-05-28 10:10:02,581 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat-8.5.69/temp/okm97678713134532836.pdf, /home/openkm/tomcat-8.5.69/temp/okm2327626557474858401.txt]

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - STDERR: Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/EN4.pdf': Too few text extracted

With this error, I cann't find document using search -> by content.

Username

MarcoOliveira

Rank

Junior Boarder

Posts

Joined

Thu May 16, 2024 10:41 am

Re: TextExtractorWorker Error

#54736 by jllort
Wed May 29, 2024 7:32 am

Here are different problems:
1- few text extracted is a warning ( should not be analyzed as an error by default , it is only an advisor to check )
2- about PDF reading is not supported Error in pixRead: pix not read -> it is an error because processing an image into to PDF -> should require checking a PDF sample with this error ( maybe something that can be solved installing some missing package or library ).

We need:
1- your current openkm version and OS
2- a PDF sample to check ourselves

Username

jllort

Rank

Moderator

Posts

12184

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: TextExtractorWorker Error

#54739 by MarcoOliveira
Wed May 29, 2024 8:43 am

My version: 6.3.12 (Community Extension) Ubuntu: v24
My pdf: Is a simple PDF. PDF text -> Hello.

Other important info:

system.ocr -> /usr/bin/tesseract ${fileIn} ${fileOut}
-> version(5.3.4)

This parameters are clear " ". Its ok?
system.ocr.crotate -> ""
system.pdf.force.ocr -> ""
system.pdfimages -> ""
system.swftools.pdf2swf -> ""
system.openoffice.dictionary -> ""
Do I need to configure any more parameters?

I upload 3 documents:

Test.pdf -> Content -> Test123
Test.txt -> Content -> Test1234
Test.docx -> Content -> Test12345

When I try search by content I only find .txt file.
.pdf and .docx .. I cannot find.

I check my table "OKM_NODE_DOCUMENT" when i upload a .txt file in column "NDC_TEXT" i have content.
When I upload .pdf or .docx in my "NDC_TEXT" column is null.

Username

MarcoOliveira

Rank

Junior Boarder

Posts

Joined

Thu May 16, 2024 10:41 am

Re: TextExtractorWorker Error

#54753 by susserj
Wed Jun 05, 2024 9:09 pm

Hi

I appear to have the same problem.
When I add docx, or pdf files to Openkm CE they for some reason don't get indexed if I use the mysql database.
However, if I use the default H2 database the indexing seems to work fine.

My environment is as such:
I am testing OpenKM-CE using the docker image from here. https://hub.docker.com/r/openkm/openkm-ce
I was successful in loading and starting two containers using OpenKM CE 6.13.12 (Build a3587ce).
One container is configured to use the H2 database which is the default.
One container is configured to use the mysql database which was configured using the sample docker-compose specified in your documentation.
I have them running in parallel with different url ports so I can compare .

- When using the H2 database the indexing seems to work fine for docx and pdf files.
- However, if I use my mysql docker container, configured to use a mysql database, it doesn't index the docx and pdf files. However, it does index txt and odt documents.

PS I am a complete newbie.

Cheers
joel

Username

susserj

Rank

Fresh Boarder

Posts

Joined

Wed Jun 05, 2024 8:37 pm

Re: TextExtractorWorker Error

#54793 by jllort
Mon Jul 15, 2024 6:23 am

You should use files with more content because when the application detects few text extracted it supposes is an error or at least raises a warning in the log. In the case of small text extracted ( less than 16 characters ) we consider there's some error an the data is not saved.

Username

jllort

Rank

Moderator

Posts

12184

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: TextExtractorWorker Error

#54801 by MarcoOliveira
Mon Jul 15, 2024 8:06 am

Mabye yes! Thanks for help.

Username

MarcoOliveira

Rank

Junior Boarder

Posts

Joined

Thu May 16, 2024 10:41 am

Page 1 of 1
6 posts

Return to “Configuration”

Display:

Sort by:

Jump to: