Page 1 of 1

TextExtractorWorker Error

PostPosted:Tue May 28, 2024 9:24 am
by MarcoOliveira
Hi got this error when I upload a file (.pdf)
Code: Select all
2024-05-28 10:10:00,192 [Thread-37] [] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8cd95a85-56f9-4471-a40e-ac881636c13c, docPath=/okm:root/EN4.pdf, docVerUuid=38b19a1e-1bee-4f46-a7eb-b9699bf075d5, date=Tue May 28 10:09:34 WEST 2024}

2024-05-28 10:10:02,580 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1

2024-05-28 10:10:02,581 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat-8.5.69/temp/okm97678713134532836.pdf, /home/openkm/tomcat-8.5.69/temp/okm2327626557474858401.txt]

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - STDERR: Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/EN4.pdf': Too few text extracted
With this error, I cann't find document using search -> by content.

Re: TextExtractorWorker Error

PostPosted:Wed May 29, 2024 7:32 am
by jllort
Here are different problems:
1- few text extracted is a warning ( should not be analyzed as an error by default , it is only an advisor to check )
2- about PDF reading is not supported Error in pixRead: pix not read -> it is an error because processing an image into to PDF -> should require checking a PDF sample with this error ( maybe something that can be solved installing some missing package or library ).

We need:
1- your current openkm version and OS
2- a PDF sample to check ourselves

Re: TextExtractorWorker Error

PostPosted:Wed May 29, 2024 8:43 am
by MarcoOliveira
My version: 6.3.12 (Community Extension) Ubuntu: v24
My pdf: Is a simple PDF. PDF text -> Hello.

Other important info:

system.ocr -> /usr/bin/tesseract ${fileIn} ${fileOut}
-> version(5.3.4)

This parameters are clear " ". Its ok?
system.ocr.crotate -> ""
system.pdf.force.ocr -> ""
system.pdfimages -> ""
system.swftools.pdf2swf -> ""
system.openoffice.dictionary -> ""
Do I need to configure any more parameters?


I upload 3 documents:

Test.pdf -> Content -> Test123
Test.txt -> Content -> Test1234
Test.docx -> Content -> Test12345

When I try search by content I only find .txt file.
.pdf and .docx .. I cannot find.

I check my table "OKM_NODE_DOCUMENT" when i upload a .txt file in column "NDC_TEXT" i have content.
When I upload .pdf or .docx in my "NDC_TEXT" column is null.

Re: TextExtractorWorker Error

PostPosted:Wed Jun 05, 2024 9:09 pm
by susserj
Hi

I appear to have the same problem.
When I add docx, or pdf files to Openkm CE they for some reason don't get indexed if I use the mysql database.
However, if I use the default H2 database the indexing seems to work fine.

My environment is as such:
I am testing OpenKM-CE using the docker image from here. https://hub.docker.com/r/openkm/openkm-ce
I was successful in loading and starting two containers using OpenKM CE 6.13.12 (Build a3587ce).
One container is configured to use the H2 database which is the default.
One container is configured to use the mysql database which was configured using the sample docker-compose specified in your documentation.
I have them running in parallel with different url ports so I can compare .

- When using the H2 database the indexing seems to work fine for docx and pdf files.
- However, if I use my mysql docker container, configured to use a mysql database, it doesn't index the docx and pdf files. However, it does index txt and odt documents.


PS I am a complete newbie.

Cheers
joel

Re: TextExtractorWorker Error

PostPosted:Mon Jul 15, 2024 6:23 am
by jllort
You should use files with more content because when the application detects few text extracted it supposes is an error or at least raises a warning in the log. In the case of small text extracted ( less than 16 characters ) we consider there's some error an the data is not saved.

Re: TextExtractorWorker Error

PostPosted:Mon Jul 15, 2024 8:06 am
by MarcoOliveira
Mabye yes! Thanks for help.