Open Source Document Management System | OpenKM

PostPosted:**Tue May 28, 2024 9:24 am**

Hi got this error when I upload a file (.pdf)

2024-05-28 10:10:00,192 [Thread-37] [] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8cd95a85-56f9-4471-a40e-ac881636c13c, docPath=/okm:root/EN4.pdf, docVerUuid=38b19a1e-1bee-4f46-a7eb-b9699bf075d5, date=Tue May 28 10:09:34 WEST 2024}

2024-05-28 10:10:02,580 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - Abnormal program termination: 1

2024-05-28 10:10:02,581 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - CommandLine: [/usr/bin/tesseract, /home/openkm/tomcat-8.5.69/temp/okm97678713134532836.pdf, /home/openkm/tomcat-8.5.69/temp/okm2327626557474858401.txt]

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.util.ExecutionUtils - STDERR: Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

2024-05-28 10:10:02,582 [Thread-37] [] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/EN4.pdf': Too few text extracted

With this error, I cann't find document using search -> by content.

PostPosted:**Wed May 29, 2024 7:32 am**

Here are different problems:
1- few text extracted is a warning ( should not be analyzed as an error by default , it is only an advisor to check )
2- about PDF reading is not supported Error in pixRead: pix not read -> it is an error because processing an image into to PDF -> should require checking a PDF sample with this error ( maybe something that can be solved installing some missing package or library ).

We need:
1- your current openkm version and OS
2- a PDF sample to check ourselves

PostPosted:**Wed May 29, 2024 8:43 am**

My version: 6.3.12 (Community Extension) Ubuntu: v24
My pdf: Is a simple PDF. PDF text -> Hello.

Other important info:

system.ocr -> /usr/bin/tesseract ${fileIn} ${fileOut}
-> version(5.3.4)

This parameters are clear " ". Its ok?
system.ocr.crotate -> ""
system.pdf.force.ocr -> ""
system.pdfimages -> ""
system.swftools.pdf2swf -> ""
system.openoffice.dictionary -> ""
Do I need to configure any more parameters?

I upload 3 documents:

Test.pdf -> Content -> Test123
Test.txt -> Content -> Test1234
Test.docx -> Content -> Test12345

When I try search by content I only find .txt file.
.pdf and .docx .. I cannot find.

I check my table "OKM_NODE_DOCUMENT" when i upload a .txt file in column "NDC_TEXT" i have content.
When I upload .pdf or .docx in my "NDC_TEXT" column is null.

PostPosted:**Wed Jun 05, 2024 9:09 pm**

Hi

I appear to have the same problem.
When I add docx, or pdf files to Openkm CE they for some reason don't get indexed if I use the mysql database.
However, if I use the default H2 database the indexing seems to work fine.

My environment is as such:
I am testing OpenKM-CE using the docker image from here. https://hub.docker.com/r/openkm/openkm-ce
I was successful in loading and starting two containers using OpenKM CE 6.13.12 (Build a3587ce).
One container is configured to use the H2 database which is the default.
One container is configured to use the mysql database which was configured using the sample docker-compose specified in your documentation.
I have them running in parallel with different url ports so I can compare .

- When using the H2 database the indexing seems to work fine for docx and pdf files.
- However, if I use my mysql docker container, configured to use a mysql database, it doesn't index the docx and pdf files. However, it does index txt and odt documents.

PS I am a complete newbie.

Cheers
joel

PostPosted:**Mon Jul 15, 2024 6:23 am**

You should use files with more content because when the application detects few text extracted it supposes is an error or at least raises a warning in the log. In the case of small text extracted ( less than 16 characters ) we consider there's some error an the data is not saved.

PostPosted:**Mon Jul 15, 2024 8:06 am**

Mabye yes! Thanks for help.

Open Source Document Management System | OpenKM

TextExtractorWorker Error

TextExtractorWorker Error

Re: TextExtractorWorker Error

Re: TextExtractorWorker Error

Re: TextExtractorWorker Error

Re: TextExtractorWorker Error

Re: TextExtractorWorker Error