Open Source Document Management System | OpenKM - Check text extraction for docx issue

Reply

Check text extraction for docx issue

#52935 by saleem55
Sat Oct 16, 2021 9:18 am

hello,
when I extract docx document I m getting un-readable content
please see the attachments , and when i search the content of the document ,nothing is showing

Attachments

docx trxt extraction.PNG (236.08 KiB) Viewed 12862 times

Gong Issue.docx

(11.54 KiB) Downloaded 1107 times

Username

saleem55

Rank

Junior Boarder

Posts

34

Joined

Sat Mar 03, 2018 8:19 am

Re: Check text extraction for docx issue

#52946 by jllort
Sat Oct 16, 2021 6:37 pm

Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Check text extraction for docx issue

#52949 by saleem55
Sat Oct 16, 2021 7:46 pm

jllort wrote: ↑Sat Oct 16, 2021 6:37 pm Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.

hello jllort
this is English document

Username

saleem55

Rank

Junior Boarder

Posts

34

Joined

Sat Mar 03, 2018 8:19 am

Re: Check text extraction for docx issue

#52956 by jllort
Sat Oct 23, 2021 7:53 am

Watching your screen again, the problem seems this is not a Docx file, this is a PDF file. If you take a look at the beginning of the raw will see "PDF-1.5" etc...

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Check text extraction for docx issue

#53112 by ketarino
Wed Dec 15, 2021 10:41 am

Hello. I have the exact same issue. Seems that when upload a Word document (docx) it gets converted to pdf automatically, but file extension remains the same.

Username

ketarino

Rank

Fresh Boarder

Posts

1

Joined

Wed Dec 15, 2021 10:33 am

Re: Check text extraction for docx issue

#53120 by jllort
Sat Dec 18, 2021 9:48 am

The OpenKM store the documents in the original format if you do not have done a customization for this purpose ( I suppose not ). I suggest checking the type of the document before uploading it into OpenKM.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Check text extraction for docx issue

#53186 by silverspr
Mon Jan 10, 2022 8:51 pm

I also have the same issue, it is definitely a .docx file and was directly uploaded to the "check text extraction" tool under Administration, utilities. Looks like the wrong extractor is being used ? OOT I have no idea why the output is indicating this as a PDF file.

thanks

Attachments

docx 2022-01-10 124831.png (56.14 KiB) Viewed 12384 times

Username

silverspr

Rank

Expert Boarder

Posts

81

Joined

Thu Aug 21, 2014 12:58 pm

Re: Check text extraction for docx issue

#53194 by jllort
Sat Jan 15, 2022 4:18 pm

Sorry, but you shared a small image and is not possible to read anything there. Try sharing bigger and if possible the document.

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Check text extraction for docx issue

#53468 by jllort
Sat May 07, 2022 3:48 pm

The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Check text extraction for docx issue

#53477 by UberStrike88
Tue May 10, 2022 1:04 pm

jllort wrote: ↑Sat May 07, 2022 3:48 pm The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )

I have the same issue aswell, installed the arabic language pack first in libreoffice. Then I created a new file in libre copy pasted some arabic text and saved it as a docx file (word 2007 - 365) And I get the same PDF issue without any text extraction:
In OpenKM:

In Database:

logs:

I use OpenKm 6.3 CE on Ubuntu 22.04

Username

UberStrike88

Rank

Fresh Boarder

Posts

5

Joined

Tue May 10, 2022 12:13 pm

Re: Check text extraction for docx issue

#53478 by streicher
Tue May 10, 2022 1:50 pm

Hello,

I have the exact same problem.
Ubuntu 20.04.4
OpenKM CE 6.3.11
Hungarien docx file.

Check text extraction sees docx file as pdf.

Attachments

docx_pdf_text_extractor_error.jpg (302.88 KiB) Viewed 11799 times

Username

streicher

Rank

Fresh Boarder

Posts

4

Joined

Mon Mar 18, 2013 8:15 am

Re: Check text extraction for docx issue

#53479 by UberStrike88
Tue May 10, 2022 2:03 pm

UberStrike88 wrote: ↑Tue May 10, 2022 1:04 pm
jllort wrote: ↑Sat May 07, 2022 3:48 pm The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )
I have the same issue aswell, installed the arabic language pack first in libreoffice. Then I created a new file in libre copy pasted some arabic text and saved it as a docx file (word 2007 - 365) And I get the same PDF issue without any text extraction:
In OpenKM:

In Database:

logs:

I use OpenKm 6.3 CE on Ubuntu 22.04

UPDATE:
I fixed this by completely removing libreoffice and reinstalling it. Also I set the document text to the correct language before it was somehow set to hindi... After doing those 2 things it works again!

setting the text language:
https://i.imgur.com/Okn7m1z.png

(to completely remove libreoffice from an ubuntu based system:)

Code: Select all

sudo apt-get remove --purge libreoffice*
sudo apt-get clean
sudo apt-get autoremove

and after reinstall it

Code: Select all

sudo apt install libreoffice

The file now:

Username

UberStrike88

Rank

Fresh Boarder

Posts

5

Joined

Tue May 10, 2022 12:13 pm

Re: Check text extraction for docx issue

#53480 by streicher
Tue May 10, 2022 4:48 pm

Unfortunately reinstalling LibreOffice didn't work for me, but I found a solution to my case:

I had the parameter system.openoffice.dictionary = /opt/openkm/dict-hu.oxt
I deleted this parameter value and restarted openkm, and it worked.
(In older OpenKM versions I used this parameter, and I didn't have problem, but I updated more versions, and intalled libreoffice-l10n-hu package on linux. Maybe one of the version update caused the problem, or a conflict between the libreoffice-l10n-hu package and the system.openoffice.dictionary OpenKM parameter.

Username

streicher

Rank

Fresh Boarder

Posts

4

Joined

Mon Mar 18, 2013 8:15 am

Re: Check text extraction for docx issue

#53495 by jllort
Sat May 14, 2022 4:20 pm

In the case you are using dictionary, then are restricting to the words into this dictionary -> https://docs.openkm.com/kcenter/view/ok ... ngine.html

If you are using tesseract version 4 can configure with several languages with parameter eng+spa ( english , spanish ) ->A I think this is a better option nowadays -> check tesseract configuration parameters

Username

jllort

Rank

Moderator

Posts

12126

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Reply

Page 1 of 1
14 posts