Page 1 of 1

Check text extraction for docx issue

PostPosted:Sat Oct 16, 2021 9:18 am
by saleem55
hello,
when I extract docx document I m getting un-readable content
please see the attachments , and when i search the content of the document ,nothing is showing

Re: Check text extraction for docx issue

PostPosted:Sat Oct 16, 2021 6:37 pm
by jllort
Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.

Re: Check text extraction for docx issue

PostPosted:Sat Oct 16, 2021 7:46 pm
by saleem55
jllort wrote: Sat Oct 16, 2021 6:37 pm Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.
hello jllort
this is English document

Re: Check text extraction for docx issue

PostPosted:Sat Oct 23, 2021 7:53 am
by jllort
Watching your screen again, the problem seems this is not a Docx file, this is a PDF file. If you take a look at the beginning of the raw will see "PDF-1.5" etc...

Re: Check text extraction for docx issue

PostPosted:Wed Dec 15, 2021 10:41 am
by ketarino
Hello. I have the exact same issue. Seems that when upload a Word document (docx) it gets converted to pdf automatically, but file extension remains the same.

Re: Check text extraction for docx issue

PostPosted:Sat Dec 18, 2021 9:48 am
by jllort
The OpenKM store the documents in the original format if you do not have done a customization for this purpose ( I suppose not ). I suggest checking the type of the document before uploading it into OpenKM.

Re: Check text extraction for docx issue

PostPosted:Mon Jan 10, 2022 8:51 pm
by silverspr
I also have the same issue, it is definitely a .docx file and was directly uploaded to the "check text extraction" tool under Administration, utilities. Looks like the wrong extractor is being used ? OOT I have no idea why the output is indicating this as a PDF file.

thanks

Re: Check text extraction for docx issue

PostPosted:Sat Jan 15, 2022 4:18 pm
by jllort
Sorry, but you shared a small image and is not possible to read anything there. Try sharing bigger and if possible the document.

Re: Check text extraction for docx issue

PostPosted:Mon Jan 17, 2022 9:31 am
by Brendon
Windows automatically recognizes the file as a zipped file. To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu.