• Check text extraction for docx issue

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #52935  by saleem55
 
hello,
when I extract docx document I m getting un-readable content
please see the attachments , and when i search the content of the document ,nothing is showing
Attachments
docx trxt extraction.PNG
docx trxt extraction.PNG (236.08 KiB) Viewed 3178 times
(11.54 KiB) Downloaded 190 times
 #52946  by jllort
 
Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.
 #52949  by saleem55
 
jllort wrote: Sat Oct 16, 2021 6:37 pm Must install the LibreOffice Arabic dictionary to get it working. The OOTextExtractor is the LibreOffice ( OpenOffice ) text extractor, the problem I think is in this point, a missing language in the application, that will explain why is not able to open the file to get the content.
hello jllort
this is English document
 #52956  by jllort
 
Watching your screen again, the problem seems this is not a Docx file, this is a PDF file. If you take a look at the beginning of the raw will see "PDF-1.5" etc...
 #53112  by ketarino
 
Hello. I have the exact same issue. Seems that when upload a Word document (docx) it gets converted to pdf automatically, but file extension remains the same.
 #53120  by jllort
 
The OpenKM store the documents in the original format if you do not have done a customization for this purpose ( I suppose not ). I suggest checking the type of the document before uploading it into OpenKM.
 #53186  by silverspr
 
I also have the same issue, it is definitely a .docx file and was directly uploaded to the "check text extraction" tool under Administration, utilities. Looks like the wrong extractor is being used ? OOT I have no idea why the output is indicating this as a PDF file.

thanks
Attachments
docx 2022-01-10 124831.png
docx 2022-01-10 124831.png (56.14 KiB) Viewed 2700 times
 #53194  by jllort
 
Sorry, but you shared a small image and is not possible to read anything there. Try sharing bigger and if possible the document.
 #53468  by jllort
 
The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )
 #53477  by UberStrike88
 
jllort wrote: Sat May 07, 2022 3:48 pm The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )
I have the same issue aswell, installed the arabic language pack first in libreoffice. Then I created a new file in libre copy pasted some arabic text and saved it as a docx file (word 2007 - 365) And I get the same PDF issue without any text extraction:
In OpenKM:
Image

In Database:
Image

logs:
Image

I use OpenKm 6.3 CE on Ubuntu 22.04
 #53478  by streicher
 
Hello,

I have the exact same problem.
Ubuntu 20.04.4
OpenKM CE 6.3.11
Hungarien docx file.

Check text extraction sees docx file as pdf.
Attachments
docx_pdf_text_extractor_error.jpg
docx_pdf_text_extractor_error.jpg (302.88 KiB) Viewed 2115 times
 #53479  by UberStrike88
 
UberStrike88 wrote: Tue May 10, 2022 1:04 pm
jllort wrote: Sat May 07, 2022 3:48 pm The application uses the extension to choose the plugin that must be used to extract the text. In case you have a PDF file with the wrong extension, for example, DOCX, the application will use a plugin to extract DOCX content ( wrong plugin ) and the content will not be extracted. Please check the extension of the document corresponds with the document type. In case the file have the wrong extension, should download, change the extension and upload again with the right extension ( it is easier than correct from the OpenKM side )
I have the same issue aswell, installed the arabic language pack first in libreoffice. Then I created a new file in libre copy pasted some arabic text and saved it as a docx file (word 2007 - 365) And I get the same PDF issue without any text extraction:
In OpenKM:
Image

In Database:
Image

logs:
Image

I use OpenKm 6.3 CE on Ubuntu 22.04
UPDATE:
I fixed this by completely removing libreoffice and reinstalling it. Also I set the document text to the correct language before it was somehow set to hindi... After doing those 2 things it works again!

setting the text language:
https://i.imgur.com/Okn7m1z.png

(to completely remove libreoffice from an ubuntu based system:)
Code: Select all
sudo apt-get remove --purge libreoffice*
sudo apt-get clean
sudo apt-get autoremove
and after reinstall it
Code: Select all
sudo apt install libreoffice
The file now:
Image
 #53480  by streicher
 
Unfortunately reinstalling LibreOffice didn't work for me, but I found a solution to my case:

I had the parameter system.openoffice.dictionary = /opt/openkm/dict-hu.oxt
I deleted this parameter value and restarted openkm, and it worked.
(In older OpenKM versions I used this parameter, and I didn't have problem, but I updated more versions, and intalled libreoffice-l10n-hu package on linux. Maybe one of the version update caused the problem, or a conflict between the libreoffice-l10n-hu package and the system.openoffice.dictionary OpenKM parameter.
 #53495  by jllort
 
In the case you are using dictionary, then are restricting to the words into this dictionary -> https://docs.openkm.com/kcenter/view/ok ... ngine.html

If you are using tesseract version 4 can configure with several languages with parameter eng+spa ( english , spanish ) ->A I think this is a better option nowadays -> check tesseract configuration parameters

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.