• Uploading text files with different encodings

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #19883  by jho_ed
 
Hi,

When uploading regular text files with different character encodings (ISO8859-1, UTF-8 and UTF-16) I only get the indexing, download as pdf and preview fully working on the UTF-8 versions. In ISO8859-1 the words holding characters outside the ASCII range will not be indexed correctly and in the preview those characters are displayed as reversed question marks.

My question is if there is a possiblity to handle such a mix of text files in OpenKM? When using the Linux 'file -i' command on all my example files the encoding are correctly recognised.

I have tested this in a local installation of OpenKM Community version 6.2.1 and also the web demo version and they both handle these files the same way.

Thanks!
Attachments
Three test files holding the same text encoded in ISO-8859-1, UTF-8 and UTF-16LE
(590 Bytes) Downloaded 303 times
 #19931  by jllort
 
Index file and preview are two kind of different problems.

Here if you want we can concentrate in preview and add new ticket for indexing problem. About preview should ensure you got libreoffice or openoffice correctly installed with libraries and language packages you need to support all this kind of files. To ensure soffice service is right configured you should test conversion to pdf -> if conversion is right the first 50% of the problem is solved otherside should continue working on it ( ensure if you open document with libreoffice / openoffice in the server you view document correctly -> if you see correctly you'll convert to pdf without strange characters ). When this stage will be finished we can concentrate with swf conversion.
 #19980  by jho_ed
 
Libreoffice should be correctly installed on the server as I can open all the supplied files correctly. The way Libreoffice handles text files with different character encodings interactively is to bring up the ASCII Filter Options for the user to select what character encoding should be used (called character set in the dialog).

If I on the server try to run
Code: Select all
libreoffice --headless --convert-to pdf FileToConvert.txt
then the pdf conversion works OK for ISO-8859-1 files, but not for UTF-8 nor UTF-16. In this case I am not able to give a hint to libreoffice what character encoding is used.

In OpenKM, when downloading file as pdf, it works for the UTF-8 file, but not for the ISO-8859-1 nor UTF-16 file. The swf conversion is the same as it is based on the pdf.

How does OpenKM do the conversion? Is there any setting that I am not aware of to specify the character encoding for pdf conversion, preview (and indexing)?
 #19996  by jllort
 
You're on windows ? and which locale do you have installed because utf-16 files normally are chinese or similar ?
 #20066  by jho_ed
 
I am on Linux, OpenSUSE 12.2, and my locale is en_US.UTF-8.

I need to be able to handle text files using UTF-8 and ISO-8859-1. I tested UTF-16 just to know how those files were handled.

How do I configure OpenKM to handle text files using different encodings? As said earlier running the Linux command 'file -i' on all the supplied files gives the correct encpding back so they are recognisable.
 #20556  by pavila
 
I have opened the three sample files with LibreOffice and you can see the result in the attached screenshot. Only UTF-8 is imported (or represented) correctly.
Attachments
Selección_008.png
Selección_008.png (14.77 KiB) Viewed 5867 times
 #20638  by jho_ed
 
pavila wrote:I have opened the three sample files with LibreOffice and you can see the result in the attached screenshot. Only UTF-8 is imported (or represented) correctly.
As I wrote before (from a standard install opening a text file):
jho_ed wrote:The way Libreoffice handles text files with different character encodings interactively is to bring up the ASCII Filter Options for the user to select what character encoding should be used (called character set in the dialog).
The supplied test files are all three correct when it comes to encoding, UTF-16 using a BOM, UTF-8 as well. The harder one is the ISO-8859-1 as you do not know have a way to tell in the file which encoding ISO-8859-X (or other) it uses.

My questions still stands, how do I upload files into OpenKM with different encodings? Is there any setting that I am not aware of to specify the character encoding for pdf conversion, preview and indexing?
 #21111  by pavila
 
When I open these files with gEdit all them are seen correctly. The problem is LibreOffice, which only "understand" the UTF-8 encoding.

About "different encoding in OpenKM", is a interesting question. We have customers with greek, chinese and cyrillic alphabets and in all OpenKM is working fine. I'm not sure if you have problems only with preview or you have other problems related to encodings. In case of having troubles with preview I understand that the piece causing problems is OpenOffice / LibreOffice. I don't know if you can pass any argument to OpenOffice to enhance text import with different character encodings, but it would be a solution.
 #21122  by jho_ed
 
pavila wrote:About "different encoding in OpenKM", is a interesting question. We have customers with greek, chinese and cyrillic alphabets and in all OpenKM is working fine.
That can be because all might be using UTF-8 only which covers all alphabets or at least only one encoding covering their situation and not several encodings like in my case.
pavila wrote:I'm not sure if you have problems only with preview or you have other problems related to encodings.
What about text indexing in OpenKM? Is that also done using OpenOffice/LibreOffice?
pavila wrote:I don't know if you can pass any argument to OpenOffice to enhance text import with different character encodings, but it would be a solution.
I do not know, does OpenKM allow you to pass parameters to OpenOffice/LibreOffice? If so how?

Thanks for your time!
 #21305  by pavila
 
What about text indexing in OpenKM? Is that also done using OpenOffice/LibreOffice?
OpenKM uses Lucene as search engine. Nothing related to OpenOffice / LibreOffice which is only used in gererate preview and PDF conversion.
I do not know, does OpenKM allow you to pass parameters to OpenOffice/LibreOffice? If so how?
I don't if there is a way of passing parameters to OpenOffice / LibreOffice because we use a third-party library to handle OpenOffice and don't see any option to do it. But first of all I need to know if exist such parameter.

Regards.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.