Page 1 of 1

Uploading text files with different encodings

PostPosted:Tue Dec 18, 2012 9:35 am
by jho_ed
Hi,

When uploading regular text files with different character encodings (ISO8859-1, UTF-8 and UTF-16) I only get the indexing, download as pdf and preview fully working on the UTF-8 versions. In ISO8859-1 the words holding characters outside the ASCII range will not be indexed correctly and in the preview those characters are displayed as reversed question marks.

My question is if there is a possiblity to handle such a mix of text files in OpenKM? When using the Linux 'file -i' command on all my example files the encoding are correctly recognised.

I have tested this in a local installation of OpenKM Community version 6.2.1 and also the web demo version and they both handle these files the same way.

Thanks!

Re: Uploading text files with different encodings

PostPosted:Wed Dec 19, 2012 6:26 pm
by jllort
Index file and preview are two kind of different problems.

Here if you want we can concentrate in preview and add new ticket for indexing problem. About preview should ensure you got libreoffice or openoffice correctly installed with libraries and language packages you need to support all this kind of files. To ensure soffice service is right configured you should test conversion to pdf -> if conversion is right the first 50% of the problem is solved otherside should continue working on it ( ensure if you open document with libreoffice / openoffice in the server you view document correctly -> if you see correctly you'll convert to pdf without strange characters ). When this stage will be finished we can concentrate with swf conversion.

Re: Uploading text files with different encodings

PostPosted:Fri Dec 21, 2012 10:51 am
by jho_ed
Libreoffice should be correctly installed on the server as I can open all the supplied files correctly. The way Libreoffice handles text files with different character encodings interactively is to bring up the ASCII Filter Options for the user to select what character encoding should be used (called character set in the dialog).

If I on the server try to run
Code: Select all
libreoffice --headless --convert-to pdf FileToConvert.txt
then the pdf conversion works OK for ISO-8859-1 files, but not for UTF-8 nor UTF-16. In this case I am not able to give a hint to libreoffice what character encoding is used.

In OpenKM, when downloading file as pdf, it works for the UTF-8 file, but not for the ISO-8859-1 nor UTF-16 file. The swf conversion is the same as it is based on the pdf.

How does OpenKM do the conversion? Is there any setting that I am not aware of to specify the character encoding for pdf conversion, preview (and indexing)?

Re: Uploading text files with different encodings

PostPosted:Sat Dec 22, 2012 7:54 pm
by jllort
You're on windows ? and which locale do you have installed because utf-16 files normally are chinese or similar ?

Re: Uploading text files with different encodings

PostPosted:Mon Dec 31, 2012 8:27 am
by jho_ed
I am on Linux, OpenSUSE 12.2, and my locale is en_US.UTF-8.

I need to be able to handle text files using UTF-8 and ISO-8859-1. I tested UTF-16 just to know how those files were handled.

How do I configure OpenKM to handle text files using different encodings? As said earlier running the Linux command 'file -i' on all the supplied files gives the correct encpding back so they are recognisable.

Re: Uploading text files with different encodings

PostPosted:Wed Jan 02, 2013 9:05 am
by pavila
I have opened the three sample files with LibreOffice and you can see the result in the attached screenshot. Only UTF-8 is imported (or represented) correctly.

Re: Uploading text files with different encodings

PostPosted:Wed Jan 09, 2013 6:20 am
by jho_ed
pavila wrote:I have opened the three sample files with LibreOffice and you can see the result in the attached screenshot. Only UTF-8 is imported (or represented) correctly.
As I wrote before (from a standard install opening a text file):
jho_ed wrote:The way Libreoffice handles text files with different character encodings interactively is to bring up the ASCII Filter Options for the user to select what character encoding should be used (called character set in the dialog).
The supplied test files are all three correct when it comes to encoding, UTF-16 using a BOM, UTF-8 as well. The harder one is the ISO-8859-1 as you do not know have a way to tell in the file which encoding ISO-8859-X (or other) it uses.

My questions still stands, how do I upload files into OpenKM with different encodings? Is there any setting that I am not aware of to specify the character encoding for pdf conversion, preview and indexing?

Re: Uploading text files with different encodings

PostPosted:Sun Jan 27, 2013 11:19 am
by pavila
When I open these files with gEdit all them are seen correctly. The problem is LibreOffice, which only "understand" the UTF-8 encoding.

About "different encoding in OpenKM", is a interesting question. We have customers with greek, chinese and cyrillic alphabets and in all OpenKM is working fine. I'm not sure if you have problems only with preview or you have other problems related to encodings. In case of having troubles with preview I understand that the piece causing problems is OpenOffice / LibreOffice. I don't know if you can pass any argument to OpenOffice to enhance text import with different character encodings, but it would be a solution.

Re: Uploading text files with different encodings

PostPosted:Mon Jan 28, 2013 12:09 pm
by jho_ed
pavila wrote:About "different encoding in OpenKM", is a interesting question. We have customers with greek, chinese and cyrillic alphabets and in all OpenKM is working fine.
That can be because all might be using UTF-8 only which covers all alphabets or at least only one encoding covering their situation and not several encodings like in my case.
pavila wrote:I'm not sure if you have problems only with preview or you have other problems related to encodings.
What about text indexing in OpenKM? Is that also done using OpenOffice/LibreOffice?
pavila wrote:I don't know if you can pass any argument to OpenOffice to enhance text import with different character encodings, but it would be a solution.
I do not know, does OpenKM allow you to pass parameters to OpenOffice/LibreOffice? If so how?

Thanks for your time!

Re: Uploading text files with different encodings

PostPosted:Sat Feb 09, 2013 12:00 pm
by pavila
What about text indexing in OpenKM? Is that also done using OpenOffice/LibreOffice?
OpenKM uses Lucene as search engine. Nothing related to OpenOffice / LibreOffice which is only used in gererate preview and PDF conversion.
I do not know, does OpenKM allow you to pass parameters to OpenOffice/LibreOffice? If so how?
I don't if there is a way of passing parameters to OpenOffice / LibreOffice because we use a third-party library to handle OpenOffice and don't see any option to do it. But first of all I need to know if exist such parameter.

Regards.