UTF8 in MySQL database

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF8 in MySQL database

Jacek Konieczny
Hello,

As I was just doing upgrade from OpenSIPs 1.4.4 to 1.5.0 I took a look
at the database and found out it was latin1-encode. I didn't like it
much (if any non ASCII characters are supposed to be allowed in the
database then why should it be limited to only a few languages in the
world?). I have found out that opensipsdbctl reads the MySQL server
default charset setting, so I have changed it to utf8… only to find out,
that:

„WARNING: Your current default mysql characters set cannot be used to
create DB. Please choice another one from the following list:”

What is the reason for rejecting UTF8? No other setting seems to make
much sense in an international environment.

Fortunately, most data in the database is not supposed to be
human-readable and I can live with ASCII encoding only. I am just
wondering where this limitation comes from.

Greets,
        Jacek

_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Dan Pascu
On Tuesday 24 March 2009, Jacek Konieczny wrote:
> Hello,
>
> As I was just doing upgrade from OpenSIPs 1.4.4 to 1.5.0 I took a look
> at the database and found out it was latin1-encode. I didn't like it
> much (if any non ASCII characters are supposed to be allowed in the
> database then why should it be limited to only a few languages in the
> world?).

Latin-1 is 8 bit transparent. So you can throw at it whatever you like and
it will get you back exactly what you put in, without having to worry
what the input encoding is. UTF-8 requires you that your input in already
formatted UTF-8.

> I have found out that opensipsdbctl reads the MySQL server
> default charset setting, so I have changed it to utf8… only to find
> out, that:
>
> „WARNING: Your current default mysql characters set cannot be used to
> create DB. Please choice another one from the following list:”
>
> What is the reason for rejecting UTF8? No other setting seems to make
> much sense in an international environment.

I disagree. Many other settings make sense in an international environment
and latin-1 is the most transparent of them.

--
Dan

_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Jacek Konieczny
On Wed, Mar 25, 2009 at 08:15:51AM +0200, Dan Pascu wrote:

> On Tuesday 24 March 2009, Jacek Konieczny wrote:
> > As I was just doing upgrade from OpenSIPs 1.4.4 to 1.5.0 I took a look
> > at the database and found out it was latin1-encode. I didn't like it
> > much (if any non ASCII characters are supposed to be allowed in the
> > database then why should it be limited to only a few languages in the
> > world?).
>
> Latin-1 is 8 bit transparent. So you can throw at it whatever you like and
> it will get you back exactly what you put in, without having to worry
> what the input encoding is.

... as long as every application connected to this database is
enconding-ignorant and would treat 'latin1' as just a binary string. And
that is not a sane way to do things. The problems will start as soon as
someone will try to process this data as 'latin1' (according to the
declaration on the database), when it is not latin1.

> UTF-8 requires you that your input in already formatted UTF-8.

But then you know what you have in the database.

> > What is the reason for rejecting UTF8? No other setting seems to make
> > much sense in an international environment.
>
> I disagree. Many other settings make sense in an international environment
> and latin-1 is the most transparent of them.

Then why don't we drop all primary keys, foreign keys and other
constraints from the database? Then it would be even more transparent --
we could put anything there.  Putting non-latin1 strings in a "latin1"
table is like putting non-unique values in an UNIQUE column. Even if the
RDBMS in use would not care, it won't seem right.

But, back to my original question, as can understand that 'latin1' is ok
for some or even most people. My my question was: is there any specific,
technical reason, that 'utf8' is forbidden? I don't think OpenSIPs does
any SQL queries when multibyte strings would be a problem (like asking
for 3 characters only and expecting 3 bytes in the result). I don't know
MySQL well (just forced to use it because of CDRTool limitations), but I
don't think it would cause any serious problems with that, either.

Greets,
        Jacek

_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Phil Vandry
On Wed, Mar 25, 2009 at 09:03:35AM +0100, Jacek Konieczny wrote:
> that is not a sane way to do things. The problems will start as soon as
> someone will try to process this data as 'latin1' (according to the
> declaration on the database), when it is not latin1.

Agreed. And the database would be perfectly within its rights to reject
or corrupt any byte in the range 0x80 to 0x9f if the encoding is latin1
(those bytes are not used in latin1), so you cannot even count on binary
transparency. (I doubt MySQL actually does this, though.)

> But, back to my original question, as can understand that 'latin1' is ok
> for some or even most people. My my question was: is there any specific,
> technical reason, that 'utf8' is forbidden? I don't think OpenSIPs does

I don't know why you are getting a problem with UTF-8 but there is one
issue with MySQL and UTF-8 that's worth mentioning (it's not related to
OpenSIPS). The MySQL docs do draw attention to this point.

If you have a CHAR(n) column (not a VARCHAR column) and your table is
using a fixed-length record (usually, myisam with no VARCHAR columns),
the CHAR column must reserve 3*n bytes with UTF-8 but requires only n
bytes with latin1 or ASCII.

(Actually it should be 4*n, not 3*n, but MySQL's support for UTF-8 is
crippled and only supports characters up to U+00FFFF, and that means it
never needs more than 3 bytes to encode one character.)

-Phil

_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Bogdan-Andrei Iancu
Hi Phil,

Phil Vandry wrote:

> On Wed, Mar 25, 2009 at 09:03:35AM +0100, Jacek Konieczny wrote:
>  
>> that is not a sane way to do things. The problems will start as soon as
>> someone will try to process this data as 'latin1' (according to the
>> declaration on the database), when it is not latin1.
>>    
>
> Agreed. And the database would be perfectly within its rights to reject
> or corrupt any byte in the range 0x80 to 0x9f if the encoding is latin1
> (those bytes are not used in latin1), so you cannot even count on binary
> transparency. (I doubt MySQL actually does this, though.)
>
>  
>> But, back to my original question, as can understand that 'latin1' is ok
>> for some or even most people. My my question was: is there any specific,
>> technical reason, that 'utf8' is forbidden? I don't think OpenSIPs does
>>    
>
> I don't know why you are getting a problem with UTF-8 but there is one
> issue with MySQL and UTF-8 that's worth mentioning (it's not related to
> OpenSIPS). The MySQL docs do draw attention to this point.
>
> If you have a CHAR(n) column (not a VARCHAR column) and your table is
> using a fixed-length record (usually, myisam with no VARCHAR columns),
> the CHAR column must reserve 3*n bytes with UTF-8 but requires only n
> bytes with latin1 or ASCII.
>
> (Actually it should be 4*n, not 3*n, but MySQL's support for UTF-8 is
> crippled and only supports characters up to U+00FFFF, and that means it
> never needs more than 3 bytes to encode one character.)
>  
So, more or less it is about the table size - what is not clear for me
(from what you say) is why for a char(n) you need n bytes when using  
latin1 charset? it means it supports only 256 chars? because according
to mysql docs, the latin1 supports a lot of non-standard chars (extended
codes).

Regards,
Bogdan


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Phil Vandry
(Sorry, I digress to MySQL issues not related to OpenSIPS. Future
 replies will be off-list.)

On Tue, Mar 31, 2009 at 11:44:42AM +0300, Bogdan-Andrei Iancu wrote:
> So, more or less it is about the table size - what is not clear for me

Yes -- but only if you are using a fixed record format. Usually a
variable length record format is used and then the data takes up only
the space it needs (as little as one byte per character if it's all
ASCII). In particular, the Mysql schemas included in the OpenSIPS
distribution all use a variable length record format.

> (from what you say) is why for a char(n) you need n bytes when using  
> latin1 charset? it means it supports only 256 chars? because according
> to mysql docs, the latin1 supports a lot of non-standard chars (extended
> codes).

Normally latin1 means the same as ISO-8859-1 (and that is what I had
always assumed). But according to this:

http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html

Mysql's latin1 is actually something called "cp1252" which I am not
familiar with, not ISO-8859-1. It goes on to say that "cp1252" is a
superset of ISO-8859-1. Is this what you mean by non-standard chars?
But it appears that "cp1252" is still a single-byte character set,
just like ISO-8859-1 (and all ISO-8859-x), so it can only support 256
characters and only requires one byte per character to encode.

-Phil

_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 in MySQL database

Bogdan-Andrei Iancu
Hi Phil,

sorry for the late reply....

Phil Vandry wrote:

> (Sorry, I digress to MySQL issues not related to OpenSIPS. Future
>  replies will be off-list.)
>
> On Tue, Mar 31, 2009 at 11:44:42AM +0300, Bogdan-Andrei Iancu wrote:
>  
>> So, more or less it is about the table size - what is not clear for me
>>    
>
> Yes -- but only if you are using a fixed record format. Usually a
> variable length record format is used and then the data takes up only
> the space it needs (as little as one byte per character if it's all
> ASCII). In particular, the Mysql schemas included in the OpenSIPS
> distribution all use a variable length record format.
>  
since 1.5, the DB, for mysql does not use varchar anymore, but only
char. I know it is a penalty as DB size (but is hdd/mem size in these
days?), but it is much faster when operating wit.

>> (from what you say) is why for a char(n) you need n bytes when using  
>> latin1 charset? it means it supports only 256 chars? because according
>> to mysql docs, the latin1 supports a lot of non-standard chars (extended
>> codes).
>>    
>
> Normally latin1 means the same as ISO-8859-1 (and that is what I had
> always assumed). But according to this:
>
> http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
>
> Mysql's latin1 is actually something called "cp1252" which I am not
> familiar with, not ISO-8859-1. It goes on to say that "cp1252" is a
> superset of ISO-8859-1. Is this what you mean by non-standard chars?
> But it appears that "cp1252" is still a single-byte character set,
> just like ISO-8859-1 (and all ISO-8859-x), so it can only support 256
> characters and only requires one byte per character to encode.
>  

aha, I see.....Thanks for the explanations.

Thanks and regards,
Bogdan


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users