Bienvenido! - Willkommen! - Welcome!

Bitácora Técnica de Tux&Cía., Santa Cruz de la Sierra, BO
Bitácora Central: Tux&Cía.
Bitácora de Información Avanzada: Tux&Cía.-Información
May the source be with you!

Friday, December 3, 2010

character_set fix and configuring MySQL

Source
Original MySQL variables
mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name   | Value    |
+-------------------------------+-------------------------------+
| character_set_client  | latin1   |
| character_set_connection | latin1   |
| character_set_database | utf8    |
| character_set_results  | latin1   |
| character_set_server  | latin1   |
| character_set_system  | utf8    |
| collation_connection  | latin1_swedish_ci  |
| collation_database  | utf8_general_ci  |
| collation_server  | latin1_general_ci  |
+-------------------------------+-------------------------------+
The bold values are all "wrong" or may be I should say "undesired". The database character set and the database collation values (in blue) are the ones which should be leading.
Explanation of the variables

character_set_server and collation_server

The server character set and collation can be determined from the values of the character_set_server and collation_server system variables.
Ok, those are the defaults of the server, but we already overruled those when setting up the database, so for this exercise they are not really relevant.

character_set_database and collation_database

The character set and collation of the default database can be determined from the values of the character_set_database and collation_database system variables.
Those are the defaults of the database we selected. These are the ones we want to use.

character_set_client

What character set is the statement in when it leaves the client?
The server takes the character_set_client system variable to be the character set in which statements are sent by the client.
The client in this case is the PHP script. As the sql statement which is send by the PHP script to the server may contain utf-8 characters - for instance form input which (after validation) is used in an sql query -, the client character set needs to use the same character set as the database.

character_set_connection and collation_connection

What character set should the server translate a statement to after receiving it?
For this, the server uses the character_set_connection and collation_connection system variables. It converts statements sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8).
collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.
If we would need sql statements to be translated between character sets when sending them from PHP to MySQL, this is where we would need to make any changes. However, for a stable application, using the same character set in the database as in the web output of an application is advisable.
This means for all practical purposes that the connection character set variables need to use the same character set as the database.

character_set_results

What character set should the server translate to before shipping result sets or error messages back to the client?
The character_set_results system variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, and result metadata such as column names.
The results sets (for instance from a SELECT statement) will normally be used by the web application for display on the screen, so - again - for all practical purposes this should be set to the same character set as the database, i.e. the same character set which is used throughout the application.
Quotes taken from the MySQL manual page on Connection charsets and collations
Word of warning on using names of character sets For using the charset in PHP and sending the charset to the client's browser we would use utf-8 (i.e. for instance for use in the htmlspecialchars function or in the header as shown above).
For setting the charset in MySQL we need to set it to utf8 (take note: no dash).
So even if we have a variable defined which holds the default charset for the application to use throughout the code, we cannot use this to send a change to MySQL as it has it's own naming convention for charsets.
For more information on the character set names used in MySQL, please refer to the MySQL Manual on Charsets and Collations.
Testing Stuart's method Ok, so now I knew the cause, I could start looking for a solution.
For this database I was in a position to change the mysql defaults, but doing it that way would not create a portable solution as I knew very well I couldn't touch the mysql settings on any of the shared servers which a lot of my customers use.
Stuart Herbert turned out to have been working on a similar problem involving conflicting collations, so I first tried out his proposed solution:
mysql_query("set collation_connection = @@collation_database");

Resulting MySQL variables after changing the collation using this method

mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name   | Value    |
+-------------------------------+-------------------------------+
| character_set_client  | latin1   |
| character_set_connection | utf8    |
| character_set_database | utf8    |
| character_set_results  | latin1   |
| character_set_server  | latin1   |
| character_set_system  | utf8    |
| collation_connection  | utf8_general_ci  |
| collation_database  | utf8_general_ci  |
| collation_server  | latin1_general_ci  |
+-------------------------------+-------------------------------+
Better, but not quite what I wanted.... and my text still came up as I?t?rn?ti?n?liz?ti?n.
Creating a portable solution So, I would need to use SET NAMES after all, but, to use SET NAMES, I would need to provide it with a character set name.
What does SET NAMES do ?
A SET NAMES 'x' statement is equivalent to these three statements:
SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;
Setting character_set_connection to x also sets collation_connection to the default collation for x. To specify one of the character set's collations explicitly, use the optional COLLATE clause:
SET NAMES 'charset_name' COLLATE 'collation_name'
From the MySQL manual page on Connection charsets and collations

The solution

Tests using a statement such as the following examples inspired by Stuart Herbert gave error messages as the sql statement didn't seem to interpret the @@character_set_database properly.
SET NAMES @@character_set_database
SET NAMES '@@character_set_database'
Through the SHOW VARIABLES statement I could access the value of character_set_database, even though it would be a roundabout way - getting the variable and then passing it back -, it would make for a portable solution. So the code needed was:
$db_charset = mysql_query( "SHOW VARIABLES LIKE 'character_set_database'" );
$charset_row = mysql_fetch_assoc( $db_charset );
mysql_query( "SET NAMES '" . $charset_row['Value'] . "'" );
unset( $db_charset, $charset_row );
Let's see what that would get us..

Resulting MySQL variables after changing the character set through Set Names

mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name   | Value    |
+-------------------------------+-------------------------------+
| character_set_client  | utf8    |
| character_set_connection | utf8    |
| character_set_database | utf8    |
| character_set_results  | utf8    |
| character_set_server  | latin1   |
| character_set_system  | utf8    |
| collation_connection  | utf8_general_ci  |
| collation_database  | utf8_general_ci  |
| collation_server  | latin1_general_ci  |
+-------------------------------+-------------------------------+
Yup ! Result achieved and they lived happily ever after...

Code Examples using various standard database management classes
Using standard PHP MySQL functions

$db = mysql_connect( $dbhost, $dbuser, $dbpasswd );
mysql_select_db( $dbname, $db);

// Make sure any results we retrieve or commands we send use the same charset and collation as the database:
$db_charset = mysql_query( "SHOW VARIABLES LIKE 'character_set_database'" );
$charset_row = mysql_fetch_assoc( $db_charset );
mysql_query( "SET NAMES '" . $charset_row['Value'] . "'" );
unset( $db_charset, $charset_row );
====================================
Source 
Configuring MySQL to use UTF-8
A project I’m working on at the moment is going to have multiple language options available, not all of which use the same alphabet (e.g. Russian and Chinese).
To lessen the pain commonly associated with internationalisation on the web, it’s beneficial to use the UTF-8 character set. This short summary from the Unicode Consortium may help explain better;
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.
Thankfully MySQL has supported Unicode for quite some time now, even if it’s not configured to use it by default.
First, let’s check what our settings are at the moment;
mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
That’s to be expected, but it’s not really what we wanted.
Find your MySQL configuration file (on most Linux/BSD systems it’s /etc/my.cnf) and make sure it’s got the following statements under the relevant headers.
[mysqld]
default-character-set=utf8
default-collation=utf8_general_ci
character-set-server=utf8
collation-server=utf8_general_ci
init-connect='SET NAMES utf8'
 
[client]
default-character-set=utf8
Restart MySQL and make sure it’s working;
service mysql restart
mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci | 
| collation_database   | utf8_general_ci | 
| collation_server     | utf8_general_ci | 
+----------------------+-----------------+
3 rows in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
Update; Mo pointed out that it’s worth demonstrating setting the charset and collation when creating tables too (good point!).
CREATE TABLE `content` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `language` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `title` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

No comments: