Character Set Support in TSQL and PostgreSQL
TSQL and PostgreSQL each support multiple client-side string encodings and multiple server-side string encodings; however, in PostgreSQL there is only one client-side encoding and one server-side encoding, whereas in TSQL every string may specify its own encoding based on the code page of its collation.
In TSQL, the client side encoding of the
CHAR data types is determined by the code page of the collation, while the client side encoding of the
NCHAR data types is always
UTF-16. For example, if a string is declared with data type
VARCHAR(10) COLLATE sql_latin1_general_cp1_ci_as, it will be encoded on the client using the code page of the
sql_latin1_general_cp1_ci_as collation, which is
WIN1252 is such a widely used encoding that it gets the special abbreviation
cp1, which stands for
code page 1. The
WIN1252 encoding is a single-byte per character encoding, and can therefore represent only 256 characters. Prior to the development of Unicode, code pages evolved to cover the characters needed to represent languages in all parts of the world, and code point order of characters in each encoding provided the culturally expected ordering of characters in each region. This made it possible to support characters from most parts of the world with a single byte per character, and to perform sorting with efficient bytewise comparisons, but it didn’t work for case-insensitive or accent-insensitive comparisons, and having a proliferation of possible encodings led to more complexity in client-side applications, and ultimately led to the development of Unicode. The
NCHAR types represent Unicode strings encoded in
UTF-16, which can represent characters from around the world, and new characters like emojis, with a single encoding. Unicode separates the encoding of characters from the ordering and comparison of characters.
Babelfish emulates TSQL string encodings by storing all strings in UTF-8 on the server, and automatically translating between
UTF-8 and the encoding of the corresponding code page, or
UTF-16, when data is exchanged between the client and server.
How maximum string length is interpreted in TSQL and PostgreSQL
In PostgreSQL, the
(CHAR(n)) datatype specifies a string that may may hold up to (exactly)
n characters, but in TSQL, the
CHAR(n)) datatype specifies a string that may hold up to (exactly)
n bytes. In Western languages and for the most common European code pages, there is no distinction between bytes and characters because the encoding is 1 byte per character, but what about languages such as Korean, Japanese, or Chinese, where the encodings are variable width? TSQL encodes the string at
UPDATE time to determine if the string meets the maximum string length constraint, so for variable-width encodings, Babelfish must encode the string into the client-side encoding at
UPDATE time to check the maximum length in bytes before it can store the string in UTF-8 on the server.
A peculiar consequence of TSQL’s interpretation, the fixed-length
CHAR(n) data type can only store strings with exactly
n bytes, even if the encoding is variable width. This would seem to make the
CHAR(n) data type less flexible and probably less useful in languages such as Korean, Japanese, and Chinese.
Behavior of sys.varchar and sys.sysname data types in the PostgreSQL dialect
As of Babelfish 2.2, the recommendation is to invoke the
CREATE TABLE statement in the TSQL dialect. In the TSQL dialect a column declared as
VARCHAR(n) will resolve to the
sys.varchar data type and by default will be collated as if
COLLATE DATABASE_DEFAULT had been specified.