Collations and language support
A collation specifies the sort order and presentation format of data. Babelfish maps SQL Server collations to comparable collations provided by Babelfish. Babelfish predefines Unicode collations with culturally-sensitive string comparisons and sort orders, and provides a way to translate the collations in your SQL Server DB to the closest-matching Babelfish collation. Locale-specific collations are provided for different languages and regions.
Some collations specify a code page that corresponds to a client-side encoding. Babelfish will automatically translate from the server encoding to the client encoding depending on the collation of each output column.
For detailed information about PostgreSQL collation behavior, visit the PostgreSQL website.
Babelfish supports deterministic and non-deterministic collations:
-
A collation that is deterministic will consider two characters as equal if, and only if, they have the same byte sequence. For example,
x
is not equal toX
for a deterministic collation. Collations that are deterministic are case-sensitive (cs) and accent-sensitive (as). -
A non-deterministic collation does not require an identical match. Case-insensitivity (ci) is an example of a non-deterministic characteristic. To have
x
compare equal toX
, choose a non-deterministic collation that supports case-insensitivity.
Babelfish and SQL Server follow a naming convention for collations that describe the collation attributes, as shown in the table below:
Attribute | Description |
---|---|
ai | Accent insensitive |
as | Accent sensitive |
bin | BIN requests data to be sorted in code point order, treating the first character as a wchar. Code point order is a fast deterministic collation. |
bin2 | BIN2 requests data to be sorted in code point order. Code point order is a fast deterministic collation. |
ci | Case insensitive |
cs | Case sensitive |
pref | To sort uppercase letters before lowercase letters, use a pref collation. If the comparison is case-sensitive, the uppercase version of a letter sorts before the lowercase version, if there is no other distinction. The ICU library supports uppercase preference with colCaseFirst=upper , but not for ci_as collations. pref can be applied only to cs_as (deterministic). |
PostgreSQL doesn’t support the LIKE
clause on non-deterministic collations, but Babelfish supports it for ci_as
collations. Babelfish doesn’t support LIKE
on ai
collations. Pattern matching operations on non-deterministic collations are also not supported.
To establish Babelfish collation behavior, set the following parameters:
Parameter | Description |
---|---|
server_collation_name | The collation used as the default collation at both the server level and the database level. The default value is sql_latin1_general_cp1_ci_as . When you create your PostgreSQL cluster for use with Babelfish, you can choose the “Collation name” from the following table. Don’t modify server_collation_name after the Babelfish database has been created. |
default_locale | The default_locale parameter is used to customize all locale agnostic collations (that do not have a language in their name) for a specific language and region. The default value is en-US . This parameter is ignored if the chosen collation already implies a locale. This parameter may be changed after initial Babelfish database creation time, but it won’t affect the locale of existing collations. |
The following collations can be used in server_collation_name
or for column definitions. Note that only case insensitive collations are supported.
Collation Name | Notes |
---|---|
bbf_unicode_general_ci_as | Case insensitive collation for the UTF-8 encoding that is designed to work as well as possible in multi-language environments. |
bbf_unicode_cp1250_ci_as | Case insensitive collation for the Windows-1250 single-byte character encoding. |
bbf_unicode_cp1251_ci_as | Case insensitive collation for the Windows-1251 single-byte character encoding. |
bbf_unicode_cp1_ci_as | Case insensitive collation for the Windows-1252 single-byte character encoding. |
bbf_unicode_cp1253_ci_as | Case insensitive collation for the Windows-1253 single-byte character encoding. |
bbf_unicode_cp1254_ci_as | Case insensitive collation for the Windows-1254 single-byte character encoding. |
bbf_unicode_cp1255_ci_as | Case insensitive collation for the Windows-1255 single-byte character encoding. |
bbf_unicode_cp1256_ci_as | Case insensitive collation for the Windows-1256 single-byte character encoding. |
bbf_unicode_cp1257_ci_as | Case insensitive collation for the Windows-1257 single-byte character encoding. |
bbf_unicode_cp1258_ci_as | Case insensitive collation for the Windows-1258 single-byte character encoding. |
bbf_unicode_cp874_ci_as | Case insensitive collation for the Windows-874 single-byte character encoding. |
sql_latin1_general_cp1250_ci_as | The same as bbf_unicode_cp1250_ci_as . |
sql_latin1_general_cp1251_ci_as | The same as bbf_unicode_cp1251_ci_as . |
sql_latin1_general_cp1_ci_as | The same as bbf_unicode_cp1_ci_as . |
sql_latin1_general_cp1253_ci_as | The same as bbf_unicode_cp1253_ci_as . |
sql_latin1_general_cp1254_ci_as | The same as bbf_unicode_cp1254_ci_as . |
sql_latin1_general_cp1255_ci_as | The same as bbf_unicode_cp1255_ci_as . |
sql_latin1_general_cp1256_ci_as | The same as bbf_unicode_cp1256_ci_as . |
sql_latin1_general_cp1257_ci_as | The same as bbf_unicode_cp1257_ci_as . |
sql_latin1_general_cp1258_ci_as | The same as bbf_unicode_cp1258_ci_as . |
sql_latin1_general_cp874_ci_as | The same as bbf_unicode_cp874_ci_as . |
latin1_general_ci_as | The same as sql_latin1_general_cp1_ci_as . |
arabic_ci_as | Provides support for Arabic. |
chinese_prc_ci_as | Provides support for Chinese (PRC). |
cyrillic_general_ci_as | Provides support for Cyrillic. |
estonian_ci_as | Provides support for Estonian. |
finnish_swedish_ci_as | Provides support for Finnish. |
french_ci_as | Provides support for French. |
greek_ci_as | Provides support for Greek. |
hebrew_ci_as | Provides support for Hebrew. |
japanese_ci_as | Provides support for Japanese. |
korean_wansung_ci_as | Provides support for Korean (with dictionary sort) |
modern_spanish_ci_as | Provides support for Modern Spanish. |
mongolian_ci_as | Provides support for Mongolian. |
polish_ci_as | Provides support for Polish. |
thai_ci_as | Provides support for Thai. |
traditional_spanish_ci_as | Provides support for Spanish (Traditional sort) |
turkish_ci_as | Provides support for Turkish. |
ukrainian_ci_as | Provides support for Ukranian. |
vietnamese_ci_as | Provides support for Vietnamese. |
For column definitions, you can also use case sensitive or accent insensitive versions of the above collations. The following binary collations are also supported for columns:
Collation Name | Notes |
---|---|
latin1_general_bin2 | Binary collation for the Windows-1252 single-byte character encoding. |
latin1_general_100_bin2 | The same as latin1_general_bin2 . |
latin1_general_140_bin2 | The same as latin1_general_bin2 . |
latin1_general_90_bin2 | The same as latin1_general_bin2 . |
You can use the following collations for column definitions, if you want to support more than one locale in a single database:
Dialect | Deterministic version | Non-deterministic versions |
---|---|---|
arabic | arabic_cs_as | arabic_ci_as , arabic_ci_ai |
chinese | chinese_prc_cs_as | chinese_prc_ci_as , chinese_prc_ci_ai |
cyrillic_general | cyrillic_general_cs_as | cyrillic_general_ci_as , cyrillic_general_ci_ai |
estonian | estonian_cs_as | estonian_ci_as , estonian_ci_ai |
finnish_swedish | finnish_swedish_cs_as | finnish_swedish_ci_as , finnish_swedish_ci_ai |
french | french_cs_as | french_ci_as , french_ci_ai |
greek | greek_cs_as | greek_ci_as , greek_ci_ai |
hebrew | hebrew_cs_as | hebrew_ci_as , hebrew_ci_ai |
japanese | japanese_cs_as | japanese_ci_as , japanese_ci_ai |
korean_wamsung | korean_wamsung_cs_as | korean_wamsung_ci_as , korean_wamsung_ci_ai |
modern_spanish | modern_spanish_cs_as | modern_spanish_cs_as , modern_spanish_ci_ai |
mongolian | mongolian_cs_as | mongolian_ci_as , mongolian_ci_ai |
polish | polish_cs_as | polish_ci_as , polish_ci_ai |
thai | thai_cs_as | thai_ci_as , thai_ci_ai |
traditional_spanish | traditional_spanish_cs_as | traditional_spanish_ci_as , traditional_spanish_ci_ai |
turkish | turkish_cs_as | turkish_ci_as , turkish_ci_ai |
ukranian | ukranian_cs_as | ukranian_ci_as , ukranian_ci_ai |
vietnamese | vietnamese_cs_as | vietnamese_ci_as , vietnamese_ci_ai |
Managing collations
The ICU library provides collation version tracking to ensure that indexes that depend on collations can be reindexed when a new version of ICU becomes available. You can use the following query to identify all collations in the current database that need to be refreshed and the objects that depend on them:
SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
pg_describe_object(classid, objid, objsubid) AS "Object"
FROM pg_depend d JOIN pg_collation c
ON refclassid = 'pg_collation'::regclass
AND refobjid = c.oid
WHERE c.collversion < > pg_collation_actual_version(c.oid)
ORDER BY 1, 2;
Predefined collations are stored in the sys.fn_helpcollations
table. You can use the following command to display information about a collation (such as its lcid, style, and collate flags). To retrieve the list, connect a psql client to the PostgreSQL port (by default, 5432
) and enter:
postgres=# SET search_path = public, pg_temp, sys;
SET
postgres=# \dO
Connect to the T-SQL port (by default 1433
) and enter:
SELECT * FROM fn_helpcollations();
Collation limitations and behaviors
Babelfish uses the ICU library for collation support. The following section lists some of the known limitations and behavior variations of Babelfish collations.
Using CHARINDEX() with a non-deterministic collation
Note: the following limitation applies only to Babelfish version 1.x.x. This issue has been resolved in version 2.1.0.
CHARINDEX()
cannot currently be used when the applicable collation is non-deterministic. Babelfish (by default) uses a non-deterministic case-insensitive collation, so you may encounter a run-time error saying “nondeterministic collations are not supported for substring searches”. Until this is resolved, this issue can be worked around in two ways:
- You can explicitly convert the expression to a case-sensitive collation and case-fold both arguments by applying LOWER() or UPPER(). For example:
SELECT CHARINDEX('x', a) from t1
becomes:SELECT CHARINDEX(LOWER('x'), LOWER(a collate sql_latin1_general_cp1_cs_as)) from t1
- Create a SQL function named
f_charindex()
, and replaceCHARINDEX()
calls with calls to this function:
create function f_charindex(@s1 varchar(max), @s2 varchar(max)) returns int
as
begin
declare @i int = 1
while len(@s2) >= len(@s1)
begin
if lower(@s1) = lower(substring(@s2,1,len(@s1))) return @i
set @i += 1
set @s2 = substring(@s2,2,999999999)
end
return 0
end
go
Unicode sorting rules
In SQL Server, “SQL” collations (those that start with the letters SQL_
) sort Unicode-encoded data (nchar
and nvarchar
) one way, but non-Unicode encoded data (char
and varchar
) a different way. Babelfish databases will always be UTF-8 encoded and will always apply Unicode sorting rules consistently, regardless of the data type.
Secondary-equal collations
The default ICU Unicode secondary-equal (ci_as
) collation sorts punctuation marks and other non-alphanumeric characters before numeric characters, and numeric characters before alphabetic characters; however, within the set of punctuation or special characters, the order may be different from that of the corresponding collation in SQL Server.
Tertiary collations
SQL collations, such as sql_latin1_general_pref_cp1_ci_as
, support the TERTIARY_WEIGHTS
function and the ability to sort strings that compare equally in a ci_as
collation to be sorted upper case first: ABC
, ABc
, AbC
, Abc
, aBC
, aBc
, abC
, and finally abc
. Thus the DENSE_RANK() OVER (ORDER BY column)
analytic function would assess these strings as having the same rank but would order them upper case first within a partition.
A similar result can be obtained with Babelfish by adding a COLLATE
clause to the ORDER BY
clause that specifies a tertiary cs_as
collation that specifies @colCaseFirst=upper
. Unfortunately the colCaseFirst
modifier applies only to strings that are tertiary-equal (rather than secondary-equal like a ci_as
collation); therefore, tertiary SQL collations cannot be emulated using a single ICU collation. As a workaround, we recommend that you use a different collation for GROUP BY
than you use for ORDER BY
.
PostgreSQL supports exactly one version of each collation
PostgreSQL supports exactly one version of each collation, whereas SQL Server can support multiple versions of the same collation, typically by modifying the collation name with a version suffix such as _80
or _100
.
Character expansion
A character expansion treats a single character as equal to a sequence of characters at the primary level. SQL Server’s default ci_as
collation supports character expansion, whereas ICU collations support character expansion only for accent-insensitive collations.
Character reordering options are currently more restricted in PostgreSQL than in ICU, and the order of special characters can’t be made to follow the order of special characters in SQL Server exactly.
When character expansion is required, then an ai
collation must be used for comparisons, but such collations are not currently supported by the LIKE
operator. A column declared using the default ci_as
collation can use the LIKE
operator without character expansion being performed, which is typically the desired behavior, while preserving the option to explicitly collate any equality comparisons to use a ci_ai
collation to do character expansion. If you create a functional index on a ci_as
column with an explicit ci_ai
collation, then such an index can be exploited by the optimizer to perform equality and grouping operations efficiently.
char and varchar encoding
When collations that begin with SQL are used for char or varchar data types, the sort order for characters above ASCII 127 is determined by the specific code page for that SQL collation. For SQL collations, strings declared as char or varchar may sort in a different way from strings declared as nchar or nvarchar.
PostgreSQL encodes all strings with the database encoding, so will convert all characters to UTF-8 and sort using Unicode rules.
Since SQL collations sort nchar and nvarchar data types using Unicode rules, Babelfish encodes all strings on the server using UTF-8, and sorts nchar and nvarchar strings the same way it sorts char and varchar strings, using Unicode rules.
Supplementary Character
The SQL Server functions NCHAR
, UNICODE
, and LEN
support characters for code-points outside the Unicode Basic Multilingual Plane (BMP), whereas non-SC collations use surrogate pair characters to handle supplementary characters. For Unicode data types, SQL Server can represent up to 65,535 characters using UCS-2, or the full Unicode range (1,114,111 characters) if supplementary characters are used. Supplementary characters aren’t supported for use in metadata, such as in names of database objects, and collations without SC in the name behave differently in SQL Server depending on whether the encoding is UTF-16 or UCS-2.
Kana-Sensitive
When Japanese Kana characters Hiragana
and Katakana
are treated differently, the collation is called Kana sensitive (KS). ICU supports the Japanese collation standard JIS X 4061
. The now deprecated colhiraganaQ [on | off]
locale modifier may provide the same functionality as KS collations, but KS collations of the same name as SQL Server will not be provided by Babelfish in the current release.
Width-Sensitive
When a single-byte character (half-width) and the same character represented as a double-byte character (full-width) are treated differently, the collation is called width-sensitive (WS). WS collations with the same name as SQL Server won’t be provided by Babelfish in the current release.
Variation-Selector Sensitivity
Variation-Selector Sensitivity (VSS) collations distinguish between ideographic variation selectors in Japanese collations Japanese_Bushu_Kakusu_140
and Japanese_XJIS_140
. A variation sequence is made up of a base character plus an additional variation selector. If you don’t the select the _vss
option, the variation selector isn’t considered in the comparison.
VSS collations will not be provided by Babelfish in the current release.
bin
and bin2
A bin2
collation sorts characters according to code point order. The byte-by-byte binary order of UTF-8 preserves Unicode code point order, so this is also likely to be the best-performing collation. If Unicode code point order works for an application, using a bin2
collation should certainly be considered, but could result in data being displayed on the client in an order that is culturally unexpected. New mappings to lower-case characters are added to Unicode as time progresses, so the lower
function may perform in a different way on different versions of ICU. This is a special case of the more general collation versioning problem rather than something specific to the bin2
collation. Babelfish will provide the bbf_latin1_general_bin2
collation to collate in Unicode code point order. In a BIN collation only the first character is sorted as a wchar, and remaining characters are sorted byte-by-byte, effectively in code point order according to its encoding. This does not follow Unicode collation rules and it won’t be supported by Babelfish.