Collation in PostgreSQL

Don’t know what COLLATION is? Just remember one thing: always use C COLLATE!

Why does Pigsty default to locale=C and encoding=UTF8 when initializing PostgreSQL databases?

The answer is simple: Unless you explicitly need LOCALE-specific features, you should never configure anything other than C.UTF8 for character encoding and collation settings.

I’ve previously written about character encoding, so let’s focus on LOCALE configuration today.

While there might be some justification for using non-UTF8 character encoding on the server side, using any LOCALE other than C is unforgivable. In PostgreSQL, LOCALE isn’t just about trivial things like date and currency display formats - it affects critical functionality.

Incorrect LOCALE configuration can lead to performance degradation of several to dozens of times, and prevents LIKE queries from using regular indexes. Meanwhile, setting LOCALE=C doesn’t affect scenarios that genuinely need localization rules. As the official documentation states: “Use LOCALE only if you really need it.”

Unfortunately, PostgreSQL’s default locale and encoding settings depend on the operating system configuration, so C.UTF8 might not be the default. This leads many users to unknowingly misuse LOCALE, suffering significant performance penalties and missing out on certain database features.


TL;DR

  • Always use UTF8 character encoding and C collation rules.
  • Using non-C collation rules can increase string comparison operation overhead by several to dozens of times, significantly impacting performance.
  • Using non-C collation rules prevents LIKE queries from using regular indexes, creating potential pitfalls.
  • For instances using non-C collation rules, you can create indexes using text_ops COLLATE "C" or text_pattern_ops to support LIKE queries.

What is LOCALE?

We often see LOCALE (locale) settings in operating systems and various software, but what exactly is it?

LOCALE support refers to applications adhering to cultural preferences, including alphabets, sorting, number formats, etc. A LOCALE consists of many rules and definitions:

LC_COLLATE String sorting order
LC_CTYPE Character classification (What is a character? Is its uppercase form equivalent?)
LC_MESSAGES Language of messages
LC_MONETARY Currency format
LC_NUMERIC Number format
LC_TIME Date and time format
…… Others……

A LOCALE is a set of rules, typically named using a language code + country code. For example, the LOCALE zh_CN used in mainland China has two parts: zh is the language code, and CN is the country code. In the real world, one language might be used in multiple countries, and one country might have multiple languages. Taking Chinese and China as an example:

China (COUNTRY=CN) related language LOCALEs:

  • zh: Chinese: zh_CN
  • bo: Tibetan: bo_CN
  • ug: Uyghur: ug_CN

Chinese-speaking (LANG=zh) countries or regions:

  • CN China: zh_CN
  • HK Hong Kong: zh_HK
  • MO Macau: zh_MO
  • TW Taiwan: zh_TW
  • SG Singapore: zh_SG

A LOCALE Example

Let’s look at a typical Locale definition file: Glibc’s zh_CN

Here’s a small excerpt that shows various format definitions - how months and weeks are named, how currency and decimal points are displayed, etc.

But there’s one crucial element here: LC_COLLATE, the sorting method (Collation), which significantly impacts database behavior.

LC_CTYPE
copy "i18n"
translit_start
include  "translit_combining";""
translit_end
class	"hanzi"; /
<U4E00>..<U9FA5>;/
<UF92C>;<UF979>;<UF995>;<UF9E7>;<UF9F1>;<UFA0C>;<UFA0D>;<UFA0E>;/
<UFA0F>;<UFA11>;<UFA13>;<UFA14>;<UFA18>;<UFA1F>;<UFA20>;<UFA21>;/
<UFA23>;<UFA24>;<UFA27>;<UFA28>;<UFA29>
END LC_CTYPE

LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

LC_TIME
% January, February, March, April, May, June, July, August, September, October, November, December
mon           "<U4E00><U6708>";/
     "<U4E8C><U6708>";/
     "<U4E09><U6708>";/
     "<U56DB><U6708>";/
...
% Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
day           "<U661F><U671F><U65E5>";/
     "<U661F><U671F><U4E00>";/
     "<U661F><U671F><U4E8C>";/
...
week          7;19971130;1
first_weekday 2
% %Y年%m月%d日 %A %H时%M分%S秒
d_t_fmt       "%Y<U5E74>%m<U6708>%d<U65E5> %A %H<U65F6>%M<U5206>%S<U79D2>"
% %Y年%m月%d日
d_fmt         "%Y<U5E74>%m<U6708>%d<U65E5>"
% %H时%M分%S秒
t_fmt         "%H<U65F6>%M<U5206>%S<U79D2>"
% AM, PM
am_pm         "<U4E0A><U5348>";"<U4E0B><U5348>"
% %p %I时%M分%S秒
t_fmt_ampm    "%p %I<U65F6>%M<U5206>%S<U79D2>"
% %Y年 %m月 %d日 %A %H:%M:%S %Z
date_fmt      "%Y<U5E74> %m<U6708> %d<U65E5> %A %H:%M:%S %Z"
END LC_TIME

LC_NUMERIC
decimal_point "."
thousands_sep ","
grouping      3
END LC_NUMERIC

LC_MONETARY
% ¥
currency_symbol    "<UFFE5>"
int_curr_symbol    "CNY "

For example, zh_CN provides LC_COLLATE using the iso14651_t1_pinyin collation rule, which is a pinyin-based sorting rule.

Let’s demonstrate how LOCALE’s COLLATION affects PostgreSQL behavior with an example.


Collation Example

Create a table containing 7 Chinese characters and perform sorting operations.

CREATE TABLE some_chinese(
    name TEXT PRIMARY KEY
);
INSERT INTO some_chinese VALUES 
('阿'),('波'),('磁'),('得'),('饿'),('佛'),('割');

SELECT * FROM some_chinese ORDER BY name;

Execute the following SQL to sort the records using the default C collation rule. Here, we can see that it’s actually sorting based on the ascii|unicode code points.

vonng=# SELECT name, ascii(name) FROM some_chinese ORDER BY name COLLATE "C";
 name | ascii
------+-------
| 20315
| 21106
| 24471
| 27874
| 30913
| 38463
 饿   | 39295

But this code-point-based sorting might be meaningless for Chinese users. For example, a Chinese dictionary wouldn’t use this sorting method. Instead, it would use the pinyin sorting rule used by zh_CN, sorting by pinyin. Like this:

SELECT * FROM some_chinese ORDER BY name COLLATE "zh_CN";
 name
------
 
 
 
 
 饿
 
 

We can see that sorting with the zh_CN collation rule produces results in pinyin order abcdefg, rather than the meaningless Unicode code point order.

Of course, this query result depends on the specific definition of the zh_CN collation rule. Such collation rules aren’t defined by the database itself - the database only provides the C collation (or its alias POSIX). COLLATIONs typically come from either the operating system, glibc, or third-party localization libraries (like icu), so different actual definitions might produce different effects.

But at what cost?

The biggest negative impact of using non-C or non-POSIX LOCALE in PostgreSQL is:

Specific collation rules have a huge performance impact on operations involving string comparisons, and they also prevent LIKE queries from using regular indexes.

Additionally, the C LOCALE is guaranteed by the database itself to work on any operating system and platform, while other LOCALEs aren’t, making non-C Locale less portable.


Performance Impact

Let’s consider an example using LOCALE collation rules. We have 1.5 million Apple Store app names and want to sort them according to different regional rules.

-- Create a table of app names, containing both Chinese and English
CREATE TABLE app(
    name TEXT PRIMARY KEY
);
COPY app FROM '/tmp/app.csv';

-- View table statistics
SELECT
    correlation, -- correlation coefficient 0.03542578, basically random distribution
    avg_width,   -- average length 25 bytes
    n_distinct   -- -1, meaning 1,508,076 records with no duplicates
FROM pg_stats WHERE tablename = 'app';

-- Run a series of experiments with different collation rules
SELECT * FROM app;
SELECT * FROM app order by name; 
SELECT * FROM app order by name COLLATE "C";
SELECT * FROM app order by name COLLATE "en_US";
SELECT * FROM app order by name COLLATE "zh_CN"; 

The results are quite shocking - using C and zh_CN can differ by ten times:

# Scenario Time(ms) Notes
1 No sort 180 Uses index
2 order by name 969 Uses index
3 order by name COLLATE "C" 1430 Sequential scan, external sort
4 order by name COLLATE "en_US" 10463 Sequential scan, external sort
5 order by name COLLATE "zh_CN" 14852 Sequential scan, external sort

Here’s the detailed execution plan for experiment 5. Even with sufficient memory configured, it still spills to disk for external sorting. However, all experiments with explicit LOCALE specification showed this behavior, allowing us to compare the performance difference between C and zh_CN.

Another more comparative example is comparison operations.

Here, all strings in the table are compared with 'World', equivalent to performing 1.5 million specific rule comparisons on the table, without even involving disk I/O.

SELECT count(*) FROM app WHERE name > 'World';
SELECT count(*) FROM app WHERE name > 'World' COLLATE "C";
SELECT count(*) FROM app WHERE name > 'World' COLLATE "en_US";
SELECT count(*) FROM app WHERE name > 'World' COLLATE "zh_CN";

Even so, compared to C LOCALE, zh_CN still takes nearly 3 times longer.

# Scenario Time(ms)
1 Default 120
2 C 145
3 en_US 351
4 zh_CN 441

If sorting might be O(n²) comparisons with 10x overhead, then the 3x overhead for O(n) comparisons here roughly matches. We can draw a preliminary conclusion:

Compared to C Locale, using zh_CN or other Locales can cause several times additional performance overhead.

Besides performance issues, incorrect Locale can also lead to functional limitations.


Functional Limitations

Besides poor performance, another unacceptable issue is that using non-C LOCALE prevents LIKE queries from using regular indexes.

Let’s use our previous experiment as an example. We’ll execute the following query on database instances using C and en_US as default LOCALE:

SELECT * FROM app WHERE name LIKE '中国%';

Find all apps starting with “中国” (China).

On a C-based database

This query can normally use the app_pkey index, leveraging the ordered nature of the primary key B-tree to speed up the query, completing in about 2 milliseconds.

postgres@meta:5432/meta=# show lc_collate;
 C

postgres@meta:5432/meta=# EXPLAIN SELECT * FROM app WHERE name LIKE '中国%';
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Only Scan using app_pkey on app  (cost=0.43..2.65 rows=1510 width=25)
   Index Cond: ((name >= '中国'::text) AND (name < '中图'::text))
   Filter: (name ~~ '中国%'::text)
(3 rows)

On an en_US-based database

We find that this query cannot use the index, performing a full table scan. The query degrades to 70 milliseconds, 30-40 times worse performance.

vonng=# show lc_collate;
 en_US.UTF-8

vonng=# EXPLAIN SELECT * FROM app WHERE name LIKE '中国%';
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on app  (cost=0.00..29454.95 rows=151 width=25)
   Filter: (name ~~ '中国%'::text)

Why?

Because index (B-tree index) construction is also based on ordering, which means equality and comparison operations.

However, LOCALE has its own set of rules for string equivalence. For example, the Unicode standard defines many counterintuitive equivalence rules (after all, it’s a universal language standard - like multiple characters combining to form a string equivalent to another single character, details in the Modern Character Encoding article).

Therefore, only the simplest C LOCALE can perform pattern matching normally. The C LOCALE’s comparison rules are very simple - just compare character code points one by one, without any fancy tricks. So, if your database unfortunately uses a non-C LOCALE, then LIKE queries cannot use default indexes.

Solution

For non-C LOCALE instances, only special types of indexes can support such queries:

CREATE INDEX ON app(name COLLATE "C");
CREATE INDEX ON app(name text_pattern_ops);

Here, using the text_pattern_ops operator family to create an index can also support LIKE queries. This is a special operator family for pattern matching that ignores LOCALE and directly performs pattern matching based on character-by-character comparison, which is the C LOCALE way.

Therefore, in this case, only indexes built on the text_pattern_ops operator family, or indexes using COLLATE "C"' on the default text_ops, can support LIKE queries.

vonng=# EXPLAIN ANALYZE SELECT * FROM app WHERE name LIKE '中国%';

Index Only Scan using app_name_idx on app  (cost=0.43..1.45 rows=151 width=25) (actual time=0.053..0.731 rows=2360 loops=1)
   Index Cond: ((name ~>=~ '中国'::text) AND (name ~<~ '中图'::text))
   Filter: (name ~~ '中国%'::text COLLATE "en_US.UTF-8")

After creating the index, we can see that the original LIKE query can use the index.

The issue of LIKE not being able to use regular indexes might seem solvable by creating an additional text_pattern_ops index. But this means that what could have been solved directly using the existing PRIMARY KEY or UNIQUE constraint’s built-in index now requires additional maintenance costs and storage space.

For developers unfamiliar with this issue, they might encounter performance issues in production because queries aren’t using indexes (e.g., if development uses C but production uses non-C LOCALE).


Compatibility

Suppose you’ve inherited a database already using non-C LOCALE (this is quite common), and now that you know the dangers of using non-C LOCALE, you decide to change it back.

What should you watch out for? Specifically, Locale configuration affects the following PostgreSQL features:

  1. Queries using LIKE clauses.
  2. Any queries relying on specific LOCALE collation rules, e.g., using pinyin sorting as result ordering.
  3. Queries using case conversion related functions: upper, lower, and initcap.
  4. The to_char function family, when formatting to local time.
  5. Case-insensitive matching in regular expressions (SIMILAR TO, ~).

So, for any queries involving case conversion, always “explicitly specify Collation!”

If unsure, you can list all queries involving the following keywords using pg_stat_statements for manual review:

LIKE|ILIKE                   -- Using pattern matching?
SIMILAR TO | ~ | regexp_xxx  -- Using i option?
upper, lower, initcap        -- Using for languages with case (Western characters)?
ORDER BY col                 -- When sorting by text columns, relying on specific collation? (e.g., pinyin)

Compatibility Modifications

Generally, C LOCALE is a superset of other LOCALE configurations in terms of functionality, and you can always switch from other LOCALEs to C. If your business doesn’t use these features, you usually don’t need to do anything. If you use localization features, you can always achieve the same effect in C LOCALE by explicitly specifying COLLATE.

SELECT upper('a' COLLATE "zh_CN");  -- Perform case conversion based on zh_CN rules
SELECT  '阿' < '波';                 -- false, under default collation 阿(38463) > 波(27874)
SELECT  '阿' < '波' COLLATE "zh_CN"; -- true, explicitly using Chinese pinyin collation: 阿(a) < 波(bo)

However, please note that collations provided by glibc - like “zh_CN” - aren’t necessarily stable. The PostgreSQL community’s recommended best practice is to use C or POSIX as the default collation. Then use ICU as the collation provider, for example:

SELECT  '阿' < '波' COLLATE "zh-x-icu"; -- true, explicitly using Chinese pinyin collation: 阿(a) < 波(bo)
  • zh-x-icu: Roughly represents “Generic Chinese (unspecified simplified/traditional)”, using ICU rules.
  • zh-Hans-x-icu: Represents Simplified Chinese (Hans = Han Simplified), ICU rules.
  • zh-Hans-CN-x-icu: Simplified Chinese (Mainland China)
  • zh-Hans-HK-x-icu: Simplified Chinese (Hong Kong)
  • zh-Hant-x-icu: Represents Traditional Chinese (Hant = Han Traditional)

Overriding CTYPE

You can override CTYPE when performing case conversions:

SELECT
    'é'          AS original,
    UPPER('é')   AS upper_default,  -- Using default Locale
    UPPER('é' COLLATE "C") AS upper_en_c, -- C Locale doesn't handle these characters 'é'
    UPPER('é' COLLATE PG_C_UTF8) AS upper_en_cutf8, -- C.UTF8 handles some case issues É
    UPPER('é' COLLATE "en_US.UTF-8") AS upper_en_us; -- en_US.UTF8 converts to uppercase É
  oid  |     collname     | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate  |  collctype   | colllocale | collicurules | collversion
-------+------------------+---------------+-----------+--------------+---------------------+--------------+--------------+--------------+------------+--------------+-------------
 12888 | lzh_TW           |            11 |        10 | c            | t                   |            6 | lzh_TW       | lzh_TW       | NULL       | NULL         | 2.28
 12889 | lzh_TW.utf8      |            11 |        10 | c            | t                   |            6 | lzh_TW.utf8  | lzh_TW.utf8  | NULL       | NULL         | 2.28
 13187 | zh_CN            |            11 |        10 | c            | t                   |            2 | zh_CN        | zh_CN        | NULL       | NULL         | 2.28
 13188 | zh_CN.gb2312     |            11 |        10 | c            | t                   |            2 | zh_CN.gb2312 | zh_CN.gb2312 | NULL       | NULL         | 2.28
 13189 | zh_CN.utf8       |            11 |        10 | c            | t                   |            6 | zh_CN.utf8   | zh_CN.utf8   | NULL       | NULL         | 2.28
 13190 | zh_HK.utf8       |            11 |        10 | c            | t                   |            6 | zh_HK.utf8   | zh_HK.utf8   | NULL       | NULL         | 2.28
 13191 | zh_SG            |            11 |        10 | c            | t                   |            2 | zh_SG        | zh_SG        | NULL       | NULL         | 2.28
 13192 | zh_SG.gb2312     |            11 |        10 | c            | t                   |            2 | zh_SG.gb2312 | zh_SG.gb2312 | NULL       | NULL         | 2.28
 13193 | zh_SG.utf8       |            11 |        10 | c            | t                   |            6 | zh_SG.utf8   | zh_SG.utf8   | NULL       | NULL         | 2.28
 13194 | zh_TW.euctw      |            11 |        10 | c            | t                   |            4 | zh_TW.euctw  | zh_TW.euctw  | NULL       | NULL         | 2.28
 13195 | zh_TW.utf8       |            11 |        10 | c            | t                   |            6 | zh_TW.utf8   | zh_TW.utf8   | NULL       | NULL         | 2.28
 13349 | zh_CN            |            11 |        10 | c            | t                   |            6 | zh_CN.utf8   | zh_CN.utf8   | NULL       | NULL         | 2.28
 13350 | zh_HK            |            11 |        10 | c            | t                   |            6 | zh_HK.utf8   | zh_HK.utf8   | NULL       | NULL         | 2.28
 13351 | zh_SG            |            11 |        10 | c            | t                   |            6 | zh_SG.utf8   | zh_SG.utf8   | NULL       | NULL         | 2.28
 13352 | zh_TW            |            11 |        10 | c            | t                   |            4 | zh_TW.euctw  | zh_TW.euctw  | NULL       | NULL         | 2.28
 13353 | zh_TW            |            11 |        10 | c            | t                   |            6 | zh_TW.utf8   | zh_TW.utf8   | NULL       | NULL         | 2.28
 14066 | zh-x-icu         |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh         | NULL         | 153.80.32.1
 14067 | zh-Hans-x-icu    |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hans    | NULL         | 153.80.32.1
 14068 | zh-Hans-CN-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hans-CN | NULL         | 153.80.32.1
 14069 | zh-Hans-HK-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hans-HK | NULL         | 153.80.32.1
 14070 | zh-Hans-MO-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hans-MO | NULL         | 153.80.32.1
 14071 | zh-Hans-SG-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hans-SG | NULL         | 153.80.32.1
 14072 | zh-Hant-x-icu    |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hant    | NULL         | 153.80.32.1
 14073 | zh-Hant-HK-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hant-HK | NULL         | 153.80.32.1
 14074 | zh-Hant-MO-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hant-MO | NULL         | 153.80.32.1
 14075 | zh-Hant-TW-x-icu |            11 |        10 | i            | t                   |           -1 | NULL         | NULL         | zh-Hant-TW | NULL         | 153.80.32.1

The only known issue currently appears in the pg_trgm extension.

https://www.pgevents.ca/events/pgconfdev2024/sessions/session/95/slides/26/pgcon24_collation.pdf

Last modified 2025-04-04: bump to v3.4.1 (98648b1)