From my previous post, I realized that pattern matching operators (LIKE
, ILIKE
) do not utilize indexes. As I explored further, I came across the concept of collation and decided to take some notes in this post.
Encoding
Encoding maps human-readable characters to numbers so computers can understand them. Essentially, it assigns a unique number to each character. Common encodings include UTF-8 and ASCII.
- ASCII: Represents 256 unique characters.
- UTF-8: Represents 1,112,064 characters, covering almost all characters from any language.
Most modern programming languages, such as Go, natively support UTF-8. Unlike ASCII, which uses 1 byte per character, UTF-8 uses up to 4 bytes. Strings in programming languages are typically represented as byte arrays. In ASCII, the number of bytes corresponds to the number of characters. However, this is not true for UTF-8.
In Go, iterating through a string’s byte array will produce incorrect results for non-ASCII characters. Instead, Go uses the rune
type to represent characters.
|
|
Result:
|
|
Collation
Why Do We Need Collation?
In programming, comparing two strings or characters often involves comparing their numeric representations from the encoding table. For example, in ASCII:
a
(decimal 97) >A
(decimal 65).
However, this approach does not work well for all languages. For instance, in the Vietnamese alphabet, a
has variations like â
, ă
, á
, à
, ạ
, and ả
. Comparing â
and b
using UTF-8’s numeric representation gives:
â
(decimal 226) >b
(decimal 98).
But in Vietnamese, â
< b
because â
comes earlier in the alphabet.
Additionally, determining the lowercase of Á
requires knowing the language and region. For example, in vi_VN.UTF-8
, the lowercase of Á
is á
. Collation provides rules for such comparisons.
Vietnamese Alphabet with UTF-8 Decimal Code Points
The table below shows the order of letters in the Vietnamese alphabet and their UTF-8 decimal codes. Notice how the order does not correspond to the numeric codes:
Letter | UTF-8 Decimal Code |
---|---|
A | 65 |
Á | 193 |
À | 192 |
 | 226 |
à | 195 |
Ă | 258 |
E | 69 |
Ê | 202 |
I | 73 |
O | 79 |
Ô | 212 |
Õ | 213 |
Ở | 472 |
U | 85 |
Ứ | 372 |
Y | 89 |
What Is Collation?
Collation defines a set of rules for comparing strings, affecting operations like ordering, comparison operators (<
, >
), and pattern matching (LIKE
, ILIKE
, regex). It also influences capitalization functions like UPPER()
, LOWER()
, and INITCAP()
.
Check the default collation in PostgreSQL:
In this case, the default collation is en_US.UTF-8
. Without explicitly specifying a collation, PostgreSQL uses the default.
Why Don’t Pattern Matching Operators Use Indexes?
Using locales other than C
or POSIX
impacts performance. Pattern matching operators like LIKE
do not utilize ordinary indexes under non-C collations.
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them. As a workaround to allow PostgreSQL to use indexes with LIKE clauses under a non-C locale, several custom operator classes exist. These allow the creation of an index that performs a strict character-by-character comparison, ignoring locale comparison rules. Refer to Section 11.10 for more information. Another approach is to create indexes using the C collation, as discussed in Section 23.2.
Making Pattern Matching Operators Use Indexes
There are three options:
- Collation
C
text_pattern_ops
- Trigram (
pg_trgm
)
Let’s prepare some data for incoming demonstrations:
|
|
If you check the collation of the table’s columns and see it’s empty, it means the collation of the column is the default one:
|
|
Collation C
Collation C
ignores locale rules and performs byte-wise comparisons using memcmp
. It is faster but lacks localization.
The code snippet below shows how they handle collation in Postgres:
|
|
Since collate C compares the byte value of strings, its comparing result differs from other collations.
In the collation en_US.UTF-8
, á
is less than z
:
However, in the collate C, á
is greater than z
, since byte value of á
is greater than byte value of z
.
When using character capitalization functions like LOWER
or UPPER
, the C collation only supports characters in the a-z range. Characters outside this range remain unaffected:
Additionally, the C collation does not validate Unicode code points. It simply performs byte-by-byte comparisons, which can lead to surprising results:
Here, Unicode point U+0378
does not exist, but because the C collation performs a raw byte comparison, it treats it as valid and determines the result accordingly.
References: 1
Example
Create an index with collation C
:
Query with collation C
:
|
|
It is important to add collate "C"
into the query when you are using collate C indexes, otherwise, the operator like equal (=
) will not utilize the index.
Text Pattern Ops
Query:
|
|
Note on Collate C and Text Pattern Ops
Note that you should also create an index with the default operator class if you want queries involving ordinary <, <=, >, or >= comparisons to use an index. Such queries cannot use the xxx_pattern_ops operator classes. (Ordinary equality comparisons can use these operator classes, however.) It is possible to create multiple indexes on the same column with different operator classes. If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.
That mean you should also create a normal index on the same column with the text_pattern_ops index, like this:
|
|
The ordinary <, <=, >, or >= comparisons will (probably) utilize the normal index.
Trigram
Trigram indexing is suitable for non-left-anchored patterns like %abc%
but adds overhead. Use it for advanced pattern matching:
Query:
|
|
References
- PostgreSQL Documentation - Locale Behavior
- PostgreSQL Documentation - Collation
- PostgreSQL Documentation - Indexes and Operator Classes
- https://stackoverflow.com/questions/68384666/b-tree-index-does-not-seem-to-be-used/68385039#68385039
- https://dba.stackexchange.com/questions/53811/why-would-you-index-text-pattern-ops-on-a-text-column