Searching for text within your data is a frequently requested feature, often leading to excellent UX. Gmail’s web interface is entirely built on top of the search. No wonder databases have supported basic text search operators like ~, LIKE, ILIKE, etc for a long time. But they often fall short or give inaccurate results as we try to evolve the feature. Say, searching in multiple languages or searching for different variants of the same word: consider realistically, realistic, and realist. Or searching for one word, but not the other.
This is where full-text search comes in. Postgres ships with excellent full-text search capabilities, which allow us to implement text search in our application without incurring additional dependencies and operational overhead. Moreover, having a built-in search allows us to compose a search with other existing queries and procedures. When I say excellent capabilities, I mean fully featured. And when I say fully featured, I mean that it supports stemming, search relevance, search highlights, fuzzy matching, and multiple languages.
In this post, we’ll look at basic text searches. In particular, we’ll look at the fundamental components of a text search, and how to use them.
Document
This is the source data within which we intend to search. Typically spread across multiple columns. Now, I know any search worth its salt must be able to search across multiple rows in multiple tables, but we’ll come to that bit a little later in the post. Postgres offers a function to_tsvector
to obtain a search document from the text. It accepts text
as an argument and returns tsvector
for the text.
1 -- Example 1a: tsvector
2 =# SELECT
3 to_tsvector('With great power, comes great responsibility!')
4 AS document;
5
6 document
7 --------------------------------------------
8 'come':4 'great':2,5 'power':3 'respons':6
9 (1 row)
10
11 -- Example 1b: have strings, will concat
12 =# SELECT
13 to_tsvector('With great power' || ' ' || 'comes great responsibility!')
14 AS document;
15
16 document
17 --------------------------------------------
18 'come':4 'great':2,5 'power':3 'respons':6
19 (1 row)
20
21 -- Example 1c: have tsvectors, will concat
22 =# SELECT
23 to_tsvector('With great power')
24 || to_tsvector('comes great responsibility!')
25 AS document;
26
27 document
28 --------------------------------------------
29 'come':4 'great':2,5 'power':3 'respons':6
30 (1 row)
A few things are noticeable:
Punctuation is gone, and so is ‘with.’ All commonly occurring words that don’t have search relevance, such as ‘with,’ are removed from the documents. These differ from language to language and are called “stop words.”
Words are reduced to a base form. These are called “lexemes”; they are nothing but normalized forms of words, a term taken from linguistics.
The lexemes are sorted alphabetically, and there are numbers associated with the lexemes.
The tsvectors can be concatenated or can operate on concatenated strings
If you want to dig deeper into how the search actually works, these are some of the things you can read about. For now, we’ll focus on just one thing from this: the to_tsvector accepted our text input and returned a searchable value.
As a side note, I like this example driven approach to learning. Code and examples make for an easier understanding of the subject, and give the reader a starting point to dig deeper. In that spirit, this blog post is a repl-driven-blog-post.
Query
The term we wish to search for. Typically a word or a phrase, but it can be any text. Even a full document, if you will. For obtaining a query object from the given search string, Postgres has two functions: to_tsquery
and plainto_tsquery
. to_tsquery
allows us to use control characters like wildcards, but is a lot stricter with the input. plainto_tsquery
on the other hand doesn’t use control characters, but escapes all the input, and is safe from SQL injection. In this post we’ll only look at to_tsquery
since it aligns with our “fully featured” requirement.
1 -- Example 2: tsquery
2 =# SELECT to_tsquery('responsibility');
3 to_tsquery
4 ------------
5 'respons'
6 (1 row)
7
8 -- Example 3: tsquery multiple words, with escaping
9 =# SELECT to_tsquery('great\ responsibility');
10 to_tsquery
11 ---------------------
12 'great' & 'respons'
13 (1 row)
14
15 -- Example 4: Wildcard search
16 =# SELECT to_tsquery('Eliz:*');
17 to_tsquery
18 ------------
19 'eliz':*
20 (1 row)
21
22 -- Example 5: Intersection (AND-ing)
23 =# SELECT to_tsquery('Barry') && to_tsquery('Allen');
24 ?column?
25 -------------------
26 'barri' & 'allen'
27 (1 row)
28
29 -- Example 5: Union (OR-ing)
30 =# SELECT to_tsquery('Barry') || to_tsquery('Wally');
31 ?column?
32 -------------------
33 'barri' | 'walli'
34 (1 row)
Examples demonstrate the various usages of to_tsquery. Wildcards, AND, and OR do exactly what you’d expect them to do. The key takeaway is that, these are just regular functions, like LOWER
, LENGTH
and so we can just use them in any query.
1 -- Example 6
2 =# SELECT
3 name,
4 to_tsvector(name) AS document
5 FROM states
6 ORDER BY name ASC
7 LIMIT 10;
8
9 name | document
10 ----------------------+---------------------------
11 Alabama | 'alabama':1
12 Alaska | 'alaska':1
13 Arizona | 'arizona':1
14 Arkansas | 'arkansa':1
15 California | 'california':1
16 Colorado | 'colorado':1
17 Connecticut | 'connecticut':1
18 Delaware | 'delawar':1
19 District of Columbia | 'columbia':3 'district':1
20 Florida | 'florida':1
21 (10 rows)
Putting document and query together
Now the question arises: How to actually use tsvector and tsquery? Enter the @@
operator. This is the operator that performs the search. Examples work the best, so let’s search for a US State with the word “North” in it:
1 =# SELECT
2 name,
3 to_tsvector(name) AS document
4 FROM states
5 WHERE to_tsvector(name) @@ to_tsquery('north')
6 ORDER BY name ASC;
7
8 name | document
9 ----------------+------------------------
10 North Carolina | 'carolina':2 'north':1
11 North Dakota | 'dakota':2 'north':1
12 (2 rows)
Or, let’s search for a state that has a word that starts with “CA”:
1 =# SELECT
2 name,
3 to_tsvector(name) AS document
4 FROM states
5 WHERE to_tsvector(name) @@ to_tsquery('ca:*')
6 ORDER BY name ASC;
7
8 name | document
9 ----------------+------------------------
10 California | 'california':1
11 North Carolina | 'carolina':2 'north':1
12 South Carolina | 'carolina':2 'south':1
13 (3 rows)
And that is your standard text search spanning multiple rows of a table. A friendly, neighborhood text-search is just a WHERE clause away! Another clear takeaway is that any @@
operation is no different from your average =
operation. Naturally, a multi-table search is just a join away. Multi-column searches are just a concatenation away.
However, what I like about this idea is that I can now compose this search with my existing SQL queries. Let’s say I have a query for “all contacts of a user that have a facebook profile”, and now I can “name search” in just this subset. To me, this is the best use-case of having the search built-in. Composability is a very powerful design pattern.
A note on NULL
s: As with everything in Postgres, text-search doesn’t quite work well with NULLs. If you have null columns, COALESCE is your friend, use it liberally!
Search relevance and rankings
Searches are often centered around finding “all matching documents”, rather than finding a specific document. In its most basic form, search relevance boils down to two questions:
How do I rank the results returned by the search?
How do I control the ranking based on my context and requirements?
The answer to the first question is the ts_rank
function. It accepts a tsvector and a tsquery as an argument, and returns the “rank”; which is a bit of a misnomer, because unlike a regular rank, where 1 is better than 2 is better than 3, this rank has the property “higher the better”. It’s best used in the ORDER clause. Here’s an example, searching all people whose name starts with “Eliz”. If that sounds odd, think autocomplete :
1 =# SELECT
2 first_name,
3 ts_rank(to_tsvector(first_name), to_tsquery('eliz:*')) as rank
4 FROM person_names
5 WHERE to_tsvector(first_name) @@ to_tsquery('eliz:*')
6 ORDER BY rank DESC
7 LIMIT 10;
8
9 first_name | rank
10 -----------------+-----------
11 Elizabeth Eliza | 0.1215850
12 Elizabeth | 0.0607927
13 Elizabeth | 0.0607927
14 Elizabeth | 0.0607927
15 Elizabeth | 0.0607927
16 Elizabeth | 0.0607927
17 Elizabeth | 0.0607927
18 Elizabeth | 0.0607927
19 ELIZABETH | 0.0607927
20 Elizabeth | 0.0607927
I think you will find that absolute rank isn’t as useful as being able to order results by it. That brings us to the next question, how can the ts_rank function be configured? Say, you are searching blog posts, and want a match in title to carry more weight than a match in the body. For this, Postgres offers “weights”. The weights are called as A, B, C and D, in the order of precedence. The default value for these weights are 1.0, 0.4, 0.2, 0.1. Which means a match with A carries 10 times more weight than a match with D.
Think of these weights as “tags”, i.e. you tag a tsvector as A or B or C or D, and specify which tags carry how much weight. With that, Postgres will yield appropriate rank. The tags analogy will make more sense once we look at how tsquery uses these weights. Here’s a query to demonstrate using weights in tsvector:
1 =# SELECT ts_rank(
2 setweight(to_tsvector('With great power'), 'A') ||
3 setweight(to_tsvector('comes great responsibility!'), 'D'),
4 to_tsquery('power'))
5 AS rank;
6
7 rank
8 ----------
9 0.607927
10 (1 row)
11
12 =# SELECT ts_rank(
13 setweight(to_tsvector('With great power'), 'A') ||
14 setweight(to_tsvector('comes great responsibility!'), 'D'),
15 to_tsquery('responsibility'))
16 AS rank;
17
18 rank
19 ----------
20 0.0607927
21 (1 row)
Our document here comprises two parts, “with great power”, tagged A, and “comes great responsibility”, tagged D.
The rank for “power” is 10 times higher than “responsibility” when searching this document, because “power” is in group A, while responsibility is in group D. Without the weights, or with same weights to all components, they will have the same rank.
The default assignment of weights A = 1, B = 0.4, C = 0.2 and D = 0.1 can be changed. Refer to the documentation for variations of ts_rank
that accept weights as an argument. Playing around with the values in the psql console would give you an idea what works best for you. I’ve often found that the default values really do work the best.
Coming back to the tags analogy, these same weights can also be assigned to a tsquery object, and the query would then match only amongst the given weight groups. Quite like a “filter”. This allows for features like “match this, but not that”. Have a look:
1 =# SELECT setweight(to_tsvector('With great power'), 'A')
2 || setweight(to_tsvector('comes great responsibility!'), 'A')
3 @@
4 to_tsquery('resp:*B')
5 AS is_match;
6
7 is_match
8 ----------
9 f
10 (1 row)
11
12 =# SELECT setweight(to_tsvector('With great power'), 'A')
13 || setweight(to_tsvector('comes great responsibility!'), 'A')
14 @@
15 to_tsquery('resp:*A')
16 AS is_match;
17
18 is_match
19 ----------
20 t
21 (1 row)
The official documentation about controlling text search is quite good, and detailed. I highly recommend reading at least this section of the documentation, if not all of it.
This concludes the post about basic full-text search in Postgres. Have fun searching! Let me know what curious cases you tried out with tsquery and tsvectors.
In a follow-up post, we’ll look at improving search performance by indexing, fuzzy matching, text highlighting, and supporting multiple languages.