implementing full-text search with postgres

By Per Fröjd

October 17, 2021

postgresql
snippets

Using text-search for fun (and probably not profit)

So you have your fancy table set up, you’ve got your content and now you want to start thinking about searchability of your content. You probably tried something like:

SELECT *
FROM table
WHERE content ILIKE '%$query%';

And it worked, sort of.. but not that great really. So you start looking at implement full-text search and immediately get blasted with things like Apache Solr, Apache Lucene, ElasticSearch and then realize this is all too much work for your little side-project.

Good news is that you don’t need any of that for your basic search functionality, since postgres has you covered.

To start with, you need to set up your tsv column, so tsv is short for TS_VECTOR which is a vectorized way of storing the text we’re going to operate on, but before that, we can quickly look at what the output will come to look like:

SELECT to_tsvector('pg_catalog.english', 'Hey, look mom, I made a fancy text search!');
-- 'fanci':7 'hey':1 'look':2 'made':5 'mom':3 'search':9 'text':8

To me, this is mostly magic, but we can see that it deconstructed our sentence, and added some numbers to it! It also removed some things known as stop-words, which is smaller words like “I” or “a” that are otherwise too complex to search for. Want to understand more in depth how this works? Well, I can’t really tell you, but I’m sure some other blog can.

So lets create our column.

CREATE TABLE article(
    id serial PRIMARY KEY,
    content TEXT NOT NULL,
    content_tsv TSVECTOR,
    updated_at timestamp with timezone default now(),
    created_at timestamp with timezone default now(),
    deleted_at timestamp with timezone
);

Some people argue that you don’t need to store the tsv separate, that you can do the to_tsvector call on the fly. I disagree, because in this way you can much more easily add indexing to the tsv column, but your mileage may vary.

Okay, so there it is, now how do we update it, with a trigger of course!

CREATE FUNCTION article_update_tsv() RETURNS trigger AS $$
BEGIN
    new.content_tsv :=
        to_tsvector('pg_catalog.english', new.content)
    return new;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER article_update_tsv_trigger BEFORE INSERT OR UPDATE
ON article FOR EACH ROW EXECUTE PROCEDURE article_update_tsv();

This is pretty straight forward, we create the function that basically popuplates content_tsv with the results of the content column. Then we add the trigger that does this whenever we INSERT or UPDATE a row in the table.

Okay, great, but how do we do queries on this? This can vary a lot, because very few of the ORMs I’ve used have support for these columns, so I always found raw SQL to do the job here.

But here’s how it looks for me:

SELECT *
FROM article
WHERE content_tsv @@ TO_TSQUERY(${completeWithOrTokens})
AND deleted_at is null
ORDER BY RANDOM()
LIMIT 1;

One thing to note here (aside from the RANDOM and LIMIT stuff), is that we need to provide a nice way of searching either for a single word, or a full sentence, and it’s not that complicated!

Simply split the search terms with a ' | ' in between and there you go!

Also, we talked about indexing earlier, so here’s the indexing strategy I use:

CREATE INDEX article_tsv_index ON article USING gin (content_tsv);

One final thing, it’s pretty easy to include more columns in the content_tsv, it looks something like this.

CREATE FUNCTION article_update_tsv() RETURNS trigger AS $$
BEGIN
    new.content_tsv :=
        to_tsvector('pg_catalog.english', new.content)
        to_tsvector('pg_catalog.english', new.other_column) ||
        to_tsvector('pg_catalog.english', new.another_column)
    return new;
END
$$ LANGUAGE plpgsql;

Easy! Now we have some pretty snappy and (more correct) full-text search, without needing to resort to additional software or third-party applications.