Jędrzej Biedrzycki: Ignore nulls in Postgres

0. Problem statement

One of the most missing features in Postgres in my opinion was always a possibility to ignore nulls in analytic functions. Other databases support ignore nulls clause which makes the task much easier. Let’s start by creating a simple, two-column table and populating it with data:

create table tbl (
    id bigint primary key,
    a int
);
insert into tbl (id, a)
select s, case when random() < .5 then s end
from generate_series(0, 100) s;

The problem is to generate a column (lnn – last not null) with the last non-null value ordered by id:

id a lln
0 <null> <null>
1 <null> <null>
2 2 2
3 <null> 2
4 <null> 2
5 <null> 2
6 6 6
7 7 7
8 <null> 7
9 9 9
Table with nullable column and a column with gaps filled with last not null value

1. How we would do it in Oracle

As mentioned, in Oracle the solution is straightforward:

select id, a, last_value(a) ignore nulls over (order by id) lnn
from tbl;

The query passes through all the rows ordered by id starting from the first one up to the current (according to over clause) and retrieves the last not null value (or null if it’s not present).

2. Nested query in Postgres

One of the possible solutions in Postgres is a nested query:

select *, (
    select a
    from tbl it
    where it.id <= t.id
    and it.a is not null
    order by id desc
    limit 1
) lnn
from tbl t;

This solution works and yields the expected results, but has two major drawbacks:

  • It’s much more convoluted than the Oracle’s version – you need to think harder about what it does, whereas the intent of the ignore nulls version is obvious.
  • It is not reusable.

Fortunately, there is yet another possible solution.

3. Aggregates in Postgres

Although there is no ignore nulls syntax, Postgres provides us a way to write an equivalent analytic function using create aggregate. One of the parameters it takes is sfunc which is a reducing function. In our case it should simply take the second value if it’s not null and the first one otherwise:

create function coalesce_r_sfunc(state anyelement, value anyelement)
    returns anyelement
    immutable parallel safe
as
$$
select coalesce(value, state);
$$ language sql;

I’ve named the function coalesce_r_sfunc, because the builtin coalesce returns the first non-null argument, whereas our function – the last one out of exactly two. Now we can define our custom analytic function:

create aggregate find_last_ignore_nulls(anyelement) (
    sfunc = coalesce_r_sfunc,
    stype = anyelement
);

We can use it just like an ordinary analytic function (notice how similar the query is to the one for Oracle):

select *, find_last_ignore_nulls(a) over (order by id)
from tbl;

You might have noticed the notation inconsistency – I’m talking about the analytic functions whereas the syntax is create aggregate. It’s because the defined function can be used both as an analytic and aggregate function, depending on the context. Since our function processes multiple rows and returns a result for every row – it is clearly analytic.

4. Performance considerations

To run the benchmarks, let us propagate the table with a million rows:

insert into tbl (id, a)
select s, case when random() < .5 then s end
from generate_series(0, 1000000) s;

We also need to prepare two scripts: ns.sql containing the nested select query and a.sql containing the analytic query. The benchmarks yielded the following results:

root@5686cdefbcdf:/# pgbench -f ns.sql -U jb -t 10 postgres
starting vacuum...
transaction type: ns.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 7104.174 ms
tps = 0.140762 (including connections establishing)
tps = 0.140769 (excluding connections establishing)
root@5686cdefbcdf:/# pgbench -f a.sql -U jb -t 10 postgres
starting vacuum...
transaction type: a.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 3442.054 ms
tps = 0.290524 (including connections establishing)
tps = 0.290603 (excluding connections establishing)

The number of transactions per second (tps) for the analytic function is larger by the magnitude of 2 than for the nested query. Latency is only half.

5. Conclusions

Despite the lack of ignore nulls syntax in Postgres, we can mimic the required behavior by defining a custom aggregate. The code is really simple and it has several advantages:

  • The code is very readable.
  • Aggregations are reusable, there is no need to rewrite the whole queries.
  • It’s twice as fast compared to the equivalent nested selects.

This was just a sample of what can be done using custom aggregates, I will dive deeper into this topic in the upcoming posts.

Here are the db-fiddles for Oracle and Postgres. The code is also available on github.


Go to Source of this post
Author Of this post:
Title Of post: Jędrzej Biedrzycki: Ignore nulls in Postgres
Author Link: {authorlink}