Normalizing queries for pg_stat_statements < 9.2

If you follow PostgreSQL’s development or Depesz’ blog, you might know that pg_stat_statement extension is getting a lot of improvement in 9.2 and especially is able to «lump “similar” queries together». I will not re-phrase here what Despsz already explain on his blog.

So, we have this great feature in 9.2, but what about previous release? Until 9.1, pg_stat_statement is keeping track of most frequent queries individually. No normalization, nothing. It’s been a while I’ve been thinking about importing pgBadger normalization code in SQL. Next pieces of code are tested under PostgreSQL 9.1 but should be easy to port to previous versions. So here is the function to create (I tried my best to keep it readable :-)):

CREATE OR REPLACE FUNCTION normalize_query(IN TEXT, OUT TEXT) AS $body$
  SELECT
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(

    lower($1),
    
    -- Remove extra space, new line and tab caracters by a single space
    '\s+',                          ' ',           'g'   ),

    -- Remove string content                       
    $$\\'$$,                        '',            'g'   ),
    $$'[^']*'$$,                    $$''$$,        'g'   ),
    $$''('')+$$,                    $$''$$,        'g'   ),

    -- Remove NULL parameters                      
    '=\s*NULL',                     '=0',          'g'   ),

    -- Remove numbers                              
    '([^a-z_$-])-?([0-9]+)',        '\1'||'0',     'g'   ),

    -- Remove hexadecimal numbers                  
    '([^a-z_$-])0x[0-9a-f]{1,10}',  '\1'||'0x',    'g'   ),

    -- Remove IN values                            
    'in\s*\([''0x,\s]*\)',          'in (...)',    'g'   )
  ;
$body$
LANGUAGE SQL;

Keep in mind that I extracted these regular expressions straight from pgbadger. Any comment about how to make it quicker/better/simpler/whatever is appreciated!

Here the associated view to group everything according to the normalized queries:

CREATE OR REPLACE VIEW pg_stat_statements_normalized AS
SELECT userid, dbid, normalize_query(query) AS query, sum(calls) AS calls,
  sum(total_time) AS total_time, sum(rows) as rows,
  sum(shared_blks_hit) AS shared_blks_hit,
  sum(shared_blks_read) AS shared_blks_read,
  sum(shared_blks_written) AS shared_blks_written,
  sum(local_blks_hit) AS local_blks_hit,
  sum(local_blks_read) AS local_blks_read,
  sum(local_blks_written) AS local_blks_written, 
  sum(temp_blks_read) AS temp_blks_read,
  sum(temp_blks_written) AS temp_blks_written
FROM pg_stat_statements
GROUP BY 1,2,3;

Using this function and the view, a small pgbench -t 30 -c 10, gives:

pgbench=> SELECT round(total_time::numeric/calls, 2) AS avg_time, calls, 
  round(total_time::numeric, 2) AS total_time, rows, query 
FROM pg_stat_statements_normalized 
ORDER BY 1 DESC, 2 DESC;

 avg_time | calls | total_time | rows |                                               query                                               
----------+-------+------------+------+---------------------------------------------------------------------------------------------------
     0.05 |   187 |       9.86 |  187 | update pgbench_accounts set abalance = abalance + 0 where aid = 0;
     0.01 |   195 |       2.30 |  195 | update pgbench_branches set bbalance = bbalance + 0 where bid = 0;
     0.00 |   300 |       0.00 |    0 | begin;
     0.00 |   300 |       0.00 |    0 | end;
     0.00 |   196 |       0.00 |  196 | insert into pgbench_history (tid, bid, aid, delta, mtime) values (0, 0, 0, 0, current_timestamp);
     0.00 |   193 |       0.00 |  193 | select abalance from pgbench_accounts where aid = 0;
     0.00 |   183 |       0.26 |  183 | update pgbench_tellers set tbalance = tbalance + 0 where tid = 0;
     0.00 |     1 |       0.00 |    0 | truncate pgbench_history

For information, the real non-normalized pg_stat_statement view is 959 lines:

pgbench=> SELECT count(*) FROM pg_stat_statements;

 count 
-------
   959
(1 ligne)

Obvisouly, regular expression are not magic and this will never be as strict as the engine itself. But at least it helps while waiting for 9.2 in production!

Do not hesitate to report me bugs and comment to improve it!

Cheers!

Comments Leave a comment by email or Disqus

Note: I only publish your name/pseudo, mail subject and content. I will NOT publish your email address.

depesz reply

Mon, 06 Aug 2012 20:53:04 +0200

This is not the first place I see it, so I have to ask: why “[\t\s\r\n]+” and not simply \s+ ? \s class already includes \t, \r, \n, and most likely also \f and \v.

ioguix reply

Mon, 06 Aug 2012 23:06:04 +0200

@Depesz: mh, I have no explanation sorry, just copy/paste from pgfouine/pgbadger code :-)

Fixed in the blog post.

Peter Geoghegan reply

Mon, 06 Aug 2012 22:28:43 +0200

When I first started reading this post, I assumed that the idea was to normalise the query text on the fly within the executor hooks that pg_stat_statements uses. While that might be quite inefficient, it would at least have the advantage of not eating an entry from the shared hashtable for every set of constants for every query seen. The hashtable will constantly have entries evicted, so you’re only going to have a very limited window on the queries executed against the database. Look at the “calls” figure for each query after executing pgbench in your example.

Why not somehow compile the regular expression ahead of time?

FYI, it isn’t that hard to backport the changes to the core system that make the new pg_stat_statements work (mostly it’s just that there are new hooks added). You’d have to fix pg_stat_statements to work with the 9.1 representation of the query tree (the code won’t compile, because the representation of CTAS changed, and perhaps other small things too). The “jumbling” code you’d have to modify is fairly straightforward though.

ioguix reply

Mon, 06 Aug 2012 23:38:40 +0200

@Peter Geoghegan: About backporting, my goal here was to give an easy trick on <9.2 without patching/compiling code, just pure and easy SQL.

I’m not sure to understand the first part of your comment. If you talk about current implementation of pg_stat_statement in 9.2+, I’m not sure regular expression are the best way to normalize queries on the fly. I believe a tokenizer would be very useful in PostgreSQL core, and discussed it in past. Normalizing would just be another useful feature from it.

Peter Geoghegan reply

Tue, 07 Aug 2012 00:53:33 +0200

There is a low-level interface to the authoritative scanner (tokenizer) used for lexing SQL in Postgres. That’s how I got 9.2’s pg_stat_statements to produce a normalised query string after the parser stage (note that that’s totally orthogonal to how the query fingerprinting works). It’s been used by plpgsql for a long time.

If you don’t understand my remarks about the problem with what you’ve done here, run pgbench in just the same way against 9.1, but with -M prepared. Then run without -M prepared, and apply your technique. What you’ve done here is going to fill the internal pg_stat_statements table too fast to be useful, because there will be a distinct entry for every set of constants for every query. Minor variations will be evicted, and query execution costs will be dramatically underestimated. That’s clearly what’s happened in your example.

ioguix reply

Tue, 07 Aug 2012 11:07:01 +0200

@Peter Geoghegan: Ok, I understood this time !

Well, thank you to point this out, but this problem is inherent to pg_stat_statements anyway. The only thing I can think about to somehow limit this problem is to have a higher pg_stat_statements.max …