If you follow PostgreSQL’s development or
Depesz’ blog,
you might know that
pg_stat_statement
extension is getting a lot of improvement in 9.2 and especially is able to
«lump “similar” queries together». I will not re-phrase here what Despsz
already explain on his blog.
So, we have this great feature in 9.2, but what about previous release? Until
9.1, pg_stat_statement is keeping track of most frequent queries individually.
No normalization, nothing. It’s been a while I’ve been thinking about
importing pgBadger normalization code in SQL. Next pieces of code are tested
under PostgreSQL 9.1 but should be easy to port to previous versions. So here
is the function to create (I tried my best to keep it readable :-)):
Keep in mind that I extracted these regular expressions straight from pgbadger.
Any comment about how to make it quicker/better/simpler/whatever is
appreciated!
Here the associated view to group everything according to the normalized
queries:
Using this function and the view, a small pgbench -t 30 -c 10, gives:
For information, the real non-normalized pg_stat_statement view is 959 lines:
Obvisouly, regular expression are not magic and this will never be as strict as
the engine itself. But at least it helps while waiting for 9.2 in production!
Do not hesitate to report me bugs and comment to improve it!
This is not the first place I see it, so I have to ask: why “[\t\s\r\n]+” and not simply \s+ ? \s class already includes \t, \r, \n, and most likely also \f and \v.
When I first started reading this post, I assumed that the idea was to normalise the query text on the fly within the executor hooks that pg_stat_statements uses. While that might be quite inefficient, it would at least have the advantage of not eating an entry from the shared hashtable for every set of constants for every query seen. The hashtable will constantly have entries evicted, so you’re only going to have a very limited window on the queries executed against the database. Look at the “calls” figure for each query after executing pgbench in your example.
Why not somehow compile the regular expression ahead of time?
FYI, it isn’t that hard to backport the changes to the core system that make the new pg_stat_statements work (mostly it’s just that there are new hooks added). You’d have to fix pg_stat_statements to work with the 9.1 representation of the query tree (the code won’t compile, because the representation of CTAS changed, and perhaps other small things too). The “jumbling” code you’d have to modify is fairly straightforward though.
@Peter Geoghegan: About backporting, my goal here was to give an easy trick on <9.2 without patching/compiling code, just pure and easy SQL.
I’m not sure to understand the first part of your comment. If you talk about current implementation of pg_stat_statement in 9.2+, I’m not sure regular expression are the best way to normalize queries on the fly. I believe a tokenizer would be very useful in PostgreSQL core, and discussed it in past. Normalizing would just be another useful feature from it.
There is a low-level interface to the authoritative scanner (tokenizer) used for lexing SQL in Postgres. That’s how I got 9.2’s pg_stat_statements to produce a normalised query string after the parser stage (note that that’s totally orthogonal to how the query fingerprinting works). It’s been used by plpgsql for a long time.
If you don’t understand my remarks about the problem with what you’ve done here, run pgbench in just the same way against 9.1, but with -M prepared. Then run without -M prepared, and apply your technique. What you’ve done here is going to fill the internal pg_stat_statements table too fast to be useful, because there will be a distinct entry for every set of constants for every query. Minor variations will be evicted, and query execution costs will be dramatically underestimated. That’s clearly what’s happened in your example.
Well, thank you to point this out, but this problem is inherent to pg_stat_statements anyway. The only thing I can think about to somehow limit this problem is to have a higher pg_stat_statements.max …
Comments Leave a comment by email or Disqus
Note: I only publish your name/pseudo, mail subject and content. I will NOT publish your email address.
This is not the first place I see it, so I have to ask: why “[\t\s\r\n]+” and not simply \s+ ? \s class already includes \t, \r, \n, and most likely also \f and \v.
@Depesz: mh, I have no explanation sorry, just copy/paste from pgfouine/pgbadger code :-)
Fixed in the blog post.
When I first started reading this post, I assumed that the idea was to normalise the query text on the fly within the executor hooks that pg_stat_statements uses. While that might be quite inefficient, it would at least have the advantage of not eating an entry from the shared hashtable for every set of constants for every query seen. The hashtable will constantly have entries evicted, so you’re only going to have a very limited window on the queries executed against the database. Look at the “calls” figure for each query after executing pgbench in your example.
Why not somehow compile the regular expression ahead of time?
FYI, it isn’t that hard to backport the changes to the core system that make the new pg_stat_statements work (mostly it’s just that there are new hooks added). You’d have to fix pg_stat_statements to work with the 9.1 representation of the query tree (the code won’t compile, because the representation of CTAS changed, and perhaps other small things too). The “jumbling” code you’d have to modify is fairly straightforward though.
@Peter Geoghegan: About backporting, my goal here was to give an easy trick on <9.2 without patching/compiling code, just pure and easy SQL.
I’m not sure to understand the first part of your comment. If you talk about current implementation of pg_stat_statement in 9.2+, I’m not sure regular expression are the best way to normalize queries on the fly. I believe a tokenizer would be very useful in PostgreSQL core, and discussed it in past. Normalizing would just be another useful feature from it.
There is a low-level interface to the authoritative scanner (tokenizer) used for lexing SQL in Postgres. That’s how I got 9.2’s pg_stat_statements to produce a normalised query string after the parser stage (note that that’s totally orthogonal to how the query fingerprinting works). It’s been used by plpgsql for a long time.
If you don’t understand my remarks about the problem with what you’ve done here, run pgbench in just the same way against 9.1, but with -M prepared. Then run without -M prepared, and apply your technique. What you’ve done here is going to fill the internal pg_stat_statements table too fast to be useful, because there will be a distinct entry for every set of constants for every query. Minor variations will be evicted, and query execution costs will be dramatically underestimated. That’s clearly what’s happened in your example.
@Peter Geoghegan: Ok, I understood this time !
Well, thank you to point this out, but this problem is inherent to pg_stat_statements anyway. The only thing I can think about to somehow limit this problem is to have a higher pg_stat_statements.max …