ioguix' README

To content | To menu | To search

Friday 28 March 2014

Playing with indexes and better bloat estimate

Most of the PostgreSQL DBAs might know about this large bloat estimate query integrated in check_postgres. It is supposed to compute a rough estimate of the bloat for tables and indexes in a database. As the PostgreSQL wiki page says:

This query is for informational purposes only. It provides a loose estimate of table growth activity only, and should not be construed as a 100% accurate portrayal of space consumed by database objects

Lately, a customer asked me about existing tools to help deciding when he is supposed to REINDEX. While writing a detailed answer with examples, I had quite a surprise seeing an estimated index size of 0.7 for a simple table! I realized I never payed much attention to this part of the result...

The problem

Here is a test table, a copy table "rental" from pagila project:

CREATE TABLE test (LIKE rental INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES );
INSERT INTO test SELECT * FROM rental;
ANALYZE;
postgres@pagila=# \d test
                                           Table "public.test"
    Column    |            Type             |                         Modifiers                          
--------------+-----------------------------+------------------------------------------------------------
 rental_id    | integer                     | not null default nextval('rental_rental_id_seq'::regclass)
 rental_date  | timestamp without time zone | not null
 inventory_id | integer                     | not null
 customer_id  | smallint                    | not null
 return_date  | timestamp without time zone | 
 staff_id     | smallint                    | not null
 last_update  | timestamp without time zone | not null default now()
Indexes:
    "test_pkey" PRIMARY KEY, btree (rental_id)
    "test_rental_date_inventory_id_customer_id_idx" UNIQUE, btree (rental_date, inventory_id, customer_id)
    "test_inventory_id_idx" btree (inventory_id)

I copy pasted the bloat query in file "~/tmp/bloat_original.sql", here is its output for this table:

postgres@pagila=# \i ~/tmp/bloat_original.sql
 current_database | schemaname |    tablename     | tbloat | wastedbytes |                        iname                        | ibloat | wastedibytes 
------------------+------------+------------------+--------+-------------+-----------------------------------------------------+--------+--------------
...
 pagila           | public     | test             |    1.2 |      188416 | test_pkey                                           |    0.5 |            0
 pagila           | public     | test             |    1.2 |      188416 | test_rental_date_inventory_id_customer_id_idx       |    0.8 |            0
 pagila           | public     | test             |    1.2 |      188416 | test_inventory_id_idx                               |    0.7 |            0
...

A B-tree index is filled at 90% per default, see storage parameter FILL FACTOR. So, a freshly created B-tree index with no bloat is supposed to be 1.1x larger than it should be with a FILL FACTOR of 100. As I had some time to dive in this, I couldn't resist to investigate these insane estimated size factors for my test table indexes: 0.5, 0.8 and 0.7. How an index could be smaller than it is supposed to be ?

I hadn't to dive deap in the query, after a closer look at the query code and comments, we find:

COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols

Oh, ok. The query estimates the ideal size of each index considering it references ALL the table fields. It's quite rare to find a btree index on all fields of a table, and obviously there's no point having multiple indexes on a table, all of them referencing all fields of the table.

A look at the real bloat

First, let see how the indexes are really bloated:

CREATE schema stattuple;
CREATE extension pgstattuple WITH schema stattuple;
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     376832 |       10.25
 test_inventory_id_idx                         |     507904 |       34.11
 test_rental_date_inventory_id_customer_id_idx |     630784 |       26.14

First point, the bloat on indexes is not 10% everywhere, only for the PK. This is because indexes were created BEFORE inserting data. So it looks like data were naturally order on the PK on table "rental" when scanning it sequentially. What if we load data sorting on "inventory_id" field ?

TRUNCATE test;
INSERT INTO test SELECT * FROM rental ORDER BY inventory_id ;
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     524288 |       36.22
 test_inventory_id_idx                         |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     647168 |       28.04

Ok, it totally makes sense there: index "test_inventory_id_idx" is now 10% bloated. First lesson: Do not create your indexes before loading your datas! Demo:

DROP TABLE test;
 
CREATE TABLE test (LIKE rental INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
INSERT INTO test SELECT * FROM rental;
ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY (rental_id); 
CREATE INDEX test_inventory_id_idx ON test (inventory_id);
CREATE INDEX test_rental_date_inventory_id_customer_id_idx ON test (rental_date, inventory_id, customer_id);
 
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     376832 |       10.25
 test_inventory_id_idx                         |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     524288 |       10.73

A better query to estimate index bloat?

Wait, you just showed us you can have the real bloat on indexes, why would I want a loose estimate from a query relying on stats?!

Because I feel bad reading the whole index again and again from monitoring tools every few minutes. Having a loose estimate is good enough in some cases.

And anyway, I want to dig into this for education :-)

So, I tried to write a query able to give a better estimate of bloat ONLY for indexes. I picked some parts of the original bloat query, but rewrote mostly everything.

Here is the result:

-- change to the max number of field per index if not default.
\SET index_max_keys 32
-- (readonly) IndexTupleData size
\SET index_tuple_hdr 2
-- (readonly) ItemIdData size
\SET item_pointer 4
-- (readonly) IndexAttributeBitMapData size
\SET index_attribute_bm (:index_max_keys + 8 - 1) / 8
 
SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END                                    AS wastedbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
FROM (
  SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
    ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
      CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
    ) AS otta
  FROM (
    SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
      ( index_tuple_hdr_bm +
          maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
            WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
            ELSE index_tuple_hdr_bm%maxalign
          END
        + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
            WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
            ELSE nulldatawidth::integer%maxalign
          END
      )::numeric AS nulldatahdrwidth, pagehdr
    FROM (
      SELECT
        i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
        current_setting('block_size')::numeric AS bs,
        /* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
        CASE
          WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
          ELSE 4
        END AS maxalign,
        /* per page header, fixed size: 20 for 7.X, 24 for others */
        CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
          THEN 24
          ELSE 20
        END AS pagehdr,
        /* per tuple header: add index_attribute_bm if some cols are null-able */
        CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
          THEN :index_tuple_hdr
          ELSE :index_tuple_hdr + :index_attribute_bm
        END AS index_tuple_hdr_bm,
        /* data len: we remove null values save space using it fractionnal part from stats */
        sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
      FROM pg_attribute AS a
        JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
        JOIN (
          SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
          FROM pg_index
            JOIN pg_class ON pg_class.oid=pg_index.indexrelid
            JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
        ) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
      WHERE a.attnum > 0
      GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
    ) AS s1
  ) AS s2
    LEFT JOIN pg_am am ON s2.relam = am.oid
) AS sub
JOIN pg_class c ON c.oid=sub.table_oid
ORDER BY 2,3,4

How does it perform compared to the actual values?

Let's go back to a bloated situation:

TRUNCATE test ;
INSERT INTO test SELECT * FROM rental;
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;
ANALYZE test;
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     507904 |       34.11
 test_pkey                                     |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     630784 |       26.14

I created the file "~/tmp/bloat_index.sql" with this estimated indexes bloat query filtering on table "test", here is the result of it:

postgres@pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     507904 |      172032 | 33.8709677419354839
 pagila           | public  | test       | test_pkey                                     |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     630784 |      172032 | 27.2727272727272727

Well, pretty close :-)

Let's REINDEX:

REINDEX TABLE test;
ANALYZE test;
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     376832 |       10.25
 test_pkey                                     |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     524288 |       10.73

postgres@pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_pkey                                     |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     524288 |       65536 | 12.5000000000000000

Not bad.

Let's create some bloat. The table is approx. 160,000 rows, so the following query updates ~10% of the table:

UPDATE test SET rental_date = current_timestamp WHERE rental_id < 3200 AND rental_id%2 = 0;
ANALYZE test ;
postgres@pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1
;
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     475136 |       22.42
 test_pkey                                     |     450560 |       18.04
 test_rental_date_inventory_id_customer_id_idx |     598016 |       14.26
(3 rows)

postgres@pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     475136 |      139264 | 29.3103448275862069
 pagila           | public  | test       | test_pkey                                     |     450560 |      114688 | 25.4545454545454545
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     598016 |      139264 | 23.2876712328767123
(3 rows)

Erk, not so good. Moar bloat?

UPDATE test SET rental_date = current_timestamp WHERE rental_id%2 = 0;
ANALYZE test ;
SELECT relname, pg_table_size(oid) AS index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;
                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     745472 |       55.48
 test_pkey                                     |     737280 |       54.98
 test_rental_date_inventory_id_customer_id_idx |     925696 |       49.96
(3 rows)

postgres@pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     745472 |      409600 | 54.9450549450549451
 pagila           | public  | test       | test_pkey                                     |     737280 |      401408 | 54.4444444444444444
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     925696 |      466944 | 50.4424778761061947
(3 rows)

Well, better again.

Conclusion

This new query performs much better for indexes than the usual one everyone knew for a long time. But it isn't perfect, showing sometimes some quite wrong results.

First issue, this query doesn't make any difference between branch nodes and leaves in indexes. Rows values and references are kept in leaves, and only a fractional part of them are in branches. I should do some more tests with larger indexes to see how it behaves with a lot of branch nodes. As the number of branch nodes is supposed to be a logarithm of the total number of rows, maybe we could include some more guessing in the optimal index size computing.

Second issue...keep in mind it is based on statistics.

Anyway, as useful as this query can be, at least it has been funny to dive in indexes and pages guts.

As a side note, I had been surprised to see that padding to MAXALIGN was applied on both tuple header AND tuple data. My understanding is that for each row in an index leave page, we have:

  • the item_pointer at the beginning of the page,
  • the tuple header + data at the end of the page that must be aligned on MAXALIGN.

Why ain't they padded together to MAXALIGN? Wouldn't this save some space?

Thank you for reading till the end ! Cheers, see you maybe in another post about table bloat next time ;-)

Monday 6 August 2012

Normalizing queries with pg_stat_statements < 9.2

Hey,

If you follow PostgreSQL's development or Depesz' blog, you might know that "pg_stat_statement" extension is getting a lot of improvement in 9.2 and especially is able to «lump "similar" queries together». I will not re-phrase here what Despsz already explain on his blog.

So, we have this great feature in 9.2, but what about previous release ? Until 9.1, "pg_stat_statement" is keeping track of most frequent queries individually. No normalization, nothing. It's been a while I've been thinking about importing pgFouine/pgBadger normalization code in SQL. Next pieces of code are tested under PostgreSQL 9.1 but should be easy to port to previous versions. So here is the function to create (I tried my best to keep it readable :-)):

CREATE OR REPLACE FUNCTION normalize_query(IN TEXT, OUT TEXT) AS $body$
  SELECT
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(
    regexp_replace(regexp_replace(regexp_replace(regexp_replace(

    lower($1),
    
    -- Remove extra space, new line and tab caracters by a single space
    '\s+',                          ' ',           'g'   ),

    -- Remove string content                       
    $$\\'$$,                        '',            'g'   ),
    $$'[^']*'$$,                    $$''$$,        'g'   ),
    $$''('')+$$,                    $$''$$,        'g'   ),

    -- Remove NULL parameters                      
    '=\s*NULL',                     '=0',          'g'   ),

    -- Remove numbers                              
    '([^a-z_$-])-?([0-9]+)',        '\1'||'0',     'g'   ),

    -- Remove hexadecimal numbers                  
    '([^a-z_$-])0x[0-9a-f]{1,10}',  '\1'||'0x',    'g'   ),

    -- Remove IN values                            
    'in\s*\([''0x,\s]*\)',          'in (...)',    'g'   )
  ;
$body$
LANGUAGE SQL;

Keep in mind that I extracted these regular expressions straight from pgfouine/pgbadger. Any comment about how to make it quicker/better/simpler/whatever is appreciated !

Here the associated view to group everything according to the normalized queries :

CREATE OR REPLACE VIEW pg_stat_statements_normalized AS
SELECT userid, dbid, normalize_query(query) AS query, sum(calls) AS calls,
  sum(total_time) AS total_time, sum(rows) AS rows,
  sum(shared_blks_hit) AS shared_blks_hit,
  sum(shared_blks_read) AS shared_blks_read,
  sum(shared_blks_written) AS shared_blks_written,
  sum(local_blks_hit) AS local_blks_hit,
  sum(local_blks_read) AS local_blks_read,
  sum(local_blks_written) AS local_blks_written, 
  sum(temp_blks_read) AS temp_blks_read,
  sum(temp_blks_written) AS temp_blks_written
FROM pg_stat_statements
GROUP BY 1,2,3;

Using this function and the view, a small pgbench test (-t 30 -c 10), gives:

SELECT round(total_time::numeric/calls, 2) AS avg_time, calls, 
  round(total_time::numeric, 2) AS total_time, rows, query 
FROM pg_stat_statements_normalized 
ORDER BY 1 DESC, 2 DESC;
 avg_time | calls | total_time | rows |                                               query                                               
----------+-------+------------+------+---------------------------------------------------------------------------------------------------
     0.05 |   187 |       9.86 |  187 | update pgbench_accounts set abalance = abalance + 0 where aid = 0;
     0.01 |   195 |       2.30 |  195 | update pgbench_branches set bbalance = bbalance + 0 where bid = 0;
     0.00 |   300 |       0.00 |    0 | begin;
     0.00 |   300 |       0.00 |    0 | end;
     0.00 |   196 |       0.00 |  196 | insert into pgbench_history (tid, bid, aid, delta, mtime) values (0, 0, 0, 0, current_timestamp);
     0.00 |   193 |       0.00 |  193 | select abalance from pgbench_accounts where aid = 0;
     0.00 |   183 |       0.26 |  183 | update pgbench_tellers set tbalance = tbalance + 0 where tid = 0;
     0.00 |     1 |       0.00 |    0 | truncate pgbench_history


For information, the real non-normalized "pg_stat_statement" view is 959 lines:

SELECT count(*) FROM pg_stat_statements;
 count 
-------
   959
(1 ligne)

Obvisouly, regular expression are not magic and this will never be as strict as the engine itself. But at least it helps while waiting for 9.2 in production !

Do not hesitate to report me bugs and comment to improve it !

Cheers,

Using pgBagder and logsaw for scheduled reports

Hey,

While waiting for next version of pgBadger, here is a tip to create scheduled pgBadger report. For this demo, I'll suppose :

  • we have PostgreSQL's log files in "/var/log/pgsql"
  • we want to produce a weekly report using pgBadger with the "postgres" system user...
  • ...so we keep at least 8 days of log

You will need pgbadger and logsaw. Both tools are under BSD license and pure perl script with no dependencies.

"logsaw" is a tool aimed to parse log files, looking for some regexp-matching lines, printing them to standard output and remembering where it stop last time. At start, it searches for the last line it parsed on the previous call, and starts working from there. Yes, for those familiar with "tail_n_mail", it does the same thing, but without the mail and report processing part. Moreover (not sure about "tail_n_mail"), it supports rotation and compression of log files. Thanks to this tool, we'll be able to create new reports from where the last one finished !

We need to create a simple configuration file for logsaw:

$ cat <<EOF > ~postgres/.logsaw
LOGDIR=/var/log/pgsql/
EOF

There's three more optional parameters in this configuration file you might want to know:

  • if you want to process some particular files, you can use the "LOGFILES" parameter, a regular expression to filter files in the "LOGDIR" folder. When not defined, the default empty string means: «take all files in the folder».
  • if your log files are compressed, you can use the "PAGER" parameter which should be a command that uncompress your log files to standard output, eg. "PAGER=zcat -f". Note that if available, "logsaw" will use silently "IO::Zlib" to read your compressed files.
  • you can filter extracted lines from log files using the "REGEX" parameters. You can add as many "REGEX" parameter than needed, each of them must be a regular expression. When not defined, the default empty "REGEX" array means: «match all the lines».

Each time you call "logsaw", it saves its states to your configuration file, adding parameters "FILEID" and "OFFSET".

That's it for "logsaw". See its README files for more details, options and sample configuration file.

Now, the command line to create the report:

$ logsaw | pgbadger -g -o /path/to/report-$(date +%Y%m%d).html -

You might want to add the "-f" option to pgbadger if it doesn't guess the log format itself (stderr or syslog or csv).

About creating reports on a weekly basis, let's say every sunday at 2:10am, using crontab:

  10 2 1-7 * 7  logsaw | pgbadger -g -o /path/to/report-$(date +%Y%m%d).html -

Here you go, enjoy :)

That's why I like UNIX style and spirit commands: simple single-task but powerful and complementary tools.

Wait or help for more nice features in pgBadger !

Cheers !

PS: There's another tool to deal with log files and reports you might be interested in: "logwatch"

Thursday 16 June 2011

phpPgAdmin 5.0.2 in Debian unstable !

Today, Christoph Berg has uploaded a new version of the phpPgAdmin package to the Debian unstable branch. This package is based on the latest version of phpPgAdmin: 5.0.2.

This story starts around another beer during the PGCon2011. I was talking with Dimitri Fontaine about the status of the PostgreSQL's Debian repository driven by our community. I explained him how bad was the situation about phpPgAdmin's package that was stuck at the 4.2.3 version for all debian like distributions. Because of that, we received a bunch of bug reports, comments and mails about things we already fixed in new versions.

As Dimitri is working with Christoph on the PostgreSQL Debian repo, he gave me his email address telling me that, as an official Debian developer, he could probably help me. After some discussion about me becoming the official package maintainer, it seems Christophe finally did the package pretty quickly and added himself as co-maintainer of the project.

In conclusion, I'm glad to see a new package maintainer for our PPA package and I'll do my best to help him on this purpose. Thank you to Christoph and Dimitri !

Friday 20 May 2011

2nd chance for my PGCon 2011 Lightning Talk

I'm proud to announce I'm the only one that couldn't finish his Lightning talk on time this year.

Someone had to do it to justify Magnus work as a speaker cutter after 5min, so I picked the job :)

Here is your second chance to watch my lightning talk slides online.

Too bad I hadn't time to explain how I wanted to kill pgfouine and how a "Replay" script on top of pgShark might be awesome for benchmarking and testing purpose...

Hopefully I'll be able to make a longer talk at pgconf.eu !

Gleu: yeah I know, you warned me :-)

Tuesday 26 April 2011

GSoC project "New phpPgAdmin Plugin Architecture" accepted

Leonardo's project has been accepted for the GSoC 2011 today !

Knowing a bit Leonardo, I'm sure he will make everything possible to finish this project, or going as far as he can. One thing to keep in mind about leonardo and that I learned a bit late last year: as a Brazilian student, he was actually still studying at school while working for the GSoC 2010 !

Moreover, Leonardo kept sending some small comments and contributions out of the GSoC during the year. This is in my opinion the best thing about the GSoC: making students meet some FOSS communities, get used to it, team working and turning them into real new contributors.

...but this year, he will have two mentors to convince ! Both Andreas Scherbaum and me will tutor Leonardo on this project, helping him and discussing the code on ppa-dev and irc://irc.freenode.net/#phppgadmin.

It will be the first contributions of Andreas Scherbaum to PPA and I'm really looking forward having some fresh eyes and manpower on the project !

Tuesday 29 March 2011

let's play with PostgreSQL network dumps !

A bit more than a year ago, I start messing with tshark pdml output to be able to extract queries from a network dump. As I was writing a PoC, I did it in PHP.

As it was relying on tshark, I called it « pgShark ». pgShark worked and was useful a couple of time, but had some drawbacks:

  • really, really slow
  • PHP is not the best language to parse and mess with data
  • one useless step: pcap -> XML/PDML -> parsing
  • I have no fun coding in PHP
  • support only a limited part of the PostgreSQL protocol

So I decided to rewrite it from scratch using Perl and the Net::Pcap module.

Why Perl? First, probably the most frequent reasons people are using Perl are : portable and fast about parsing data. Then, I discovered the useful Net::Pcap module, directly binded to the libpcap. And finally, I wanted to learn a bit more about Perl than just quick scripting and I needed a challenging project :)

At first I wanted a very simple program, dealing with frontend messages only and using a plugin architecture to process them and output useful and various informations. But then, I realized it could go way further, crazy ideas about plugins pop up in my mind and quickly I faced the fact that I will have to support the whole PostgreSQL's protocol, not just a small specific part.

The project started in early February 2011 and eat all my open source related personal and professional time until now. I finally managed to add the last missing PostgreSQL messages on Monday 28th march.

So I guess the pgShark::Core module is now stable enough to focus on plugins. I'll have to split this project in two pieces at some point and release this core on its own.

One of the most exciting plugin is "Fouine". Yeah, if you know pgFouine, you understand this plugin's purpose (and if you don't, here is the pgfouine project). Having Fouine as a plugin of pgShark has many advantages. One of the most important one is the ability to extract way much more statistics from network. Think about errors, notices, session time, busy ratio, amount of data per sessions, connections statistics, you name it... Another really important point is that you don't need to tweak your PostgreSQL log behaviour. No need to log every query, no need to change the log_line_header parameter or using syslog if you don't want to. You just leave your PostgreSQL configuration the way you love it. Even better: working on network dumps means you can snif anywhere in between the frontend and the backend, having 0% performance penalty on your PostgreSQL box.

Check the output example page from pgshark's wiki: https://github.com/dalibo/pgshark/wiki

The plugins are all in alpha development stage and need some more work for accuracy or bug fix, but they already spit some useful datas!

Here is my TODO list for plugins:

  • fix SQL plugin in regards with named portals (prepared statements with binded datas)
  • work on the TODO list for the Fouine plugin (includes accuracy of statistics)
  • make graphical reports (HTML) from the Fouine plugin
  • graph various stats from the Fouine plugin

I have another really useful plugin idea, but really challenging. Hopefully I'll be able to start it and blog about it soon ! However, I suspect I'll have to take some of my time on phpPgAdmin very soon, as I was pretty much idle on it for a while now and stuff are staking on this TODO list as well.

So here is the project's home: https://github.com/dalibo/pgshark

New phpPgAdmin website

Some days or week ago now, I released a new website for phpPgAdmin.

Goals were to refresh a bit its old fashion style, but most importantly to move it to a wiki. And here it is !

I hope this will help us to keep the FAQ up-to-date, ease the website update and help us to centralize all project informations and data in one place (and outside of CVS !).

Thanks to the postgresql.fr team for their work on hosting the beta version until sf.net finally update their hosting services.

Monday 28 March 2011

GSoC project ideas for phpPgAdmin

GSoC season started !

PostgreSQL has been accepted this year again as project organization. Big thanks to Selena and others to handle this for the community ! You can find the PostgreSQL GSoC 2011 page here. As far as I know, student projects submissions starts today.

phpPgAdmin is a sub-project of the PostgreSQL organization and already had 3 or 4 projects accepted in past GSoC programs.

I was mentoring last year's (only) project for phpPgAdmin and enjoy it. Leonardo Augusto Sápiras was the student that worked on this project and did a good job. He was working hard, learning how to use community's tools, how to communicate with us, discussing issues, code and managed to finish his project in time. All his work was committed during or soon after the end of the GSoC...everything while still being at school as a Brazilian student !

Good news is that he's motivated this year again and is currently writing a proposal to add a long time wanted feature: a proper plugin architecture. It might be a good subject for another blog post later. Hopefully his proposal will be accepted.

So, here are some more ideas for PPA :

  • support PostgreSQL 9.1
  • switch PPA to UTF-8 only
  • graphical explain: kind of a merge between pgAdmin one and the excellent http://explain.depesz.com/. But to be honest, as I told to Leonardo, I would like to keep it for me
  • add support for multi-edit/delete/add data from tables
  • drop adodb keeping the compatibility. A lots of advantages here: remove dependency, ability to use some postgresql-only php function, lighter low-level db access layer, last but not least: ability to keep a full history of PPA queries,downloading it etc
  • support for showing/editing database/user/database+user/function specific configuration parameters
  • cleanup / improve xhtml code for better themes, accessibility and add some more theme

Plus, check our TODO file for a bunch of pending TODOs or our feature request list on sf.net for more ideas !

Saturday 11 December 2010

Problems and workaround about recreating implicit casts when upgrading to 8.3+

I start reading Gregory Smith book « PostgreSQL 9.0 High performance » some days ago. Quickly, he's talking about this very popular issue when upgrading to version 8.3 and bellow : the removal of some implicit casts from text types to time or numerical ones in 8.3. Obviously, the only clean solution here is to fix the application itself.

However, for those that cannot afford quickly such a work, it also points to a popular quick workaround that and suffer from a side effect. Then, it remind me I found another quick-n-dirty fix about a year ago I spoke a bit on our favorite chan. Again, keep in mind that the only clean way to go is fix your application if you hit this problem !

So here is the problem:

casts=# CREATE TABLE test AS SELECT generate_series(1,10) AS id;
SELECT
casts=# SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
ERROR:  operator does NOT exist: integer = text
LINE 1 : SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
                                                         ^
TIPS : No operator matches the given name AND argument type(s). You might need TO ADD explicit type casts.

The very well known solution is to recreate some of these implicit casts that were removed in 8.3. Peter Eisentraut blogged about that, you'll find his SQL script here.

However, as some users noticed in the comments, there is a side effect bug with this solution: it can breaks the concatenation operator:

casts=# BEGIN ;
BEGIN
casts=# \i /tmp/implicit_casts.sql 
CREATE FUNCTION
CREATE CAST
-- [...]
CREATE FUNCTION
CREATE CAST
casts=# SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
ERROR:  operator IS NOT UNIQUE: unknown || integer
LINE 1 : SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
                                ^
TIPS : Could NOT choose a best candidate operator. You might need TO ADD explicit type casts.
casts=# ROLLBACK ;
ROLLBACK

Oops.

From here, the solution could be to cast one of the operand :

casts=# SELECT id, 'value = ' || id::text FROM test WHERE id = '5'::text;
  5 | value = 5

But then, we are back to the application fix where it might worth spending more time fixing things in the good way.

There is another solution: creating missing operators instead of implicit casts. Here is a sql file with a lot of those operators: operators_workaround.sql and a sample with text to integer:

CREATE FUNCTION pg_catalog.texteqint(text, integer) RETURNS BOOLEAN STRICT IMMUTABLE LANGUAGE SQL AS $$SELECT textin(int4out($2)) = $1;$$;
CREATE FUNCTION pg_catalog.inteqtext(integer, text) RETURNS BOOLEAN STRICT IMMUTABLE LANGUAGE SQL AS $$SELECT textin(int4out($1)) = $2;$$;
CREATE OPERATOR pg_catalog.= ( PROCEDURE = pg_catalog.texteqint, LEFTARG=text, RIGHTARG=integer, COMMUTATOR=OPERATOR(pg_catalog.=));
CREATE OPERATOR pg_catalog.= ( PROCEDURE = pg_catalog.inteqtext, LEFTARG=integer, RIGHTARG=text, COMMUTATOR=OPERATOR(pg_catalog.=));

And here is the same previous test:

casts=# BEGIN ;
BEGIN
casts=# \i /tmp/create_operators.sql 
CREATE FUNCTION
-- [...]
CREATE FUNCTION
CREATE OPERATOR
-- [...]
CREATE OPERATOR
casts=# SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
  5 | value = 5
 
casts=# -- what, you don't trust me ?
casts=# ROLLBACK ;
ROLLBACK
casts=# SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
ERROR:  operator does NOT exist: integer = text
LINE 1 : SELECT id, 'value = ' || id FROM test WHERE id = '5'::text;
                                                         ^
TIPS : No operator matches the given name AND argument type(s). You might need TO ADD explicit type casts.

Same advice from Peter here: if possible, only create the operators you need to fix your application !

So far, I only had one positive feedback about this workaround about a year ago, and I don't consider this is enough to actually claim it is a safe solution. So please, comments, tests and reports are welcome !

Monday 6 December 2010

« install done. »

$ ./configure --prefix=blog.ioguix.net
$ make
# make install