28 March 2014

postgresql

4 comments

Playing with indexes and better bloat estimate

Most of the PostgreSQL DBAs might know about this large bloat estimate query integrated in check_postgres. It is supposed to compute a rough estimate of the bloat for tables and indexes in a database. As the PostgreSQL wiki page says:

This query is for informational purposes only. It provides a loose estimate of table growth activity only, and should not be construed as a 100% accurate portrayal of space consumed by database objects

Lately, a customer asked me about existing tools to help deciding when he is supposed to REINDEX. While writing a detailed answer with examples, I had quite a surprise seeing an estimated index size ratio of 0.7 for a simple table! I realized I never payed much attention to this part of the result…

The problem

Here is a test table, a copy table rental from pagila project:

pagila=# CREATE TABLE test (LIKE rental INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES );
pagila=# INSERT INTO test SELECT * FROM rental;
pagila=# ANALYZE;
pagila=# \d test
                                           Table "public.test"
    Column    |            Type             |                         Modifiers                          
--------------+-----------------------------+------------------------------------------------------------
 rental_id    | integer                     | not null default nextval('rental_rental_id_seq'::regclass)
 rental_date  | timestamp without time zone | not null
 inventory_id | integer                     | not null
 customer_id  | smallint                    | not null
 return_date  | timestamp without time zone | 
 staff_id     | smallint                    | not null
 last_update  | timestamp without time zone | not null default now()
Indexes:
    "test_pkey" PRIMARY KEY, btree (rental_id)
    "test_rental_date_inventory_id_customer_id_idx" UNIQUE, btree (rental_date, inventory_id, customer_id)
    "test_inventory_id_idx" btree (inventory_id)

I copy pasted the bloat query in file ~/tmp/bloat_original.sql, here is its output for this table:

postgres@pagila=# \i ~/tmp/bloat_original.sql
 current_database | schemaname |    tablename     | tbloat | wastedbytes |                        iname                        | ibloat | wastedibytes 
------------------+------------+------------------+--------+-------------+-----------------------------------------------------+--------+--------------
...
 pagila           | public     | test             |    1.2 |      188416 | test_pkey                                           |    0.5 |            0
 pagila           | public     | test             |    1.2 |      188416 | test_rental_date_inventory_id_customer_id_idx       |    0.8 |            0
 pagila           | public     | test             |    1.2 |      188416 | test_inventory_id_idx                               |    0.7 |            0
...

A B-tree index is filled at 90% per default, see storage parameter FILL FACTOR. So, a freshly created B-tree index with no bloat is supposed to be 1.1x larger than it should be with a FILL FACTOR of 100. As I had some time to dive in this, I couldn’t resist to investigate these insane estimated size factors for my test table indexes: 0.5, 0.8 and 0.7. How an index could be smaller than it is supposed to be ?

I hadn’t to dive deap in the query, after a closer look at the query code and comments, we find:

COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols

Oh, ok. The query estimates the ideal size of each index considering it references ALL the table fields. It’s quite rare to find a btree index on all fields of a table, and obviously there’s no point having multiple indexes on a table, all of them referencing all fields of the table.

A look at the real bloat

First, let see how the indexes are really bloated:

pagila=# create schema stattuple;
pagila=# create extension pgstattuple with schema stattuple;
pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     376832 |       10.25
 test_inventory_id_idx                         |     507904 |       34.11
 test_rental_date_inventory_id_customer_id_idx |     630784 |       26.14

First point, the bloat on indexes is not 10% everywhere, only for the PK. This is because indexes were created BEFORE inserting data. So it looks like data were naturally order on the PK on table “rental” when scanning it sequentially. What if we load data sorting on “inventory_id” field ?

pagila=# TRUNCATE test;
pagila=# INSERT INTO test SELECT * FROM rental ORDER BY inventory_id ;
pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     524288 |       36.22
 test_inventory_id_idx                         |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     647168 |       28.04

Ok, it totally makes sense there: index test_inventory_id_idx is now 10% bloated. First lesson: Do not create your indexes before loading your datas! Demo:

pagila=# DROP TABLE test;
pagila=# CREATE TABLE test (LIKE rental INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
pagila=# INSERT INTO test SELECT * FROM rental;
pagila=# ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY (rental_id); 
pagila=# CREATE INDEX test_inventory_id_idx ON test (inventory_id);
pagila=# CREATE INDEX test_rental_date_inventory_id_customer_id_idx ON test (rental_date, inventory_id, customer_id);

pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i';

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_pkey                                     |     376832 |       10.25
 test_inventory_id_idx                         |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     524288 |       10.73

A better query to estimate index bloat?

Wait, you just showed us you can have the real bloat on indexes, why would I want a loose estimate from a query relying on stats?!

Because I feel bad reading the whole index again and again from monitoring tools every few minutes. Having a loose estimate is good enough in some cases. And anyway, I want to dig into this for education :-)

So, I tried to write a query able to give a better estimate of bloat ONLY for indexes. I picked some parts of the original bloat query, but rewrote mostly everything. Here is the result:

-- change to the max number of field per index if not default.
\set index_max_keys 32
-- (readonly) IndexTupleData size
\set index_tuple_hdr 2
-- (readonly) ItemIdData size
\set item_pointer 4
-- (readonly) IndexAttributeBitMapData size
\set index_attribute_bm (:index_max_keys + 8 - 1) / 8

SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END                                    AS wastedbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
FROM (
  SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
    ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
      CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
    ) AS otta
  FROM (
    SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
      ( index_tuple_hdr_bm +
          maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
            WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
            ELSE index_tuple_hdr_bm%maxalign
          END
        + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
            WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
            ELSE nulldatawidth::integer%maxalign
          END
      )::numeric AS nulldatahdrwidth, pagehdr
    FROM (
      SELECT
        i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
        current_setting('block_size')::numeric AS bs,
        /* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
        CASE
          WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
          ELSE 4
        END AS maxalign,
        /* per page header, fixed size: 20 for 7.X, 24 for others */
        CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
          THEN 24
          ELSE 20
        END AS pagehdr,
        /* per tuple header: add index_attribute_bm if some cols are null-able */
        CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
          THEN :index_tuple_hdr
          ELSE :index_tuple_hdr + :index_attribute_bm
        END AS index_tuple_hdr_bm,
        /* data len: we remove null values save space using it fractionnal part from stats */
        sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
      FROM pg_attribute AS a
        JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
        JOIN (
          SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
          FROM pg_index
            JOIN pg_class ON pg_class.oid=pg_index.indexrelid
            JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
        ) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
      WHERE a.attnum > 0
      GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
    ) AS s1
  ) AS s2
    LEFT JOIN pg_am am ON s2.relam = am.oid
) as sub
JOIN pg_class c ON c.oid=sub.table_oid
ORDER BY 2,3,4

How does it perform compared to the actual values? Let’s go back to a bloated situation:

pagila=# TRUNCATE test ;
pagila=# INSERT INTO test SELECT * FROM rental;
pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;
pagila=# ANALYZE test;

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     507904 |       34.11
 test_pkey                                     |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     630784 |       26.14

I created the file “~/tmp/bloat_index.sql” with this estimated indexes bloat query filtering on table test, here is its result:

pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     507904 |      172032 | 33.8709677419354839
 pagila           | public  | test       | test_pkey                                     |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     630784 |      172032 | 27.2727272727272727

Well, pretty close :-)

Let’s REINDEX:

pagila=# REINDEX TABLE test;
pagila=# ANALYZE test;
pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     376832 |       10.25
 test_pkey                                     |     376832 |       10.25
 test_rental_date_inventory_id_customer_id_idx |     524288 |       10.73

pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_pkey                                     |     376832 |       40960 | 10.8695652173913043
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     524288 |       65536 | 12.5000000000000000

Not bad.

Let’s create some bloat. The table is approx. 160,000 rows, so the following query updates ~10% of the table:

pagila=# UPDATE test SET rental_date = current_timestamp WHERE rental_id < 3200 AND rental_id%2 = 0;
pagila=# ANALYZE test ;

pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     475136 |       22.42
 test_pkey                                     |     450560 |       18.04
 test_rental_date_inventory_id_customer_id_idx |     598016 |       14.26
(3 rows)

pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     475136 |      139264 | 29.3103448275862069
 pagila           | public  | test       | test_pkey                                     |     450560 |      114688 | 25.4545454545454545
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     598016 |      139264 | 23.2876712328767123
(3 rows)

Erk, not so good. Moar bloat?

pagila=# UPDATE test SET rental_date = current_timestamp WHERE rental_id%2 = 0;
pagila=# ANALYZE test ;
pagila=# SELECT relname, pg_table_size(oid) as index_size,
  100-(stattuple.pgstatindex(relname)).avg_leaf_density AS bloat_ratio
FROM pg_class
WHERE relname ~ 'test' AND relkind = 'i'
ORDER BY 1;

                    relname                    | index_size | bloat_ratio 
-----------------------------------------------+------------+-------------
 test_inventory_id_idx                         |     745472 |       55.48
 test_pkey                                     |     737280 |       54.98
 test_rental_date_inventory_id_customer_id_idx |     925696 |       49.96
(3 rows)

pagila=# \i ~/tmp/bloat_index.sql 
 current_database | nspname | table_name |                  index_name                   | totalbytes | wastedbytes |      realbloat      
------------------+---------+------------+-----------------------------------------------+------------+-------------+---------------------
 pagila           | public  | test       | test_inventory_id_idx                         |     745472 |      409600 | 54.9450549450549451
 pagila           | public  | test       | test_pkey                                     |     737280 |      401408 | 54.4444444444444444
 pagila           | public  | test       | test_rental_date_inventory_id_customer_id_idx |     925696 |      466944 | 50.4424778761061947
(3 rows)

Well, better again.

Conclusion

This new query performs much better for indexes than the usual one everyone knew for a long time. But it isn’t perfect, showing sometimes some quite wrong results.

First issue, this query doesn’t make any difference between branch nodes and leaves in indexes. Rows values and references are kept in leaves, and only a fractional part of them are in branches. I should do some more tests with larger indexes to see how it behaves with a lot of branch nodes. As the number of branch nodes is supposed to be a logarithm of the total number of rows, maybe we could include some more guessing in the optimal index size computing.

Second issue…keep in mind it is based on statistics.

Anyway, as useful as this query can be, at least it has been funny to dive in indexes and pages guts.

As a side note, I had been surprised to see that padding to MAXALIGN was applied on both tuple header AND tuple data. My understanding is that for each row in an index leave page, we have:

the item_pointer at the beginning of the page,
the tuple header + data at the end of the page that must be aligned on MAXALIGN.

Why ain’t they padded together to MAXALIGN? Wouldn’t this save some space?

Thank you for reading till the end! Cheers, see you maybe in another post about table bloat next time ;-)

Comments Leave a comment by email or Disqus

Note: I only publish your name/pseudo, mail subject and content. I will NOT publish your email address.

eggyknap reply

Sat, 29 Mar 2014 05:45:45 +0100

The check_postgres project welcomes patches! :)

Josh Berkus reply

Tue, 01 Apr 2014 01:58:01 +0200

Thanks for posting this! I’ve known for a while that the index bloat calculation was faulty, but haven’t quite known what was wrong with it.

I’m working on a 2nd bloat calculation involving dead row counts – just for tables,though.

Josh Berkus reply

Tue, 01 Apr 2014 02:10:31 +0200

Actually, there’s some bugs in the query version as you’ve written it above:

It’s \set, not \SET
index_tuple_hdr is missing the : in two places.

Thanks!

ioguix reply

Thu, 03 Apr 2014 15:43:33 +0200

Josh,

About index_tuple_hdr, I just realize I used it as a field name AND a psql variable, my bad. See the code bellow this comment in the query:

/* per tuple header: add index_attribute_bm if some cols are null-able */

So if you are referring to index_tuple_hdr at lines 11, 13 and 14, those are referring to the computed field, so no :, it would produce bad stats with indexes on NULLable cols.

I fixed to query to remove this confusion.

Good catch about the \SET! It appears there was some magic in my old blog system transforming it in uppercase. I’ll not investigate further as I moved to jekyll…

Thanks!