Not JOINing on PERFORMANCE_SCHEMA

isotopp image Kristian Köhntopp -
December 1, 2020
a featured image

The tables in PERFORMANCE_SCHEMA (P_S) are not actually tables. You should not think of them as tables, even if your SQL works on them. You should not JOIN them, and you should not GROUP or ORDER BY them.

Unlocked memory buffers without indexes

The stuff in P_S has been created with “keep the impact on production small” in mind. That is, from a users point of view, you can think of them as unlocked memory buffers - the values in there change as you look at them, and there are precisely zero stability guarantees.

There are also no indexes (EDIT: There actually are secondary indexes on P_S tables in MySQL 8, see below).

Unstable comparisons

When sorting a table for a GROUP BY or ORDER BY, it may be necessary to compare the value of one row to other rows multiple times in order to determine where the row goes. The value compared to other rows can change while this happens, and will change more often the more load the server has. The end result is unstable. Also, as the table you sort may be larger on a server under load, the row may need more comparisons, making this even more likely to happen.

The table you look at may produce correct results on your stable, underutilized test systems, but the monitoring you base on this will fail on a loaded test system.

Do not use GROUP BY or ORDER BY on P_S tables.

No indexes, meaning slow joins on loaded systems

When JOINing a P_S table against other tables, the join is done without indexes (in MySQL up to and including 5.7). There are no indexes defined in P_S before MySQL 8.

In practice that means your join against the processlist or session variables tables in P_S do little harm in test, but will fail in production environments with many connections. You will be losing monitoring the moment you need it most - under load, in critital situations.

Do not JOIN P_S tables to anything.

How to monitor

About the only type of query you can reliably run on P_S is a single table SELECT * FROM P_S.table, maybe with a simple WHERE clause. That is, you can download and materialize data from a single P_S table at a time, unsorted, unaggregated.

Connection to other tables, aggregation and sorting should be done on tables that are not P_S tables.

There are multiple ways to do this.

Subqueries, without optimization

It used to be that the MySQL optimizer did not resolve simple subqueries properly. So

mysql> select <complicated stuff> from
    ->   ( select * from performance_schema.sometable ) as t
    -> order by <something>

used to work. The subquery t would materialize the P_S table as whatever your version of MYSQL used for implicit temporary tables, and the rest of the query resolution would happen on the materialized temptable. This is a snapshot, and would be stable.

And it still would not add up to 100%, of course. That is, queries like Dennis Kaarsemakers “How loaded is the SQL_THREAD” Replication Load analysis never came out at 100%, because the various values changed while the temporary table would be materialized, so you do not get a consistent snapshot (and by construction, this kind of consistency is impossible in P_S).

Anyway, with older versions of MySQL, this results in the query plan we want. Starting with MySQL 5.7, this does no longer work, because the optimizer became too smart. :-)

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.22    |
+-----------+
1 row in set (0.00 sec)

mysql> explain select * from ( select * from processlist ) as t;
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table       | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+
|  1 | SIMPLE      | processlist | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  256 |   100.00 | NULL  |
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

Newer MySQL (5.7 and above) will apply the derived_merge optimization and fold the subquery into the outer query, resulting in a rewritten single query that again is executed on P_S directly. This is extremely useful, but in this one particular case precisely not what we want.

You either need to SET SESSION optimizer_switch = "derived_merge=off"; or provide an advanced MySQL 8 optimizer hint to prevent the optimizer from ruining your cunning plan:

mysql> explain select /*+ NO_MERGE(t) */ * from ( select * from processlist ) as t;
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table       | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+
|  1 | PRIMARY     | <derived2>  | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  256 |   100.00 | NULL  |
|  2 | DERIVED     | processlist | NULL       | ALL  | NULL          | NULL | NULL    | NULL |  256 |   100.00 | NULL  |
+----+-------------+-------------+------------+------+---------------+------+---------+------+------+----------+-------+

Here we get the DERIVED table as a non-P_S temptable, and then run our “advanced” SQL on that as PRIMARY on it.

In the client

The alternative is, of course, to completely download the tables in question into client side hashes, and then perform the required operations on them on the client side, in memory. The important thing here is to limit the amount of memory spent - do not download unconstrained result sets into your client monitoring program.

Then use a linearly scaling join method to construct the connections between the tables. Effectively, load data into hashes, and then program a client side hash join. This is additive (n + m) instead of quadratic (n * m), so you can survive this.

This is the recommended method.

Who is doing it wrong?

Getting monitoring queries that use P_S wrongly is common - it understands SQL, it handles SHOW CREATE TABLE, so it is treated as a table and exposed to full SQL all the time. And on idle test boxen, it even looks like it works.

At work, see this in our own code (still using a deprecated Diamond collector) and in SolarWinds nee Vividcortex.

SolarWinds kindly highlights itself:

-- Most time consuming query - Coming from solar winds monitoring itself ¯\_(ツ)_/¯
select `ifnull` (`s`.`sql_text` , ?) , `ifnull` (`t`.`processlist_user` , ?) , `ifnull` (`t`.`processlist_host` , ?) 
  from `performance_schema`.`events_statements_history` `s` 
  left join `performance_schema`.`threads` `t` 
    using (`thread_id`) 
  where `s`.`thread_id`=? and `s`.`event_id`=?
 
-- Coming from the "table ownership write identifier".
select count (*) as `cnt` , `digest_text` , `current_schema` , `processlist_user` as system_user 
  from `performance_schema`.`events_statements_history` `esh` 
  inner join `performance_schema`.`threads` `t` 
    on `t`.`thread_id`=`esh`.`thread_id` 
  where `event_name` in (...) 
    and `current_schema` in (...) 
  group by `digest_text` , `current_schema` , `processlist_user`
 
-- Coming from diamond collector
select `t`.`processlist_user` , `sbt`.`variable_value` , count (*)
  from `performance_schema`.`status_by_thread` `sbt` 
  join `performance_schema`.`threads` `t`
    using (`thread_id`) 
  where `sbt`.`variable_name`=? 
    and `t`.`processlist_user` is not null
group by `t`.`processlist_user` , `variable_value`

Many of the above examples fail in multiple ways: Using JOIN for bad scalability (this is how we spotted them), at least before MySQL 8 (and you probably want your monitoring to work anywhere). Some are also using unstable sorting.

We also see ORDER BY statements in the Telegraf MySQL plugin in one place. It uses LIMIT, but if the ORDER BY does not work (ie does not actually sort), you cut off randomly.

Is PERFORMANCE_SCHEMA broken?

Clearly, it is not. Just badly misunderstood.

The alternative is INFORMATION_SCHEMA, which often locks, and that can be actually deadly:

Just select * from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE on a server with a few hundreds of GB of buffer pool, humming at 10k QPS. The query will freeze the server completely for the runtime of the query – which with a large buffer pool size can be substantial.

I’d rather have this in P_S and then deal with the vagaries of the data changing while I read it than lose an important production server. So P_S is much better than I_S, provided that you ask it in the right way.

Addendum

Both Øystein Grøvlen and Mark Leith remind me that in MySQL 8 we geht secondary indexes on P_S - this will address some of the problematic queries above and will need some research on my side.

The sorting is still complicated. Øystein writes :

It depends on the sort algorithm used by MySQL. Many sorts will add all requested columns to the sort buffer, and will not need to fetch anything from the tables after sorting, but it depends on the size of the sort buffer and how much data needs to be fetched.

So there are some sorts that are stable, and some sorts that will always be unstable (those that work with bloblike columns such as SQL_TEXT or DIGEST_TEXT likely), but it is really hard to see this when looking at the query.

I still maintain my advice of forcing stable copies with subselects and NO_MERGE() hints - this is much easier to teach and guaranteed to be stable.

Another addendum

In a tweet Justin Swanhart pointed me at PS_HISTORY , which makes snapshots of P_S that try to be as consistent as possible.

It will allow you to replay and review old P_S states and query them, and it will - if you configure it - also automatically expire the old recordings. It may be useful as a kind of flight recorder to debug performance problems that cannot otherwise reproduced.

Share