“He’s making a table. He’s sorting it twice. SELECT * FROM contacts WHERE behavior = “nice”; SQL Clause is coming town! (buy here)
I mean, except for the fact that sorting something twice is TERRIBLY optimized
So how bad is this? Let’s find out.
Some test data
We are defining a table
santa, where we store peoples names (GDPR, EU Regulation 2016/679 applies!), their behavior (naughty or nice), their age, their location, and their wishlist items.
We are also writing some code to generate data (to evade GDPR, we are using randomly generated test data):
The full data generator is available as santa.py. Note that the data generator there defines more indexes - see below.
In our example we generate one million rows, and assume a general niceness of 0.9 (90% of the children are nice). Also, all of our children have 64 characters long names, a single 64 characters long wish, a random age, and are equidistributed on a perfect sphere.
Our real planet is not a perfect sphere, and also not many people live in the Pacific Ocean. Also, not many children have 64 character names.
Sorting it twice
How do you even sort the data twice? Now, assuming we sort by name, we can run an increasingly deeply nested subquery:
Out of 1 million children, we have around 900k nice children. No indexes can be used to resolve the query.
Let’s order by name, using a subquery:
We can already see that the MySQL 8 optimizer recognizes that this subquery can be merged with the inner query, and does this.
This can be done multiple times, but the optimizer handles this just fine:
We can see
using filesort, so while we ask for the query result to be sorted by name twice, it is actually only sorted once.
No sorting at all
We can improve on this, using a covering index in appropriate order:
Having done this, we now see that we lost the
using filesort altogether:
The query is now annotated
using index, which means that all data we ask for is present in the (covering) index
behavior_name, and is stored in sort order. That means the data is physically stored and read in sort order and no actual sorting has to be done on read - despite us asking for sorting, twice.
Hidden ‘SELECT *’ and Index Condition Pushdown
In the example above, we have been asking for
t.name only, and because the name is part of the index,
using index is shown to indicate use of a covering index. We do not actually go to the table to generate the result set, we are using the index only.
Now, if we were to ask for
t.* in the middle subquery, what will happen?
Code 1003 Note we still see the exact same reconstituted query, but as can be seen in the plan annotastions, the internal handling changes - so the optimizer has not been working on this query at all times, but on some intermediary representation.
The ‘using index condition’ annotation points to Index Condition Pushdown Optimization being used. In our example, this is not good.
Worse than sorting: Selectivity
The column we select on is a column with a cardinality of 2:
behavior can be either
nice. That means, in an equidistribution, around half of the values are
naughty, the other half is
Data from disk is read in pages of 16 KB. If one row in a page matches, the entire page has to be read from disk. In our example, we have a row length of around 200 Byte, so we end up with 75-80 records per page. Half of them will be
nice, so with an average of around 40
nice records per page, we will very likely have to read all pages from disk anyway.
Using the index will not decrease the amount of data read from disk at all. In fact we will have to read the index pages on top of the data pages, so using an index on a low cardinality column has the potential of making the situation slightly worse than even a full table scan.
Generally speaking, defining an index on a low cardinality column is usually not helpful - if there are 10 or fewer values, benchmark carefully and decide, or just don’t define an index.
In our case, the index is not even equidistributed, but biased to 90%
nice. We end up with mostly
nice records, so we can guarantee that all data pages will be read for the SQL
SELECT * FROM santa WHERE behavior = "nice", and the index usage will not be contributing in any positive way.
We could try to improve the query by adding conditions to make it more selective. For example, we could ask for people close to our current position, using an RTREE index such as this:
ALTER defines a spatial index (an RTREE), which can help to speed up coordinate queries.
SET defines a coordinate rectangle around our current position (supposedly 15/15).
We then use the
mbrcovers() function to find all points
loc that are covered by the
@rect. It seems to be somewhat complicated to get MySQL to actually use the index, but I have not been investigating deeply.
If we added an
ORDER BY name here, we would see
using filesort again, because data is retrieved in RTREE order, if the index
loc is used, but we want output in
- The Santa query is inefficient, but likely sorting twice is not the root cause for that.
- The optimizer will be able to merge the multiple sorts and be able to deliver the result with one or no sorting, depending on our index construction.
- The optimizer is not using the reconstituted query shown in the warning to plan the execution, and that is weird.
- Selectivity matters, especially for indices on low cardinality columns.
- Asking for all
nicebehaviors on a
nicecolumn is usually not benefitting from index usage.
- Additional indexable conditions that improve selectivity can help, a lot.
- Asking for all