We have had a look at how MySQL 8 handles JSON recently, but with all those JSON functions and expressions it is clear that many JSON accesses cannot be fast. To grab data from a JSON column, you will use a lot of $->>field expressions and similar, and without indexes nothing of this will be fast.
JSON cannot be indexed.
But MySQL 8 offers another feature that comes in handy: Generated columns and indexes on those. Let’s look at the parts, step by step, and how to make them work, because they are useful even outside of the context of JSON.
An example table
For the following example we are going to define a table t1 with an integer id and two integer data fields, a and b. We will be filling it with random integers up to 999 for the data values:
A generated column is a column with values that are calculated from a deterministic expression provided in the column definition. It has the usual name and type, and then a GENERATED ALWAYS AS () term. The parentheses are part of the syntax and cannot be left off. The GENERATED ALWAYS is optional, and we are going to leave it off, because we are lazy.
The column can be VIRTUAL, in which case the expression is evaluated when reading every time a value is needed, or STORED, in which case the value is materialized and stored on write.
In may also contain inline index definition and a column comment.
VIRTUAL generated columns
So we get our trivial example:
That was fast - the table definition is changed, but because the column is VIRTUAL, no data values need to be changed. Instead, the data is calculated on read access. We could have written our sample read as SELECT id, a, b, a+b AS c FROM t1 LIMIT 3 for the same effect, because that is what happened.
We may even store that statement in a view and then call it, and that’s effectively the same:
Well, not quite. Let’s explain the same query on t1 and v1 and see what the optimizer has to say:
The output differs slightly in two places: the estimate given for filtered is different, and the view “sees” and exposes the definition for c as a+b in the reparsed statement in the “Note” section.
Further down we will also see how the generated column can be indexed, while the statement expression can not - and that in the end what makes the key difference in performance.
STORED generated columns
Let’s flip from VIRTUAL to STORED and see what happens. We drop the old definition of c, and re-add the same one, but with a STORED attribute.
If we looked at the average row length in INFORMATION_SCHEMA.TABLES, we would see it as a bit longer (but as is usual with I_S.TABLES output for small and narrow tables, the values are a bit off).
We also see the ALTER TABLE now takes actual time, proportional to the table size. What happened is that the values for c now get materialized on write, as if we defined a BEFORE INSERT trigger maintaining the values in c.
Trying to write to a generated column fails (except when it doesn’t)
VIRTUALand STORED don’t matter: you can’t write to generated columns:
With one exception:
So if you aren’t actually writing to c, you are allowed to write to c. That sounds stupid until you define a view on t1 that includes c and is considered updatable - by allowing this construct, it stays updatable, even if it includes c.
Filling in the correct value is not the same as default and does not work:
Caution: CREATE TABLE … AS SELECT vs. generated columns
We already know (I hope) that CREATE TABLE ... AS SELECT is of the devil and should not be used to copy table definitions: It creates a table from the result set of the select statement, which is most definitively not the definition of the original table.
We have seen this fail already with indexes and foreign key definitions, and in case you didn’t, here is what I mean:
broken is most decidedly not the same table as sane. The definition of broken has been inferred from the format of the result set, which may or may not have the same types as the base table(s). It also has no indexes and no constraints.
The correct way to copy a table definition is CREATE TABLE ... LIKE ... and then move the data with INSERT ... SELECT .... You still have to move the foreign key constraints manually, though:
And here is how it works with generated columns:
CREATE TABLE ... AS SELECT ... defined a table from the result set of the select clause, and the fact that c is generated is completely lost. So we now have a normal 4-column table.
So, how about CREATE TABLE ... LIKE ...?
Yes! Success! Ok, now the data:
Awww, yes. Okay, the full monty:
Ok, copying data between tables with generated columns requires a bit more engineering than a mindless INSERT ... SELECT *. The rules are not unexpected, we have explored them right above, still…
The wrong data type
Ok, let’s get a bit mean. What happens when we define c tinyint as (a+b) virtual so that the values exceed the range possible in a signed single bit value?
Oh, they are on to us!?!? Are they?
They are not when we do it in two steps:
It clips the values according to the rules that MySQL always had, and that ate so much data.
Now, let’s CREATE TABLE ... AS SELECT again:
Wow. No less than three error messages. At least they mention the column c and the word “range”, so we kind of can have an idea what goes on. Still, this is only medium helpful and initially confusing.
What happens, and why?
SQL_MODE helpfully detected the problem and prevented data loss. As usual, SQL_MODE was as useless as it was helpful - while it prevented data loss, it did not directly point us into the right direction with its error messages.
By turning off SQL_MODE we get the clipped values copied and a bunch of warnings that everybody ignores all of the time, anyway, so I guess it’s an improvement.
Allowed and disallowed functions
For generated columns to work it is a requirement that the functions are deterministic, idempotent and side-effect free. All user defined functions and stored functions are disallowed, and the usual suspects from the set of builtins are also out:
From the final example above we learn that it is also impossible to change the existing definition of any column that is used by a generated column definition. We need to drop the generated column, change the definition of the base columns and then recreate the generated column.
For VIRTUAL columns that is cheap, for STORED - less so.
Secondary indexes and generated columns
So far, so nice. Now let’s cash in on this: Indexes, we have them. At least secondary indexes:
As expected, adding the index takes time, even if the column c is VIRTUAL: For an index we extract the indexed values from the table, sort them and store them together with pointers to the base row in the (secondary) index tree. In InnoDB, the pointer to the base row always is the primary key, so what we get in the index is actually pairs of (c, id).
We can prove that:
Queries for c can be answered from the index. The index is called covering, it saves us chasing the row pointer and an access to the actual row. In an EXPLAIN we see this being indicated with using index.
Queries for c and id should also be covering: the queried values are all present in the index so that going to the base row is unnecessary.
Querying for c and a is not covering, so the using index should be gone.
As predicted, the final query for c, a cannot be covering and is missing the using index notice in the Extra column. This is still a good query: it is considering and using the index on c - the index alone is just not sufficient to resolve the query.
This should give us an idea about how to design:
In almost all cases STORED columns will not be paying off. They use disk space, and still need to evaluate the expression at least once for storage. If indexed, they will use disk space in the index a second time - the column is actually materialized twice, in the table in primary key order and the index in index order.
STORED generated columns make sense only if the expression is complicated and slow to calculate. But with the set of functions available to us that is hardly going to be the case, ever. So unless the expression is being evaluated really often the cost for the storage is not ever amortized.
Even then, for generated columns STORED and VIRTUAL, many queries can probably be answered leveraging an index on the generated column so that we might try to get away with VIRTUAL columns all of the time.
Generated columns and the Optimizer
The optimizer is aware of the generated column definitions, and can leverage them, as long as they match:
The optimizer is still the MySQL optimizer we all love to hate, so you have to be pretty literal for the match:
Yup, no canonicalization, for reasons.
Making it work with JSON
That’s a long article. Do you still remember how we started?
JSON cannot be indexed.
Well, now it can and you know how.
Yay, ref: const, primary key lookup in the optimizer and we did not even have a query to run.
We have been looking at the two flavors of generated columns, and how they can make our life easier in many ways. We have been looking at various pitfalls with respect to copying data and table definitions around. We have been learning about indexing generated columns, and how the optimizer can leverage indexes even against the expressions defined in generated columns.
Finally we put the parts together and made JSON data lookups fast.
This should give us a number of ideas in terms of sensible table design around JSON. Often we use JSON for variable-ish data while we explore a data model. Then a JSON schema solidifies, and we can leverage values we require and rely on by putting them into generated columns and index these, then use these for search and access.
Eventually we may extract the columns from the variable JSON part of the schema completely and turn them into actually statically typed columns of the SQL schema, because we require them all of the time.
This opens up a pathway to incremental schema design while at the same time being flexible enough to have bag style soft and denormalized data types where we need them.
The Fine Manual
There is a lot more to all of this than I can show here. This means you have homework. Read the following manual pages:
The CREATE INDEX statement and multi valued indexes
The entire page is useful, because it speaks about functional indexes and how they are implemented as hidden virtual columns and indexes on these (which has implications). But within the discussion of JSON, the interesting part are Multi-Valued Indexes, which are indexes on non-scalar values such as JSON arrays, and how they are being used to speed up certain JSON functions that deal with array membership and overlaps.