Iterating schemas over time is not an uncommon thing. Often requirements emerge only after you have data, and then directed action is possible. Consequently, working on existing data, and structuring and cleaning it up is a common task.
In todays example we work with a log table that logged state transitions of things in freeform VARCHAR fields. After some time the log table grew quite sizeable, and the log strings are repeated rather often, contributing to the overall size of the table considerably.
We are starting with this table:
That is, our log table has an id field to allow individual row addressing, and then logs the state change of a device_id at a certain change_time from an old_state into a new_state. The two state fields are varchar(64) and contain one of some 13 or so different strings.
Maybe they also contain typos, outdated state codes or other stuff that will later needs remapping and cleanup, but in today example we want to concentrate on the cleanup.
Some small manual sample data:
Let’s build a list of all possible states from both columns, old_state and new_state. This is easily done. Using old_state, and we prepend a NULL column because we want to INSERT ... SELECT ... this later into a map table which will map the string to an id value. We will then encode the strings using exactly this value.
Now using a UNION we extend it to new_state as well.
The result strings look good, but something happened to our NULL values. In the original statement they were still good, but in the UNION they get mangled.
The data types of the result table are derived from the values and their types in the first result set of the UNION. And NULL is not a good value to guess anything from.
So let’s be more explicit about the types:
This now works, and we can feed it into an INSERT ... SELECT ....
Yay. A nice and autonumbered list of all possible states from the existing data. We need this indexed, and we also need indices on the two source columns.
If we want to encode these two columns with id-Values from map, we need to add columns for that.
We can now encode by looking up each log.old_state value in map.state and returning the matching map.id value. With this we should be able to construct an UPDATE statement that fills in the log.old_state_id column.
Let’s try this out in an UPDATE:
So we update log, specifically setting each current log.old_state_id value to whatever is returned as matching, for this row, from the subselect we wrote. We then do the same, again, for the new_state column. The result is as shown, it works.
We now have two redundant columns and the indexes that go with them, and can drop these. Let’s also check the new table structure, and validate the output by joining the id values for resolution against the map.
Note that we have to join against the map twice, once for each source column, and that also means we have to rename the map table to get unique names for each usage.
Well, that’s a toy example. Let’s do that at scale, using a Python driver implementing exactly this procedure. Code is on github.com.
We are also defining a list of possible states, and in a data generation loop fill the log table with many state transitions from random devices. Note that we do not really care much about matching source and target states, so the transitions are really random jumps. We commit once every statecount many rows (once for each device) to make it a bit faster.
On my super slow test machine, filling the table with a million rows from a thousand devices yields a 64M data file after 2 minutes. I have some 59M of data, and a bit of empty space:
The file size will go up to 132M in the transformation step, most of that is from the indexing we add.
Finishing up will drop the two VARCHAR columns and the indices defined on them, but leaves us with the encoded values.
So data length went from 59326464 bytes to 48824320 bytes, a reduction to 82.3% of the original size. We could save even more by not using integer 4-byte values to encode, but for example tinyint unsigned 1-byte values. On the other hand, that may become a problem later on when we exceed the id-space of the map table as we add new states.
On top of that, the size of the target log table is now a function of the row number, as we have no variable length columns any more. The size of the source table is dependent on the variable length string values as well, so by encoding we also got a better plannable table size.
Using a map table, and update statements with a subselect, we can encode variable length repeated string values into integer values.
We also get better data quality, as now only legal values from the map table can be encoded.
The transformation requires at least two ALTER TABLE statements (or matching online schema changes), and one full scan for each column to be encoded. A lot of intermediate disk space is required. This needs planning.