One of the new things in MySQL is the implementation of Window Functions. They are related to aggregates, but do not actually lump values together.
To better understand what goes on, let’s create some fake data to work with:
This will create test data for a number of fictional sensors sensors. For each sensor, there will be values many readings with a random float value between minvalue and maxvalue. The sensor readings will be taken at a random point in time between starttime and delta days later. In our sample config, that is 3 sensors with 10 readings each, values between 0 and 10, at some random point in time in January 2020.
We get data similar to this:
We could group this data, for example by sensor. The database would then make piles for each sensor value, and we could apply aggregate functions to each pile:
Window Functions work similarly, but they do not make piles. They work by defining partitions over the data, and then walking through each partition, resetting the window functions applied at each partition boundary.
As with GROUP BY, when leaving out a partitioning the entire table is taken as one:
The id is stored, the row_number() is generated. Let’s add a partitioning by sensor number and re-run the command:
By defining partition boundaries at sensor value changes with PARTITION BY sensor, the row_number() counter is reset each time the sensor value changes. Within each partition, we can order the values as we wish. Using the same statement as before, but ordering the values by id in reverse, we get
Note that the OVER () clause is written as a field. Different windows and hence different partitions and orderings can be defined concurrently.
Most aggregate functions can be used with OVER() clauses, among them not only AVG() and the other statistics functions, but also JSON_ARRAYAGG() and JSON_OBJECTAGG(). On top of that a number of window-specific functions have been defined, as listed in Window Function Descriptions in the manual. Among these are RANK() and DENSE_RANK() , as well as LAG() and LEAD(), which refer to the value next or preceding the current row.
We have shown above already how to produce global row numbers and row numbers per sensor. We can also quite easily run a topK query, for example “show me the three smallest values per sensor”:
We need a subquery here, because according to the manual:
Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.
So if we want to filter on the result of a window function like rank, we need to take our result from an inner query and then apply the filter in an outer query.
To answer the second Reddit question, we need to use LAG() and LEAD(). With them we can also process differences between two adjacent rows in the same partition:
We define a partition by sensor, and process values in each partition in temporal order (we generated our fake sensor data in random temporal order within a time window). We calculate the difference between the current value and the lagging (following) value as delta.
The way we have written the delta calculation highlights a pecularity of the syntax: The expression for the preceding value is not LAG(value), it is lag(value) over (partition by sensor order by checktime). We have to put the different - value behind the full window expression, not into the middle of it: lag(value)-value over (partition by sensor order by checktime) as delta yields a syntax error.
Because OVER() expressions can become quite unwieldy (especially if you also define frames - not explained here - in them), they can be named and put into the select-statement at a place behind the FROM clause.
This gives us
as a different syntax with the same result. It is possible to define more than window, as long as they have different window names, and it is possible for a window to be referred to zero or more times.