This is an update and translation of a much older article, which I wrote in German Language back then. I was experimenting with importing the account statements from my German Sparkasse, which at that time were being made available as a CSV.
The initial data load
The data looked like this:
Because I want to know how I spend my money, I am loading the data into MySQL. Here is how:
As a first step we are defining a table into which to load the data. The table has columns whose primary purpose it to hold the data. We still have to clean and adjust the data to be able to do things with the data, so we are not looking at the final fields or types at all.
Sadly, we have no unique transaction identifiers, so I cannot really define a proper primary key. I am trying to make do with a
UNIQUE INDEX, but it is likely overly specific.
The failure scenario is that I have to specify start and end dates when exporting the account statements, and sometimes a few records are exported twice: At the end of the one statement file, and at the beginning of the next statement file again.
To prevent loading the same line twice if it happens to appear in multiple CSV files this unique index is probably okay.
I load can load the various CSV files into this table:
Now it is time to clean up the data. For this we create a target table, and add a primary key to it.
Cleaning and changing the data
We can now copy the data over, and in the process clean it up.
amountfield has to be changed from “xxx.xxx,yy” (german monetary notation with dots between the thousands, and a comma before the cents) to “xxxxxxx.yy”.
- We also have to turn the date fields
bookdate_textinto iso date syntax.
- We need to add a year to bookdate_text, taking it from the valutadate_text.
infocolumn is useless.
Preparing binning categories
I now want to aggregate my expenses. In the raw data, spent money is listed together with the remote account it went to. To aggregate, we need to assign each of these accounts to a category, for example we would assign all gas stations to the “fuel” category or all supermarkets to the “food” category.
I can have a table “moneysinks” that does this account to category assignment.
And here is the
moneysinks table, which needed manual population (I had to assign a category to each pattern by hand):
Using the mapping in
moneysinks and the query below I can now permanently assign the
category field in
As I complete my list of patters I get a
category assigned to everything.
That enables me to ask questions:
category I an also group this and make group totals.
This will tell me how I spend my money.
Often I want to observe how my spending habits change over time, so I would not just use the categories from the moneysinks table as grouping critieron, but also group over some time dimension (
GROUP BY year(bookdate) as year, category).
As my data warehouse grows, it may be useful to run the aggregation query and save the result in some preaggregated table (by day or month), as a kind of materialized view of the aggregates. I can use these to create coarser grained aggregates faster (take daily sums and counts, and create monthly or yearly aggregates, fast).
Summary and outlook
Using account statements from a bank, we have built a basic data warehouse. We demonstrated data import, staging and cleaning processes. Using category tables, we demonstrated how bin individual records into larger sets that can be useful in statistical analysis, and we looked at how we could materialize these aggregation into preaggregated tables.
This is a toy data warehouse, and it poses no challenges in data management due to the small size. Still, it has many properties that are also present in larger structures. Let’s look at this in a more generalized way:
A data warehouse is centered around one or more fact tables. Fact tables always have a time dimension, they have log-like nature.
Literal attribute values, resolve id’s
Fact tables contain snapshot data that is not normalized, but contains literal values. In our example: actual account holder names, and account numbers.
For example in a sales warehouse from a web shop, we are never interested in the current price of an item in a sales-fact table, but always in what we sold it for back then, for each individual sale. We are never interested where the customer lives now, but where we sent the article. Hence we log literal attributes, never ids pointing to the current article or customer record.
Encoding can shrink tables, but do only what is necessary
This often leads to data duplication. In our example, the string “SCHECK IN CENTER KA DURLACH” is repeated 36 times. For the amount of data shown in our example this does not matter.
Even in the one million rows example in Coding fields for great profit the gain is not critial, though substantial. A data warehouse starting out can often skip encoding the values and just take the hit from the duplication. An encoding step can be performed later, as long as there is infrastructure in place for online schema change, and sufficient disk space.
The Star and the Snowflake
When we talk about OLTP databases, we often talk about normalization, and aiming for the 3rd normal form as an idealized schema to work from. For transactional data that is a good model and a good goal.
In data warehousing, this is obviously not the case - our data has a time dimension, and unlike OLTP, in a data warehouse we are interested in how things were then, not what they are like now. In our example, we are interested in how much we spent for the fuel back then, and not what the same amount would cost us now.
Conversely, in a data warehouse the logged data usually does not change much after the fact: there can be
- backfilling, due to data arriving late,
- backpropagation of newly added attributes,
- or there can be corrections due to transactions changed after the fact at the business level
All of these require us to be able to rewrite our logs for some time window. But eventually that window closes and the data becomes immutable, and we can seal it (run
OPTIMIZE TABLE, and apply encoding and page level compression).
In our example, once money is spent, it would not come back (or if, in a second corrective reverse booking), and the amounts booked would never change. The log records are immutable from the outset.
Things that OLTP structures need to avoid by normalizing - insert, update and delete paradox - do not happen to us in the same way. Also, because our data does not change much (if ever), but grows a lot over time, it is useful to invest the CPU cycles for data compression (using encoding at the data model level, and using page compression at the database engine level, in this order) once the data can be sealed.
The classical structure of a data warehouse is therefore the star, in which we have a fact table, and a number of category tables aiding binning, plus a number of generated daily/weekly/monthly preaggregates as outputs. All of these auxiliary tables group around the fact table, linked to it in some way, hence the “Star Schema”.
In our example, we have the
b fact table, and only a single binning input table,
moneysinks. More complicated warehouses can have many more of these.
We have no materialized output tables (
expenses per days or similar), because our data set is small enough to do all of this ad-hoc. In larger warehouses, materializing aggregates that help to speed up reports is useful.
If you add categories of catgeories, or produce multidimensional aggregates, you go from Star to Snowflake Schema.
The Star and the Snowflake are the normal forms for Data Warehosues, Normalization as practiced in transactional databases is not helpful, 3NF is not a thing.
Time Dimensional Tables in OLTP schemas
Almost every OLTP schema earning money has some tables in it that have a time dimension. It is often visble either in the table name (
shop.sales_202009) or in the tables primary key - often a compound primary key which contains a pair of an id and a date.
It will be visible in any case, if you think about each tables growth.
Assume unchanging conditions - a webshop with a mostly fixed number of articles, and a mostly fixed number of customers buying at a fixed rate, which tables will stay at a fixed size, and which tables will grow without bounds?
Solution: It is the
Find these unbounded structures in your schema.
At the heart of every OLTP schema there is a data warehouse that is struggling to get out. You get it out by completing the data lifecycle for the OLTP phase of things and establishing an Extract, Transform and Load cycle (ETL cycle).
In planning a data warehouse, you identify tables without growth boundaries, and decide which data attributes you want to record as literal data for the warehouse.
In the Extract phase you create a monster join you resolve all id values present in your normalized data, extract the literal attributes of interest, and push them into a CSV or some staging table.
Data in the OLTP system can now be deleted by the OLTP systems data lifecycle management without concern for other, non-transactional business functions: we have isolated the non-transactional concerns and confined them to the Extraction process (which will have to be adjusted when the OLTP schema changes, so we are a stakeholder in the OLTP schema evolution).
This isolation of concerns means that the OLTP system of our shop can now delete or archive orders at will after fulfillment or whatever other concerns the transactional system has to serve. Our non-transactional needs are all served from the data gathered in the Extract phase of the ETL process.
The staged extracted values can now be cleaned, categorized and aggregated, depending on the demands from the non-transactional business concerns (statistics, business intelligence, decision making and so on).
Identifying data warehousey tables in OLTP systems, and isolating the transactional and non-transactional business concerns is important to keep the OLTP system small and agile.
Without taking the non-transactional, long term business functions out the OLTP system will grow without bounds, and ultimately transactional performance will suffer - the system will choke on itself. The non-transactional concerns not being isolated and bundled will also impede OLTP schema evolution and choke the development process on the money-earning side.
Data Lifecycle Management at the Warehouse
Fact tables in the data warehouse have a time dimension, and as time passes, they will accumulate more data. In order to manage our data warehouse, we will also have to complete the data lifecycle on this side of the ETL boundary, and establish some deletion policy.
Queries to the data warehouse table are often time bounded (“How did sales change in the last 2 years”) and binned by - among other things - a time dimension (“How did demand change month-over-month”, “Which of our products are seasonal?”).
Handling data at volume, and getting rid of data no longer needed, is much easier in MySQL when using partitions.
In data warehouses, partitions are usually on a time value as the first dimension. That is, we partition our data set by year, month or day and we delete data by dropping old partitions. This leaves all internal B-Trees in all the partitions subtables untouched and as they have been after the import, optimization and compression.
As queries also have a time dimension, the optimizer profits from this structure as well. It can exclude entire partitions from consideration, making the subset of data to look at much smaller.
Daily Snapshots are slow, but simple
A daily snapshot ETL import is slow - you do not get realtime data updates - but simple to implement retroactively on existing structures. Realtime structures require more engineering, but can be crafted on top of existing architectures as well. We are going to look at them in some later examples.