Importing account statements and building a data warehouse

This is an update and translation of a much older article , which I wrote in German Language back then. I was experimenting with importing the account statements from my German Sparkasse, which at that time were being made available as a CSV.
The initial data load
The data looked like this:
$ head -2 /home/kris/Documents/banking/umsatz-22758031-29122004.csv
"Local Account";"Book Date";"Valuta Date";"Transaction Type";
"Purpose";
"Remote Party";"Remote Account";"Bank Code";
"Amount";"Currency";"Info"
"08154711";"30.12";"30.12.05";"Direct Debit";
"DRP 08154711 040441777 INKL. 16% UST 5.38 EUR";
"STRATO MEDIEN AG";"040441777";"10050000";
"-39,00";"EUR";"Direct Debit booked"
Because I want to know how I spend my money, I am loading the data into MySQL. Here is how:
As a first step we are defining a table into which to load the data. The table has columns whose primary purpose it to hold the data. We still have to clean and adjust the data to be able to do things with the data, so we are not looking at the final fields or types at all.
-- load data
warnings;
DROP TABLE IF EXISTS transactions;
CREATE TABLE transactions (
localaccount char(8) NOT NULL,
bookdate_text char(10) NOT NULL,
valutadate_text char(10) NOT NULL,
transactiontype varchar(50) NOT NULL,
purpose varchar(180) NOT NULL,
remoteaccount_name varchar(100) NOT NULL,
remoteaccount_number char(20) NOT NULL,
remoteaccount_bankcode char(8) NOT NULL,
amount_text char(12) NOT NULL,
currentcy char(3) NOT NULL,
info varchar(255) NOT NULL,
unique index ( bookdate_text
, transactiontype
, purpose
, remoteaccount_number
, remoteaccount_bankcode
, amount_text)
) ;
truncate table transactions;
Sadly, we have no unique transaction identifiers, so I cannot really define a proper primary key. I am trying to make do with a UNIQUE INDEX
, but it is likely overly specific.
The failure scenario is that I have to specify start and end dates when exporting the account statements, and sometimes a few records are exported twice: At the end of the one statement file, and at the beginning of the next statement file again.
To prevent loading the same line twice if it happens to appear in multiple CSV files this unique index is probably okay.
I load can load the various CSV files into this table:
load data infile
'/home/kris/Documents/banking/umsatz-22758031-29122004.csv'
into table buchungen
fields terminated by ";"
optionally enclosed by '"'
ignore 1 lines;
load data infile
'/home/kris/Documents/banking/umsatz-22758031-24012005.csv'
into table buchungen
fields terminated by ";"
optionally enclosed by '"'
ignore 1 lines;
...
Now it is time to clean up the data. For this we create a target table, and add a primary key to it.
-- prepare conversion stage
DROP TABLE IF EXISTS b;
create table b like transactions;
alter table b add column id integer unsigned not null first;
alter table b add primary key (id);
alter table b change column id
id integer unsigned not null auto_increment;
Cleaning and changing the data
We can now copy the data over, and in the process clean it up.
- The
amount
field has to be changed from “xxx.xxx,yy” (german monetary notation with dots between the thousands, and a comma before the cents) to “xxxxxxx.yy”. - We also have to turn the date fields
valutadate_text
andbookdate_text
into iso date syntax. - We need to add a year to bookdate_text, taking it from the valutadate_text.
- The
info
column is useless.
-- load data into conversion stage
insert into b select NULL, transactions.* from transactions;
-- adapt amount
update b set amount_text = replace(amount_text, ".", "");
update b set amount_text = replace(amount_text, ",", ".");
alter table b change column amount_text
amount decimal(12,2) not null;
-- adapt valutadate
update b set valutadate_text =
concat("20",
substring(valutadate_text, 7, 2),
"-",
substring(valutadate_text, 4, 2),
"-",
substring(valutadate_text, 1,2));
alter table b change column valutadate_text
valutadate date not null;
-- adapt bookdate
update b set bookdate_text =
concat(year(valutadate),
"-",
substring(bookdate_text, 4,2),
"-",
substring(bookdate_text, 1,2));
alter table b change column bookdate_text
bookdate date not null;
-- drop info
alter table b drop column info;
Preparing binning categories
I now want to aggregate my expenses. In the raw data, spent money is listed together with the remote account it went to. To aggregate, we need to assign each of these accounts to a category, for example we would assign all gas stations to the “fuel” category or all supermarkets to the “food” category.
I can have a table “moneysinks” that does this account to category assignment.
-- add category
alter table b add column category varchar(20) not null;
And here is the moneysinks
table, which needed manual population (I had to assign a category to each pattern by hand):
DROP TABLE IF EXISTS `moneysinks`;
CREATE TABLE `moneysinks` (
`id` int(10) unsigned NOT NULL auto_increment,
`pattern` varchar(100) NOT NULL,
`category` varchar(20) NOT NULL,
PRIMARY KEY (`id`)
);
LOCK TABLES `moneysinks` WRITE;
INSERT INTO `moneysinks` VALUES (77,'sparkasse','account and card');
INSERT INTO `moneysinks` VALUES (78,'ga','ATM domestic');
INSERT INTO `moneysinks` VALUES (79,'qsc','Internet');
INSERT INTO `moneysinks` VALUES (80,'linux new media','Newspaper');
INSERT INTO `moneysinks` VALUES (81,'premiere','TV');
INSERT INTO `moneysinks` VALUES (82,'walmart','Food');
INSERT INTO `moneysinks` VALUES (83,'kabel bw','TV');
INSERT INTO `moneysinks` VALUES (84,'gez','TV');
INSERT INTO `moneysinks` VALUES (85,'t-mobile','Telefon');
INSERT INTO `moneysinks` VALUES (86,'finanzkasse','Tax');
INSERT INTO `moneysinks` VALUES (87,'domainfactory','Internet');
INSERT INTO `moneysinks` VALUES (88,'nagel ue','Clothing');
INSERT INTO `moneysinks` VALUES (89,'mobilcom','Telefon');
INSERT INTO `moneysinks` VALUES (90,'domainfactory','Internet');
INSERT INTO `moneysinks` VALUES (91,'strato','Internet');
INSERT INTO `moneysinks` VALUES (92,'stadtwerke','Gas Water Shit');
INSERT INTO `moneysinks` VALUES (93,'deutsche bahn','Rail');
INSERT INTO `moneysinks` VALUES (94,'debeka','Insurance');
INSERT INTO `moneysinks` VALUES (95,'ec-ga','ATM intl');
INSERT INTO `moneysinks` VALUES (96,'scheck in','Food');
INSERT INTO `moneysinks` VALUES (97,'mastercard','Kreditkarte');
INSERT INTO `moneysinks` VALUES (98,'dr.','Miete');
INSERT INTO `moneysinks` VALUES (99,'a.t.u','Auto');
INSERT INTO `moneysinks` VALUES (100,'ungeheuer','Auto');
INSERT INTO `moneysinks` VALUES (101,'agip','Fuel');
INSERT INTO `moneysinks` VALUES (102,'aral','Fuel');
INSERT INTO `moneysinks` VALUES (103,'avia','Fuel');
INSERT INTO `moneysinks` VALUES (104,'bab','Fuel');
INSERT INTO `moneysinks` VALUES (105,'citti','Food');
INSERT INTO `moneysinks` VALUES (106,'efa','Fuel');
INSERT INTO `moneysinks` VALUES (107,'esso','Fuel');
INSERT INTO `moneysinks` VALUES (108,'expedia','Travel');
INSERT INTO `moneysinks` VALUES (109,'fantasy','RPG');
INSERT INTO `moneysinks` VALUES (110,'gravis','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (111,'foto','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (112,'heinrich','Clothing');
INSERT INTO `moneysinks` VALUES (113,'hem','Fuel');
INSERT INTO `moneysinks` VALUES (114,'hotel','Travel');
INSERT INTO `moneysinks` VALUES (115,'jet-tank','Fuel');
INSERT INTO `moneysinks` VALUES (116,'karstadt','Food');
INSERT INTO `moneysinks` VALUES (117,'kassen-','Tax');
INSERT INTO `moneysinks` VALUES (118,'leichtsinn','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (119,'media markt','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (120,'mios','Fuel');
INSERT INTO `moneysinks` VALUES (121,'plaza','Food');
INSERT INTO `moneysinks` VALUES (122,'rundfunkgebuehren','TV');
INSERT INTO `moneysinks` VALUES (123,'saturn','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (124,'sb tank','Fuel');
INSERT INTO `moneysinks` VALUES (125,'segelkiste','Clothing');
INSERT INTO `moneysinks` VALUES (126,'total/','Fuel');
INSERT INTO `moneysinks` VALUES (127,'trx','Travel');
INSERT INTO `moneysinks` VALUES (128,'tst. bensheim','Fuel');
INSERT INTO `moneysinks` VALUES (129,'vobis','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (130,'willenberg','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (131,'shell','Fuel');
INSERT INTO `moneysinks` VALUES (132,'spreadshirt','Clothing');
INSERT INTO `moneysinks` VALUES (133,'armin meier','Clothing');
INSERT INTO `moneysinks` VALUES (134,'itzehoer','Insurance');
INSERT INTO `moneysinks` VALUES (135,'ec-pos','ATM intl');
INSERT INTO `moneysinks` VALUES (136,'euf-ga','ATM intl');
INSERT INTO `moneysinks` VALUES (137,'dell','Toys und Gadgets');
INSERT INTO `moneysinks` VALUES (138,'yvonne','RPG');
UNLOCK TABLES;
Asking Questions
Using the mapping in moneysinks
and the query below I can now permanently assign the category
field in b
:
update b set category = (
select category
from moneysinks as w
where b.remoteaccount_name like concat(w.pattern, "%")
order by length(pattern) desc
limit 1) where b.amount < 0;
As I complete my list of patters I get a category
assigned to everything.
That enables me to ask questions:
select remoteaccount_name,
count(remoteaccount_name) as income
from b
where betrag>=0
group by remoteaccount_name
order by income desc;
+--------------------------------------------------------+-----------+
| remoteaccount_name | income |
+--------------------------------------------------------+-----------+
| WEB.DE AG AMALIENBADSTR. 41 | 12 |
| MYSQL GMBH | 7 |
| MYSQL GMBH SCHLOSSERSTR. 4 72622 NUERTINGEN | 3 |
| COOP SCHLESWIG-HOLSTEIN EG BENZSTR. 10 | 2 |
+--------------------------------------------------------+-----------+
select remoteaccount_name,
count(remoteaccount_name) as payments
from b
where betrag<0
group by remoteaccount_name
order by payments desc;
+---------------------------------------------------------+----------+
| remoteaccount_name | payments |
+---------------------------------------------------------+----------+
| SCHECK IN CENTER KA DURLACH | 36 |
| MOBILCOM COMMUNICATIONSTECH | 22 |
| STRATO MEDIEN AG | 22 |
| VERMIETER | 19 |
| T-MOBILE DEUTSCHLAND GMBH | 18 |
| DEUTSCHE BAHN KARLSRUHE HB | 17 |
| KABEL BW GMBH & CO. KG | 17 |
| QSC AG | 17 |
| STADTWERKE KARLSRUHE | 17 |
...
Using the category
I can also group this and make group totals.
select category,
count(amount) as payments,
sum(amount) as total
from b
where amount<0
group by category
order by total;
This will tell me how I spend my money.
Often I want to observe how my spending habits change over time, so I would not just use the categories from the moneysinks table as grouping criterion, but also group over some time dimension (GROUP BY year(bookdate) as year, category
).
As my data warehouse grows, it may be useful to run the aggregation query and save the result in some pre-aggregated table (by day or month), as a kind of materialized view of the aggregates. I can use these to create coarser grained aggregates faster (take daily sums and counts, and create monthly or yearly aggregates, fast).
Summary and outlook
Using account statements from a bank, we have built a basic data warehouse. We demonstrated data import, staging and cleaning processes. Using category tables, we demonstrated how bin individual records into larger sets that can be useful in statistical analysis, and we looked at how we could materialize these aggregations into pre-aggregated tables.
This is a toy data warehouse, and it poses no challenges in data management due to the small size. Still, it has many properties that are also present in larger structures. Let’s look at this in a more generalized way:
A data warehouse is centered around one or more fact tables. Fact tables always have a time dimension, they have log-like nature.
Literal attribute values, resolve id’s
Fact tables contain snapshot data that is not normalized, but contains literal values. In our example: actual account holder names, and account numbers.
For example in a sales warehouse from a web shop, we are never interested in the current price of an item in a sales-fact table, but always in what we sold it for back then, for each individual sale. We are never interested where the customer lives now, but where we sent the article. Hence, we log literal attributes, never ids pointing to the current article or customer record.
Encoding can shrink tables, but do only what is necessary
This often leads to data duplication. In our example, the string “SCHECK IN CENTER KA DURLACH” is repeated 36 times. For the amount of data shown in our example this does not matter.
Even in the one million rows example in Coding fields for great profit the gain is not critical, though substantial. A data warehouse starting out can often skip encoding the values and just take the hit from the duplication. An encoding step can be performed later, as long as there is infrastructure in place for online schema change, and sufficient disk space.
The Star and the Snowflake
When we talk about OLTP databases, we often talk about normalization, and aiming for the 3rd normal form as an idealized schema to work from. For transactional data that is a good model and a good goal.
In data warehousing, this is obviously not the case - our data has a time dimension, and unlike OLTP, in a data warehouse we are interested in how things were then, not what they are like now. In our example, we are interested in how much we spent for the fuel back then, and not what the same amount would cost us now.
Conversely, in a data warehouse the logged data usually does not change much after the fact: there can be
- back filling, due to data arriving late,
- backpropagation of newly added attributes,
- or there can be corrections due to transactions changed after the fact at the business level
All of these require us to be able to rewrite our logs for some time window. But eventually that window closes and the data becomes immutable, and we can seal it (run OPTIMIZE TABLE
, and apply encoding and page level compression).
In our example, once money is spent, it would not come back (or if, in a second corrective reverse booking), and the amounts booked would never change. The log records are immutable from the outset.
Things that OLTP structures need to avoid by normalizing - insert, update and delete paradox - do not happen to us in the same way. Also, because our data does not change much (if ever), but grows a lot over time, it is useful to invest the CPU cycles for data compression (using encoding at the data model level, and using page compression at the database engine level, in this order) once the data can be sealed.
The classical structure of a data warehouse is therefore the star, in which we have a fact table, and a number of category tables aiding binning, plus a number of generated daily/weekly/monthly pre-aggregates as outputs. All of these auxiliary tables group around the fact table, linked to it in some way, hence the “Star Schema”.
In our example, we have the b
fact table, and only a single binning input table, moneysinks
. More complicated warehouses can have many more of these.
We have no materialized output tables (expenses per days
or similar), because our data set is small enough to do all of this ad-hoc. In larger warehouses, materializing aggregates that help to speed up reports is useful.
If you add categories of categories, or produce multidimensional aggregates, you go from Star to Snowflake Schema.
The Star and the Snowflake are the normal forms for Data Warehouses, Normalization as practiced in transactional databases is not helpful, 3NF is not a thing.
Time Dimensional Tables in OLTP schemas
Almost every OLTP schema earning money has some tables in it that have a time dimension. It is often visible either in the table name (shop.sales_202009
) or in the tables primary key - often a compound primary key which contains a pair of an id and a date.
It will be visible in any case, if you think about each tables’ growth.
Assume unchanging conditions - a webshop with a mostly fixed number of articles, and a mostly fixed number of customers buying at a fixed rate, which tables will stay at a fixed size, and which tables will grow without bounds?
Solution: It is the
orders
table.Find these unbounded structures in your schema.
At the heart of every OLTP schema there is a data warehouse that is struggling to get out. You get it out by completing the data lifecycle for the OLTP phase of things and establishing an Extract, Transform and Load cycle (ETL cycle).
In planning a data warehouse, you identify tables without growth boundaries, and decide which data attributes you want to record as literal data for the warehouse.
In the Extract phase you create a monster join you resolve all id values present in your normalized data, extract the literal attributes of interest, and push them into a CSV or some staging table.
Data in the OLTP system can now be deleted by the OLTP systems data lifecycle management without concern for other, non-transactional business functions: we have isolated the non-transactional concerns and confined them to the Extraction process (which will have to be adjusted when the OLTP schema changes, so we are a stakeholder in the OLTP schema evolution).
This isolation of concerns means that the OLTP system of our shop can now delete or archive orders at will after fulfillment or whatever other concerns the transactional system has to serve. Our non-transactional needs are all served from the data gathered in the Extract phase of the ETL process.
The staged extracted values can now be cleaned, categorized and aggregated, depending on the demands from the non-transactional business concerns (statistics, business intelligence, decision-making and so on).
Identifying data warehousey tables in OLTP systems, and isolating the transactional and non-transactional business concerns is important to keep the OLTP system small and agile.
Without taking the non-transactional, long term business functions out the OLTP system will grow without bounds, and ultimately transactional performance will suffer - the system will choke on itself. The non-transactional concerns not being isolated and bundled will also impede OLTP schema evolution and choke the development process on the money-earning side.
Data Lifecycle Management at the Warehouse
Fact tables in the data warehouse have a time dimension, and as time passes, they will accumulate more data. In order to manage our data warehouse, we will also have to complete the data lifecycle on this side of the ETL boundary, and establish some deletion policy.
Queries to the data warehouse table are often time bounded (“How did sales change in the last 2 years”) and binned by - among other things - a time dimension (“How did demand change month-over-month”, “Which of our products are seasonal?”).
Handling data at volume, and getting rid of data no longer needed, is much easier in MySQL when using partitions .
In data warehouses, partitions are usually on a time value as the first dimension. That is, we partition our data set by year, month or day, and we delete data by dropping old partitions. This leaves all internal B-Trees in all the partitions sub-tables untouched and as they have been after the import, optimization and compression.
As queries also have a time dimension, the optimizer profits from this structure as well. It can exclude entire partitions from consideration, making the subset of data to look at much smaller.
Daily Snapshots are slow, but simple
A daily snapshot ETL import is slow - you do not get realtime data updates - but simple to implement retroactively on existing structures. Realtime structures require more engineering, but can be crafted on top of existing architectures as well. We are going to look at them in some later examples.