Bi-temporal database design

Today, databases are the primary system of record, and organizations (especially heavily regulated ones) are required to keep an accurate picture of all the facts, as they occur. Most organizations manage data that changes over time. If you are in a regulated industry that does not allow you to physically delete data, you must maintain a database design which allows you to view the state of data as of time T. A temporal database design helps you to achieve that easily. While some databases do come with built-in features to achieve temporality, even if the database doesn't have temporal querying features we can achieve the same with smart table designs.

Let's consider a multi-national ecommerce application which receives forex rates from exchange periodically - the application stores these rates in a table (the rows are updated whenever we get new rates). The prices are listed in USD and whenever an order is processed, the table is queried and buyer's credit card is charged the appropriate amount in the local currency.

Let's assume our forex rate table looks like this:

CNC1	CNC2	RATE
USD	CAD	1.33
USD	EUR	0.88
USD	JPY	111.58

Now let's say we have a situation where a customer complained that his/her credit card was inappropriately charged an incorrect amount. With the current table design we have no way of knowing what the rate was in the table when the order was processed. Unfortunately most conventional databases don't maintain or expose a versioned or timed view of the data. In 1992, while formalizing the SQL standards (https://en.wikipedia.org/wiki/SQL-92 ) this issue was recognized and an extension (called TSQL2) to the SQL standards was defined to handle temporality. There was an attempt to incorporate a subset of TSQL2 into the SQL:1999 standard, a subset named SQL/Temporal but it was heavily criticized and not included. A new standard called SQL:2011, introduced in Dec 2011, finally included clauses for definition of temporal tables. However it's not widely adopted yet and only a few vendors have included support for SQL:2011 standards in their products so far.

Even if a database isn't compliant with SQL:2001 standards and don't support built-in temporal features, we can still have user managed temporality through smart DB design.

Before we take a deeper dive into user managed temporality let's understand the following temporality-related concepts first :

Valid-Time State Tables : Valid time is the time period during which a fact is true with respect to the real world. The actual time when the data was persisted in database has no impact on the valid time. For example if we get a message from exchange that USD-CAD rate was 1.33 at time T1 and another message indicating that USD-CAD rate was 1.35 at time T2, then during the time range T1-T2, the applicable rate was 1.33. This is true irrespective of when the data was actually entered in the system or when the data is being queried.
Transaction-Time State Tables : Transaction time is the time period during which a fact stored in the database is considered to be true. Lets assume that the messages we receive from the exchange dont have any time info on them. So we assume them to be valid as and when we receive them. So if we receive a message at time T3 indicating that the rate is 1.33 and another message at T4 indicating that the rate is 1.35, then during the time range T3-T4, we assume that the applicable rate id 1.33. Note that if there is a significant lag between the message being published by the exchange and the message being persisted in DB, there would be inconsistency between temporality in our database vs temporality in the real world.

Clearly both valid-time state tables and transaction-time state tables associate state of data with a time range. SQL lacks a time-range data type, so we'll have to use two different time columns to capture that information. Let's call these columns VALID_FROM and VALID_TO.

Lets assume we are getting event timestamp information as part of the messages from exchange and hence we are implementing a valid-time state table. If we get a message indicating that USD-CAD rate at time T1 is 1.33, VALID_FROM could be T1 and VALID_TO could be null (which will be updated when we get the next message). Like this :

CNC1	CNC2	RATE	VALID_FROM	VALID_TO
USD	CAD	1.33	2019-03-11 9:00	NULL

Ideally we would want the primary key of the table to uniquely identify a rate. In order to achieve that, we'll have to include VALID_FROM and VALID_TO as part of the primary key which means they should not be nullable. (Though we can solve the problem by having an auto-generated primary key ) Also, when we insert a row with VALID_TO as NULL we are really using NULL to represent infinity here ( A row is valid until we get new information, if we don't get new, updated information ever, the row is valid forever). One way to handle this is to insert a row with VALID_TO column containing the highest data supported by the database (on many databases it's 9999-12-31 23:59:59. 999.

CNC1	CNC2	RATE	VALID_FROM	VALID_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59

Now let's assume we get another message from exchange indicating that the rate has changed to 1.35 at 9:01. We need to update the VALID_TO column of the existing row (the existing row stops being valid at 9:01 and insert a new row with the latest state of the data.

CNC1	CNC2	RATE	VALID_FROM	VALID_TO
USD	CAD	1.33	2019-03-11 9:00	2019-03-11 9:01
USD	CAD	1.35	2019-03-11 9:01	9999-12-31 23:59

Its important that the previous state's VALID_TO exactly matches the current state's VALID_FROM. If there are gaps between the two it means we have no valid data for that duration of the gap.

A key concept in this design is considering our "temporal granularity." We should pick the right granularity for the problem. This could be days, hours, minutes, seconds, or whatever we need as a granularity level. In this example we are using minute as the granularity level.

We keep on continuing with this approach - for every new state of data, we update the VALID_TO column of the existing state and insert a new row with the new state. At any point of time only one row will have VALID_TO > current timestamp, and that row represents the current state

CNC1	CNC2	RATE	VALID_FROM	VALID_TO
USD	CAD	1.33	2019-03-11 9:00	2019-03-11 9:01
USD	CAD	1.35	2019-03-11 9:01	2019-03-11 9:12
USD	CAD	1.37	2019-03-11 9:12	2019-03-11 9:21
USD	CAD	1.32	2019-03-11 9:21	2019-03-11 9:27
USD	CAD	1.29	2019-03-11 9:27	9999-12-31 23:59

Now, let's consider a situation involving data correction. We got a message from exchange indicating that the rate at 9:12 was 1.39 however we incorrectly persisted 1.37 in database. We detected this at 9:20. We can go ahead and update the RATE value from 1.37 to 1.39 but there might be orders in between which queried the system and believed the rate to be 1.37 and not 1.39. If we update the existing row, we lose track of this.

This is essentially the limitation of a monotemporal or unitemporal database design - it can't distinguish between "what was the state of the system at time T ?" and "what did the system believe it's state to be at time T ?" in case of a data correction.

A bitemporal data model combines the concepts of both Valid and Transaction Time together. So we have 4 timestamp columns - VALID_FROM/VALID_TO capturing valid time range and TRANSACTION_FROM/TRANSACTION_TO capturing transaction time range (or in other words the processing time range).

So, if we receive a message from exchange indicating that the rate was 1.33 at time T1 and if we persist the information in DB at time T2, we'll have VALID_FROM as T1 and TRANSACTION_FROM as T2. VALID_TO and TRANSACTION_TO will be infinity (or 9999-12-31 23:59:59.999 for the lack of a suitable date representation for infinity)

CNC1	CNC2	RATE	VALID_FROM	VALID_TO	TRANSACTION_FROM	TRANSACTION_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59	2019-03-11 9:01	9999-12-31 23:59

The above row represents that USD-CAD rate was 1.33 at 9 am on 2019-03-11 and we captured this information in Database at 9:01 am on the same day

Let's say at 9:04 we decide to delete the data. To delete a row, we first update the TRANSACTION_TO to 9:04 am and then insert another row with TRANSACTION_TO as infinity but VALID_TO as 9:04 am

CNC1	CNC2	RATE	VALID_FROM	VALID_TO	TRANSACTION_FROM	TRANSACTION_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59	2019-03-11 9:01	2019-03-11 9:04
USD	CAD	1.33	2019-03-11 9:00	2019-03-11 9:04	2019-03-11 9:04	9999-12-31 23:59

The first row indicates that between 9:01 and 9:04, any system that queried the database would have found the rate to be 1.33

The second row indicates that post 9:04, we know that the data was valid from 9:00 to 9:04 but ceased to be valid afterwards

Lets go back to our original message again

CNC1	CNC2	RATE	VALID_FROM	VALID_TO	TRANSACTION_FROM	TRANSACTION_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59	2019-03-11 9:01	9999-12-31 23:59

Now lets say that we received another message indicating that the rate changed to 1.37 effective 9:05 am and we are persisting this information at 9:06 am

We need to bi temporally delete the existing row (we have already explored bi temporal delete in the last step) and insert a new row with VALID_FROM = 9:05 am and TRANSACTION_FROM = 9:06 am

CNC1	CNC2	RATE	VALID_FROM	VALID_TO	TRANSACTION_FROM	TRANSACTION_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59	2019-03-11 9:01	2019-03-11 9:06
USD	CAD	1.33	2019-03-11 9:00	2019-03-11 9:05	2019-03-11 9:06	9999-12-31 23:59
USD	CAD	1.37	2019-03-11 9:05	9999-12-31 23:59	2019-03-11 9:06	9999-12-31 23:59

The first row indicates that between 9:01 and 9;06, system believed that the active rate was 1.33

The second row indicates that post 9:06, the system realized that the rate of 1.33 stopped being valid at 9:05 am

The third row indicates that the currently active (since 9:05) rate is 1.37 and it was inserted at 9:06 am

Lets deal with a correction now. Lets assume that at 9:08 we realized that the last message was processed incorrectly and the rate should have been 1.39 and not 1.37

When the correction is made, we expire out the incorrect row (like a delete) and then insert a new row with the correct data

CNC1	CNC2	RATE	VALID_FROM	VALID_TO	TRANSACTION_FROM	TRANSACTION_TO
USD	CAD	1.33	2019-03-11 9:00	9999-12-31 23:59	2019-03-11 9:01	2019-03-11 9:06
USD	CAD	1.33	2019-03-11 9:00	2019-03-11 9:05	2019-03-11 9:06	9999-12-31 23:59
USD	CAD	1.37	2019-03-11 9:05	9999-12-31 23:59	2019-03-11 9:06	2019-03-11 9:08
USD	CAD	1.39	2019-03-11 9:05	9999-12-31 23:59	2019-03-11 9:08	9999-12-31 23:59

Note that the correction has no impact on the start or end time of the data so VALID_FROM and VALID_TO remain unchanged from row 3 to row 4

Querying data:

If our table design is bitemporal the select queries have to be take temporality into account too.

Here are a few useful hints for running select queries on bitemporal tables :

How to query for the current state of the data ?

VALID_TO >CURRENT_TIMESTAMP and TRANSACTION_TO > CURRENT_TIMESTAMP

How do I query for data as of time T ?

check for row with T between VALID_FROM and VALID_TO

How do I query for what my system believed the state to be at time T ?

check for row with T between TRANSACTION_FROM and TRANSACTION_TO