News » Datum aggregate database schema redesign

This is a follow-on to the Datum database schema redesign post, where I outlined a redesign of the main datum database schema used by SolarNetwork. Another big part of what SolarNetwork does is take all the raw data posted by SolarNode devices and calculate aggregate summaries of common time periods (hours, days, and months) and store them so they don’t need to be calculated every time someone wants to view their data at one of those aggregate levels. This enables SolarNetwork to respond very quickly to queries like “show me the average power per day generated by my solar panels over the past month” or “show me the energy used per hour by my house over the past week”.

What that means is SolarNetwork continuously computes and stores hourly, daily, and monthly aggregate records for every datum stream which contain averages, minimum and maximum values, and other statistical data for each aggregate level.

For example, an hourly node datum computed from a power meter datum stream might have properties like this:

Property	Value
`ts_start`	`2021-01-01 12:00:00+13`
`node_id`	`123`
`source_id`	`/meter/1`
`watts`	`2400`
`watts_min`	`350`
`watts_max`	`13100`
`watts_count`	`60`
`wattHours`	`2411`
`wattHours_start`	`340983098`
`wattHours_end`	`340985509`
`status`	`OK`

Old database design

Until now, SolarNetwork stored this data as a set of JSON objects, based on a simple classification of instantaneous vs accumulating vs status property types. Along with columns for the timestamp, object ID, and source ID of the datum, the database table had JSON columns for each set of properties, and several JSON columns for statistical metadata, sort of like this:

Column	Datatype	Value
`ts_start`	timestamp	`2021-01-01 12:00:00+13`
`node_id`	bigint	`123`
`source_id`	text	`/meter/1`
`jdata_i`	JSON	`{"watts":2400,"watts_min":350,"watts_max":13100}`
`jdata_a`	JSON	`{"wattHours":2411}`
`jdata_s`	JSON	`{"status":OK}`
`jmeta`	JSON	`{"i":{"watts":{"min":350,"max":13100,"count":60}}}`
`jdata_as`	JSON	`{"wattHours":340983098}`
`jdata_af`	JSON	`{"wattHours":340985509}`
`jdata_ad`	JOSN	`{"wattHours":2411}`

The jdata_i column stored the average of the instantaneous properties. The jmeta stored the minimum/maximum values seen for those properties, and the count of properties contributing to the average.

The jdata_a column stored the difference of the accumulation properties. The jdata_as, jdata_af, and jdata_ad stored the starting, final, and difference of the accumulation properties, using the reading aggregation logic (that is slightly different from the logic used to compute the jdata_a column).

The jdata_s column stored the last seen status property values.

Design problems

Although quite flexible (and having served SolarNetwork well for many years), the database design had some notable drawbacks:

Hard to handle source ID changes — source ID values are just text defined by users, but they usually have important information encoded into them, such as a logical typology of where the data is being collected from. Sometimes source IDs need to be changed, but in this design the source ID is physically part of each datum. To change the source ID, every datum row with the source ID to change needs to be updated, which can be a very expensive operation to perform.
Property names are duplicated — every datum row stores the property names along with their corresponding values, as multiple JSON objects. instantaneous property names are stored twice, and accumulation properties are stored four times. This duplication increases the amount of storage required to hold the data, and decreases the speed at which the data can be queried because there is simply more data to sift through.
Instantaneous statistics are duplicated — the instantaneous _min and _max values are duplicated in the jdata_i and jmeta columns. Not only that, since they are part of the jdata_i JSON document, every instantaneous property has effectively 3 property names added. This duplication has the same adverse effects as detailed in the last point.
Local date column duplicates timestamp column — the local_date column was used to be able to show the “local date” of the associated data, without having to join to another table to do so. Storing it here increases the amount of storage required to hold the data.
Last seen status values are not ideal — using the “last seen” value for status properties does not capture the best overall sense of what happened over the given time period. For example the stream might report 59 “OK” values followed by 1 “Error” value for a status property. Even though “Error” was the last reported value, for the majority of the time it was actually “OK”.

New database design

To overcome these design problems, we’ve developed a new database design that preserves the general flexibility of how datum aggregates are stored but solves the design issues. Building on the datum stream concept introduced previously a new table design was devised with the following attributes:

Store the stream ID instead of the object and source IDs.
Drop the local date column, and rely on joining with stream metadata to determine the time zone of the stream.
Store an ordered lists of property aggregate values where the order matches the property name list of the associated stream.
Store an ordered list of instantaneous property statistic values where the order matches the instantaneous property name list of the associated stream. Each property will have an associated 3-element list of (count, min, max) values.
Store an ordered list of accumlating reading statistic values where the order matches the accumulating property name list of the associated stream. Each property will have an associated 3-element list of (difference, start, end) values.
Store the most frequently seen status property values, instead of the last seen.

A diagram makes this easier to take in, showing just the hourly aggregate tables:

Table redesign diagram

The new agg_datm_hourly table has a primary key defined by a stream ID and timestamp. The datum property values are stored as numeric[] and text[] array columns. The instantaneous property statisitc values are stored as the stat_i numeric[] array column. The accumulating property reading values are stored as the read_a numeric[] array column.

Using the same example datum as shown earlier, in the new database design there would be one row in the da_datm_meta “stream metadata” table like this:

Column	Datatype	Value
`stream_id`	UUID	`a4840264-2b5b-4673-83f7-66a404d6faf0`
`node_id`	bigint	`123`
`source_id`	text	`/meter/1`
`time_zone`	text	`Pacific/Auckland`
`names_i`	text[]	`[watts]`
`names_a`	text[]	`[wattHours]`
`names_s`	text[]	`[status]`

Then there would be one row in the agg_datm_hourly table like this:

Column	Datatype	Value
`stream_id`	UUID	`a4840264-2b5b-4673-83f7-66a404d6faf0`
`ts_start`	timestamp	`2021-01-01 12:00:00+13`
`data_i`	numeric[]	`[2400]`
`data_a`	numeric[]	`[2411]`
`data_s`	text[]	`[OK]`
`stat_i`	numeric[]	`[[60,350,13100]]`
`read_a`	numeric[]	`[[2411,340983098,340985509]]`

Problems solved

The new design solves the first problem of renaming source IDs quite nicely. Now a source ID is just a bit of metadata for a datum stream: updating a single row in the da_datm_meta table is all it takes to update the source ID for an entire stream.

The new design solves the second problem of duplicate property names by turning the property names into metadata on a datum stream as well. As more and more datum are collected for a given stream, there will still be only one row in the da_datm_meta table. SolarNetwork is also flexible enough that property names can be added over time to a given stream, and NULL (empty) values are allowed (and stored efficiently) if a datum doesn’t always contain values for all properties.

The third problem of duplicate statistics is solved by simply not storing the statistics twice. Additionally the property names are not stored with the statistics, so they are not duplicated again.

The forth problem of the duplicate date column is solved by dropping the local date column and relying on joining to the da_datm_meta.time_zone stream metadata column to translate the timestamp values into node-local dates.

The fifth problem of the “last seen” status property values is solved by calculating the “most seen” value instead.

Overall, this change significantly reduced the size of the datum aggregate tables in SolarNetwork, and has resulted in notably faster query times.