News » Datum aggregate database schema redesign
This is a follow-on to the Datum database schema redesign post, where I outlined a redesign of the main datum database schema used by SolarNetwork. Another big part of what SolarNetwork does is take all the raw data posted by SolarNode devices and calculate aggregate summaries of common time periods (hours, days, and months) and store them so they don’t need to be calculated every time someone wants to view their data at one of those aggregate levels. This enables SolarNetwork to respond very quickly to queries like “show me the average power per day generated by my solar panels over the past month” or “show me the energy used per hour by my house over the past week”.
What that means is SolarNetwork continuously computes and stores hourly, daily, and monthly aggregate records for every datum stream which contain averages, minimum and maximum values, and other statistical data for each aggregate level.
For example, an hourly node datum computed from a power meter datum stream might have properties like this:
Property | Value |
---|---|
ts_start |
2021-01-01 12:00:00+13 |
node_id |
123 |
source_id |
/meter/1 |
watts |
2400 |
watts_min |
350 |
watts_max |
13100 |
watts_count |
60 |
wattHours |
2411 |
wattHours_start |
340983098 |
wattHours_end |
340985509 |
status |
OK |
Old database design
Until now, SolarNetwork stored this data as a set of JSON objects, based on a simple classification of instantaneous vs accumulating vs status property types. Along with columns for the timestamp, object ID, and source ID of the datum, the database table had JSON columns for each set of properties, and several JSON columns for statistical metadata, sort of like this:
Column | Datatype | Value |
---|---|---|
ts_start |
timestamp | 2021-01-01 12:00:00+13 |
node_id |
bigint | 123 |
source_id |
text | /meter/1 |
jdata_i |
JSON | {"watts":2400,"watts_min":350,"watts_max":13100} |
jdata_a |
JSON | {"wattHours":2411} |
jdata_s |
JSON | {"status":OK} |
jmeta |
JSON | {"i":{"watts":{"min":350,"max":13100,"count":60}}} |
jdata_as |
JSON | {"wattHours":340983098} |
jdata_af |
JSON | {"wattHours":340985509} |
jdata_ad |
JOSN | {"wattHours":2411} |
The jdata_i
column stored the average of the instantaneous properties. The jmeta
stored
the minimum/maximum values seen for those properties, and the count of properties contributing
to the average.
The jdata_a
column stored the difference of the accumulation properties. The jdata_as
,
jdata_af
, and jdata_ad
stored the starting, final, and difference of the accumulation
properties, using the reading aggregation logic (that is slightly different from
the logic used to compute the jdata_a
column).
The jdata_s
column stored the last seen status property values.
Design problems
Although quite flexible (and having served SolarNetwork well for many years), the database design had some notable drawbacks:
- Hard to handle source ID changes — source ID values are just text defined by users, but they usually have important information encoded into them, such as a logical typology of where the data is being collected from. Sometimes source IDs need to be changed, but in this design the source ID is physically part of each datum. To change the source ID, every datum row with the source ID to change needs to be updated, which can be a very expensive operation to perform.
- Property names are duplicated — every datum row stores the property names along with their corresponding values, as multiple JSON objects. instantaneous property names are stored twice, and accumulation properties are stored four times. This duplication increases the amount of storage required to hold the data, and decreases the speed at which the data can be queried because there is simply more data to sift through.
- Instantaneous statistics are duplicated — the instantaneous
_min
and_max
values are duplicated in thejdata_i
andjmeta
columns. Not only that, since they are part of thejdata_i
JSON document, every instantaneous property has effectively 3 property names added. This duplication has the same adverse effects as detailed in the last point. - Local date column duplicates timestamp column — the
local_date
column was used to be able to show the “local date” of the associated data, without having to join to another table to do so. Storing it here increases the amount of storage required to hold the data. - Last seen status values are not ideal — using the “last seen” value for status properties does not capture the best overall sense of what happened over the given time period. For example the stream might report 59 “OK” values followed by 1 “Error” value for a status property. Even though “Error” was the last reported value, for the majority of the time it was actually “OK”.
New database design
To overcome these design problems, we’ve developed a new database design that preserves the general flexibility of how datum aggregates are stored but solves the design issues. Building on the datum stream concept introduced previously a new table design was devised with the following attributes:
- Store the stream ID instead of the object and source IDs.
- Drop the local date column, and rely on joining with stream metadata to determine the time zone of the stream.
- Store an ordered lists of property aggregate values where the order matches the property name list of the associated stream.
- Store an ordered list of instantaneous property statistic values where the order matches the instantaneous property name list of the associated stream. Each property will have an associated 3-element list of (count, min, max) values.
- Store an ordered list of accumlating reading statistic values where the order matches the accumulating property name list of the associated stream. Each property will have an associated 3-element list of (difference, start, end) values.
- Store the most frequently seen status property values, instead of the last seen.
A diagram makes this easier to take in, showing just the hourly aggregate tables:
The new agg_datm_hourly
table has a primary key defined by a stream ID and timestamp. The datum
property values are stored as numeric[]
and text[]
array columns. The instantaneous property
statisitc values are stored as the stat_i numeric[]
array column. The accumulating property
reading values are stored as the read_a numeric[]
array column.
Using the same example datum as shown earlier, in the new database design there would be one row in
the da_datm_meta
“stream metadata” table like this:
Column | Datatype | Value |
---|---|---|
stream_id |
UUID | a4840264-2b5b-4673-83f7-66a404d6faf0 |
node_id |
bigint | 123 |
source_id |
text | /meter/1 |
time_zone |
text | Pacific/Auckland |
names_i |
text[] | [watts] |
names_a |
text[] | [wattHours] |
names_s |
text[] | [status] |
Then there would be one row in the agg_datm_hourly
table like this:
Column | Datatype | Value |
---|---|---|
stream_id |
UUID | a4840264-2b5b-4673-83f7-66a404d6faf0 |
ts_start |
timestamp | 2021-01-01 12:00:00+13 |
data_i |
numeric[] | [2400] |
data_a |
numeric[] | [2411] |
data_s |
text[] | [OK] |
stat_i |
numeric[] | [[60,350,13100]] |
read_a |
numeric[] | [[2411,340983098,340985509]] |
Problems solved
The new design solves the first problem of renaming source IDs quite nicely. Now a source ID is
just a bit of metadata for a datum stream: updating a single row in the da_datm_meta
table is all
it takes to update the source ID for an entire stream.
The new design solves the second problem of duplicate property names by turning the property names
into metadata on a datum stream as well. As more and more datum are collected for a given stream,
there will still be only one row in the da_datm_meta
table. SolarNetwork is also flexible
enough that property names can be added over time to a given stream, and NULL
(empty) values
are allowed (and stored efficiently) if a datum doesn’t always contain values for all properties.
The third problem of duplicate statistics is solved by simply not storing the statistics twice. Additionally the property names are not stored with the statistics, so they are not duplicated again.
The forth problem of the duplicate date column is solved by dropping the local date column
and relying on joining to the da_datm_meta.time_zone
stream metadata column to translate the
timestamp values into node-local dates.
The fifth problem of the “last seen” status property values is solved by calculating the “most seen” value instead.
Overall, this change significantly reduced the size of the datum aggregate tables in SolarNetwork, and has resulted in notably faster query times.