News » Datum database schema redesign

A big part of what SolarNetwork does is collect and store data from all sorts of devices and services. The type of data collected varies quite a bit, so SolarNetwork has a flexible design for storing it, which we call datum. A datum is a collection of properties tied to a specific timestamp and object and source. An object is either a SolarNode device (a node) or a physical location (such as a building address or GPS coordinates). A source is a user-assigned name.

For example, a node datum collected from a power meter might have properties like this:

Property	Value
`ts`	`2021-01-01 12:00:00+13`
`node_id`	`123`
`source_id`	`/meter/1`
`current`	`9.733`
`voltage`	`239.5`
`watts`	`2331`
`wattHours`	`340983098`
`status`	`OK`

Old database design

Until now, SolarNetwork stored this data as a set of JSON objects, based on a simple classification of instantaneous vs accumulating vs status property types. Along with columns for the timestamp, object ID, and source ID of the datum, the database table had JSON columns for each set of properties, sort of like this:

Column	Datatype	Value
`ts`	timestamp	`2021-01-01 12:00:00+13`
`node_id`	bigint	`123`
`source_id`	text	`/meter/1`
`jdata_i`	JSON	`{"current":9.733,"voltage":239.5,"watts":2331}`
`jdata_a`	JSON	`{"wattHours":340983098}`
`jdata_s`	JSON	`{"status":OK}`

Additionally, users are allowed to store arbitrary metadata for object + source ID pairs, in the form of a JSON object. In the old database design a separate table holds this information, using the object ID and source ID pair as its primary key. A row in the database table for node source metadata might look like this:

Column	Datatype	Value
`node_id`	bigint	`123`
`source_id`	text	`/meter/1`
`jdata`	JSON	`{"description":"Power meter for house supply."}`

Design problems

Although quite flexible (and having served SolarNetwork well for many years), the database design had some notable drawbacks:

Hard to handle source ID changes — source ID values are just text defined by users, but they usually have important information encoded into them, such as a logical typology of where the data is being collected from. Sometimes source IDs need to be changed, but in this design the source ID is physically part of each datum. To change the source ID, every datum row with the source ID to change needs to be updated, which can be a very expensive operation to perform.
Property names are duplicated — every datum row stores the property names along with their corresponding values, as JSON objects. This is very inefficient given most datum for a given source ID will contain the same properties, just with different values. This duplication increases the amount of storage required to hold the data, and decreases the speed at which the data can be queried because there is simply more data to sift through.

New database design

To overcome these design problems, we’ve developed a new database design that preserves the general flexibility of how datum are stored but solves the design issues. We introduced the concept of a datum stream, with these changes:

Assign a UUID to each object ID and source ID pair; call this a stream ID.
Store the stream ID and associated object ID and source ID in a new table.
Store the stream ID with each datum.
For each stream maintain an ordered list of property names for each property classification.
For each datum store ordered lists of property values where the order matches the property names list of the associated stream.

A diagram makes this easier to take in:

Table redesign diagram

The new da_datm table has a primary key defined by a stream ID and timestamp. The datum property values are stored as numeric[] and text[] array columns.

The new da_datm_meta and da_loc_datm_meta tables associate stream IDs to node/location ID and source ID pairs. These tables hold the datum property names for each property classification in text[] array columns (the names_* columns). They also hold user-defined JSON metadata for each stream in the jdata column, formerly stored in the da_meta and da_loc_meta tables.

Using the same example datum as shown earlier, in the new database design there would be one row in the da_datm_meta table like this:

Column	Datatype	Value
`stream_id`	UUID	`a4840264-2b5b-4673-83f7-66a404d6faf0`
`node_id`	bigint	`123`
`source_id`	text	`/meter/1`
`names_i`	text[]	`[current,voltage,watts]`
`names_a`	text[]	`[wattHours]`
`names_s`	text[]	`[status]`

Then there would be one row in the da_datm table like this:

Column	Datatype	Value
`stream_id`	UUID	`a4840264-2b5b-4673-83f7-66a404d6faf0`
`ts`	timestamp	`2021-01-01 12:00:00+13`
`data_i`	numeric[]	`[9.733,239.5,2331]`
`data_a`	numeric[]	`[340983098]`
`data_s`	text[]	`[OK]`

Problems solved

The new design solves the first problem of renaming source IDs quite nicely. Now a source ID is just a bit of metadata for a datum stream: updating a single row in the da_datm_meta table is all it takes to update the source ID for an entire stream. In fact, object IDs are now essentially another bit of metadata for a datum stream, and could be easily updated if needed.

The new design solves the second problem of duplicate property names by turning the property names into metadata on a datum stream as well. As more and more datum are collected for a given stream, there will still be only one row in the da_datm_meta table. SolarNetwork is also flexible enough that property names can be added over time to a given stream, and NULL (empty) values are allowed (and stored efficiently) if a datum doesn’t always contain values for all properties.