SQR-053: Representing missing values in the EFD

  • Angelo Fausti

Latest Revision: 2021-03-11

Note

You can download the code in this technote and execute it as a Jupyter notebook. You also need to download the docker-compose.yaml file used to start the required services.

Download Jupyter notebook

1 Abstract

In this technote, we discuss the problem of representing missing values in the EFD.

In particular, we demonstrate that we can use the symbol NaN to represent missing values for floating-point data types and that it works across all systems involved from a CSC to the EFD client.

The symbol NaN, however, does not solve the problem entirely, it works for floating-point data types only, and there is no equivalent symbol for other data types that work as well as NaN across all systems. Currently, we don’t have a solution for representing missing values for other data types, and we discourage using, for example, a valid number like -99 as a sentinel value for integers.

That said, using NaN for floating-point types solves the problem for about 80% of the fields in Telemetry topics, and we recommend using NaN in the CSCs explicitly for missing values and to fill sparse arrays with NaN when appropriate.

We also found that the DDS has system default values for each IDL type. The default value used by DDS for floating-point types is 0. We recommend overwriting the default value to NaN in the IDL schema definition generated by the SAL, in which case DDS will automatically fill in with NaN a field that exists in the topic schema but is omitted by the CSC.

2 Introduction

Telescope & Site implements CSCs for about 60 different telescope subsystems. The CSCs publish data to DDS, and this data is recorded in the EFD.

Missing values can occur in different ways:

  • A field is not implement. A situation that is often encountered during integration and testing is a DDS topic with a field that exists in the topic schema, but is not implemented in the CSC yet. That happened when testing the Main Telescope Camera Cable Wrap subsystem recently.

  • A sensor produces no values. Another situation, which can happen at anytime, is when a sensor goes offline.

  • A sensor produces invalid values. For example, the Environment subsystem has a sensor that measures snow depth. The sensor may produce invalid values if there is no snow. In this case the CSC might represent those values as missing.

In these situations, a value is designated as “missing” in the EFD, and high-level context information (e.g. from the Event logs) is usually required to know why the value is missing.

In practice, we need a representation for missing values when one or more field values in a topic are missing. In this case, there is a timestamp, and there may be other fields with valid values to record in the EFD.

In section 3   Representing missing values we discuss alternatives for representing missing values. In section 4   The data flow we review the data flow and how the different systems involved handle missing values. Finally, in section 6   Conclusion we conclude this investigation with a solution that works for most of the cases, and ensures that we can handle missing values in a convenient way when using the EFD client, the supported method for accessing EFD data.

3 Representing missing values

Generally, there are two approaches for representing missing values, one uses a mask to indicate the null status of a value, and the other uses a sentinel value.

In the sentinel value approach, a sentinel value indicates the missing value. It can use some data-specific convention such as -99 or use a more global convention, such as NaN (Not a Number), a symbol that is part of the IEEE 754 floating-point specification. The sentinel value may or may not be stored. See, for example, the sparse data structures in Pandas.

The two approaches have trade-offs. In particular, not all systems support masking. While a sentinel value like -99 works for all numeric types, it may cause problems because it is a valid number. The problem with NaN is that it is available for floating-point types only.

Most systems support derived data types, where it is possible to mark a member of the derived data type as optional and/or specify a default value for it. In this case, if the missing value is omitted, the system will automatically fill in a default value on behalf of the application. However, the question of which symbol to use as the default value for every primitive data type still remains.

4 The data flow

Different systems use different conventions for representing missing values. We have to ensure the representation we use works throughout the complete data flow:

CSCs -> DDS -> SAL Kafka -> Kafka -> InfluxDB Sink -> InfluxDB -> EFD Client

CSCs are implemented in multiple languages (C++, LabVIEW, Java, and Python) and publish records to DDS topics.

SAL Kafka and InfluxDB Sink are of particular importance because they are responsible for converting the records between the data formats used by the different systems. SAL Kafka converts the DDS records into Avro records and forwards them to Kafka. SAL Kafka is also responsible for converting the IDL schema to Avro schema and registering them with the Kafka schema registry. The InfluxDB Sink connector uses the Avro schemas to deserialize the Avro records and convert them into the InfluxDB line protocol format before writing to InfluxDB.

Finally, the EFD client reads from InfluxDB and returns a Pandas dataframe.

In summary, we need to understand how each system handles missing values and how data conversion between them is done.

4.1 CSCs

The ts_xml repository contains the interface definition for all CSCs.

As of Feb 5, 2021, ts_xml defines the schema for 249 Telemetry, 390 Commands, and 533 Events topics for about 60 different subsystems. From ts_xml, SAL creates the schema used by DDS in the IDL format.

In Table 1, we show the number of fields in the ts_xml schema per IDL type in Telemetry, Commands, and Events topics.

Table 1 The number of fields in the ts_xml schema per IDL type in Telemetry, Commands, and Events topics. In parentheses, new names for fixed-width integer types as introduced in IDL 4.2.

IDL type

Telemetry

Commands

Events

boolean

197

181

604

byte

2

7

11

octet*

2

2

2

char*

1

1

1

string

43

83

467

int

8

44

102

short (int16)

9

66

12

long (int32)

36

67

267

long long (int64)

3

2

13

unsigned int*

2

2

2

unsigned short (uint16)

8

2

6

unsigned long (uint32)

3

2

4

float

437

137

358

double

1091

206

800

Total

1842

802

2649

* Only present in test topics.

For the EFD, telemetry is where a representation for missing values is more important. From Table 1, the majority of the telemetry fields (83%) have float or double IDL types.

In Table 1, arrays are counted only once. When the topic field is an array, we show the type of the array items.

Table 2 shows the number of arrays in the ts_xml schema per array size in Telemetry topics for arrays larger than 50 elements. The largest arrays are from the MTCamera, MTM1M3 subsystems and usually have float or double IDL types. However, there are fairly large arrays in the MTM1M3TS and MTM2 subsystems with boolean, unsigned short or long IDL types as well.

Table 2 The largest arrays in Telemetry topics in the ts_xml schema.

Array size

# of arrays

Type of the array item

Subsystem

3024

1

double

MTCamera

197

6

double

MTCamera

156

5

float, double

MTM1M3

144

1

double

MTCamera

112

3

float

MTM1M3

100

62

float, double

ATMCS, ATPtg, MTPtg, MTM1M3

96

7

boolean, float

MTM1M3TS

78

1

unsigned short

MTM2

72

7

long, double

MTM2

71

85

double

MTCamera

68

6

float

MTDome

67

5

double

MTCamera

50

18

double

MTVMS

4.2 DDS

The DDS type system has annotations specify the default values for fields in the IDL schema.

Table 3 shows the system default values in DDS for the most common IDL types.

Table 3 System default values in DDS.

IDL type

Default value

boolean

FALSE

byte

0x00

int16, int32, int64,

0

uint16, uint32, uint64,

0

float, double, long double

0

char

'\0'

string

""

enum

The first value in the enumeration.

The @default annotation is used to overwrite the system default values in Table 3.

struct MyTopic {
  long id; # default value is 0
  float x;  # default value is 0
  @default(NaN) float y; # default value is NaN
};

In the example above, the @default annotation overwrites the default value of the MyTopic.y field to NaN. If this field is omitted by the CSC that publishes this topic, DDS would automatically fill in the default value NaN for the application that subscribes to it.

The default value of an enumeration corresponds to the first value in the enumeration. The @default_literal annotation is used to select a different value in the enumeration as the default value.

enum Color {
  GREEN,
  @default_literal RED,
  BLUE
};

Currently, there is no mechanism to overwrite the DDS system default values for fields in the CSCs interface definition (ts_xml). A possible alternative to use NaN as the default value for float, double, and long double IDL types, is to modify SAL to use the @default annotation when creating the IDL schema.

In summary, an application that subscribes to a DDS topic will get what the CSCs publishes or the DDS system default values presented in Table 3.

4.3 Kafka

We use Apache Avro to enconde messages in Kafka. In Avro records the default value for a field can be specified as follows:

{
  "type": "record",
  "name": "foo",
  "fields" : [
    {"name": "bar", "type": "float"},
    {"name": "baz", "type": "float", "default": NaN} // field baz default to NaN
  ]
}

From this and previous sections, SAL Kafka can, in principle, convert an IDL schema with default values into an Avro schema with default values.

Also, we have confirmed that NaN values in a DDS record are passed along to Kafka by SAL Kafka as NaN.

4.4 InfluxDB

InfluxDB represents a point in a time series by the line protocol:

<measurement>[,<tag_key>=<tag_value>[,<tag_key>=<tag_value>]] <field_key>=<field_value> [,<field_key>=<field_value>] [<timestamp>]

In the EFD, a mesasurement corresponds to a DDS topic, and tags are metadata associated with fields. In InfluxDB, we specify a timestamp in the Unix epoch. If a timestamp is not specified, InfluxDB uses the server’s local time with nanosecond precision.

In the simplest case, if tags and timestamps are not specified, the above simplifies to:

<measurement> <field_key>=<field_value>[,<field_key>=<field_value>]

In InfluxDB 1.8, the default type for field values is float, however InfluxDB 1.8 does not support NaN values (see InfluxDB data types), which has been subject to a long debate.

We can show that by running a local instance of InfluxDB.

%%bash
docker-compose up -d influxdb

By default, the InfluxDB API runs at http://localhost:8086. Let’s use the Python requests module to interact with the InfluxDB API and create the mydb database for our tests:

import requests
requests.post(url="http://localhost:8086/query", params={'q':'CREATE DATABASE "mydb"'})
<Response [200]>

Now let’s write a point to the measurement foo with a field key bar and value NaN:

p = "foo bar=NaN"
r = requests.post(url="http://localhost:8086/write?db=mydb", data=p)
r.text
'{"error":"unable to parse \'foo bar=NaN\': invalid number"}\n'

This confirms that we cannot use NaN in InfluxDB to represent a missing value for the float type.

However, because InfluxDB is a schema-less database, we can change the schema on write. We can add new fields or drop existing fields at any time. This suggests that fields are optional and that InfluxDB should automatically fill them in with a default value.

To verify this property of InfluxDB, let’s write a sequence of points, and change the schema as we write.

p1 = "foo bar=1.0,baz=1.0"
p2 = "foo baz=2.0"
p3 = "foo bar=3.0,baz=3.0"
requests.post(url="http://localhost:8086/write?db=mydb", data=p1)
requests.post(url="http://localhost:8086/write?db=mydb", data=p2)
requests.post(url="http://localhost:8086/write?db=mydb", data=p3)
<Response [204]>

The following query returns the foo measurement:

r = requests.get(url="http://localhost:8086/query", params={'q': 'SELECT * FROM "mydb"."autogen"."foo"'})
r.json()['results'][0]['series']
[{'name': 'foo',
  'columns': ['time', 'bar', 'baz'],
  'values': [['2021-03-15T17:17:41.087503933Z', 1, 1],
   ['2021-03-15T17:17:41.107504568Z', None, 2],
   ['2021-03-15T17:17:41.111964009Z', 3, 3]]}]

Notice that when querying the foo measurement, InfluxDB returns None for the missing value of bar in the second point. The Python keyword None is used here as the default value for an optional parameter (the InfluxDB field in this case), as expected.

The InfluxDB Sink connector is the right place for handling missing values. In particular, the influxdb-java library used by the connector, skips fields with NaN values when writing to InfluxDB. This feature was implemented upstream a month after we first noticed this problem back in September 2019, during the AuxTel integration activities at the Summit.

In summary, fields are optional, and we should skip missing values when writing to InfluxDB.

4.4.1 Filling in time intervals with no data

When querying InfluxDB, you can group the result by time intervals and use fill() to specify how InfluxDB handles time intervals with no data.

For example, this query will resample the values on a regular time grid of 10ms and use fill(linear) to perform a linear interpolation for time intervals with no data:

r = requests.get(url="http://localhost:8086/query", params={'q': 'SELECT mean(baz), mean(baz) FROM "mydb"."autogen"."foo" GROUP BY time(10ms) fill(linear)'})
r.json()['results'][0]['series']
[{'name': 'foo',
  'columns': ['time', 'mean', 'mean_1'],
  'values': [['2021-03-15T17:17:41.08Z', 1, 1],
   ['2021-03-15T17:17:41.09Z', 1.5, 1.5],
   ['2021-03-15T17:17:41.1Z', 2, 2],
   ['2021-03-15T17:17:41.11Z', 3, 3],
   ['2021-03-15T17:17:41.12Z', None, None],
   ['2021-03-15T17:17:41.13Z', None, None],
   ['2021-03-15T17:17:41.14Z', None, None]]}]

4.4.2 Writing arrays to InfluxDB

InfluxDB does not support derived types like arrays, the solution we found was to extract the array items in the InfluxDB Sink connector and write them to individual fields in InfluxDB.

For sparse arrays with NaN values the connector will extract them and skip NaN values before writing to InfluxDB.

For example, if the Avro record contains a field with the array [1.0, NaN, 2.0], the InfluxDB Sink connector will first extract the array items into the field set foo1=1.0,foo2=NaN,foo3=2.0 and then write foo1=1.0,foo3=3.0 to InfluxDB.

4.5 The EFD client

The EFD client uses the aioinflux Python client for InfluxDB.

Here we show that missing values in InfluxDB are converted back to NaN when aioinflux returns a Pandas dataframe:

from aioinflux import InfluxDBClient
client = InfluxDBClient(db="mydb", output="dataframe")
await client.query('SELECT * FROM foo')
bar baz
2021-03-15 17:17:41.087503933+00:00 1.0 1
2021-03-15 17:17:41.107504568+00:00 NaN 2
2021-03-15 17:17:41.111964009+00:00 3.0 3

which is a convenient way of representing missing values with the Pandas float64 dtype. For more information on working with missing data in Pandas, we refer the reader to this guide.

5 Recommendations

  • We recommend adopting the symbol NaN to represent missing values for floating-point data types in the CSCs.

  • For DDS topics with sparse arrays of floating-point data types, we also recommend filling in the missing values with NaN values.

  • We should consider using the @default annotation in the IDL specification to set the default values for float, double and long double fields to NaN instead of 0. That would prevent the InfluxDB Sink connector from recording a 0 to InfluxDB when it should instead skip the NaN value.

  • There is no obvious symbol to represent missing values for other data types that works across all systems in our data flow. We recommend revisiting this issue only if it becomes a problem in the future and follow as much as possible what Pandas does.

6 Conclusion

In this investigation we demonstrate that the symbol NaN can be used to represent missing values for floating-point data types across all systems in our data flow. While this is a partial solution to the problem, we showed that it covers about 80% of the fields in Telemetry topics.

With this solution in place, a field with the value NaN in a DDS record is passed along to an Avro record in Kafka. The InfluxDB Sink connector skip NaN values before writing to InfluxDB so that they are correctly represented as “missing” in InfluxDB. In particular, we showed that when querying InfluxDB in Python, it fills the missing values with None.

Finally, in the EFD client, missing values in InfluxDB are returned back as NaN which is a convenient way of representing missing values in Pandas.