Note
You can download the code in this technote and execute it as a Jupyter notebook. You also need to download the docker-compose.yaml file used to start the required services.
1 Abstract¶
We demonstrate that the symbol NaN
can be used to represent missing values for floating-point data types across all systems involved in the data flow from the CSCs to the EFD client.
However, the symbol NaN
works for floating-point data types only.
There is no equivalent symbol for other data types that work as well as NaN
across all systems. We don’t have a solution for representing missing values for other data types, and we discourage using a valid number (e.g. -99
) as sentinel value.
That said, using NaN
to represent missing values for floating-point types solves the problem for about 80% of the fields in Telemetry topics.
We recommend the CSCs to explicitly use NaN
for missing values and to fill in sparse arrays with NaN
. We also found that DDS has system default values for each IDL type.
We recommend overwriting the DDS default values for floating-point types to NaN
in the IDL schema definition generated by SAL, in which case DDS will automatically fill in with NaN
a field that exists in the topic schema but is omitted by the CSC.
2 Introduction¶
Telescope & Site implements CSCs for about 60 different telescope subsystems. The CSCs publish data to DDS, and this data is recorded in the EFD.
Missing values can occur in different ways:
A field is not implement. A situation that is often encountered during integration and testing is a DDS topic with a field that exists in the topic schema, but is not implemented in the CSC yet. That happened when testing the Main Telescope Camera Cable Wrap subsystem recently.
A sensor produces no values. Another situation, which can happen at anytime, is when a sensor goes offline.
A sensor produces invalid values. For example, the Environment subsystem has a sensor that measures snow depth. The sensor may produce invalid values if there is no snow. In this case the CSC might represent those values as missing.
In these situations, a value is designated as “missing” in the EFD, and high-level context information (e.g. from the Event logs) is usually required to know why the value is missing.
In practice, we need a representation for missing values when one or more field values in a topic are missing. In this case, there is a timestamp, and there may be other fields with valid values to record in the EFD.
In section 3 Representing missing values we discuss alternatives for representing missing values. In section 4 The data flow we review the data flow and how the different systems involved handle missing values. Finally, in section 6 Conclusion we conclude this investigation with a solution that works for most of the cases, and ensures that we can handle missing values in a convenient way when using the EFD client, the supported method for accessing EFD data.
3 Representing missing values¶
Generally, there are two approaches for representing missing values, one uses a mask to indicate the null status of a value, and the other uses a sentinel value.
In the sentinel value approach, a sentinel value indicates the missing value.
It can use some data-specific convention such as -99
or use a more global convention, such as NaN
(Not a Number), a symbol that is part of the IEEE 754 floating-point specification.
The sentinel value may or may not be stored. See, for example, the sparse data structures in Pandas.
The two approaches have trade-offs.
In particular, not all systems support masking.
While a sentinel value like -99
works for all numeric types, it may cause problems because it is a valid number.
The problem with NaN
is that it is available for floating-point types only.
Most systems support derived data types, where it is possible to mark a member of the derived data type as optional and/or specify a default value for it. In this case, if the missing value is omitted, the system will automatically fill in a default value on behalf of the application. However, the question of which symbol to use as the default value for every primitive data type still remains.
4 The data flow¶
Different systems use different conventions for representing missing values. We have to ensure the representation we use works throughout the complete data flow:
CSCs -> DDS -> SAL Kafka -> Kafka -> InfluxDB Sink -> InfluxDB -> EFD Client
CSCs are implemented in multiple languages (C++, LabVIEW, Java, and Python) and publish records to DDS topics.
SAL Kafka and InfluxDB Sink are of particular importance because they are responsible for converting the records between the data formats used by the different systems. SAL Kafka converts the DDS records into Avro records and forwards them to Kafka. SAL Kafka is also responsible for converting the IDL schema to Avro schema and registering them with the Kafka schema registry. The InfluxDB Sink connector uses the Avro schemas to deserialize the Avro records and convert them into the InfluxDB line protocol format before writing to InfluxDB.
Finally, the EFD client reads from InfluxDB and returns a Pandas dataframe.
In summary, we need to understand how each system handles missing values and how data conversion between them is done.
4.1 CSCs¶
The ts_xml repository contains the interface definition for all CSCs.
As of Feb 5, 2021, ts_xml
defines the schema for 249 Telemetry, 390 Commands, and 533 Events topics for about 60 different subsystems.
From ts_xml
, SAL creates the schema used by DDS in the IDL format.
In Table 1, we show the number of fields in the ts_xml
schema per IDL type in Telemetry, Commands, and Events topics.
IDL type |
Telemetry |
Commands |
Events |
---|---|---|---|
|
197 |
181 |
604 |
|
2 |
7 |
11 |
|
2 |
2 |
2 |
|
1 |
1 |
1 |
|
43 |
83 |
467 |
|
8 |
44 |
102 |
|
9 |
66 |
12 |
|
36 |
67 |
267 |
|
3 |
2 |
13 |
|
2 |
2 |
2 |
|
8 |
2 |
6 |
|
3 |
2 |
4 |
|
437 |
137 |
358 |
|
1091 |
206 |
800 |
Total |
1842 |
802 |
2649 |
|
For the EFD, telemetry is where a representation for missing values is more important.
From Table 1, the majority of the telemetry fields (83%) have float
or double
IDL types.
In Table 1, arrays are counted only once. When the topic field is an array, we show the type of the array items.
Table 2 shows the number of arrays in the ts_xml
schema per array size in Telemetry topics for arrays larger than 50 elements. The largest arrays are from the MTCamera
, MTM1M3
subsystems and usually have float
or double
IDL types. However, there are fairly large arrays in the MTM1M3TS
and MTM2
subsystems with boolean
, unsigned short
or long
IDL types as well.
Array size |
# of arrays |
Type of the array item |
Subsystem |
---|---|---|---|
3024 |
1 |
|
|
197 |
6 |
|
|
156 |
5 |
|
|
144 |
1 |
|
|
112 |
3 |
|
|
100 |
62 |
|
|
96 |
7 |
|
|
78 |
1 |
|
|
72 |
7 |
|
|
71 |
85 |
|
|
68 |
6 |
|
|
67 |
5 |
|
|
50 |
18 |
|
|
4.2 DDS¶
The DDS type system has annotations specify the default values for fields in the IDL schema.
Table 3 shows the system default values in DDS for the most common IDL types.
IDL type |
Default value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The first value in the enumeration. |
The @default
annotation is used to overwrite the system default values in Table 3.
struct MyTopic {
long id; # default value is 0
float x; # default value is 0
@default(NaN) float y; # default value is NaN
};
In the example above, the @default
annotation overwrites the default value of the MyTopic.y
field to NaN
.
If this field is omitted by the CSC that publishes this topic, DDS would automatically fill in the default value NaN
for the application that subscribes to it.
The default value of an enumeration corresponds to the first value in the enumeration.
The @default_literal
annotation is used to select a different value in the enumeration as the default value.
enum Color {
GREEN,
@default_literal RED,
BLUE
};
Currently, there is no mechanism to overwrite the DDS system default values for fields in the CSCs interface definition (ts_xml
). A possible alternative to use NaN
as the default value for float
, double
, and long double
IDL types, is to modify SAL to use the @default
annotation when creating the IDL schema.
In summary, an application that subscribes to a DDS topic will get what the CSCs publishes or the DDS system default values presented in Table 3.
4.3 Kafka¶
We use Apache Avro to enconde messages in Kafka.
In Avro records
the default value for a field can be specified as follows:
{
"type": "record",
"name": "foo",
"fields" : [
{"name": "bar", "type": "float"},
{"name": "baz", "type": "float", "default": NaN} // field baz default to NaN
]
}
From this and previous sections, SAL Kafka can, in principle, convert an IDL schema with default values into an Avro schema with default values.
Also, we have confirmed that NaN
values in a DDS record are passed along to Kafka by SAL Kafka as NaN
.
4.4 InfluxDB¶
InfluxDB represents a point in a time series by the line protocol:
<measurement>[,<tag_key>=<tag_value>[,<tag_key>=<tag_value>]] <field_key>=<field_value> [,<field_key>=<field_value>] [<timestamp>]
In the EFD, a mesasurement corresponds to a DDS topic, and tags are metadata associated with fields. In InfluxDB, we specify a timestamp in the Unix epoch. If a timestamp is not specified, InfluxDB uses the server’s local time with nanosecond precision.
In the simplest case, if tags and timestamps are not specified, the above simplifies to:
<measurement> <field_key>=<field_value>[,<field_key>=<field_value>]
In InfluxDB 1.8, the default type for field values is float
, however InfluxDB 1.8 does not support NaN
values (see InfluxDB data types), which has been subject to a long debate.
We can show that by running a local instance of InfluxDB.
%%bash
docker-compose up -d influxdb
By default, the InfluxDB API runs at http://localhost:8086. Let’s use the Python requests
module to interact with the InfluxDB API and create the mydb
database for our tests:
import requests
requests.post(url="http://localhost:8086/query", params={'q':'CREATE DATABASE "mydb"'})
<Response [200]>
Now let’s write a point to the measurement foo
with a field key bar
and value NaN
:
p = "foo bar=NaN"
r = requests.post(url="http://localhost:8086/write?db=mydb", data=p)
r.text
'{"error":"unable to parse \'foo bar=NaN\': invalid number"}\n'
This confirms that we cannot use NaN
in InfluxDB to represent a missing value for the float
type.
However, because InfluxDB is a schema-less database, we can change the schema on write. We can add new fields or drop existing fields at any time. This suggests that fields are optional and that InfluxDB should automatically fill them in with a default value.
To verify this property of InfluxDB, let’s write a sequence of points, and change the schema as we write.
p1 = "foo bar=1.0,baz=1.0"
p2 = "foo baz=2.0"
p3 = "foo bar=3.0,baz=3.0"
requests.post(url="http://localhost:8086/write?db=mydb", data=p1)
requests.post(url="http://localhost:8086/write?db=mydb", data=p2)
requests.post(url="http://localhost:8086/write?db=mydb", data=p3)
<Response [204]>
The following query returns the foo
measurement:
r = requests.get(url="http://localhost:8086/query", params={'q': 'SELECT * FROM "mydb"."autogen"."foo"'})
r.json()['results'][0]['series']
[{'name': 'foo',
'columns': ['time', 'bar', 'baz'],
'values': [['2021-03-11T17:07:08.257339123Z', 1, 1],
['2021-03-11T17:07:08.271710713Z', None, 2],
['2021-03-11T17:07:08.27527696Z', 3, 3]]}]
Notice that when querying the foo measurement, InfluxDB returns None
for the missing value of bar in the second point.
The Python keyword None
is used here as the default value for an optional parameter (the InfluxDB field in this case), as expected.
The InfluxDB Sink connector is the right place for handling missing values.
In particular, the influxdb-java
library used by the connector, skips fields with NaN values when writing to InfluxDB.
This feature was implemented upstream a month after we first noticed this problem back in September 2019, during the AuxTel integration activities at the Summit.
In summary, fields are optional, and we should skip missing values when writing to InfluxDB.
4.4.1 Filling in time intervals with no data¶
When querying InfluxDB, you can group the result by time intervals and use fill() to specify how InfluxDB handles time intervals with no data.
For example, this query will resample the values on a regular time grid of 10ms
and use fill(linear)
to perform a linear interpolation for time intervals with no data:
r = requests.get(url="http://localhost:8086/query", params={'q': 'SELECT mean(baz), mean(baz) FROM "mydb"."autogen"."foo" GROUP BY time(10ms) fill(linear)'})
r.json()['results'][0]['series']
[{'name': 'foo',
'columns': ['time', 'mean', 'mean_1'],
'values': [['2021-03-11T17:07:08.25Z', 1, 1],
['2021-03-11T17:07:08.26Z', 1.75, 1.75],
['2021-03-11T17:07:08.27Z', 2.5, 2.5],
['2021-03-11T17:07:08.28Z', None, None],
['2021-03-11T17:07:08.29Z', None, None]]}]
4.4.2 Writing arrays to InfluxDB¶
InfluxDB does not support derived types like arrays, the solution we found was to extract the array items in the InfluxDB Sink connector and write them to individual fields in InfluxDB.
For sparse arrays with NaN
values the connector will extract them and skip NaN
values before writing to InfluxDB.
For example, if the Avro record contains a field with the array [1.0, NaN, 2.0]
, the InfluxDB Sink connector will first extract the array items into the field set foo1=1.0,foo2=NaN,foo3=2.0
and then write foo1=1.0,foo3=3.0
to InfluxDB.
4.5 The EFD client¶
The EFD client uses the aioinflux Python client for InfluxDB.
Here we show that missing values in InfluxDB are converted back to NaN
when aioinflux
returns a Pandas dataframe:
from aioinflux import InfluxDBClient
client = InfluxDBClient(db="mydb", output="dataframe")
await client.query('SELECT * FROM foo')
bar | baz | |
---|---|---|
2021-03-11 17:07:08.257339123+00:00 | 1.0 | 1 |
2021-03-11 17:07:08.271710713+00:00 | NaN | 2 |
2021-03-11 17:07:08.275276960+00:00 | 3.0 | 3 |
which is a convenient way of representing missing values with the Pandas float64
dtype. For more information on working with missing data in Pandas, we refer the reader to this guide.
5 Recommendations¶
We recommend adopting the symbol
NaN
to represent missing values for floating-point data types in the CSCs.For DDS topics with sparse arrays of floating-point data types, we also recommend filling in the missing values with
NaN
values.We should consider using the
@default
annotation in the IDL specification to set the default values forfloat
,double
andlong double
fields toNaN
instead of0
. That would prevent the InfluxDB Sink connector from recording a0
to InfluxDB when it should instead skip theNaN
value.There is no obvious symbol to represent missing values for other data types that works across all systems in our data flow. We recommend revisiting this issue only if it becomes a problem in the future and follow as much as possible what Pandas does.
6 Conclusion¶
In this investigation we demonstrate that the symbol NaN
can be used to represent missing values for floating-point data types across all systems in our data flow. While this is a partial solution to the problem, we showed that it covers about 80% of the fields in Telemetry topics.
With this solution in place, a field with the value NaN
in a DDS record is passed along to an Avro record in Kafka.
The InfluxDB Sink connector skip NaN
values before writing to InfluxDB so that they are correctly represented as “missing” in InfluxDB.
In particular, we showed that when querying InfluxDB in Python, it fills the missing values with None
.
Finally, in the EFD client, missing values in InfluxDB are returned back as NaN
which is a convenient way of representing missing values in Pandas.