Outlier Detection on Multiple Temperature Sensors
When running large-scale services, continuously monitoring asset temperatures can provide essential information for smooth long-term operation. Whether it is large office spaces, machinery in a production line, or server racks in a data center, multiple sensors are in some applications placed at once. If one or more sensors report temperature values deviating too far from the norm, preventative steps can be taken to avoid further degradation.
Due to their small size and long-lasting battery life, Disruptive Technologies (DT) Wireless Temperature Sensors are well suited for monitoring large amounts of assets in parallel. Employable in almost any environment, by measuring the temperature every 15 minutes, the data trend and behavior can be monitored and possible outliers can be caught in real-time.
In this application note, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied on a stream of 25 temperature sensors with the aim of catching outlier events. As shown in figure 1, the data from most sensors are pretty similar in both level and trend. Occurrences of sudden spikes or level shifts caught by the algorithm are therefore considered to be outliers where appropriate action can be taken.
Figure 1: One week of temperature data from 25 DT Wireless Temperature Sensors where outlier events in the data caught by the DBSCAN algorithm are highlighted for visibility.
If the aim is to highlight outlier behavior in the temperature originating from a specific device or environment, certain considerations should be taken when mountain the sensors. For instance, if room temperatures throughout a building are the source of interest, sensors should be placed away from external heating sources such as air-conditioning or direct sunlight. Otherwise, the algorithm might classify said external intervention as an outlier, resulting in false alarms.
The implementation is built around using the DT Developer API to interact with a single DT Studio project containing all temperature sensors for which outlier detection is performed. If not already done, a project needs to be created and configured to enable the API functionality.
For authenticating the developer API against a Service Account in your DT Studio project, three separate authentication details have to be located, later to be used in the example code. If you're unfamiliar with the concept, refer to our Introduction to Service Accounts.
The script will use all temperature devices in the target project. Note that DBSCAN works better the more devices you include, preferably 10 or more.
An example code repository is provided in this application note. It illustrates one way of detecting outliers in multistream data and is meant to serve as a precursor for further development and implementation. It uses our Python API to interact with the DT Studio project.
The example code source is publicly hosted on the official Disruptive Technologies GitHub repository under the MIT license. It can be found by following this link.
The code has been written in and tested for Python 3.9+. Dependencies can be installed using pip and the provided requirements text file.
pip3 install -r requirements.txt
Using your authentication details, set the following environment variables.
If the example code is correctly authenticated to the DT Studio project as described above, running the script main.py will start streaming data from each desk sensor in the project for which outlier detection is performed as new data arrive.
-hflag to print additional flags available.
Classifying data for outlier detection is an ongoing research field that has seen many approaches over the years. Lately, machine learning techniques have been the new frontier in this area at the cost of complexity. In contrast, clustering techniques can be comparably simple while still providing good performance. In particular, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm has been found to provide good performance with relatively little parameter tweaking .
Depending on the application, time-series data are often feature-engineered before being applied to a classification scheme. However, each sample in a time series of length NN can also be considered a feature in an NN-dimensional space and be applied directly. This was, during testing, found to result in much better performance than by extracting mean, kurtosis, skew, and other typical time-series features for cluster input.
Figure 3: Windowing of the most recent 24 hours of data that are uniformly resampled before providing it as an N -dimensional input for the DBSCAN clustering algorithm.
Compared to the likes of k-means clustering, DBSCAN does not require prior knowledge about the number of clusters in the data. It is also unsupervised, simplifying its use in many applications. One feature that makes it particularly useful for outlier detection is its notion of noise in the data. If a point does not fit in any cluster, it is classified as noise instead of the closest match. Figure 4 shows the result of applying DBSCAN on some synthetic data with two features. This website provides excellent animated visualizations of the clustering procedure.
Figure 4: DBSCAN applied to data in 2 dimensions, identifying two individual clusters and noise.
When grouping the features into clusters, DBSCAN uses a distance metric, here Euclidean distance, to determine if two or more points should be linked. For this, the two search parameters ϵϵ and pp must be given, where
is the search radius and pp the minimum number of points that can define a cluster. When scanning the dataset, each NN-dimensional point is classified as one out of three possible categories. A core point is defined as one that neighbors at least pp other points within a distance of
. A border point is one that can be reached by a core point, but does not fulfill the requirement itself, marking the edge of a cluster. If a point is not reached by any core point, it is defined as noise. Figure 5 shows an example of how points are classified to form a cluster.
Figure 5: Cluster generation procedure of the DBSCAN algorithm where the ϵ neighborhood is found for each point, classifying said point as either noise, border, or core.
Finding a balance between generalized behavior and performance is one of the challenges when choosing
. Here, if we assume that an outlier does not correlate with other potential outliers, setting
should result in said outliers being classified as noise by DBSCAN as there should be no other similar series. On the other hand,
dynamically recalculated on each call to compensate for changes in the data. By finding the average of every time series in a window, ϵ is calculated as the median Euclidean distance from each series to the average.
The script can be extended to work in real time by utilizing the
disruptive.Streammodule in our Python API. Below is a short visualization of how outlier classification can work in real time. The GIF is significantly sped up here.
Figure 6: DBSCAN being continuously applied to 25 different temperature data streams in realtime as they arrive in the stream, highlighting outlier data that differentiates itself.