Anomaly Detection (AD) module

The anomaly detection (AD) module consists of two components: on-node anomaly detection (AD) and paramter server (PS).

Anomaly detection module architecture

Anomaly detection (AD) module: on-node AD module and paramter server (PS).

On-node AD Module

The on-node anomaly detection (AD) module (per applications’ process) takes streamed trace data. Each AD parses the streamed trace data and maintains a function call stack along with any communication events (if available). Then, it determines anomalous function calls that have extraordinary behaviors. If there are any anomalies within the current trace data, the AD module stores them in files or DB. This is where significant data reduction occurs because we only save the anomalies and a few nearby normal function calls of the anomalies.

Statistical anomaly analysis

An anomaly function call is a function call that has a longer (or shorter) execution time than a upper (or a lower) threshold.

\[\begin{split}threshold_{upper} = \mu_{i} + \alpha * \sigma_{i} \\ threshold_{lower} = \mu_{i} - \alpha * \sigma_{i}\end{split}\]

where \(\mu_{i}\) and \(\sigma_{i}\) are mean and standard deviation of execution time of a function \(i\), respectively, and \(\alpha\) is a control parameter (the lesser value, the more anomalies or the more false positives).

Advanced anomaly analysis

TBD

Parameter Server

The parameter server (PS) provides two services:

  • Maintain global parameters to provide consistent and robust anomaly detection power over on-node AD modules

  • Keep a global view of workflow-level performance trace analysis results and stream to visualization server

Scalable Parameter Server

TBD

Furthermore

More details can be found in PerformanceAnalysis documentation.