Altair SmartWorks Analytics

 

Model Evaluation Explained

Supported Metrics

The Model Evaluation plugin is designed for single-output models trained on supervised learning data. It provides metrics for classification and regression problems (Table 1).

 

Table 1. Available metrics for supervised learning problems

Prediction Type

Meaningful Metrics

Binary Classification

  • Accuracy

  • Area under the curve, receiver operating characteristic (AUC–ROC)

  • Confusion matrix

  • Cumulative gains

  • F1 score

  • Gini coefficient

  • Kolmogorov–Smirnov (KS) plot

  • KS statistic

  • Lift (deciles)

  • Precision

  • Recall (true positive rate, sensitivity)

  • ROC

Multiclass Classification

  • Accuracy

  • AUC–ROC, weighted, one-versus-one (OvO)

  • Confusion matrix

  • F1 score, weighted

  • Precision, weighted

  • Recall, weighted

 

Regression

  • Mean absolute error (MAE)

  • Mean squared error (MSE)

  • Root-mean-square error (RMSE)

  • R-squared

 

Input

The Model Evaluation plugin accepts exactly one Pandas dataframe (Table 2) that must contain a number of required columns and may contain one or more metadata columns. This dataframe must not contain null values .

 

Table 2. Example: Head of a valid input dataframe

timestamp

prediction_type

deployment_type

model_name

model_version

model_type

model_label

prediction

probability

actual

2021-03-14 16:33:05.755733435Z

CLASSIFICATION_BINARY

CHAMPION_CHALLENGER

nid-logreg-2

2

SKLEARN

CHAMPION

normal

0.383549681239271

anomaly

2021-03-14 16:33:05.755733435Z

CLASSIFICATION_BINARY

CHAMPION_CHALLENGER

nid-ranfor-1

1

SKLEARN

 

CHALLENGER

anomaly

 

0.962626859671747

 

anomaly

 

2021-03-14 16:33:52.227698635Z

CLASSIFICATION_BINARY

CHAMPION_CHALLENGER

 

nid-logreg-2

 

2

SKLEARN

 

CHAMPION

 

normal

 

0.027955472385276

 

normal

 

2021-03-14 16:33:52.227698635Z

CLASSIFICATION_BINARY

CHAMPION_CHALLENGER

nid-ranfor-1

 

1

SKLEARN

 

CHALLENGER

normal

 

0.010713248754487

 

normal

 

 

Required columns

The columns that must be present in the input dataframe depend on the metrics that you want to calculate (Table 3).

 

Table 3. Dependencies of metrics on the required columns in the input dataframe

Required Column

Metrics

timestamp

All

actual

All

prediction

Accuracy

Confusion matrix

F1 score

F1 score, weighted

MAE

MSE

Precision

Precision, weighted

Recall

Recall, weighted

RMSE

probability or probability_<positive_class>

AUC–ROC

Cumulative gains

Gini coefficient

Lift (deciles)

KS plot

KS statistic

ROC

probability_<class_1>, probability_<class_2>, ... probability_<class_k>

AUC–ROC, OvO, weighted

 

In general, the input dataframe must contain:

  • timestamp, a string or datetime column containing the prediction timestamps and

  • actual, a string or numeric column containing the ground-truth values.

NOTE: If the timestamp column contains time zone information, then it will be converted to UTC time within the plugin.

 

The input dataframe must also contain a prediction and/or one or more probability columns:

  • The prediction column is a string or numeric column containing the values of the dependent variable predicted by the respective model.

  • Probability columns take on continuous numeric values in the interval [0.0, 1.0]. In binary classification, the singular probability column holds the probability for the positive class.

Additional constraints may apply:

  • In binary classification, the prediction column (if present) must contain exactly two unique values.

  • If there are multiple probability columns, then they must sum to 1.0 row-wise.

Metadata Columns

The input dataframe may contain one or more optional string columns (Table 4). These metadata are displayed in the plugin UI.

 

Table 4. Special string columns recognized by the Model Evaluation plugin

Column Name

Column Contents

prediction_type

Type of supervised learning problem, e.g., binary classification

deployment_type

Model deployment setup, e.g., champion–challenger

model_name

Model identifier

model_type

Model library/framework, e.g., scikit-learn, PySpark

model_version

Model version

model_label

Arbitrary string; used for champion–challenger deployments and A/B/n tests

 

Extra columns may be present but will be ignored.

You can add these metadata:

  • When creating or editing a deployment using the feedback loop in the MLOps app or

  • by using the Column Changes plugin.

If absent from the input dataframe, some metadata will be imputed within the plugin.

The prediction_type and deployment_type columns, if present, must each contain exactly one unique value because the plugin is designed to evaluate only one deployment at a time. (We assume a deployment contains one or more models that were trained on exactly the same training data.)

Special Named constants

The plugin recognizes certain string literals in some columns (Table 5) that can be used to control the UI.

 

Table 5. String literals recognized by the Model Evaluation plugin

Column Name

Special Named Constants

prediction_type

'CLASSIFICATION_BINARY', 'CLASSIFICATION_MULTICLASS','REGRESSION', 'CUSTOM'

deployment_type

'REGULAR', 'CHAMPION_CHALLENGER', 'ABN_TEST', 'OTHER'

model_label

'CHALLENGER', 'CHAMPION'

 

If the deployment type is 'CHAMPION_CHALLENGER', then there may be only one model name associated with the 'CHAMPION' label.

User Interface

The plugin is configured on the left-hand panel; the data are visualized on the right-hand panel (Figure 1).

Figure 1. Model Evaluation plugin graphical UI

 

Information

Here, you can see information that provides context for the evaluation:

  • Prediction type, e.g., binary classification

  • Deployment type, e.g., champion–challenger

  • Model names and labels, if a regular deployment, A/B/n test or other deployment

  • Name of champion model, if a champion–challenger deployment

Before visualizing any metric, you should refer to this information to confirm that you are investigating the right problem and set of models.

Configure

Choose the label for the positive class from the “Positive Label” drop-down menu. This section appears only when the prediction_type column is equal to the string literal 'CLASSIFICATION_BINARY'.

Define Window Size

Choose the base frequency from a drop-down menu and enter a positive integer multiple for that frequency. These inputs define the time windows (bins) over which the plugin should calculate the metrics.

Under the hood, the base frequencies are mapped to Pandas DateOffset objects (Table 6).

 

Table 6. Correspondence between base frequency and Pandas DateOffset

Base Frequency

Pandas DateOffset Object

SECONDS

Second

MINUTES

Minute

HOURS

Hour

DAYS

Day

WEEKS

Week

MONTHS

MonthEnd

YEARS

YearEnd

 

The smallest base frequency that you can choose is constrained by the temporal resolution of the timestamp column. For example, if your deployment served one prediction every 1–5 seconds, then the smallest window size that you can define is 1 minute.

You can see the approximate number of predictions in each time window expressed as a range. Use this estimate to ascertain whether you have a sufficiently large sample in each time window to draw reliable conclusions from your data. The number of windows per model is also displayed and includes both empty and nonempty windows.

NOTE: The range is two standard deviations around the mean number of predictions. If there are at least four windows, then the endpoints are not included in the statistics since they may skew the results.

Select Metrics

Choose at least one metric to calculate and visualize. Only those metrics that can be calculated from the input dataframe will be available for selection (Table 3).

Visualize metrics

When you select a metric, an interactive chart will appear on the right-hand panel under a new tab. Hover your cursor over any data point to bring up a legend.

When visualizing a one-dimensional metric that can be plotted as a time series, you can control the x-limits using the “From” and “To” calendar widgets.

You may use the “Model Selection” drop-down filter to hide or show plots for one or more models.

Classification

When visualizing the accuracy, F1 score, precision or recall, you may plot the metric over one or more individual classes by choosing those classes from the “Class Labels” drop-down filter. Each selection adds a new plot of the metric computed over just the records where actual is equal to the selection.

NOTE: The plot of the metric calculated over all the classes is always shown as a reference point.

 

When visualizing the confusion matrix, cumulative gains, lift, KS plot or ROC, you may use the “Periods” drop-down filter to hide or show plots for one or more periods (time windows). The confusion matrix is visualized for only one model and one period at a time.

Preview Data

Under the “Data Preview” tab, you can see the calculated metrics in tabular format. It contains:

  • A period column that represents the time windows,

  • any special metadata columns (Table 4) that are present in the input dataframe, and

  • one column for each metric selected in the configuration panel.

NOTES:

  • The data preview does not take into account any charting filters that are in effect.

  • The following metrics are currently not exported from the plugin:

    • Confusion matrix

    • Cumulative gains

    • Lift

    • KS plot

    • ROC

 

Save and Run

Save the configuration to persist the choice of positive class (if applicable), the window size, the metric selection and any visualization tabs that are currently open.

Closing the Feedback Loop

All the payloads going to and coming from a model deployment created through the MLOps app contain all the metadata needed for the Model Evaluation plugin to work smoothly. As long as you have turned the feedback loop on for a deployment, you can extract the relevant request–response data using the Import from Database plugin in SQL mode.

The following code snippet shows a sample query to import Model Evaluation-ready predictions for a PostgreSQL-backed feedback loop. Enter your database name on lines 11 and 12 and deployment name on line 14.

 

SELECT pl.model_id,

       pl.response_timestamp AS timestamp,

       pl.deployment_name,

       md.model_prediction_type AS prediction_type,

       md.deployment_type,

       md.model_name,

       md.model_version,

       md.model_package_type AS model_type,

       md.model_label,

       json_array_elements(pl.response_payload::json -> 'data' -> 'ndarray') ->> 0 AS prediction

  FROM <database_name>.mlops_payload AS pl

  JOIN <database_name>.mlops_metadata AS md

    ON pl.model_id = md.model_id

 WHERE pl.deployment_name = '<deployment_name>'

 ORDER BY timestamp

 

 

All that is left for you to do is to bring your ground-truth values onto the canvas. You could:

  • Import the ground-truth values from another source, then join those values onto the imported feedback loop data using the Join plugin on an appropriate key, e.g., mlops_payload.request_id, or

  • insert the ground-truth values directly into the feedback loop database and pull them as part of your SQL query.