Scorecard Dashboard

About the Scorecard Dashboard

This dashboard acts as a “cockpit” from which an entire environment or set of environments can be observed in real time, where the reliability of that environments is continuously analyzed and scored across three different dimensions: deployment, applications and tiers.

1684 — The Reliability Scorecard provides a high-level as to the reliability of a selected environments(is) across three different dimensions - deployments, applications and code tiers.

Deployments – with this view the reliability of each code version within the selected environment(s) is dynamically scored. This view shows the number of severe and nonsevere new errors, increasing errors (regressions) and performance slowdowns incurred by that release. For each new error, regression or slowdown the QA/SRE/Dev may jump to see its complete analysis via the new errors, regressions and slowdown dashboards.

619 — The deployments view shows the anomalies detected in and reliability score of each of the active deployments currently running in the selected environment(s).

Applications – with this view, the reliability of each application (microservice or monolith) running within the selected environment(s) is dynamically scored. This view shows any key applications defined by the user via the Settings dashboard. If no key applications are defined this view will show the top applications incurring the highest errors and slowdown volume within the target environment(s).

Like the deployments view, the applications view shows the number of severe and non-severe new errors, increasing errors (regressions) and performance slowdowns incurred by that release. For each new error, regression or slow down the QA/SRE/Dev may jump to see its complete analysis via the new errors , increasing errors and slowdowns Dashboard.

609 — The Applications view shows the anomalies detected in and score of each of the currently running running applications (monoliths or microservices) within the selected environment(s).

Tiers – with this view, the reliability of code tier (custom or 3rd party) running within the selected environment(s) is dynamically scored. A tier is defined as a top level package in which a new error, regression or slowdown originated. For example, all errors originating from AWS APIs (e.g. com.amazonaws.) will be labeled as part of the “AWS” tier; any errors originating from a MySQL DB (com.mysql.) will be labeled as “MySQL”; and any errors originating from an Oracle DB (com.oracle.) will be labels as “Oracle”.

This process of detection of 3rd party tiers used by the application is done automatically. The default definitions of code layers for 3rd party frameworks are defined here: https://git.io/fpPT0.

Users may also add their own custom tiers that designate code components that are use horizontally across their applications. For example, a team in charge of a pricing module used by multiple applications across their environments, and is interested in understanding its reliability can use the Settings dashboard to label the com.acme.payment.* package as the “Payment” tier.

For each one of the three observed reliability aspects (e.g. new errors, regressions, slowdowns) , the user can see any P1 (severe) and P2 (non-severe) issues observed by OverOps, and jump to its corresponding drill down dashboard which provides deeper context and cause for that issue. The severity of issues are defined via a set of machine learning algorithms and thresholds applied to the data collected by OverOps which can be controlled the Settings Dashboard.

617 — The deployments view shows the anomalies detected in and reliability score of each of the active code tiers currently in the selected environment(s).

As you can see, each of these three views is composed of 2 vertical parts: the top widget provides you with a visualization of the score for each Deployment / Application / Code Tier, while the bottom widget provides you with the score details and a drill-down into them.

For each column in the bottom widget, the following drill-downs are available (for each deployment / application / code tier) to inspect any anomalies detected:

Name: Reliability Analysis – this dashboard provides an overview of the overall reliability of a target application, deployment or code tier.

New: New Error – this dashboard shows all severe and non-severe new issues experienced by the selected deployment, application or code tier.

Increasing: Increasing Errors - this dashboard shows all severe and non severe increasing errors (regressions) experienced by the selected deployment, application or code tier.

Slow: Slowdowns - this dashboard shows all severe and non-severe new slowdowns experienced by the selected deployment, application or code tier.

📘
Scorecard Dashboard JSON Model
Customize the dashboard, or integrate any of the widgets in it into your Grafana using the Grafana JSON Model of this dashboard.