Settings Dashboard
Introduction
The Settings dashboard contains the settings and configurations available to users to control the behavior of the OverOps Reliability dashboards as well as managing your OverOps environments. Settings are stored per environment in the form of an editable json document.
Settings are set per environment, so that if you have multiple environments, you'll need to specify the settings for each one.
Using the Settings Dashboard
To open the Settings dashboard:
- Open Settings by clicking Settings in the top right corner of the OverOps dashboard.
This opens the Settings dashboard for the environment you're in (you can see the environment in the upper left corner of the screen).
- To switch to another environment, click Manage Environments. This opens the Manage Environments window.
- Click Settings next to the environment you wish to configure and follow the steps below.
Configuring Environment Settings
The Settings dashboard's values are divided into the following primary categories.
- General Settings
- Installation
- Security
- Integrations
- Reliability Analysis
Note
Account owners and admins will have a list of environments that can be configured.
Adding Multiple Values
When entering multiple values in the Settings dashboard, if one or more values can't be added, those values will be skipped.
- If there are critical exceptions, you'll receive an indication in the top of the screen (Some were not added).
- If you've entered multiple values, and some of them can't be added - either because they are duplicates or because of issues such as long names - you'll receive a popup message detailing which values weren't added:
Setting Alert Defaults
- Under the General Settings tab, select Alerts to display the Alerts - Default settings window.
- For each channel, specify the default settings that will be applied to the alerts sent via this channel.
- After each channel specified, click Apply to save your changes. If you don't click Apply you'll be asked whether you wish to keep the changes or not.
About Default Alert Settings
Default settings are optional, users can set up customized alerts for each View. When creating an alert, the user will be able to override the default settings by either defining new channel details or disabling it.
3.1 For the Email channel, specify whether to email the user creating the alert, all team members, or enter the emails of additional recipients.
Adding Multiple Values
To enter multiple values - in this case emails - specify the emails, separated by a space and comma or separated by a semi-colon without a space.
3.2 For the Slack channel, enter the incoming webhook URL provided by Slack.
3.3 For the PagerDuty channel, enter the Integration Key provided by PagerDuty.
3.4 For the ServiceNow channel
i. Enter the ServiceNow instance URL and credentials.
ii Click Load Tables, and then choose a ServiceNow table to update.
3.5 For the Webhook channel, enter the URL for receiving webhooks.
Advanced Settings
The Advanced Settings tab is only visible to admin users (so if you're not an admin or account owner you won't see this tab). These are advanced settings for the Dashboard Views and for the Automated Root Cause (ARC) screen (which enables you to get to the root of errors and exceptions).
The Advanced window lets you specify the following:
- Allowed IPs - This limits user access to a list of selected IPs
- Show rethrows - This option displays the exception rethrows in the dashboard
- Clear sticky filters when moving between views - When moving between Views, clear the Apps / Deployments / Server filters
- Show log links - Enable log links for exceptions (this requires a JVM restart, and is supported in version 4.9.0 or above).
- Enhanced decompilation - Enhanced decompilation minimizes the difference between the original source code and decompiled sources by using raw bytecode as a reference. Note that this is supported in version 4.39 and above.
About the Reliability Analysis Parameters
This section explains how to configure the settings related to the Reliability Analysis features. To learn more about how to set the reliability analysis parameters, refer to the presentation below.
Reliability Scoring Settings
This function controls the way by which applications, deployments and key tiers are scored for reliability. Reliability scoring takes into account the number of anomalies (i.e., new errors, increasing errors and transaction slowdowns) detected within a target application, deployment, tier and their severity, as well as the duration that code has been running. Duration is taken into account as code that has been running for a month with 3 anomalies is considered more reliable than code that has the same amount of anomalies detected within during its first 12 hours for example.
Setting | Description |
---|---|
New Event Score | The number of points deducted for each new event. |
Regression Score | The number of points deducted for a regressed (P2) event. |
Critical Regression Score | The number of points deducted for a severe (P1) regression or slowdown. |
Score Weight | A factor applied to the score deduction of a new error, regression or slowdown. |
Key Score Weight | A factor applied to the score deduction of a new error, regression or slowdown in a key app or tier. |
Groups
The Groups tab enables you to create different groups - for transactions, applications, and tiers - into logical groupings that can be reported on and scored.
Adding Multiple Values
To enter multiple values - transactions, applications or tiers - specify the values, separated by a space and comma or separated by a semi-colon without a space.
Transaction Groups
This tab enables you to group together different entry point classes into logical groupings based on package and namespace to report and alert on the reliability of key business transactions within their environment(s).
Application Groups
This tab enables you to group different applications into logical grouping that can be reported on and scored.
Tier Groups
This tab enables you to define key 3rd party or custom application top level packages within their environment to report and alert on. 3rd party tiers are automatically detected by OverOps based on the definitions which can be found here: https://git.io/fpPT0.
You may also add your own custom tier definition by specifying a top level package name. By defining key tiers, you can then observe the reliability of infrastructure and utility code components used horizontally across their environment(s).
Importing Groups
The Groups tab enables account owners and admins to import groups that were already defined for other environments. Note that each import only applies to the current group - i.e, transactions, applications or tiers - and not to all group types in one import.
Handling Group Import Issues
When importing groups, remember the following:
- If a group that you import contains values that are already being used in existing group
- If you import a group with the same title as one of your existing groups
- If you import an item that has a very long name
Then these values won't be imported and will be displayed in a notification similar to the one below.
Important
When moving to another tab within the Groups section, unsaved changes will be kept. However, they'll be deleted after moving to another section (in the sidebar) if you don't click Apply.
Increasing Errors (Regressions) Settings
These settings control the way by which OverOps categorizes events as new and increasing in comparison to their corresponding baseline period, and assigns severity levels to them.
Setting | Explanation |
---|---|
Active Timespan | The default active timespan used to compare the volume of a target event against its baseline |
Min Baseline Timespan | The minimal ratio that must be maintained between the active window and baseline window |
Baseline Timespan Factor: | The minimal volume an event must have within the active window for it to be considered regression |
Error Minimal Volume Threshold: | The minimal volume an event must have within the active window for it to be considered regression |
Error Minimal Rate Threshold | The minimal rate between event volume and calls into the event location event must have within the active window for it to be considered regression. |
Error Regression Delta: | The minimal change in % between the relative volume of an event in the active time frame vs. that of the base line for it to be considered a regression. |
Critical Regression Delta: | The minimal change in % between the relative volume of an event in the active time frame vs. that of the base line for it to be considered a critical regression. |
Apply Seasonality | Control whether to exclude an object that has had more than one slice/season within the baseline window exceed the volume of the active window, or two windows that have exceeded it by >50% of the absolute volume of the active window, |
Slowdown Settings
Control the way by which OverOps calculates and determines which code entry points / transaction handlers are slowing down in comparison to their matching dynamic baseline.
The Slowdown settings set the rules to define an event type (whether new / regressed / increasing) and its severity.
Setting | Description |
---|---|
Active Invocations Threshold | The minimal number of calls to a target entry point - within the selected time frame - for the event to be taken into consideration for slowdown analysis. |
Baseline Invocations Threshold | The minimal number of calls to a target entry point - within the baseline time frame - for the event to be taken into consideration for slowdown analysis. |
Minimal Delta Threshold | The minimal change between the average response time - within the selected time frame and baseline - for a transaction's state to be marked as Slowing or Critical. |
Minimal Delta Threshold Percentage | The minimal rate of increase in percentage points between the avg baseline response time and the selected time window. For example, an increase from 40ms to 60ms would constitute a 50% increase. |
Slowing Percentage | The percentage change that the number of calls within the active timeframe whose response time exceeds the avg of the baseline + the std dev of the baseline * std_dev_factor, for a call to be considered slowing. |
Critical Slowing Percentage | The percentage change that the number of calls within the active timeframe whose response time exceeds the avg of the baseline + the std dev of the baseline * std_dev_factor, for a call to be considered critical. |
Standard Deviation Factor | The number of additional std devs (as calculated from the baseline time frame) added to the current baseline avg when calculating the percentage of calls exceeding that combined value for a transaction to be considered slow or slowing. |
Event Types
Manage the types of events that are shown, and control how tables and graphs are displayed, and how data is presented, grouped, and ordered.
- Event Types: The default list of event types to be used when performing volume functions
- Transaction Failures: A list of event types that each if found within the context of an entry point (transaction) call will mark that transaction as failed. Note that more than one of these events can take place within the context of a single transaction call (i.e. more than one log error can take place within the execution of a single entry point call).
- Critical Exception Types: A List of event types that if a new event is introduced within the active time frame whose type is contained in the list, it will be considered a severe (p1) new issues, regardless of whether it exceeded the error minimal volume/rate threshold settings.
Settings Dashboard JSON Model
Customize the dashboard, or integrate any of the widgets in it into your grafana using the Grafana JSON Model of this dashboard.
Updated over 1 year ago