Prometheus Time Series Database (TSDB) Integration Overview

Operational-state visibility is key for troubleshooting, testing, monitoring and capacity management. This requires to sample router metrics periodically. Ingestion of time-series data allows to ask of interesting operational queries.

Examples:

A slightly increasing memory consumption over time while the overall PPPoE session count has not changed, for example, is an indication of a memory leak.
If the 5-minute chassis temperature is too high, this might be an indication for an imminent hardware breakdown and the switch hardware must be replaced.
If utilization of all fabric interfaces is constantly touching the 80% saturation levels then new fabric links must be commissioned.
High input traffic with degradation of optical receive levels might be an indication of running very close to the optical budget.

The challenge is to sample all this information efficiently in terms of disk, memory and CPU utilization while providing comprehensive query and reporting functionality.

Metric Profiles

In RBFS, you can create metric profiles that allow you to group metrics which share same characteristics, such as metrics from the same application or subsystems. Instead of configuring each metric individually, you can define a common settings, such as scrape intervals and default attributes, in the profile, and these settings will be applied automatically to all associated metrics.

If you do not explicitly assign a metric to a specific metric profile, it will be by default assigned to the standard metric profile. This default profile is set to scrape metrics every 5 seconds. If you do not create or use any custom metric profiles, all metrics will automatically fall under this default profile.

If you do not define any custom metric profiles, Prometheus scrapes all daemons that expose metrics every 5 seconds using a single configuration block created by the HOSTCONFD.

Alert Profiles

An alert profile allows you to group multiple alerts into a single profile by applying the same characteristics to them.

Architectural Overview

The RBFS telemetry architecture is based on Prometheus as an open-source systems monitoring and alerting toolkit. Prometheus is designed to pull metrics periodically and save them efficiently. It allows us to analyze the metrics with a powerful query language called PromQL. Also, an optional alert management is available. There is an opportunity to tie it together with its own services to integrate it into the system landscape. Data should have short retention times (default 15d).

This fits perfectly with the needs of BDS. The figure below shows how it fits in an overall architecture.

Figure 1. Prometheus in RBFS

To mitigate the short retention times, which fits to BDS but not in an overall telemetry process, the data can be stored in a centralized storage database (for example, Influx) this can be done by federation or via remote storage adapters. To distribute the alert messages from Prometheus, CTRLD functions as an "alertmanager webhook receiver", which takes the alert and distributes it to a log management tool.

Router deployment model

Prometheus DB is run on the router as a dedicated process. It ships with a package-time configuration to poll each BDS-capable speaker at periodic intervals. Initially the periodic interval is 1 second. The Prometheus Exposition format is a very simple HTTP-based GET query that asks a given BD speaker "Give me all your metrics". Each BD subscribes to the global.time-series.metric.config table, which contains an operator-configurable list of BDS targets. Only the BDS which is the master of a table responds. Next Prometheus polls the BD using the /metrics URL.

Figure 2. Prometheus in RBFS with the different scrape target

Storage efficiency

On average Prometheus uses only around 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

The single binaries disk space:

-rwxr-xr-x 1 root root 27M Sep 2 22:51 alertmanager +
-rwxr-xr-x 1 root root 81M Sep 2 22:51 prometheus +
-rwxr-xr-x 1 root root 49M Sep 3 19:55 promtool

Promtool is needed to test the configurations before set them to prometheus.

Alert

The alerting is configured through Prometheus. For more information, see alertmanager.

Role of CTRLD

Prometheus and Alertmanager register themselves in CTRLD, so that CTRLD is aware of these two services.

Refer to Figure-1, which provides an overview of the role of CTRLD.

Service state and Proxy

The registration of the services gives 2 advantages:

The operational state is an indicator that the service is up and running.
The proxy functionality of CTRLD can be used for Prometheus and alertmanager.

The proxy functionality is used for querying Prometheus directly:

curl 'http://198.51.100.125:19091/api/v1/rbfs/elements/rtbrick/services/prometheus/proxy/api/v1/query?query=up' | jq .

But it is also used for federation and therefore the following URL is used:

http://198.51.100.125:19091/api/v1/rbfs/elements/rtbrick/services/prometheus/proxy/federate

Alert distribution

CTRLD can forward the alerts from the alert manager to Graylog or any other REST endpoint.

API for Configuration

CTRLD provides a REST API Endpoint for the configuration of alerts and metrics.

Federation deployment model

Figure 3. Federation of Prometheus, Alertmanager and graylog target

Prometheus is intended to have at least one instance per datacenter usually; also with a global Prometheus for global graphing or alerting. Federation allows for pulling metrics and aggregations up the hierarchy.

In the global Prometheus config, this time-series is pulled:

prometheus.yml:

global:
  scrape_interval: 60s # By default, scrape targets every 15 seconds.
  # A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
  - job_name: "federate"

    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="bds"}'
    scrape_interval: 15s
    # Patterns for files from which target groups are extracted.
    file_sd_configs:
      - files:
          - ./bds.target.yml
        refresh_interval: 5m

The match[] here requests all BDS job time series. By following this job naming convention, you do not have to adjust the config whenever there is a new aggregating rule.

The targets itself can be configured in a separate file.

bds.target.yml:

- targets: ['198.51.100.125:19091']
  labels:
    __metrics_path__: "/api/v1/rbfs/elements/rtbrick/services/prometheus/proxy/federate"
    box: 125_rtbrick

IP Pool Monitoring and Export

The IP Pool Monitoring and Export functionality enables real-time insights into IP allocation and enables more efficient resource management. It provides 'usage' and 'size' attributes that allows for accurate monitoring IP pool utilization through TSDB and supports continuous export via Prometheus.

Supported Platforms

Not all features are necessarily supported on each hardware platform. Refer to the Platform Guide for the features and the sub-features that are or are not supported by each platform.