Alarms

In the Fabric Services System, alarms arise when a managed object enters an undesirable state; for example, if a managed node goes out of service, an associated alarm is raised.

The Fabric Services System supports the following alarm types, each with its own subtypes:
  • Communication
  • Configuration
  • Environment
  • Equipment
  • Operational
The Fabric Services System includes the following tools that can help you manage alarms:
  • the Alarms panel on the dashboard, which summarizes current alarms.
  • the Alarms List page, which you can use to view and manage individual alarms
  • the policy manager, which you can use to customize the severity level for specific types of alarm or to suppress alarms of a specific type entirely.

Alarm states

In the Fabric Services System, an alarm can adopt the following states:

  • Acknowledged: An acknowledged alarm still displays in the Alarms List page. When viewing details for the individual alarm, its state displays as Acknowledged and any note you added to the alarm while acknowledging it is displayed as well. You can use the Acknowledge state as the basis for filtering or sorting the alarm list.
  • Closed: A Closed alarm still displays in the Alarms List page. This state can be the basis for filtering the alarms included in the list. Closing an alarm does not resolve the condition that caused the alarm to be raised in the first place.
  • Cleared: An alarm is Cleared when the condition that raised the alarm has been resolved. Unlike Acknowledged and Closed, the Cleared state cannot be assigned manually by a Fabric Services System operator. Only the device or devices that raised the original alarm can determine and communicate its closure.

Displaying alarms

The Alarms List view displays a list of current alarms known to the Fabric Services System. From this page, you can view details about the state of each alarm and also acknowledge any alarm.

To view and manage alarms with the Alarm List page:

  1. From the main menu, select Alarms List.
    The alarm list displays, showing all active alarms for the current region (where "active" refers to alarms that have not been cleared).
    Note: Cleared alarms are not included in this list because the "Cleared" filter for this display is set to "False" by default. To view cleared alarms in this list, clear that filter.
    Note: A set of default columns display in the Alarms List view:
    • Severity
    • Alarm type
    • Node name
    • Resource name
    • Cleared
    • Occurrence
    • Last Raised

    There are other columns available to show more information about each alarm. You can add or remove columns from any list.

  2. If required, use the Region Selector at the top of the page to select a different region whose alarms are displayed.
  3. To view details about an alarm and its state:
    1. Select an alarm in the list.
    2. At the right edge of the row, click and select State Details from the displayed action list.
    3. Click the ALARM STATE tab to view details about the alarm's severity, a description of the alarm, and the time it was raised.
    4. Click the OPERATOR STATE tab to view the state assigned by the operator to address the alarm (either Acknowledged or Closed).
    5. When you are finished, click CLOSE to return to the Alarms List page.
  4. To acknowledge an alarm:
    Acknowledging an alarm marks it as received, but does not clear the alarm from the alarm list.
    1. Select an alarm in the list.
    2. At the right edge of the row, click and select Acknowledge from the displayed action list.
    3. Optionally, enter any comments about the acknowledgement in the Additional Info field.
    4. Click SAVE.
    The alarm is marked as Acknowledged (but not Closed).
  5. To close an alarm:
    Closing an alarm prevents the alarm from appearing in the Alarms List page, but does not resolve the condition that raised the alarm in the first place.
    1. Select an alarm from the list.
    2. At the right edge of the row, click and select Close from the displayed action list.
    3. Optionally, enter any comments about the acknowledgement in the Additional Info field.
      Note: The text you enter here is displayed in the Additional Info column of the OPERATOR STATE tab in the Alarm Details overlay.
    4. Click SAVE.

Customizing an alarm severity level

Policies allow you to customize the severity level associated with individual supported alarms.

A policy affects the alarm type of all future alarms raised; it does not retroactively modify existing alarms of the same type.

Each policy can include a start time and an end time; these are boundaries on the time of day during which the policy applies. An alarm raised outside these boundaries has its default severity instead of the severity level defined by the policy. If no start and end times are defined, the policy is always active.

You can also use a policy to suppress an alarm entirely while the policy is in effect.

The Fabric Services System supports the definition of a policy's scope in two mutually exclusive ways:
  • by key value, which allows you to trigger a policy based on the name of the object (node, fabric, intent, or region) affected by the alarm
  • by alarm category and type, to apply the policy regardless of the object affected.
Note: All policies are specific to the region in which they are created. A policy that is created within one region is not visible within, or available to, other regions.

To customize an alarm's severity:

  1. From the main menu, select Policies.
  2. Use the Region Selector at the top of the page to select the region in which to create the policy.
  3. Click + CREATE A POLICY.
  4. Set the Name and Description fields for the policy.
  5. In the Policy Definition panel, set the Start Time and End Time fields.
    An alarm that would be affected by this policy uses the customized severity level only if it is raised during this period. If it is raised outside this period, it uses the default severity level.
  6. Do one of the following:
    • To configure a policy based on the object it affects, click the Key Value toggle to enable it and go to step 7.
    • To configure a policy based on alarm category and type, leave the Key Value toggle disabled and go to step 8.
  7. With the Key Value toggle enabled, do the following:
    1. Click the Key Value Objects drop-down list and select one or more of the displayed key value candidates:
      • Node Name
      • Fabric Name
      • Intent Name
      • Region Name
      For each selected item, a field displays.
    2. Use the displayed field or fields to provide the unique name of each object type you selected.
      This name identifies the unique object of that type for which the system changes the alarm severity or suppresses alarms, depending on how you configure the remainder of the policy.
    3. Go to step 9.
  8. In the Policy Definition panel, do the following:
    1. Click the Alarm Category drop-down list and select from the following values:
      • Communication
      • Equipment
      • Operational
      • FSS
    2. Click the Alarm Type drop-down list and check the box beside one or more alarms types in the displayed list.
    3. Select a value for the Alarm Severity field:
      • Major
      • Minor
      • Critical
      • Warning
      • Default
  9. Configure the way this policy modifies alarms:
    1. Click the Priority drop-down list and select a value from 0 to 9, with 0 being the highest priority.
    2. Optionally, enable Suppress Alarms toggle.
      Enabling this option means that alarms of this type are disabled and are not triggered while the policy is in effect.
    3. Optionally, enable the Deployed Intent Alarms Only toggle.
      This option applies to "Communication – Interface Down" alarms. If this option is enabled, alarms are raised only on interfaces that are part of the intent configuration.
  10. Click CREATE.

Third-party tool access to Fabric Services System alarms

You can configure the system to allow third-party tools to access Fabric System Services alarms to allow operators to use their operational tool sets to monitor and operate their network. The Fabric Services System exposes raised alarms to third-party tools through a Kafka message bus. The system publishes all generated alarms on a Kafka topic to which an external system can subscribe.

The Kafka broker used for this topic only exposes SSL connections to itself for external systems to use. An external client must authenticate before being able to subscribe to a topic. The Kafka broker allows the external client to only subscribe to the topic, but not publish to it.

Alarm messages

The alarm messages that are published to the Kafka topic are in Protocol Buffer (protobuf) format. For an example, see Appendix B: Protobuf file message format.

From the Fabric Services System, you can obtain this file using the following REST call:

https://fss.domain.tld/rest/alarmmgr/fss_alarmexternal.proto

Configuration

The settings that enable third-party tools to access Fabric Services System alarms are configured during the Fabric Services System application installation; for more information, see “Editing the installation configuration file” in the Fabric Services System Software Installation Guide.

After the Fabric Services System application has been installed, you can update the settings as described in Updating configuration for the external Kafka service.

Enabling Kafka alarms after software installation

Perform this procedure only during a maintenance window.
Use this procedure to enable Kafka alarms after the Fabric Services System application has been installed. Perform this procedure only if the generation of Kafka alarms was not configured during initial installation or software upgrade.
  1. Update the sample-input json file with the Kafka alarm settings:
    In the fss section of the sample-input.json file, add the following lines:
    "kafkaconfig": {
         "port": "31000",
         "groupprefix": "fsskafka",
         "user": "fssalarms",
         "password": "fssalarms",
         "maxConnections": 2
  2. Update the system configuration.
    Execute the following commands:
    /root/bin/fss-install.sh configure input.json
    /root/bin/fss-upgrade.sh upgrade
    /root/bin/update-kafka.sh

Updating configuration for the external Kafka service

  • You must perform this procedure during a maintenance window.
  • All external connections must be closed before executing this procedure; you can initiate the connections again after you have completed this procedure.
  1. Update the sample-input.json file.
    The parameters are in kafkaconfig sub-section of the fss section.
     "fss": {
    
         "dhcpnode": "fss-node01",
         "dhcpinterface": "192.0.2.11/24",
         "ztpaddress": "192.0.2.11",
         "httpsenabled":  true,
         "certificate": "/root/certs/fss-tls.crt",
         "privatekey": "/root/certs/fss-tls.key",
         "domainhost": "myhost.mydomain.com",
         "kafkaconfig": {
           "port": "32425",
           "groupprefix": "mygrp",
           "user": "myuser",
           "password": "mypasswd",
           "maxConnections": 2
         }
    }
    Currently, you can only change the setting for the maxConnections parameter. This parameter specifies the maximum number of clients that can connect to the Kafka service; the maximum value for this parameter is 10.
    [root@fss-deployer ~]# diff updated-input.json input.json
    <        "maxConnections": 3
    ---
    >        "maxConnections": 2
  2. Run the fss-install.sh script to update the system configuration.
    The fss-install.sh script is available in the /root/bin directory.
    [root@fss-deployer ~]# /root/bin/fss-install.sh configure updated-input.json
    WARNING: truststore not configured
        Timesync service is running on 10.254.45.123  Time difference is 0 seconds
        Timesync service is running on 10.254.44.123  Time difference is 0 seconds
        Timesync service is running on 10.254.43.123  Time difference is -1 seconds
        Timesync service is running on 10.254.42.123  Time difference is 0 seconds
        Timesync service is running on 10.254.41.123  Time difference is 0 seconds
        Timesync service is running on 10.254.40.123  Time difference is 0 seconds
      Maximum time difference between nodes 1 seconds
    WARNING: Storage related disks will be wiped clean during install, data will be lost. Please verify that correct disks are referred in the input configuration.
  3. Update the Kafka service.
    [root@fss-deployer ~]# /root/bin/update-kafka.sh
    Kafka will be updated with the current config.
    release "kafka" uninstalled
    Using User certificates for the cluster
    secret "kafka-fss-cluster-ca-cert" deleted
    secret/kafka-fss-clients-ca-cert created
    secret/kafka-fss-cluster-ca-cert created
    secret "kafka-fss-cluster-ca" deleted
    secret/kafka-fss-cluster-ca created
    secret/kafka-fss-clients-ca created
    secret/kafka-fss-cluster-ca-cert labeled
    secret/kafka-fss-clients-ca-cert labeled
    secret/kafka-fss-cluster-ca labeled
    secret/kafka-fss-clients-ca labeled
    secret/kafka-fss-cluster-ca-cert annotated
    secret/kafka-fss-clients-ca-cert annotated
    secret/kafka-fss-cluster-ca annotated
    secret/kafka-fss-clients-ca annotated
    NAME: kafka
    LAST DEPLOYED: Fri Mar 31 05:03:05 2023
    NAMESPACE: default
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    Fri Mar 31 05:03:07 UTC 2023 Start: Checking Kafka pods status
    Fri Mar 31 05:03:07 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:03:18 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:03:28 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:03:39 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:03:49 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:04:00 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:04:10 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:04:52 UTC 2023 wait 800s for kafka cluster to startup
    Fri Mar 31 05:05:02 UTC 2023 Kafka Operator is up
        
    NAME                                        READY   STATUS    RESTARTS   AGE
    kafka-fss-entity-operator-b6757b664-bvvpq   3/3     Running   0          37s
    kafka-fss-kafka-0                           1/1     Running   0          71s
    kafka-fss-kafka-1                           1/1     Running   0          71s
    kafka-fss-kafka-2                           1/1     Running   0          71s
    kafka-fss-zookeeper-0                       1/1     Running   0          115s
    kafka-fss-zookeeper-1                       1/1     Running   0          115s
    kafka-fss-zookeeper-2                       1/1     Running   0          115s
    strimzi-cluster-operator-5bc66cb4f9-dnkcv   1/1     Running   0          12h
    NAME              CLUSTER     AUTHENTICATION   AUTHORIZATION   READY
    fss-kafka-admin   kafka-fss   scram-sha-512    simple          True
    myuser            kafka-fss   scram-sha-512    simple          True
    NAME                                    TYPE                                  DATA   AGE
    default-token-tr6nz                     kubernetes.io/service-account-token   3      12h
    fss-kafka-admin                         Opaque                                2      116s
    kafka-fss-clients-ca                    Opaque                                1      2m1s
    kafka-fss-clients-ca-cert               Opaque                                3      2m3s
    kafka-fss-cluster-ca                    Opaque                                1      2m1s
    kafka-fss-cluster-ca-cert               Opaque                                3      2m2s
    kafka-fss-cluster-operator-certs        Opaque                                4      115s
    kafka-fss-entity-operator-token-zhz2r   kubernetes.io/service-account-token   3      37s
    kafka-fss-entity-topic-operator-certs   Opaque                                4      37s
    kafka-fss-entity-user-operator-certs    Opaque                                4      37s
    kafka-fss-kafka-brokers                 Opaque                                12     71s
    kafka-fss-kafka-token-52ddx             kubernetes.io/service-account-token   3      72s
    kafka-fss-zookeeper-nodes               Opaque                                12     115s
    kafka-fss-zookeeper-token-xfvfb         kubernetes.io/service-account-token   3      115s
    myuser                                  Opaque                                2      116s
    sh.helm.release.v1.kafkaop.v1           helm.sh/release.v1                    1      12h
    strimzi-cluster-operator-token-q5kkt    kubernetes.io/service-account-token   3      12h
    NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                               AGE
    kafka-fss-kafka-0                    NodePort    10.233.61.12    <none>        9094:31239/TCP                        72s
    kafka-fss-kafka-1                    NodePort    10.233.5.58     <none>        9094:30182/TCP                        72s
    kafka-fss-kafka-2                    NodePort    10.233.56.222   <none>        9094:30026/TCP                        72s
    kafka-fss-kafka-bootstrap            ClusterIP   10.233.34.56    <none>        9091/TCP,9092/TCP,9093/TCP            72s
    kafka-fss-kafka-brokers              ClusterIP   None            <none>        9090/TCP,9091/TCP,9092/TCP,9093/TCP   72s
    kafka-fss-kafka-external-bootstrap   NodePort    10.233.45.207   <none>        9094:32425/TCP                        72s
    kafka-fss-zookeeper-client           ClusterIP   10.233.21.201   <none>        2181/TCP                              115s
    kafka-fss-zookeeper-nodes            ClusterIP   None            <none>        2181/TCP,2888/TCP,3888/TCP            115s
        
       
    
  4. Wait for the Fabric Services System application to stabilize.
    Convergence may take some time. During this period, pods are known to fail and restart.
    The system is stable when all the pods are in Running state.
    [root@fss-deployer ~]# export KUBECONFIG=/var/lib/fss/config.fss
    [root@fss-deployer ~]# kubectl get pods
    NAME                                                  READY   STATUS    RESTARTS        AGE
    fss-logs-fluent-bit-56t99                             1/1     Running   0               12h
    fss-logs-fluent-bit-d94x2                             1/1     Running   0               12h
    fss-logs-fluent-bit-hbvzt                             1/1     Running   0               12h
    fss-logs-fluent-bit-q7f6g                             1/1     Running   0               12h
    fss-logs-fluent-bit-r5tr4                             1/1     Running   0               12h
    fss-logs-fluent-bit-tmldd                             1/1     Running   0               12h
    prod-ds-apiserver-88fcd7cd7-lhmhh                     1/1     Running   0               12h
    prod-ds-cli-7cfd7664db-6xhk5                          1/1     Running   0               12h
    prod-ds-docker-registry-5b467bbf67-4lh2z              1/1     Running   0               12h
    prod-ds-imgsvc-deploy-5f99648577-fjfdg                1/1     Running   0               12h
    prod-fss-alarmmgr-78fd576464-2tfl9                    1/1     Running   1 (2m19s ago)   12h
    prod-fss-auth-6c99d44ccb-tnt8t                        1/1     Running   1 (3m20s ago)   12h
    prod-fss-catalog-54cb57645-s6mj7                      1/1     Running   1 (2m50s ago)   12h
    prod-fss-cfggen-6dfc6d8ccb-rjmxt                      1/1     Running   1 (2m49s ago)   12h
    prod-fss-cfgsync-78df54976f-nqrcm                     1/1     Running   0               12h
    prod-fss-connect-58c98db7d4-x4w5g                     1/1     Running   1 (3m18s ago)   12h
    prod-fss-da-0                                         1/1     Running   1 (2m20s ago)   12h
    prod-fss-da-1                                         1/1     Running   1 (2m20s ago)   12h
    prod-fss-da-2                                         1/1     Running   1 (2m20s ago)   12h
    prod-fss-da-3                                         1/1     Running   1 (2m20s ago)   12h
    prod-fss-da-4                                         1/1     Running   1 (2m18s ago)   12h
    prod-fss-da-5                                         1/1     Running   1 (2m48s ago)   12h
    prod-fss-da-6                                         1/1     Running   1 (2m18s ago)   12h
    prod-fss-da-7                                         1/1     Running   1 (2m18s ago)   12h
    prod-fss-deviationmgr-acl-7d8d878d66-jc48z            1/1     Running   0               12h
    prod-fss-deviationmgr-bfd-5f6bcf7d-xsq46              1/1     Running   0               12h
    prod-fss-deviationmgr-interface-5f7fdcfc6c-fpk48      1/1     Running   0               12h
    prod-fss-deviationmgr-netinst-c7d5648d7-z9mdp         1/1     Running   0               12h
    prod-fss-deviationmgr-platform-6d9c574bb9-l4cb7       1/1     Running   0               12h
    prod-fss-deviationmgr-qos-5b99fcc7d9-977r6            1/1     Running   0               12h
    prod-fss-deviationmgr-routingpolicy-775f49b66-qnqrj   1/1     Running   0               12h
    prod-fss-deviationmgr-system-557bbbc75f-rjknq         1/1     Running   0               12h
    prod-fss-dhcp-5bc95b6966-kzd2n                        1/1     Running   0               12h
    prod-fss-dhcp6-69d8785d64-l4qdk                       1/1     Running   0               12h
    prod-fss-digitalsandbox-5c44679f86-4bp8p              1/1     Running   1 (2m50s ago)   12h
    prod-fss-filemgr-65c6799996-ggl27                     1/1     Running   0               12h
    prod-fss-imagemgr-fd97fc4fb-6w8t4                     1/1     Running   1 (2m50s ago)   12h
    prod-fss-intentmgr-64f97dc466-ftjgm                   1/1     Running   1 (2m20s ago)   12h
    prod-fss-inventory-6f84769f46-w8h97                   1/1     Running   1 (3m18s ago)   12h
    prod-fss-labelmgr-847575b8c6-4m8xj                    1/1     Running   1 (3m19s ago)   12h
    prod-fss-maintmgr-7f599dd5db-fqk29                    1/1     Running   1 (2m20s ago)   12h
    prod-fss-mgmtstack-79c67c585c-pk2nv                   1/1     Running   1 (2m20s ago)   12h
    prod-fss-oper-da-0                                    1/1     Running   1 (2m20s ago)   12h
    prod-fss-oper-da-1                                    1/1     Running   1 (2m20s ago)   12h
    prod-fss-oper-da-2                                    1/1     Running   1 (2m20s ago)   12h
    prod-fss-oper-da-3                                    1/1     Running   1 (2m20s ago)   12h
    prod-fss-oper-da-4                                    1/1     Running   1 (2m19s ago)   12h
    prod-fss-oper-da-5                                    1/1     Running   1 (2m18s ago)   12h
    prod-fss-oper-da-6                                    1/1     Running   1 (2m18s ago)   12h
    prod-fss-oper-da-7                                    1/1     Running   1 (2m18s ago)   12h
    prod-fss-oper-topomgr-6b848bbcf7-5z8c9                1/1     Running   1 (2m19s ago)   12h
    prod-fss-protocolmgr-776bdf59c7-zvfl2                 1/1     Running   0               12h
    prod-fss-topomgr-5dd97997b8-jw8rk                     1/1     Running   1 (2m19s ago)   12h
    prod-fss-transaction-79bdb7d78d-lxwpp                 1/1     Running   1 (2m50s ago)   12h
    prod-fss-version-767b859c96-t2v5w                     1/1     Running   1 (2m20s ago)   12h
    prod-fss-web-5c94fd7455-l4sfz                         1/1     Running   1 (2m20s ago)   12h
    prod-fss-workloadmgr-7b8f44b95d-f8cv6                 1/1     Running   1 (3m19s ago)   12h
    prod-fss-ztp-86cbf5cdc-xtx9q                          1/1     Running   1 (2m49s ago)   12h
    prod-keycloak-0                                       1/1     Running   0               12h
    prod-mongodb-arbiter-0                                1/1     Running   0               12h
    prod-mongodb-primary-0                                1/1     Running   0               12h
    prod-mongodb-secondary-0                              1/1     Running   0               12h
    prod-neo4j-core-0                                     1/1     Running   0               12h
    prod-postgresql-0                                     1/1     Running   0               12h
    prod-sftpserver-77cd8696d5-fxswn                      1/1     Running   0               12h
    [root@6node-deployer-vm ~]#
  5. Initiate external connections from Kafka clients.