System Health

Container

The steps described in this section are performed on the host OS. For logging into the host OS instead of into the Linux container, use SSH port 1022 instead of the default port 22. Example:

ssh supervisor@10.10.0.100 -p 1022

Checking the Container Status

The RBFS services run in a Linux container (LXC). If you are able to log in into the container, obviously the container is running. If you are not able to log in, you can verify the status of the container on the host OS, i.e. ONL on hardware switches, as follows:

supervisor@spine1:~$ sudo lxc-ls -f
NAME    STATE   AUTOSTART GROUPS IPV4      IPV6
rtbrick RUNNING 1         -      10.0.3.10 -

On a hardware switch, there will be a single container called "rtbrick" and the state shall be "Running". If the state is "Stopped" or "Failed", the container has failed and the system is in a non-operational state.

Recovering from a Container Failure

If the container exists but is not running, you can start it using the rtb-image tool:

supervisor@spine1:~$ sudo rtb-image container start rtbrick

Alternatively you can use the following lxc command:

supervisor@spine1:~$ sudo lxc-start -n rtbrick

If the container does not exist, or if starting it fail, you can try to recover by restarting the device at the ONL layer:

supervisor@spine1:~$ sudo reboot

Brick Daemons

Checking the BD’s Status

RBFS runs multiple Brick Daemons (BD). You can verify the status of the daemons using the Ubuntu system control (systemctl) or using the RBFS show command: 'show bd running-status'. The following commands will show all rtbrick services. The status should be "running":

Example 1: Show output using the Ubuntu system control command

supervisor@rtbrick:~$ sudo systemctl list-units | grep rtbrick
var-rtbrick-auth.mount                    loaded active     mounted   /var/rtbrick/auth
rtbrick-alertmanager.service              loaded active     running   rtbrick-alertmanager service
rtbrick-bgp.appd.1.service                loaded active     running   rtbrick-bgp.appd.1 service
rtbrick-bgp.iod.1.service                 loaded active     running   rtbrick-bgp.iod.1 service
<...>

Example 2: Show output using the 'show bd running-status' command

supervisor@rtbrick: op> show bd running-status
Daemon              Status
alertmanager        running
bgp.appd.1          running
bgp.iod.1           running
confd               running
etcd                running
fibd                running
<...>

Please note the supported BDs differ by role and may change in a future release. You can display further details as shown in the following example:

supervisor@rtbrick:~$ sudo systemctl status rtbrick-fibd.service

 rtbrick-fibd.service - rtbrick-fibd service
   Loaded: loaded (/lib/systemd/system/rtbrick-fibd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-10-19 05:01:16 UTC; 4h 41min ago
  Process: 248 ExecStartPost=/bin/bash -c [ -f "/usr/local/bin/rtbrick-ems-service-event" ] && { /usr/bin/python3 /usr/local/bin/rtbr
  Process: 240 ExecStartPre=/bin/mkdir -p /var/run/rtbrick/fibd_pipe (code=exited, status=0/SUCCESS)
  Process: 225 ExecStartPre=/usr/local/bin/rtbrick-bcm-sdk-symlink.bash (code=exited, status=0/SUCCESS)
  Process: 150 ExecStartPre=/usr/local/bin/rtbrick-vpp-startup-conf.bash (code=exited, status=0/SUCCESS)
 Main PID: 246 (vpp_main)
   CGroup: /system.slice/rtbrick-fibd.service
           246 /usr/local/bin/bd -i /etc/rtbrick/bd/config/fibd.json

If the status is "failed", the respective service or daemon has failed. A daemon failure is a major issue, and depending on which BD has failed, the system is likely to be in a non-operational state. If a BD has failed, inspect the system log (syslog) file as well as the respective BD log file as described in section 5, then proceed to sections 2.3.2 and 2.3.3, and finally report the failure to RtBrick as described in section 6.

Core Dump

If a BD fails, it shall create a core dump file. A core dump is a file containing a process’s state like its address space (memory) when the process terminates unexpectedly. These files are located in /var/crash/rtbrick. If you have identified or suspect a BD failure, navigate to this directory and check for core dump files:

supervisor@rtbrick:~$ cd /var/crash/rtbrick/
supervisor@rtbrick:/var/crash/rtbrick$ ls -l
-rw-r--r-- 1 root root 3236888576 Apr  9 10:17 core.fibd_136_2020-04-09_10-16-52

If there is a core dump file, you can decode it using the GNU debugger tool like in the following example:

supervisor@rtbrick:/var/crash/rtbrick$ sudo gdb bd core.fibd_136_2020-04-09_10-16-52
<...>

At the resulting 'gdb' prompt, type and enter 'bt' for backtrace:

(gdb) bt
<...>

Report the resulting output to RtBrick as described in section 6. Analysing the core dump file will typically require support from RtBrick and is beyond the scope of this guide.

Recovering from a BD Failure

If a brick daemon fails, RBFS will automatically restart it. If the automatic restart does not succeed, you can use the Ubuntu system control to start a daemon like in the following example:

supervisor@rtbrick:~$ sudo systemctl start rtbrick-fibd.service

Alternatively you can recover from a BD failure by rebooting the container from the Linux container shell:

supervisor@rtbrick:~$ sudo reboot

Running Configuration

Verifying the Configuration

A missing running configuration is another possible problem scenario. There are several reasons why the system might be missing its configuration. Posting the running configuration might have failed for example due an invalid configuration syntax, or there was a connectivity issue between the device and the provisioning system.

You can easily verify via CLI if there is a running configuration:

supervisor@rtbrick: op> show config

If you suspect a configuration load issue, inspect the confd logs as well as the CtrlD log as describe in section 5.

Restoring a Configuration

It depends on the customer deployment scenario how a running configuration shall be applied or restored in case of an issue.

If the device already had a configuration previously, and you have saved it in a file, you can simply load it via the CLI:

supervisor@rtbrick: cfg> load config spine1-2020-10-19.json

If the device already had a configuration previously, and has been configured to load the last configuration with the 'load-last-config: true' attribute, you can restore it by rebooting the container at the Linux container shell:

supervisor@rtbrick:~$ sudo reboot

Otherwise you can also copy an automatically stored running configuration file into your user directory and load it manually like in the following example:

supervisor@leaf1:~$ sudo cp /var/rtbrick/commit_rollback/766e102957bf99ec79100c2acfa9dbb9/config/running_config.json running_config.json

supervisor@leaf1:~$ ls -l
total 12
-rw-r--r-- 1 root root 8398 Oct 21 09:45 running_config.json

supervisor@leaf1:~$ cli
supervisor@leaf1: op> switch-mode config
Activating syntax mode : cfg [config]
supervisor@leaf1: cfg> load config running_config.json
supervisor@leaf1: cfg> commit

If it’s a newly deployed or upgraded device, and there is out-of-band connectivity from your network management system (for example RBMS), you can trigger the configuration from your NMS.

If it’s a newly deployed or upgraded device, and the configuration shall be applied via a ZTP process from a local ZTP server, you need to reboot the device at the ONL layer in order to trigger the ZTP process:

supervisor@spine1:~$ sudo reboot

There is also an option to manually copy a configuration file to the device and into the container. If you have copied a configuration file via an out-of-band path to the ONL layer of the device, you can copy it into the container as follows. Please note the name in the directory path needs to match the name of the container, like "spine" in this example:

supervisor@spine1:~$ cp spine1-2020-10-19.json /var/lib/lxc/spine1/rootfs/home/supervisor/

Next you can load this configuration via the CLI as already described above:

supervisor@rtbrick: cfg> load config spine1-2020-10-19.json

License

RBFS software requires a license to ensure its ligitimate use. The license will be automatically validated and enforced. After an initial grace period of 7 days, if a license is missing or expired, RBFS will be restricted. The CLI as well as the BDS APIs will not work anymore.

If all CLI commands do not work, the license might be missing or expired.

Verifying a License

You can verify the license via the CLI:

supervisor@rtbrick: op> show system license
License Validity:
  License index 1:
    Start date : Fri Mar 12 06:43:25 GMT +0000 2021
    End date   : Sat Mar 12 06:43:25 GMT +0000 2022

The output will indicate if there is a valid license, no license, or if the license is expired.

Restoring or Updating a License

The license is installed by configuration. If the license is missing but the device already had a license configuration previously, please restore the configuration as described in the Restoring a Configuration section above.

If the license is expired, please configure a new valid license key. If you do not have a license key yet, please contact your RtBrick support or sales representative to obtain a license.

Control Daemon

In addition to the Brick Daemons running inside the LXC container, there are some RBFS services running on the host OS. The most important one is CtrlD (Control Daemon). CtrlD acts as the single entry point to the system. Verify the status of CtrlD:

supervisor@spine1:~$ sudo service rtbrick-ctrld status
[....] Checking the rtbrick ctrld service:3751
. ok

If the status is not "ok", restart, start, or stop and start CtrlD:

supervisor@spine1:~$ sudo service rtbrick-ctrld restart

supervisor@spine1:~$ sudo service rtbrick-ctrld stop
supervisor@spine1:~$ sudo service rtbrick-ctrld start

If the status is "ok", but you suspect an issue related to CtrlD, inspect the ctrld logs as also described in section 5:

supervisor@spine1:/var/log$ more rtbrick-ctrld.log

System Utilization

There are cases when system-related information has to be checked: symptoms like sluggish response, or daemons/processes crashing repeatedly can mean that system resources are overutilized. In such cases first steps are to verify CPU, memory, and disk. Before, it is good to remember the general architecture of an RtBrick switch: we have the physical box, on which ONL (Open Network Linux) sits. In ONL we run the LXC container (which has Ubuntu 22.04 installed), which in turn has RBFS running inside it.

In the following sections we’ll mainly concentrate on the LXC container verifications, and will specify where the commands are executed in ONL. The order in which the commands are shown in this section can also be used to do the basic system troubleshooting.

Memory and CPU Verification

When suspecting the memory is overutilized, a quick way to verify that use free: this command provides information about unused and used memory and swap space. By providing the -h (human readable) flag, we can quickly see the memory availability of the system:

supervisor@rtbrick:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            31G        4.3G         24G        469M        2.3G         26G
Swap:            0B          0B          0B

The output from free is based on what the system reads from /proc/meminfo; of importance are the available and used columns. The description of the fields can be seen below (since man is not available on the switches):

`free` command fields description
Name	Description
total	Total amount of memory that can be used by the applications.
used	Used memory, which is calculated as `total` - `free` - `buffers` - `cache`
free	Unused memory.
shared	Backwards compatibility, not used.
buff/cache	The combined memory used by the kernel buffers and page cache. This memory can be reclaimed at any time if needed by the applications.
available	Estimate of the amount of memory that is available for starting new applications. Does not account swap memory.

free has a few useful options that can be used:

-h - human readable: makes the output easier to read, by using the common shortcuts for units (e.g M for mebibytes, G fo gibibytes etc)
-t - total: will display a total at the bottom of each column (basically adding physical+swap memory)
-s - continuous print output (can be interrupted with Ctrl+C): by giving a seconds value at which the output is refreshed, you will get a continuous display of values (similar to the watch command; e.g free -s 5)

As it can be seen, free is a basic utility that displays relevant information in a compressed format. It does not offer detailed or real-time information about the running processes. As with the CPU, we can use top to obtain realtime information about memory usage

Another way to check memory consumption, as well as CPU utlization, is to use top; it is one of the most common ways to start troubleshooting a Linux-based system, because it provides a wealth of information, and, in general, is a good starting point for system troubleshooting.

Basically, this command allows users to monitor processes and CPU/memory usage, but, unlike many other commands, it does so in an interactive way. top output can be customized in many ways, depending on the information we want to focus on, but in this guide we will not go through all the possible options top has.

A typical top output looks like the one below:

supervisor@rtbrick:~$ top

top - 21:12:41 up 1 day,  8:13,  1 users,  load average: 2.66, 2.72, 2.73
Tasks:  46 total,   1 running,  45 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.4 us,  8.5 sy,  0.0 ni, 79.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 32785636 total, 26181044 free,  4135516 used,  2469076 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 27834804 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  260 root      20   0 10.563g 1.473g 177440 S 109.3  4.7   2280:59 vpp_main
  108 root      20   0  406552 121648  46196 S  31.9  0.4  36:49.48 pppoed.1
  168 root      20   0 3203196  40704   9824 S   7.0  0.1 117:40.63 rtbrick-resmond
  156 root      20   0  461888 134852  59328 S   2.3  0.4  38:11.79 subscriberd.1
  112 root      20   0  438592 140644  46892 S   2.0  0.4  36:35.40 igmp.iod.1
  166 root      20   0  408076 117936  43416 S   2.0  0.4  36:35.14 pim.iod.1
  176 root      20   0  392036 114596  40312 S   2.0  0.3  37:15.43 l2tpd.1
  183 root      20   0  586944 144644  51256 S   1.7  0.4  38:27.94 bgp.iod.1
  136 root      20   0  425212 147080  35636 S   1.3  0.4  22:50.32 resmond
  193 root      20   0 1453416 929168  93672 S   0.7  2.8  17:09.07 confd
  266 root      20   0  836892 107804  35424 S   0.7  0.3  17:59.60 prometheus
  <...output omitted...>

top output is divided in two different sections: the upper half (summary area) of the output contains statistics on processes and resource usage, while the lower half contains a list of the currently running processes. You can use the arrow keys and Page Up/Down keys to browse through the list. If you want to quit, press “q” or Ctrl+C.

On the first line, you will notice the system time and the uptime, followed by the number of users logged into the system. The first row concludes with load average over one, five and 15 minutes. “Load” means the amount of computational work a system performs. In our case, the load is the number of processes in the R (runnable) and D (uninterruptible sleep) states at any given moment.

A word on process states

We’ve mentioned above about a "process state". In Linux, a process may be in of these states:

Runnable ( R ): The process is either executing on the CPU, or it is present in the run queue, ready to be executed.
Interruptible sleep (S): The process is waiting for an event to complete.
Uninterruptible sleep (D): The process is waiting for an I/O operation to complete.
Zombie (Z): A process may create a number of child processes, which can exit while the parent is still active. However, the data structures that the kernel creates in memory to keep track of these child processes have to be maintained until the parent find out about the status of its child. These terminated processes who still have associated data structures in memory are called zombies.

While looking at the summary area, it is also good practice to check if any zombie processes exist (on the Tasks row); a high number of zombies is indicative of a system problem.

The CPU-related statistics are on the %CPU(s) row:

us: Time the CPU spends executing processes for users in “user space.”
sy: Time spent running system “kernel space” processes.
ni: Time spent executing processes with a manually set "nice" value ( nice values determine the priority of a process relative to others - higer nice values of a process means that process will get a lower priority to run).
id: CPU idle time.
wa: Time the CPU spends waiting for I/O to complete.
hi: Time spent servicing hardware interrupts.
si: Time spent servicing software interrupts.
st: Time lost due to running virtual machines (“steal time”).

For systems with multiple CPU cores, we can see the per-core load by pressing "1" in the interface; another useful way of visualisation is to have a graphical display of CPU load: this can be done by pressing "t" in the top interface. Below is an example of top with both "1" and "t" pressed:

top - 12:11:56 up 1 day, 23:12,  4 users,  load average: 2.76, 2.86, 2.97
Tasks:  58 total,   1 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu0  :   5.6/1.6     7[||||||||                                                                                            ]
%Cpu1  :  59.8/40.2  100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu2  :  19.0/2.0    21[|||||||||||||||||||||                                                                               ]
%Cpu3  :  20.5/3.1    24[|||||||||||||||||||||||                                                                             ]
%Cpu4  :   2.1/3.4     5[|||||                                                                                               ]
%Cpu5  :   0.0/0.3     0[                                                                                                    ]
%Cpu6  :   1.7/0.7     2[|||                                                                                                 ]
%Cpu7  :   2.0/2.0     4[||||                                                                                                ]
GiB Mem :   31.267 total,   24.086 free,    4.059 used,    3.122 buff/cache
GiB Swap:    0.000 total,    0.000 free,    0.000 used.   26.328 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  260 root      20   0 10.783g 1.486g 191704 S 142.7  4.8 521:41.08 vpp_main
  148 root      20   0  399540 120656  44336 S   2.3  0.4   3:15.48 lldpd
  124 root      20   0  461888 137532  59312 S   2.0  0.4   8:22.26 subscriberd.1
  136 root      20   0  485168 183640  58500 S   2.0  0.6  98:59.56 igmp.iod.1
  180 root      20   0  588120 152332  52132 S   2.0  0.5   8:28.70 bgp.iod.1
  117 root      20   0  441872 137428  52904 S   1.7  0.4   9:59.74 pim.iod.1
  169 root      20   0  410648 131128  52828 S   1.7  0.4   8:50.00 pppoed.1
  171 root      20   0  425068 148420  36016 S   1.7  0.5   4:50.66 resmond
  176 root      20   0  388964 116732  40536 S   1.7  0.4   8:08.61 l2tpd.1
  <...output omitted...>

The next two lines are dedicated to memory information, and as expected, the “total”, “free” and “used” values have their usual meanings. The “avail mem” value is the amount of memory that can be allocated to processes without causing more swapping.

As with "t", the same thing can be done for displaying memory usage, but this time we will press "m". It is also worth noting that we can change the units in which memory values are displayed by pressing "E" (pressing repeatedly will cycle through kibibytes, mebibytes, gibibytes, tebibytes, and pebibytes). The following example shows the unit changed from kibi to gibibytes:

top - 12:25:19 up 1 day, 23:26,  4 users,  load average: 3.29, 3.12, 3.05
Tasks:  58 total,   1 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s):  17.2/6.9    24[||||||||||||||||||||||||                                                                            ]
GiB Mem :   31.267 total,   24.081 free,    4.063 used,    3.123 buff/cache
GiB Swap:    0.000 total,    0.000 free,    0.000 used.   26.323 avail Mem
<...output omitted...>

Moving to the lower half of the output (the task area), here we can see the list of processes that are running on the system. Below you can find a short explanation for each of the columns in the task area:

`top` task area columns description
Name	Description
PID	process ID, a unique positive integer that identifies a process.
USER	the "effective" username of the user who started the process; Linux assigns a real user ID and an effective user ID to processes; the second one allows a process to act on behalf of another user (e.g.: a non-root user can elevate to root in order to install a package).
PR and NI	NI show the "nicety" value of a process, while the "PR" shows the scheduling priority fomr the perspective of the kernel. Higher nice values give a processa lower priority.
VIRT, RES, SHR and %MEM	these fields are related to the memory consumed by each process. “VIRT” is the total amount of memory consumed by a process. “RES” is the memory consumed by the process in RAM, and “%MEM” shows this value as a percentage of the total RAM available. “SHR” is the amount of memory shared with other processes.
S	state of the process, in single letter form.
TIME+	total CPU time used by the process since it started, in seconds/100.
COMMAND	the name of the process.

From a troubleshooting standpoint, check for processes that consume large amounts of CPU and/or memory. In the task area, of interest are the RES, for memory, and %CPU columns.

For a cleaner (and possibly more relevant) output of top, we can sort only the active processes to be displayed, by running top -i, and we can sort even further by CPU usage, by pressing Shift+P while running top (or by initially running top -o %CPU):

supervisor@rtbrick:~$ top -i

top - 23:55:20 up 1 day, 10:56,  0 users,  load average: 2.98, 2.87, 2.79
Tasks:  46 total,   1 running,  45 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.6 us,  6.5 sy,  0.0 ni, 83.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 32785636 total, 26168764 free,  4137340 used,  2479532 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 27832552 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  260 root      20   0 10.564g 1.474g 177952 S 110.0  4.7   2475:13 vpp_main
  112 root      20   0  438592 140908  47112 S   2.0  0.4  39:40.10 igmp.iod.1
  129 root      20   0  399544 118640  44244 S   2.0  0.4  16:40.52 lldpd
  156 root      20   0  461888 134852  59328 S   2.0  0.4  41:22.55 subscriberd.1
  166 root      20   0  408076 117936  43416 S   2.0  0.4  39:39.87 pim.iod.1
  183 root      20   0  586944 144644  51256 S   2.0  0.4  41:42.46 bgp.iod.1
  108 root      20   0  406552 122028  46576 S   1.7  0.4  39:57.25 pppoed.1
  136 root      20   0  425212 147080  35636 S   1.7  0.4  24:43.47 resmond
  176 root      20   0  392036 114596  40312 S   1.7  0.3  40:23.35 l2tpd.1
  193 root      20   0 1453420 929224  93728 S   1.0  2.8  18:35.60 confd
  168 root      20   0 3424392  41132   9824 S   0.7  0.1 127:35.58 rtbrick-resmond
  125 root      20   0  434464 132772  50396 S   0.3  0.4   2:54.41 mribd
  215 root      20   0 1527800  12736   5952 S   0.3  0.0   0:10.43 rtbrick-restcon
  266 root      20   0  837180 107808  35424 S   0.3  0.3  19:30.23 prometheus

As with the example above, we can also filter by any column present in the task area.

If, for example, a process is hogged and starts consuming too much CPU or memory, thus preventing the good functioning of the system, top offers the option to kill the respective process: you can press "k" and enter the process ID of the process to be killed; in the example below, the operator will terminate the cron process (make sure to run top as root when terminating processes spawned with the root user):

top - 07:39:16 up 1 day, 18:40,  3 users,  load average: 2.89, 2.90, 2.91
Tasks:  56 total,   2 running,  54 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.7 us,  7.5 sy,  0.0 ni, 81.7 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32785636 total, 26042276 free,  4145480 used,  2597880 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 27766808 avail Mem
PID to signal/kill [default pid = 260] 126
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  260 root      20   0 10.559g 1.471g 181376 R 106.2  4.7 174:41.75 vpp_main
12192 supervi+  20   0   39572   3564   3048 R   6.2  0.0   0:00.01 top
    1 root      20   0   77448   8740   6840 S   0.0  0.0   0:00.27 systemd
   21 root      19  -1   70268  12152  11476 S   0.0  0.0   0:00.37 systemd-journal
   33 root      20   0   42584   3992   2980 S   0.0  0.0   0:00.20 systemd-udevd
   75 systemd+  20   0   71860   5388   4792 S   0.0  0.0   0:00.03 systemd-network
   81 systemd+  20   0   70640   5088   4532 S   0.0  0.0   0:00.05 systemd-resolve
  107 root      20   0 1604588  15864   8376 S   0.0  0.0   0:00.77 rtbrick-hostcon
  109 root      20   0  612252 189544  90092 S   0.0  0.6   0:10.33 etcd
  114 syslog    20   0  263036   4164   3652 S   0.0  0.0   0:00.10 rsyslogd
  117 root      20   0  408076 119448  43908 S   0.0  0.4   2:50.89 pim.iod.1
  120 root      20   0  503648 151880  66984 S   0.0  0.5   0:11.60 ifmd
  <...output omitted...>

Alternatively, ps can be used; ps is an utility for viewing information related with the processes on a system; it’s abbreviated from "Process Status", and gets its information from /proc. It can be used in conjunction with tools like top, or standalone. Usually you would run ps after seeing a summary with top, for example. ps is useful to get more information about some specific process (for example the command - or arguments - a process is executed with). Normally ps is executed with one or more options, in order to obtain a meaningful output.

Common `ps` options
Name	Description
e	Show all processes
u	Select processes by effective user ID (EUID) or name
f	Full-format listing (there is also F - Extra full format)
L	Show threads

Some common example are:

listing all running processes, detailed

supervisor@rtbrick:~$ ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 17:07 ?        00:00:00 /sbin/init
root        23     1  0 17:07 ?        00:00:00 /lib/systemd/systemd-journald
root        31     1  0 17:07 ?        00:00:00 /lib/systemd/systemd-udevd
systemd+    53     1  0 17:07 ?        00:00:00 /lib/systemd/systemd-networkd
systemd+    94     1  0 17:07 ?        00:00:00 /lib/systemd/systemd-resolved
root       136     1  1 17:07 ?        00:03:46 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json
syslog     138     1  0 17:07 ?        00:00:00 /usr/sbin/rsyslogd -n
root       139     1  1 17:07 ?        00:06:53 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json
root       142     1  1 17:07 ?        00:07:06 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json
root       145     1  0 17:07 ?        00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json
<...output omitted...>

listing all running processes and threads

supervisor@rtbrick:~$ ps -eLf
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
<...output omitted...>
root       136     1   136  1    1 17:07 ?        00:03:48 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json
root       139     1   139  1    1 17:07 ?        00:06:56 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json
root       142     1   142  1    1 17:07 ?        00:07:09 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json
root       145     1   145  0    1 17:07 ?        00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json
root       147     1   147  1    1 17:07 ?        00:07:04 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_iod.json
<...output omitted...>
root       157     1   157  0    1 17:07 ?        00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/policy_server.json
root       160     1   160  0   19 17:07 ?        00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022
root       160     1   202  0   19 17:07 ?        00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022
root       160     1   203  0   19 17:07 ?        00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022
<...output omitted...>
root       165     1   165  0    3 17:07 ?        00:00:18 /usr/bin/python3 /usr/local/bin/rtbrick-resmond-agent
root       314     1   349  0   22 17:07 ?        00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager
root       314     1   366  0   22 17:07 ?        00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager
root       314     1   367  0   22 17:07 ?        00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager
<...output omitted...>

listing all processes run by a user

supervisor@rtbrick:~$ ps -u syslog -f
UID        PID  PPID  C STIME TTY          TIME CMD
syslog     138     1  0 17:07 ?        00:00:00 /usr/sbin/rsyslogd -n
supervisor@rtbrick:~$

Along with ps you can use pgrep and pkill to search, and then terminate a process:

supervisor@rtbrick:~$ pgrep -u root -a
1 /sbin/init
23 /lib/systemd/systemd-journald
31 /lib/systemd/systemd-udevd
136 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json
139 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json
142 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json
145 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json
147 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_iod.json
149 /usr/local/bin/bd -i /etc/rtbrick/bd/config/etcd.json
152 /usr/local/bin/bd -i /etc/rtbrick/bd/config/resmond.json
154 /usr/local/bin/bd -i /etc/rtbrick/bd/config/staticd.json
<...output omitted...>
314 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager
316 /usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.retention.time=5d --storage.tsdb.path=/var/db/prometheus
<...output omitted...>
supervisor@rtbrick:~$ pkill prometheus

Disk Space

Another issue that can affect the functioning of the system is the lack of disk space; in severe situations, the system will become unusable. From this standpoint, checking disk space is one of the first things you do when doing first troubleshooting steps.

On Linux-based systems there are two main tools to check disk space: du (disk usage) and df (disk free). As in the case with ps and top, it is important to understand the uses cases for the two, and how they can complement each other.

Normally, you would first use df to have a quick look of the overall system disk space, then you would use du to look deeper into the problem. This approach is due to how these two tools work: df reads the superblocks only and trusts it completely, while du traverses a directory and reads each object, then sums the values up. This means that, most of the times, there will be differences between the exact values reported by these two; you can say that df sacrifices accuracy for speed.

First, we can look at the total space on the switch (we run the command in ONL):

supervisor@5916-nbg1:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.0M     0  1.0M   0% /dev
/dev/sdb7       113G  5.2G  102G   5% /
/dev/sdb6       2.0G  1.2G  677M  64% /mnt/onl/images
/dev/sdb1       256M  252K  256M   1% /boot/efi
/dev/sdb4       120M   43M   69M  39% /mnt/onl/boot
/dev/sdb5       120M  1.6M  110M   2% /mnt/onl/config
tmpfs           3.2G  720K  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           6.3G     0  6.3G   0% /run/shm
cgroup           12K     0   12K   0% /sys/fs/cgroup
tmpfs           6.0G  546M  5.5G   9% /shm
supervisor@5916-nbg1:~$

We then verify container disk space, by looking at the general snapshot of the system:

supervisor@rtbrick:~$ df -h
Filesystem                                                                                                                                                  Size  Used Avail Use% Mounted on
/var/cache/rtbrick/imagestores/847c6ecd-df58-462e-a447-38c620a12fe1/rbfs-cont/rbfs-accessleaf-qmx-20.10.0-g4internal.20201103065150+Bmvpn.C1067d22e/rootfs  113G  5.1G  102G   5% /
none                                                                                                                                                        492K     0  492K   0% /dev
/dev/sdb7                                                                                                                                                   113G  5.1G  102G   5% /var/log
tmpfs                                                                                                                                                       6.0G  546M  5.5G   9% /shm
devtmpfs                                                                                                                                                    1.0M     0  1.0M   0% /dev/mem
tmpfs                                                                                                                                                        16G  4.3M   16G   1% /dev/shm
tmpfs                                                                                                                                                        16G  9.0M   16G   1% /run
tmpfs                                                                                                                                                       5.0M     0  5.0M   0% /run/lock
tmpfs                                                                                                                                                        16G     0   16G   0% /sys/fs/cgroup
tmpfs                                                                                                                                                       3.2G     0  3.2G   0% /run/user/1000
supervisor@rtbrick:~$

At a quick glance we can see here that the root partition has a 5% usage, from a total of 113GB. You will also notice that /dev/sdb7 in the container has the same values as the output reported in ONL. It also has the same total size and same used space as the root filesystem. Notice the usage of the -h flag, which makes the output easier to read ("human readable").

Then you can verify the details of a specific directory, let’s say you want too see how much disk space is used by user files in /usr:

supervisor@rtbrick:~$ ls -l /usr/
total 44
drwxr-xr-x  1 root root 4096 Nov  3 11:54 bin
drwxr-xr-x  2 root root 4096 Apr 24  2018 games
drwxr-xr-x 37 root root 4096 Nov  3 06:59 include
drwxr-xr-x  1 root root 4096 Nov  3 11:54 lib
drwxr-xr-x  1 root root 4096 Nov  3 06:57 local
drwxr-xr-x  2 root root 4096 Nov  3 06:59 sbin
drwxr-xr-x  1 root root 4096 Nov  3 11:54 share
drwxr-xr-x  2 root root 4096 Apr 24  2018 src
supervisor@rtbrick:~$ du -sh /usr/
2.6G	/usr/
supervisor@rtbrick:~$ ls -l /usr
total 44
drwxr-xr-x  1 root root 4096 Nov  3 11:54 bin
drwxr-xr-x  2 root root 4096 Apr 24  2018 games
drwxr-xr-x 37 root root 4096 Nov  3 06:59 include
drwxr-xr-x  1 root root 4096 Nov  3 11:54 lib
drwxr-xr-x  1 root root 4096 Nov  3 06:57 local
drwxr-xr-x  2 root root 4096 Nov  3 06:59 sbin
drwxr-xr-x  1 root root 4096 Nov  3 11:54 share
drwxr-xr-x  2 root root 4096 Apr 24  2018 src

We then go even deeper, to check what takes most space in the /usr directory

supervisor@rtbrick:~$ du -h /usr/ | sort -rh | head -5
2.6G	/usr/
1.8G	/usr/local
1.7G	/usr/local/lib
506M	/usr/lib
169M	/usr/share

We used du in conjunction with sort (options r - reverse the result -, and h - compare human readable numbers -), as well as with head, to get only the biggest 5 directories from the output.