System Health
Container
The steps described in this section are performed on the host OS. For logging into the host OS instead of into the Linux container, use SSH port 1022 instead of the default port 22. Example:
ssh supervisor@10.10.0.100 -p 1022
Checking the Container Status
The RBFS services run in a Linux container (LXC). If you are able to log in into the container, obviously the container is running. If you are not able to log in, you can verify the status of the container on the host OS, i.e. ONL on hardware switches, as follows:
supervisor@spine1:~$ sudo lxc-ls -f NAME STATE AUTOSTART GROUPS IPV4 IPV6 rtbrick RUNNING 1 - 10.0.3.10 -
On a hardware switch, there will be a single container called "rtbrick" and the state shall be "Running". If the state is "Stopped" or "Failed", the container has failed and the system is in a non-operational state.
Recovering from a Container Failure
If the container exists but is not running, you can start it using the rtb-image tool:
supervisor@spine1:~$ sudo rtb-image container start rtbrick
Alternatively you can use the following lxc command:
supervisor@spine1:~$ sudo lxc-start -n rtbrick
If the container does not exist, or if starting it fail, you can try to recover by restarting the device at the ONL layer:
supervisor@spine1:~$ sudo reboot
Brick Daemons
Checking the BD’s Status
RBFS runs multiple Brick Daemons (BD). You can verify the status of the daemons using the Ubuntu system control (systemctl) or using the RBFS show command: 'show bd running-status'. The following commands will show all rtbrick services. The status should be "running":
Example 1: Show output using the Ubuntu system control command
supervisor@rtbrick:~$ sudo systemctl list-units | grep rtbrick var-rtbrick-auth.mount loaded active mounted /var/rtbrick/auth rtbrick-alertmanager.service loaded active running rtbrick-alertmanager service rtbrick-bgp.appd.1.service loaded active running rtbrick-bgp.appd.1 service rtbrick-bgp.iod.1.service loaded active running rtbrick-bgp.iod.1 service <...>
Example 2: Show output using the 'show bd running-status' command
supervisor@rtbrick: op> show bd running-status Daemon Status alertmanager running bgp.appd.1 running bgp.iod.1 running confd running etcd running fibd running <...>
Please note the supported BDs differ by role and may change in a future release. You can display further details as shown in the following example:
supervisor@rtbrick:~$ sudo systemctl status rtbrick-fibd.service rtbrick-fibd.service - rtbrick-fibd service Loaded: loaded (/lib/systemd/system/rtbrick-fibd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-10-19 05:01:16 UTC; 4h 41min ago Process: 248 ExecStartPost=/bin/bash -c [ -f "/usr/local/bin/rtbrick-ems-service-event" ] && { /usr/bin/python3 /usr/local/bin/rtbr Process: 240 ExecStartPre=/bin/mkdir -p /var/run/rtbrick/fibd_pipe (code=exited, status=0/SUCCESS) Process: 225 ExecStartPre=/usr/local/bin/rtbrick-bcm-sdk-symlink.bash (code=exited, status=0/SUCCESS) Process: 150 ExecStartPre=/usr/local/bin/rtbrick-vpp-startup-conf.bash (code=exited, status=0/SUCCESS) Main PID: 246 (vpp_main) CGroup: /system.slice/rtbrick-fibd.service 246 /usr/local/bin/bd -i /etc/rtbrick/bd/config/fibd.json
If the status is "failed", the respective service or daemon has failed. A daemon failure is a major issue, and depending on which BD has failed, the system is likely to be in a non-operational state. If a BD has failed, inspect the system log (syslog) file as well as the respective BD log file as described in section 5, then proceed to sections 2.3.2 and 2.3.3, and finally report the failure to RtBrick as described in section 6.
Core Dump
If a BD fails, it shall create a core dump file. A core dump is a file containing a process’s state like its address space (memory) when the process terminates unexpectedly. These files are located in /var/crash/rtbrick. If you have identified or suspect a BD failure, navigate to this directory and check for core dump files:
supervisor@rtbrick:~$ cd /var/crash/rtbrick/ supervisor@rtbrick:/var/crash/rtbrick$ ls -l -rw-r--r-- 1 root root 3236888576 Apr 9 10:17 core.fibd_136_2020-04-09_10-16-52
If there is a core dump file, you can decode it using the GNU debugger tool like in the following example:
supervisor@rtbrick:/var/crash/rtbrick$ sudo gdb bd core.fibd_136_2020-04-09_10-16-52 <...>
At the resulting 'gdb' prompt, type and enter 'bt' for backtrace:
(gdb) bt <...>
Report the resulting output to RtBrick as described in section 6. Analysing the core dump file will typically require support from RtBrick and is beyond the scope of this guide.
Recovering from a BD Failure
If a brick daemon fails, RBFS will automatically restart it. If the automatic restart does not succeed, you can use the Ubuntu system control to start a daemon like in the following example:
supervisor@rtbrick:~$ sudo systemctl start rtbrick-fibd.service
Alternatively you can recover from a BD failure by rebooting the container from the Linux container shell:
supervisor@rtbrick:~$ sudo reboot
Running Configuration
Verifying the Configuration
A missing running configuration is another possible problem scenario. There are several reasons why the system might be missing its configuration. Posting the running configuration might have failed for example due an invalid configuration syntax, or there was a connectivity issue between the device and the provisioning system.
You can easily verify via CLI if there is a running configuration:
supervisor@rtbrick: op> show config
If you suspect a configuration load issue, inspect the confd logs as well as the CtrlD log as describe in section 5.
Restoring a Configuration
It depends on the customer deployment scenario how a running configuration shall be applied or restored in case of an issue.
If the device already had a configuration previously, and you have saved it in a file, you can simply load it via the CLI:
supervisor@rtbrick: cfg> load config spine1-2020-10-19.json
If the device already had a configuration previously, and has been configured to load the last configuration with the 'load-last-config: true' attribute, you can restore it by rebooting the container at the Linux container shell:
supervisor@rtbrick:~$ sudo reboot
Otherwise you can also copy an automatically stored running configuration file into your user directory and load it manually like in the following example:
supervisor@leaf1:~$ sudo cp /var/rtbrick/commit_rollback/766e102957bf99ec79100c2acfa9dbb9/config/running_config.json running_config.json supervisor@leaf1:~$ ls -l total 12 -rw-r--r-- 1 root root 8398 Oct 21 09:45 running_config.json supervisor@leaf1:~$ cli supervisor@leaf1: op> switch-mode config Activating syntax mode : cfg [config] supervisor@leaf1: cfg> load config running_config.json supervisor@leaf1: cfg> commit
If it’s a newly deployed or upgraded device, and there is out-of-band connectivity from your network management system (for example RBMS), you can trigger the configuration from your NMS.
If it’s a newly deployed or upgraded device, and the configuration shall be applied via a ZTP process from a local ZTP server, you need to reboot the device at the ONL layer in order to trigger the ZTP process:
supervisor@spine1:~$ sudo reboot
There is also an option to manually copy a configuration file to the device and into the container. If you have copied a configuration file via an out-of-band path to the ONL layer of the device, you can copy it into the container as follows. Please note the name in the directory path needs to match the name of the container, like "spine" in this example:
supervisor@spine1:~$ cp spine1-2020-10-19.json /var/lib/lxc/spine1/rootfs/home/supervisor/
Next you can load this configuration via the CLI as already described above:
supervisor@rtbrick: cfg> load config spine1-2020-10-19.json
License
RBFS software requires a license to ensure its legitimate use. The license will be automatically validated and enforced. After an initial grace period of 7 days, if a license is missing or expired, RBFS will be restricted. The CLI as well as the BDS APIs will not work anymore.
If all CLI commands do not work, the license might be missing or expired.
Verifying a License
You can verify the license via the CLI:
supervisor@rtbrick: op> show system license License Validity: License index 1: Start date : Fri Mar 12 06:43:25 GMT +0000 2021 End date : Sat Mar 12 06:43:25 GMT +0000 2022
The output will indicate if there is a valid license, no license, or if the license is expired.
Restoring or Updating a License
The license is installed by configuration. If the license is missing but the device already had a license configuration previously, please restore the configuration as described in the Restoring a Configuration section above.
If the license is expired, please configure a new valid license key. If you do not have a license key yet, please contact your RtBrick support or sales representative to obtain a license.
Control Daemon
In addition to the Brick Daemons running inside the LXC container, there are some RBFS services running on the host OS. The most important one is CtrlD (Control Daemon). CtrlD acts as the single entry point to the system. Verify the status of CtrlD:
supervisor@spine1:~$ sudo service rtbrick-ctrld status [....] Checking the rtbrick ctrld service:3751 . ok
If the status is not "ok", restart, start, or stop and start CtrlD:
supervisor@spine1:~$ sudo service rtbrick-ctrld restart supervisor@spine1:~$ sudo service rtbrick-ctrld stop supervisor@spine1:~$ sudo service rtbrick-ctrld start
If the status is "ok", but you suspect an issue related to CtrlD, inspect the ctrld logs as also described in section 5:
supervisor@spine1:/var/log$ more rtbrick-ctrld.log
System Utilization
There are cases when system-related information has to be checked: symptoms like sluggish response, or daemons/processes crashing repeatedly can mean that system resources are overutilized. In such cases first steps are to verify CPU, memory, and disk. Before, it is good to remember the general architecture of an RtBrick switch: we have the physical box, on which ONL (Open Network Linux) sits. In ONL we run the LXC container (which has Ubuntu 22.04 installed), which in turn has RBFS running inside it.
In the following sections we’ll mainly concentrate on the LXC container verifications, and will specify where the commands are executed in ONL. The order in which the commands are shown in this section can also be used to do the basic system troubleshooting.
Memory and CPU Verification
When suspecting the memory is overutilized, a quick way to verify that use free
: this command provides information about unused and used memory and swap space. By providing the -h
(human readable) flag, we can quickly see the memory availability of the system:
supervisor@rtbrick:~$ free -h total used free shared buff/cache available Mem: 31G 4.3G 24G 469M 2.3G 26G Swap: 0B 0B 0B
The output from free
is based on what the system reads from /proc/meminfo
; of importance are the available
and used
columns. The description of the fields can be seen below (since man
is not available on the switches):
Name | Description |
---|---|
total |
Total amount of memory that can be used by the applications. |
used |
Used memory, which is calculated as |
free |
Unused memory. |
shared |
Backwards compatibility, not used. |
buff/cache |
The combined memory used by the kernel buffers and page cache. This memory can be reclaimed at any time if needed by the applications. |
available |
Estimate of the amount of memory that is available for starting new applications. Does not account swap memory. |
free
has a few useful options that can be used:
-
-h
- human readable: makes the output easier to read, by using the common shortcuts for units (e.g M for mebibytes, G fo gibibytes etc) -
-t
- total: will display a total at the bottom of each column (basically adding physical+swap memory) -
-s
- continuous print output (can be interrupted with Ctrl+C): by giving aseconds
value at which the output is refreshed, you will get a continuous display of values (similar to thewatch
command; e.gfree -s 5
)
As it can be seen, free
is a basic utility that displays relevant information in a compressed format. It does not offer detailed or real-time information about the running processes. As with the CPU, we can use top
to obtain realtime information about memory usage
Another way to check memory consumption, as well as CPU utlization, is to use top
; it is one of the most common ways to start troubleshooting a Linux-based system, because it provides a wealth of information, and, in general, is a good starting point for system troubleshooting.
Basically, this command allows users to monitor processes and CPU/memory usage, but, unlike many other commands, it does so in an interactive way. top
output can be customized in many ways, depending on the information we want to focus on, but in this guide we will not go through all the possible options top
has.
A typical top
output looks like the one below:
supervisor@rtbrick:~$ top top - 21:12:41 up 1 day, 8:13, 1 users, load average: 2.66, 2.72, 2.73 Tasks: 46 total, 1 running, 45 sleeping, 0 stopped, 0 zombie %Cpu(s): 12.4 us, 8.5 sy, 0.0 ni, 79.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32785636 total, 26181044 free, 4135516 used, 2469076 buff/cache KiB Swap: 0 total, 0 free, 0 used. 27834804 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 260 root 20 0 10.563g 1.473g 177440 S 109.3 4.7 2280:59 vpp_main 108 root 20 0 406552 121648 46196 S 31.9 0.4 36:49.48 pppoed.1 168 root 20 0 3203196 40704 9824 S 7.0 0.1 117:40.63 rtbrick-resmond 156 root 20 0 461888 134852 59328 S 2.3 0.4 38:11.79 subscriberd.1 112 root 20 0 438592 140644 46892 S 2.0 0.4 36:35.40 igmp.iod.1 166 root 20 0 408076 117936 43416 S 2.0 0.4 36:35.14 pim.iod.1 176 root 20 0 392036 114596 40312 S 2.0 0.3 37:15.43 l2tpd.1 183 root 20 0 586944 144644 51256 S 1.7 0.4 38:27.94 bgp.iod.1 136 root 20 0 425212 147080 35636 S 1.3 0.4 22:50.32 resmond 193 root 20 0 1453416 929168 93672 S 0.7 2.8 17:09.07 confd 266 root 20 0 836892 107804 35424 S 0.7 0.3 17:59.60 prometheus <...output omitted...>
top
output is divided in two different sections: the upper half (summary area) of the output contains statistics on processes and resource usage, while the lower half contains a list of the currently running processes. You can use the arrow keys and Page Up/Down keys to browse through the list. If you want to quit, press “q” or Ctrl+C.
On the first line, you will notice the system time and the uptime, followed by the number of users logged into the system. The first row concludes with load average
over one, five and 15 minutes. “Load” means the amount of computational work a system performs. In our case, the load is the number of processes in the R (runnable) and D (uninterruptible sleep) states at any given moment.
A word on process states
We’ve mentioned above about a "process state". In Linux, a process may be in of these states:
|
While looking at the summary area, it is also good practice to check if any zombie processes exist (on the Tasks
row); a high number of zombies is indicative of a system problem.
The CPU-related statistics are on the %CPU(s)
row:
-
us: Time the CPU spends executing processes for users in “user space.”
-
sy: Time spent running system “kernel space” processes.
-
ni: Time spent executing processes with a manually set "nice" value ( nice values determine the priority of a process relative to others - higher nice values of a process means that process will get a lower priority to run).
-
id: CPU idle time.
-
wa: Time the CPU spends waiting for I/O to complete.
-
hi: Time spent servicing hardware interrupts.
-
si: Time spent servicing software interrupts.
-
st: Time lost due to running virtual machines (“steal time”).
For systems with multiple CPU cores, we can see the per-core load by pressing "1" in the interface; another useful way of visualisation is to have a graphical display of CPU load: this can be done by pressing "t" in the top
interface. Below is an example of top
with both "1" and "t" pressed:
top - 12:11:56 up 1 day, 23:12, 4 users, load average: 2.76, 2.86, 2.97 Tasks: 58 total, 1 running, 57 sleeping, 0 stopped, 0 zombie %Cpu0 : 5.6/1.6 7[|||||||| ] %Cpu1 : 59.8/40.2 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] %Cpu2 : 19.0/2.0 21[||||||||||||||||||||| ] %Cpu3 : 20.5/3.1 24[||||||||||||||||||||||| ] %Cpu4 : 2.1/3.4 5[||||| ] %Cpu5 : 0.0/0.3 0[ ] %Cpu6 : 1.7/0.7 2[||| ] %Cpu7 : 2.0/2.0 4[|||| ] GiB Mem : 31.267 total, 24.086 free, 4.059 used, 3.122 buff/cache GiB Swap: 0.000 total, 0.000 free, 0.000 used. 26.328 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 260 root 20 0 10.783g 1.486g 191704 S 142.7 4.8 521:41.08 vpp_main 148 root 20 0 399540 120656 44336 S 2.3 0.4 3:15.48 lldpd 124 root 20 0 461888 137532 59312 S 2.0 0.4 8:22.26 subscriberd.1 136 root 20 0 485168 183640 58500 S 2.0 0.6 98:59.56 igmp.iod.1 180 root 20 0 588120 152332 52132 S 2.0 0.5 8:28.70 bgp.iod.1 117 root 20 0 441872 137428 52904 S 1.7 0.4 9:59.74 pim.iod.1 169 root 20 0 410648 131128 52828 S 1.7 0.4 8:50.00 pppoed.1 171 root 20 0 425068 148420 36016 S 1.7 0.5 4:50.66 resmond 176 root 20 0 388964 116732 40536 S 1.7 0.4 8:08.61 l2tpd.1 <...output omitted...>
The next two lines are dedicated to memory information, and as expected, the “total”, “free” and “used” values have their usual meanings. The “avail mem” value is the amount of memory that can be allocated to processes without causing more swapping.
As with "t", the same thing can be done for displaying memory usage, but this time we will press "m". It is also worth noting that we can change the units in which memory values are displayed by pressing "E" (pressing repeatedly will cycle through kibibytes, mebibytes, gibibytes, tebibytes, and pebibytes). The following example shows the unit changed from kibi to gibibytes:
top - 12:25:19 up 1 day, 23:26, 4 users, load average: 3.29, 3.12, 3.05 Tasks: 58 total, 1 running, 57 sleeping, 0 stopped, 0 zombie %Cpu(s): 17.2/6.9 24[|||||||||||||||||||||||| ] GiB Mem : 31.267 total, 24.081 free, 4.063 used, 3.123 buff/cache GiB Swap: 0.000 total, 0.000 free, 0.000 used. 26.323 avail Mem <...output omitted...>
Moving to the lower half of the output (the task area), here we can see the list of processes that are running on the system. Below you can find a short explanation for each of the columns in the task area:
Name | Description |
---|---|
PID |
process ID, a unique positive integer that identifies a process. |
USER |
the "effective" username of the user who started the process; Linux assigns a real user ID and an effective user ID to processes; the second one allows a process to act on behalf of another user (e.g.: a non-root user can elevate to root in order to install a package). |
PR and NI |
NI show the "nicety" value of a process, while the "PR" shows the scheduling priority from the perspective of the kernel. Higher nice values give a process a lower priority. |
VIRT, RES, SHR and %MEM |
these fields are related to the memory consumed by each process. “VIRT” is the total amount of memory consumed by a process. “RES” is the memory consumed by the process in RAM, and “%MEM” shows this value as a percentage of the total RAM available. “SHR” is the amount of memory shared with other processes. |
S |
state of the process, in single letter form. |
TIME+ |
total CPU time used by the process since it started, in seconds/100. |
COMMAND |
the name of the process. |
From a troubleshooting standpoint, check for processes that consume large amounts of CPU and/or memory. In the task area, of interest are the RES , for memory, and %CPU columns.
|
For a cleaner (and possibly more relevant) output of top
, we can sort only the active processes to be displayed, by running top -i
, and we can sort even further by CPU usage, by pressing Shift+P while running top
(or by initially running top -o %CPU
):
supervisor@rtbrick:~$ top -i top - 23:55:20 up 1 day, 10:56, 0 users, load average: 2.98, 2.87, 2.79 Tasks: 46 total, 1 running, 45 sleeping, 0 stopped, 0 zombie %Cpu(s): 9.6 us, 6.5 sy, 0.0 ni, 83.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 32785636 total, 26168764 free, 4137340 used, 2479532 buff/cache KiB Swap: 0 total, 0 free, 0 used. 27832552 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 260 root 20 0 10.564g 1.474g 177952 S 110.0 4.7 2475:13 vpp_main 112 root 20 0 438592 140908 47112 S 2.0 0.4 39:40.10 igmp.iod.1 129 root 20 0 399544 118640 44244 S 2.0 0.4 16:40.52 lldpd 156 root 20 0 461888 134852 59328 S 2.0 0.4 41:22.55 subscriberd.1 166 root 20 0 408076 117936 43416 S 2.0 0.4 39:39.87 pim.iod.1 183 root 20 0 586944 144644 51256 S 2.0 0.4 41:42.46 bgp.iod.1 108 root 20 0 406552 122028 46576 S 1.7 0.4 39:57.25 pppoed.1 136 root 20 0 425212 147080 35636 S 1.7 0.4 24:43.47 resmond 176 root 20 0 392036 114596 40312 S 1.7 0.3 40:23.35 l2tpd.1 193 root 20 0 1453420 929224 93728 S 1.0 2.8 18:35.60 confd 168 root 20 0 3424392 41132 9824 S 0.7 0.1 127:35.58 rtbrick-resmond 125 root 20 0 434464 132772 50396 S 0.3 0.4 2:54.41 mribd 215 root 20 0 1527800 12736 5952 S 0.3 0.0 0:10.43 rtbrick-restcon 266 root 20 0 837180 107808 35424 S 0.3 0.3 19:30.23 prometheus
As with the example above, we can also filter by any column present in the task area.
If, for example, a process is hogged and starts consuming too much CPU or memory, thus preventing the good functioning of the system, top
offers the option to kill the respective process: you can press "k" and enter the process ID of the process to be killed; in the example below, the operator will terminate the cron
process (make sure to run top
as root when terminating processes spawned with the root user):
top - 07:39:16 up 1 day, 18:40, 3 users, load average: 2.89, 2.90, 2.91 Tasks: 56 total, 2 running, 54 sleeping, 0 stopped, 0 zombie %Cpu(s): 10.7 us, 7.5 sy, 0.0 ni, 81.7 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32785636 total, 26042276 free, 4145480 used, 2597880 buff/cache KiB Swap: 0 total, 0 free, 0 used. 27766808 avail Mem PID to signal/kill [default pid = 260] 126 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 260 root 20 0 10.559g 1.471g 181376 R 106.2 4.7 174:41.75 vpp_main 12192 supervi+ 20 0 39572 3564 3048 R 6.2 0.0 0:00.01 top 1 root 20 0 77448 8740 6840 S 0.0 0.0 0:00.27 systemd 21 root 19 -1 70268 12152 11476 S 0.0 0.0 0:00.37 systemd-journal 33 root 20 0 42584 3992 2980 S 0.0 0.0 0:00.20 systemd-udevd 75 systemd+ 20 0 71860 5388 4792 S 0.0 0.0 0:00.03 systemd-network 81 systemd+ 20 0 70640 5088 4532 S 0.0 0.0 0:00.05 systemd-resolve 107 root 20 0 1604588 15864 8376 S 0.0 0.0 0:00.77 rtbrick-hostcon 109 root 20 0 612252 189544 90092 S 0.0 0.6 0:10.33 etcd 114 syslog 20 0 263036 4164 3652 S 0.0 0.0 0:00.10 rsyslogd 117 root 20 0 408076 119448 43908 S 0.0 0.4 2:50.89 pim.iod.1 120 root 20 0 503648 151880 66984 S 0.0 0.5 0:11.60 ifmd <...output omitted...>
Alternatively, ps
can be used; ps
is an utility for viewing information related with the processes on a system; it’s abbreviated from "Process Status", and gets its information from /proc. It can be used in conjunction with tools like top
, or standalone.
Usually you would run ps
after seeing a summary with top
, for example. ps
is useful to get more information about some specific process (for example the command - or arguments - a process is executed with). Normally ps
is executed with one or more options, in order to obtain a meaningful output.
Name | Description |
---|---|
e |
Show all processes |
u |
Select processes by effective user ID (EUID) or name |
f |
Full-format listing (there is also F - Extra full format) |
L |
Show threads |
Some common example are:
-
listing all running processes, detailed
supervisor@rtbrick:~$ ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 17:07 ? 00:00:00 /sbin/init root 23 1 0 17:07 ? 00:00:00 /lib/systemd/systemd-journald root 31 1 0 17:07 ? 00:00:00 /lib/systemd/systemd-udevd systemd+ 53 1 0 17:07 ? 00:00:00 /lib/systemd/systemd-networkd systemd+ 94 1 0 17:07 ? 00:00:00 /lib/systemd/systemd-resolved root 136 1 1 17:07 ? 00:03:46 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json syslog 138 1 0 17:07 ? 00:00:00 /usr/sbin/rsyslogd -n root 139 1 1 17:07 ? 00:06:53 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json root 142 1 1 17:07 ? 00:07:06 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json root 145 1 0 17:07 ? 00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json <...output omitted...>
-
listing all running processes and threads
supervisor@rtbrick:~$ ps -eLf UID PID PPID LWP C NLWP STIME TTY TIME CMD <...output omitted...> root 136 1 136 1 1 17:07 ? 00:03:48 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json root 139 1 139 1 1 17:07 ? 00:06:56 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json root 142 1 142 1 1 17:07 ? 00:07:09 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json root 145 1 145 0 1 17:07 ? 00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json root 147 1 147 1 1 17:07 ? 00:07:04 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_iod.json <...output omitted...> root 157 1 157 0 1 17:07 ? 00:00:18 /usr/local/bin/bd -i /etc/rtbrick/bd/config/policy_server.json root 160 1 160 0 19 17:07 ? 00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022 root 160 1 202 0 19 17:07 ? 00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022 root 160 1 203 0 19 17:07 ? 00:00:00 /usr/local/bin/rtbrick-hostconfd -proxy-onl-config http://10.0.3.1:22022 <...output omitted...> root 165 1 165 0 3 17:07 ? 00:00:18 /usr/bin/python3 /usr/local/bin/rtbrick-resmond-agent root 314 1 349 0 22 17:07 ? 00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager root 314 1 366 0 22 17:07 ? 00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager root 314 1 367 0 22 17:07 ? 00:00:01 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager <...output omitted...>
-
listing all processes run by a user
supervisor@rtbrick:~$ ps -u syslog -f UID PID PPID C STIME TTY TIME CMD syslog 138 1 0 17:07 ? 00:00:00 /usr/sbin/rsyslogd -n supervisor@rtbrick:~$
Along with ps
you can use pgrep
and pkill
to search, and then terminate a process:
supervisor@rtbrick:~$ pgrep -u root -a 1 /sbin/init 23 /lib/systemd/systemd-journald 31 /lib/systemd/systemd-udevd 136 /usr/local/bin/bd -i /etc/rtbrick/bd/config/lldpd.json 139 /usr/local/bin/bd -i /etc/rtbrick/bd/config/pim_iod.json 142 /usr/local/bin/bd -i /etc/rtbrick/bd/config/bgp_iod.json 145 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_appd.json 147 /usr/local/bin/bd -i /etc/rtbrick/bd/config/isis_iod.json 149 /usr/local/bin/bd -i /etc/rtbrick/bd/config/etcd.json 152 /usr/local/bin/bd -i /etc/rtbrick/bd/config/resmond.json 154 /usr/local/bin/bd -i /etc/rtbrick/bd/config/staticd.json <...output omitted...> 314 /usr/local/bin/alertmanager --config.file=/etc/prometheus/alertmanager.yml --storage.path=/var/db/alertmanager 316 /usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.retention.time=5d --storage.tsdb.path=/var/db/prometheus <...output omitted...> supervisor@rtbrick:~$ pkill prometheus
Disk Space
Another issue that can affect the functioning of the system is the lack of disk space; in severe situations, the system will become unusable. From this standpoint, checking disk space is one of the first things you do when doing first troubleshooting steps.
On Linux-based systems there are two main tools to check disk space: du
(disk usage) and df
(disk free). As in the case with ps
and top
, it is important to understand the uses cases for the two, and how they can complement each other.
Normally, you would first use df to have a quick look of the overall system disk space, then you would use du to look deeper into the problem. This approach is due to how these two tools work: df reads the superblocks only and trusts it completely, while du traverses a directory and reads each object, then sums the values up. This means that, most of the times, there will be differences between the exact values reported by these two; you can say that df sacrifices accuracy for speed.
|
First, we can look at the total space on the switch (we run the command in ONL):
supervisor@5916-nbg1:~$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 1.0M 0 1.0M 0% /dev /dev/sdb7 113G 5.2G 102G 5% / /dev/sdb6 2.0G 1.2G 677M 64% /mnt/onl/images /dev/sdb1 256M 252K 256M 1% /boot/efi /dev/sdb4 120M 43M 69M 39% /mnt/onl/boot /dev/sdb5 120M 1.6M 110M 2% /mnt/onl/config tmpfs 3.2G 720K 3.2G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 6.3G 0 6.3G 0% /run/shm cgroup 12K 0 12K 0% /sys/fs/cgroup tmpfs 6.0G 546M 5.5G 9% /shm supervisor@5916-nbg1:~$
We then verify container disk space, by looking at the general snapshot of the system:
supervisor@rtbrick:~$ df -h Filesystem Size Used Avail Use% Mounted on /var/cache/rtbrick/imagestores/847c6ecd-df58-462e-a447-38c620a12fe1/rbfs-cont/rbfs-accessleaf-qmx-20.10.0-g4internal.20201103065150+Bmvpn.C1067d22e/rootfs 113G 5.1G 102G 5% / none 492K 0 492K 0% /dev /dev/sdb7 113G 5.1G 102G 5% /var/log tmpfs 6.0G 546M 5.5G 9% /shm devtmpfs 1.0M 0 1.0M 0% /dev/mem tmpfs 16G 4.3M 16G 1% /dev/shm tmpfs 16G 9.0M 16G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup tmpfs 3.2G 0 3.2G 0% /run/user/1000 supervisor@rtbrick:~$
At a quick glance we can see here that the root partition has a 5% usage, from a total of 113GB. You will also notice that /dev/sdb7
in the container has the same values as the output reported in ONL. It also has the same total size and same used space as the root filesystem. Notice the usage of the -h
flag, which makes the output easier to read ("human readable").
Then you can verify the details of a specific directory, let’s say you want too see how much disk space is used by user files in /usr
:
supervisor@rtbrick:~$ ls -l /usr/ total 44 drwxr-xr-x 1 root root 4096 Nov 3 11:54 bin drwxr-xr-x 2 root root 4096 Apr 24 2018 games drwxr-xr-x 37 root root 4096 Nov 3 06:59 include drwxr-xr-x 1 root root 4096 Nov 3 11:54 lib drwxr-xr-x 1 root root 4096 Nov 3 06:57 local drwxr-xr-x 2 root root 4096 Nov 3 06:59 sbin drwxr-xr-x 1 root root 4096 Nov 3 11:54 share drwxr-xr-x 2 root root 4096 Apr 24 2018 src supervisor@rtbrick:~$ du -sh /usr/ 2.6G /usr/ supervisor@rtbrick:~$ ls -l /usr total 44 drwxr-xr-x 1 root root 4096 Nov 3 11:54 bin drwxr-xr-x 2 root root 4096 Apr 24 2018 games drwxr-xr-x 37 root root 4096 Nov 3 06:59 include drwxr-xr-x 1 root root 4096 Nov 3 11:54 lib drwxr-xr-x 1 root root 4096 Nov 3 06:57 local drwxr-xr-x 2 root root 4096 Nov 3 06:59 sbin drwxr-xr-x 1 root root 4096 Nov 3 11:54 share drwxr-xr-x 2 root root 4096 Apr 24 2018 src
We then go even deeper, to check what takes most space in the /usr
directory
supervisor@rtbrick:~$ du -h /usr/ | sort -rh | head -5 2.6G /usr/ 1.8G /usr/local 1.7G /usr/local/lib 506M /usr/lib 169M /usr/share
We used du
in conjunction with sort
(options r
- reverse the result -, and h
- compare human readable numbers -), as well as with head
, to get only the biggest 5 directories from the output.