One day though the CPU spiked from 0.5 to 18 on a 12 core system on every node on that cluster, including the ones that were just passive fail-overs and that did not do any processing, just being a member of the same multicast domain. Nothing seemed out of the ordinary though, the traffic pattern was the same, no upgrades, updates or any system changes whatsoever. OSSEC did not report any file checksum changes, so I knew the systems were not compromised.
This is a perfect case to demonstrate a way to troubleshoot sudden and persistent CPU utilization spikes and how to look for memory leaks. Of course this will not apply to every case, but demonstrates the use of certain tool-set and the methodology behind approaching similar problems.
First I checked the load averages on the server:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# uptime | |
14:00:31 up 13 days, 17:55, 1 user, load average: 13.34, 13.59, 13.73 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# ps -eo pcpu,pid,user,args | sort -b -k1,1n | |
... | |
10.6 30146 root /opt/nd/networkingd | |
92.0 18352 nobody /opt/nd/networkingd | |
92.0 18360 nobody /opt/nd/networkingd | |
92.0 18377 nobody /opt/nd/networkingd | |
92.1 18344 nobody /opt/nd/networkingd | |
92.1 18349 nobody /opt/nd/networkingd | |
92.1 18357 nobody /opt/nd/networkingd | |
92.1 18361 nobody /opt/nd/networkingd | |
92.1 18370 nobody /opt/nd/networkingd | |
92.2 18345 nobody /opt/nd/networkingd | |
92.2 18346 nobody /opt/nd/networkingd | |
92.2 18353 nobody /opt/nd/networkingd | |
92.2 18354 nobody /opt/nd/networkingd | |
... |
Next I performed a strace on that pid for about 10 seconds:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# strace -o pid18354_strace.log -p 18354 |
This returned lots of the following lines:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
... | |
05:01:24.541575 read(11, 0x2dd3944, 12) = -1 EAGAIN (Resource temporarily unavailable) <0.000008> | |
... |
Then I checked to see what this file descriptor is:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# ls -la /proc/18354/fd | grep " 11 " | |
lrwx------ 1 root root 64 2013-06-07 05:08 11 -> socket:[2635620716] |
Got some more information about that particular socket:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# lsof -p 18354 | grep 2635620720 | |
networkingd 18354 nobody 11u unix 0xffff88039e7ff1c0 0t0 2635620720 socket |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# pmap -d 18354 | grep mapped | awk '{print $4}' | |
450740K |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# pmap -x 18354 | |
18354: /opt/nd/networkingd | |
Address Kbytes RSS Anon Locked Mode Mapping | |
0000000000400000 9448 - - - r-x-- networkingd | |
0000000000e3a000 76 - - - rwx-- networkingd | |
0000000000e4d000 1344 - - - rwx-- [ anon ] | |
0000000002bbf000 442288 - - - rwx-- [ anon ] | |
00007f9860000000 132 - - - rwx-- [ anon ] | |
00007f9860021000 65404 - - - ----- [ anon ] | |
00007f98677fc000 88 - - - r-x-- libresolv-2.11.1.so | |
00007f9867812000 2044 - - - ----- libresolv-2.11.1.so | |
00007f9867a11000 4 - - - r-x-- libresolv-2.11.1.so | |
00007f9867a12000 4 - - - rwx-- libresolv-2.11.1.so | |
00007f9867a13000 8 - - - rwx-- [ anon ] | |
00007f9867a15000 20 - - - r-x-- libnss_dns-2.11.1.so | |
00007f9867a1a000 2044 - - - ----- libnss_dns-2.11.1.so | |
00007f9867c19000 4 - - - r-x-- libnss_dns-2.11.1.so | |
00007f9867c1a000 4 - - - rwx-- libnss_dns-2.11.1.so | |
00007f9867c1f000 4 - - - rwxs- shared | |
00007f9867c20000 4 - - - rwxs- zero (deleted) | |
00007f9867c21000 4 - - - rwxs- status | |
00007f986ac2a000 17536 - - - r-xs- ip-to-location.bin | |
00007f986bd4a000 15648 - - - r-xs- base_locations.txt | |
00007f986cc92000 1028 - - - rwx-- [ anon ] | |
00007f986ce65000 4 - - - ----- [ anon ] | |
00007f986ce66000 3972 - - - rwx-- [ anon ] | |
00007f986d247000 98308 - - - rwxs- zero (deleted) | |
00007f9873248000 61444 - - - rwxs- zero (deleted) | |
00007f9876e49000 5242880 - - - rwxs- zero (deleted) | |
00007f99b6e49000 104692 - - - rwxs- zero (deleted) | |
00007f99bd486000 195316 - - - rwxs- zero (deleted) | |
00007f99c9343000 88 - - - r-x-- libgcc_s.so.1 | |
00007f99c9359000 2044 - - - ----- libgcc_s.so.1 | |
00007f99c9558000 4 - - - r-x-- libgcc_s.so.1 | |
00007f99c9559000 4 - - - rwx-- libgcc_s.so.1 | |
00007f99c955a000 48 - - - r-x-- libnss_files-2.11.1.so | |
00007f99c9566000 2044 - - - ----- libnss_files-2.11.1.so | |
00007f99c9765000 4 - - - r-x-- libnss_files-2.11.1.so | |
00007f99c9766000 4 - - - rwx-- libnss_files-2.11.1.so | |
00007f99c9767000 40 - - - r-x-- libnss_nis-2.11.1.so | |
00007f99c9771000 2044 - - - ----- libnss_nis-2.11.1.so | |
00007f99c9970000 4 - - - r-x-- libnss_nis-2.11.1.so | |
00007f99c9971000 4 - - - rwx-- libnss_nis-2.11.1.so | |
00007f99c9972000 92 - - - r-x-- libnsl-2.11.1.so | |
00007f99c9989000 2044 - - - ----- libnsl-2.11.1.so | |
00007f99c9b88000 4 - - - r-x-- libnsl-2.11.1.so | |
00007f99c9b89000 4 - - - rwx-- libnsl-2.11.1.so | |
00007f99c9b8a000 8 - - - rwx-- [ anon ] | |
00007f99c9b8c000 32 - - - r-x-- libnss_compat-2.11.1.so | |
00007f99c9b94000 2044 - - - ----- libnss_compat-2.11.1.so | |
00007f99c9d93000 4 - - - r-x-- libnss_compat-2.11.1.so | |
00007f99c9d94000 4 - - - rwx-- libnss_compat-2.11.1.so | |
00007f99c9d99000 5120 - - - rwxs- zero (deleted) | |
00007f99ca299000 1856512 - - - rwxs- zero (deleted) | |
00007f9a3b799000 10240 - - - rwxs- zero (deleted) | |
00007f9a3c199000 4 - - - rwxs- zero (deleted) | |
00007f9a3c19a000 4 - - - rwxs- conninfo | |
00007f9a3c19b000 40 - - - rwxs- tipgroups | |
00007f9a3c1a5000 15628 - - - rwxs- zero (deleted) | |
00007f9a3d0e8000 208 - - - rwxs- zero (deleted) | |
00007f9a3d11c000 772 - - - rwxs- locations | |
00007f9a3d1dd000 772 - - - rwxs- pools | |
00007f9a3d29e000 1796 - - - rwxs- monitors | |
00007f9a3d45f000 580 - - - rwxs- zero (deleted) | |
00007f9a3d4f0000 6152 - - - rwxs- zero (deleted) | |
00007f9a3daf2000 1588 - - - rwxs- zero (deleted) | |
00007f9a3dc7f000 2276 - - - rwxs- zero (deleted) | |
00007f9a3deb8000 180 - - - rwxs- zero (deleted) | |
00007f9a3dee5000 180 - - - rwxs- zero (deleted) | |
00007f9a3df12000 180 - - - rwxs- zero (deleted) | |
00007f9a3df3f000 5080 - - - rwxs- zero (deleted) | |
00007f9a3e435000 1524 - - - r-x-- libc-2.11.1.so | |
00007f9a3e5b2000 2044 - - - ----- libc-2.11.1.so | |
00007f9a3e7b1000 16 - - - r-x-- libc-2.11.1.so | |
00007f9a3e7b5000 4 - - - rwx-- libc-2.11.1.so | |
00007f9a3e7b6000 20 - - - rwx-- [ anon ] | |
00007f9a3e7bb000 520 - - - r-x-- libm-2.11.1.so | |
00007f9a3e83d000 2044 - - - ----- libm-2.11.1.so | |
00007f9a3ea3c000 4 - - - r-x-- libm-2.11.1.so | |
00007f9a3ea3d000 4 - - - rwx-- libm-2.11.1.so | |
00007f9a3ea3e000 8 - - - r-x-- libdl-2.11.1.so | |
00007f9a3ea40000 2048 - - - ----- libdl-2.11.1.so | |
00007f9a3ec40000 4 - - - r-x-- libdl-2.11.1.so | |
00007f9a3ec41000 4 - - - rwx-- libdl-2.11.1.so | |
00007f9a3ec42000 96 - - - r-x-- libpthread-2.11.1.so | |
00007f9a3ec5a000 2044 - - - ----- libpthread-2.11.1.so | |
00007f9a3ee59000 4 - - - r-x-- libpthread-2.11.1.so | |
00007f9a3ee5a000 4 - - - rwx-- libpthread-2.11.1.so | |
00007f9a3ee5b000 16 - - - rwx-- [ anon ] | |
00007f9a3ee5f000 28 - - - r-x-- librt-2.11.1.so | |
00007f9a3ee66000 2044 - - - ----- librt-2.11.1.so | |
00007f9a3f065000 4 - - - r-x-- librt-2.11.1.so | |
00007f9a3f066000 4 - - - rwx-- librt-2.11.1.so | |
00007f9a3f067000 128 - - - r-x-- ld-2.11.1.so | |
00007f9a3f092000 4 - - - ----- [ anon ] | |
00007f9a3f093000 28 - - - rwx-- [ anon ] | |
00007f9a3f09a000 324 - - - rwxs- zero (deleted) | |
00007f9a3f0eb000 1620 - - - rwx-- [ anon ] | |
00007f9a3f282000 4 - - - rwx-- [ anon ] | |
00007f9a3f283000 4 - - - rwxs- zero (deleted) | |
00007f9a3f284000 8 - - - rwx-- [ anon ] | |
00007f9a3f286000 4 - - - r-x-- ld-2.11.1.so | |
00007f9a3f287000 4 - - - rwx-- ld-2.11.1.so | |
00007f9a3f288000 4 - - - rwx-- [ anon ] | |
00007fffb7f5c000 132 - - - rwx-- [ stack ] | |
00007fffb7fff000 4 - - - r-x-- [ anon ] | |
ffffffffff600000 4 - - - r-x-- [ anon ] | |
---------------- ------ ------ ------ ------ | |
total kB 8196392 - - - |
After this I checked to see if we have enough file descriptors for connection allocation:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# sysctl fs.file-nr | |
fs.file-nr = 9760 0 3624316 | |
root@server1:~# cat /proc/sys/fs/file-max | |
3624316 |
Next I checked to see how much memory the kernel is configured to dedicate to TCP:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# cat /proc/sys/net/ipv4/tcp_mem | |
3480768 4641024 6961536 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# cat /proc/net/sockstat | |
sockets: used 8012 | |
TCP: inuse 7751 orphan 529 tw 14095 alloc 8209 mem 9413 | |
UDP: inuse 39 mem 11 | |
UDPLITE: inuse 0 | |
RAW: inuse 1 | |
FRAG: inuse 0 memory 0 |
Then I checked to see if there are too many orphaned sockets:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# cat /proc/sys/net/ipv4/tcp_max_orphans | |
262144 |
At this point I wanted to see the different states of the INET sockets so I wrote this quick script that shows the 4 important TCP states - ESTABLISHED, SYN_RECEIVED, TIME_WAIT and LAST_ACK and show me the top 5 users for them:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
usage() | |
{ | |
cat << EOF | |
usage: $0 options | |
OPTIONS: | |
-c top ips | |
EOF | |
} | |
EXCLUDEIP=`echo $(cat /etc/network/interfaces | grep address | awk '{print $2}') | sed 's/ /|/g'` | |
count=5 | |
seq=$(seq 0 2 10) | |
while getopts “hc:†OPTION | |
do | |
case $OPTION in | |
h) | |
usage | |
exit | |
;; | |
c) | |
count=$OPTARG | |
seq=$(seq 0 2 $(expr $count \* 2)) | |
;; | |
esac | |
done | |
TIME_WAIT=$(ss -n state time-wait | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) | |
ESTABLISHED=$(ss -n state established | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) | |
SYN_RECEIVED=$(ss -n state syn-recv | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) | |
LAST_ACK=$(ss -n state last-ack | awk '{print $3}' | awk -F":" '{print $1}' | sort | uniq -c | sort -rn | egrep -vwi "$EXCLUDEIP" | head -n $count) | |
let i=0 | |
for num in $TIME_WAIT | |
do | |
array_timewait[i]="$num" | |
let i=i+1 | |
done | |
let i=0 | |
for num in $ESTABLISHED | |
do | |
array_established[i]="$num" | |
let i=i+1 | |
done | |
let i=0 | |
for num in $SYN_RECEIVED | |
do | |
array_synreceived[i]="$num" | |
let i=i+1 | |
done | |
let i=0 | |
for num in $LAST_ACK | |
do | |
array_lastack[i]="$num" | |
let i=i+1 | |
done | |
echo -e "ESTABLISHED SYN_RECEIVED TIME_WAIT LAST_ACK" | |
echo -e "=======================================================================================" | |
for i in $seq | |
do | |
echo -n "${array_established[$i]} " | |
echo -n -e "${array_established[$i+1]}\t" | |
echo -n "${array_synreceived[$i]} " | |
echo -n -e "${array_synreceived[$i+1]}\t" | |
echo -n "${array_timewait[$i]} " | |
echo -n -e "${array_timewait[$i+1]}\t" | |
echo -n "${array_lastack[$i]} " | |
echo -e "${array_lastack[$i+1]}" | |
done | column -t |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ESTABLISHED SYN_RECEIVED TIME_WAIT LAST_ACK | |
======================================================================================= | |
567 XXX.56.48.163 35 XXX.56.52.28 2169 XXX.56.48.163 33 XXX.57.214.223 | |
402 XXX.57.214.223 33 XXX.57.214.223 1350 XXX.56.4.156 30 XXX.57.204.205 | |
314 XXX.56.4.156 14 XXX.56.49.195 841 XXX.57.204.205 13 XXX.56.4.46 | |
268 XXX.57.204.205 13 XXX.57.204.205 783 XXX.56.52.28 12 XXX.56.48.163 | |
213 XXX.57.214.133 11 XXX.56.48.163 611 XXX.56.52.53 9 XXX.56.4.54 |
At this point I ran out of ideas as to what to look for. I knew it's the networking software that was causing the high load and memory consumption as shown by the process list (or top), but since the software is closed source I could not go and look at the actual code.
The last thing I tried though was to start gathering memory consumption for all running processes and their children every minute for 5 days, by running the following through cron and storing it in a file along with the PIDs:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
root@server1:~# pmap -d $pid | grep mapped | awk '{print $4}'` | sed s/.$// |
I sent the results to the product vendor and sure enough they found 2 bugs in their code, one that was not cleaning sockets and a memory leak.
Hope this was helpful.