2014年10月25日星期六

用Linux Shell脚本轻松管理Radius服务器

本邮件内容由第三方提供,如果您不想继续收到该邮件,可 点此退订
用Linux Shell脚本轻松管理Radius服务器  阅读原文»

公司的无线环境采用mac地址认证的方式,mac地址被绑定到Radius的users配置文件中,将注册了的mac地址作为用户名和密码。为了方便的管理这些mac地址,自己写了一个shell脚本来管理。

shell脚本所特有的强大文本处理能力和各种命令函数的组合,使得管理员的工作能轻松不少。

下面就列出该脚本的功能以示参考:

  1. 添加mac地址

  2. 删除mac地址

  3. 查找mac地址

  4. 去除重复mac地址

  5. 检查mac地址合法性

  6. TODO,导入导出mac地址,添加注释

其中用到的Shell脚本技术包括但不限于:

  1. 文本文件的列处理和行处理,如sed、awk等命令

  2. 字符串查找、过滤、大小写转换,bash和grep等命令

  3. 获取、计算、比较字符串长度,bash和wc等命令

  4. mac地址正则表达式的处理和类型转换

  5. shell编程操作、包括文件包含、函数、参数传递、返回值等

  6. 其他

代码示例:

  #!/bin/bash  #  # Source function library.  . /etc/rc.d/init.d/functions  RADIUSD=/usr/sbin/radiusd  LOCKF=/var/lock/subsys/radiusd  CONFIG=/etc/raddb/radiusd.conf  USERCONFIG=/etc/raddb/users  [ -f $RADIUSD ] || exit 0  [ -f $CONFIG ] || exit 0  [ -f $USERCONFIG ] || exit 0  RETVAL=0  OPERATION=$1  MACADDRESS=$2  function help()  {          clear          echo $""          echo $"===================================================================================="          echo $"For Radius on Fedora/CentOS/RadHat Linux Server, Written by Chris"          echo $"===================================================================================="          echo $"A tool to manage Radius server"          echo $""          echo $"Usage: $0 {find|add|modify|delete|check|remove|start|stop|status|restart|reload} mac"          #TODO          echo $"Usage: $0 {import|export|debug}"          echo $""          echo $"For more information please contract dgdenterprise@gmail.com"          echo $"===================================================================================="          echo $""          exit 1  }  function mac()  {          if [ -z $MACADDRESS ];then                  echo $"no mac address is signed! "                  echo $"\$2 is $MACADDRESS"                  exit 1          else                  if [[ "${#MACADDRESS}" != "12" ]] && [[ "${#MACADDRESS}" != "17" ]] ;then                          echo "mac length is ${#MACADDRESS}"                          echo "mac address is illegal! "                          exit 1  #                else  #                        echo $"mac which you input is $MACADDRESS"                  fi                  #echo $MACADDRESS | sed -nr '/[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}/p'                  #echo $MACADDRESS | sed -nr '/[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}/p'                  #echo $MACADDRESS | sed -nr '/[A-Fa-f0-9]{12}/p'                  if [[ `echo $MACADDRESS | grep -` ]];then                          PROMAC=`echo $MACADDRESS | sed -nr '/[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}-[A-Fa-f0-9]{2}/p' | tr '[:upper:]' '[:lower:]' | sed 's/-//g'`                  elif [[ `echo $MACADDRESS | grep :` ]];then                          PROMAC=`echo $MACADDRESS | sed -nr '/[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}:[A-Fa-f0-9]{2}/p' | tr '[:upper:]' '[:lower:]' | sed 's/://g'`                  else                          PROMAC=`echo $MACADDRESS | tr '[:upper:]' '[:lower:]'`                  fi                  echo $PROMAC          fi  }  function find()  {          MAC=`mac`          echo $"accepted mac is $MAC"          if [[ `grep $MAC $USERCONFIG` ]]; then                  MACLINE=`grep -n $MAC $USERCONFIG | awk -F ':' '{print $1}'`                  #echo $MACLINE                  MACLINECOUNT=$(echo $MACLINE | wc -w)                  #echo $MACLINECOUNT                  if [[ "$MACLINECOUNT" != "1" ]];then                          echo $"ERROR, this mac $MAC has duplicate record, you should use $0 remove $MAC to remove duplicate record"                          exit 1                  fi                  echo $"Successfully find $MAC in $MACLINE line of file $USERCONFIG! "                  echo                  REVAL=$?          else                  echo $"Can not find $MAC in file $USERCONFIG! "                  echo                  exit 1                  REVAL=$?          fi  }  function add()  {          MAC=`mac`          echo $"accepted mac is $MAC"          #find $MAC          LINENUM=`grep -n "Cleartext-Password :='" users | grep -v \# | head -n1 | awk -F ":" '{print $1}'`          SEDOPERATION=$LINENUM"a"          sed -i "$SEDOPERATION $MAC    Cleartext-Password :='$MAC'" $USERCONFIG          find $MAC          restart  }  function modify()  {          MAC=`mac`          find $MAC          #TODO  }  function delete()  {          MAC=`mac`          echo $"accepted mac is $MAC"          if [[ `grep $MAC $USERCONFIG` ]]; then                  MACLINE=`grep -n $MAC $USERCONFIG | awk -F ':' '{print $1}'`                  ##echo $MACLINE                  #MACLINECOUNT=$(echo $MACLINE | wc -w)                  ##echo $MACLINECOUNT                  #if [[ "$MACLINECOUNT" != "1" ]];then                  #        echo $"ERROR, this mac $MAC has duplicate record, you should use $0 remove $MAC to remove duplicate record"                  #        exit 1                  #fi                  echo $"Successfully find $MAC in $MACLINE line of file $USERCONFIG! "                  echo $"It will be deleted! "                  sed -i "$MACLINE d" $USERCONFIG                  #TODO                  echo $"If you see 'Can not find $MAC in file $USERCONFIG! ', it means successfully! "                  find $MAC                  echo                  REVAL=$?          else                  echo $"Can not find $MAC in file $USERCONFIG! "                  echo                  REVAL=$?          fi  }  function check()  {          MAC=`mac`          find $MAC          remove $MAC  }  function remove()  {          MAC=`mac`          echo $"accepted mac is $MAC"          #TODO          #echo $"backuped file to file $FILENAME"          if [[ `grep $MAC $USERCONFIG` ]]; then                  MACLINE=`grep -n $MAC $USERCONFIG  awk -F ':' '{print $1}'`                  #echo $MACLINE                  MACLINECOUNT=$(echo $MACLINE  wc -w)                  #echo $MACLINECOUNT                  if [[ "$MACLINECOUNT" == "1" ]];then                          echo $"WARNNING, this mac $MAC is good record, no duplicate record has found! "                          exit 0                  fi          TOREMOVE="$MAC    Cleartext-Password :='$MAC'"          sed -i "/^$TOREMOVE$/d" $USERCONFIG          add $MAC          fi  }  function restart()  {          service radiusd restart  }  function reload()  {          service radiusd reload  }  function status()  {          service radiusd status  }  case "$1" in          find)                  find                  RETVAL=$?          ;;          add)                  add                  RETVAL=$?          ;;          modify)                  modify                  RETVAL=$?          ;;          delete)                  delete                  RETVAL=$?          ;;          check)                  check                  RETVAL=$?          ;;          remove)                  remove                  RETVAL=$?          ;;          start)                  start                  RETVAL=$?          ;;          stop)                  stop                  RETVAL=$?          ;;          status)                  status                  RETVAL=$?          ;;          restart)                  restart                  RETVAL=$?          ;;          reload)                  reload                  RETVAL=$?          ;;          *)                  help                  exit 1          ;;  esac  

其中有一些可以改进的地方,比如换一种方法或者增强用户的使用体验都是可以的,欢迎大家提出意见。

本文出自 "通信,我的最爱" 博客,请务必保留此出处http://dgd2010.blog.51cto.com/1539422/1567085

sudo bug导致的zabbix断图问题  阅读原文»

sudo bug导致的zabbix断图问题

线上使用zabbix的host update来监测监控值是否完整(关于host update的实现请参考:

http://caiguangguang.blog.51cto.com/1652935/1345789)

一直发现有机器过一段时间update值就会莫名其妙变低,之前一直没有找到rc,只是简单通过重启agent来进行修复,最近同事细心地发现可能是和sudo的bug有关系。

回过头再来验证下整个的排查过程。

1.通过zabbix 数据库获取丢失数据的item,拿出缺失的(20分钟没有更新的)值的item列表

  select b.key_,b.lastvalue,from_unixtime(b.lastclock) from hosts a,   items b where a.hostid=b.hostid and a.host='xxxxxx' and   b.lastclock < (unix_timestamp() - 1200) limit 10;  

比如这里看agent.ping:

观察监控图,发现在18点20分之后数据丢失:

wKiom1RH0cKSp6HeAAJbnn76bS0678.jpg

2.分析zabbix agent端的日志

发现在18点24粉左右出现下面的日志,没有看到正常的获取值和发送值的情况,只有大量的update_cpustats状态,同时发现有一行kill command 失败的日志:

  27589:20141021:182442.143 In zbx_popen() command:'sudo hadoop_stats.sh nodemanager StopContainerAvgTime'  27589:20141021:182442.143 End of zbx_popen():5  48430:20141021:182442.143 zbx_popen(): executing script  27585:20141021:182442.284 In update_cpustats()  27585:20141021:182442.285 End of update_cpustats()  27585:20141021:182443.285 In update_cpustats()  27585:20141021:182443.286 End of update_cpustats()  27585:20141021:182444.286 In update_cpustats()  27585:20141021:182444.287 End of update_cpustats()  27585:20141021:182445.287 In update_cpustats()  27585:20141021:182445.287 End of update_cpustats()  27585:20141021:182446.288 In update_cpustats()  27585:20141021:182446.288 End of update_cpustats()  ..........  27585:20141021:182508.305 In update_cpustats()  27585:20141021:182508.305 End of update_cpustats()  27585:20141021:182509.306 In update_cpustats()  27585:20141021:182509.306 End of update_cpustats()  27585:20141021:182510.306 In update_cpustats()  27585:20141021:182510.307 End of update_cpustats()  27585:20141021:182511.307 In update_cpustats()  27585:20141021:182511.308 End of update_cpustats()  27589:20141021:182512.154 failed to kill [sudo hadoop_stats.sh nodemanager StopContainerAvgTime]: [1] Operation not permitted  27589:20141021:182512.155 In zbx_waitpid()  27585:20141021:182512.308 In update_cpustats()  27585:20141021:182512.309 End of update_cpustats()  27585:20141021:182513.309 In update_cpustats()  27585:20141021:182513.309 End of update_cpustats()  

对比正常的日志:

  27589:20141021:180054.376 In zbx_popen() command:'sudo hadoop_stats.sh nodemanager StopContainerAvgTime'  27589:20141021:180054.377 End of zbx_popen():5  18798:20141021:180054.377 zbx_popen(): executing script  27589:20141021:180054.384 In zbx_waitpid()  27589:20141021:180054.384 zbx_waitpid() exited, status:1  27589:20141021:180054.384 End of zbx_waitpid():18798  27589:20141021:180054.384 Run remote command [sudo  hadoop_stats.sh nodemanager StopContainerAvgTime] Result [2] [-1]...  27589:20141021:180054.384 For key [hadoop_stats[nodemanager,StopContainerAvgTime]] received value [-1]  27589:20141021:180054.384 In process_value() key:'gd6g203s80-hadoop-datanode.idc.vipshop.com:hadoop_stats[nodemanager,StopContainerAvgTime]' value:'-1'  27589:20141021:180054.384 In send_buffer() host:'10.200.100.28' port:10051 values:37/50  27589:20141021:180054.384 Will not send now. Now 1413885654 lastsent 1413885654 < 1  27589:20141021:180054.385 End of send_buffer():SUCCEED  27589:20141021:180054.385 buffer: new element 37  27589:20141021:180054.385 End of process_value():SUCCEED  

可以看到正常情况下脚本会有返回值,而出问题的时候,脚本是没有返回值的,并且由于是使用sudo 运行脚本,导致以普通用户启动的zabbix在超时时没有办法杀掉这个command(Operation not permitted错误)

3.假设这里启动zabbix agent的普通用户为apps用户,我们看下这个脚本目前的状态

  ps -ef|grep hadoop_stats.sh  root     34494 31429  0 12:54 pts/0    00:00:00 grep 48430  root     48430 27589  0 Oct21 ?        00:00:00 sudo hadoop_stats.sh nodemanager StopContainerAvgTime  root     48431 48430  0 Oct21 ?        00:00:00 [hadoop_stats.sh] <defunct>  

可以看到,这里产生了一个僵尸进程(关于僵尸进程可以参考:http://en.wikipedia.org/wiki/Zombie_process)

僵尸进程是由于子进程运行完毕之后,发送SIGCHLD到父进程,而父进程没有正常处理这个信号导致。

  You have killed the process, but a dead process doesn't disappear from the process table  until its parent process performs a task called "reaping" (essentially calling wait(3)   for that process to read its exit status). Dead processes that haven't been reaped are    called "zombie processes."  The parent process id you see for 31756 is process id 1, which always belongs to init.  That process should reap its zombie processes periodically, but if it can't, they will   remain zombies in the process table until you reboot.  

正常的进程情况下,我们使用strace attach到父进程,然后杀掉子进程后可以看到如下信息:

  Process 3036 attached - interrupt to quit  select(6, [5], [], NULL, NULL  )          = ? ERESTARTNOHAND (To be restarted)  --- SIGCHLD (Child exited) @ 0 (0) ---  rt_sigreturn(0x11)                      = -1 EINTR (Interrupted system call)  wait4(3037, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG|WSTOPPED, NULL) = 3037  exit_group(143)                         = ?  Process 3036 detached  

产生僵尸进程之后,可以通过杀掉父进程把僵尸进程变成孤儿进程(父进程为init进程)

但是这里因为是用sudo启动的脚本,导致启动用户都是root,apps用户就没有权限杀掉启动的命令,进而导致子进程一直是僵尸进程的状态存在

4.来看一下zabbix agent端启动的相关进程情况

  ps -ef|grep zabbix  apps     27583     1  0 Sep09 ?        00:00:00 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  apps     27585 27583  0 Sep09 ?        00:33:25 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  apps     27586 27583  0 Sep09 ?        00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  apps     27587 27583  0 Sep09 ?        00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  apps     27588 27583  0 Sep09 ?        00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  apps     27589 27583  0 Sep09 ?        02:28:12 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf  root     34207 31429  0 12:54 pts/0    00:00:00 grep zabbix  root     48430 27589  0 Oct21 ?        00:00:00 sudo /apps/sh/zabbix_scripts/hadoop/hadoop_stats.sh nodemanager StopContainerAvgTime  

通过strace我们发现27589的进程一直在等待48430的进程

  strace  -p 27589  Process 27589 attached - interrupt to quit  wait4(48430, ^C <unfinished ...>  Process 27589 detached  

阅读更多内容

没有评论:

发表评论