Automatically Restart a Service if it Crashes
Problem:
Sometimes we all experience services that die randomly. The ideal solution in those cases can take some time, like a patch, rebuild the server or wait for a service window. Being able to quickly implement a watchdog for that service makes our life as admins so much better. The following solution is simple, quick and really works in most cases. I have it in production use right now with very good results.
Solution:
The solution I use isn’t really my own invention but I really like its simplicity. It’s basically just a shell script called from cron. The script watches the service and restarts it in case of a crash. Saves the users on our network from loads of grief.
Example:
This is what a sample script for LUM on SLED10 looks like:
#!/bin/bash MYPROC=namcd #The name of the process INITS=namcd #The name of the /etc/init.d/ file COUNT=$(UNIX95=1 ps -C $MYPROC -o pid= -o args= | wc -l) #This command gets the number of occurances of the command $MYPROC. If its running it gives 0. if [ $COUNT -lt 1 ] #Checks if the service seems like its running or not. then /etc/init.d/$INITS start # The command to start the service fi
If we want to check for an open port, we get a script that looks like this:
#!/bin/bash PORT=:445 #The port, the : makes it easy to snag only ports and not other numbers in the output. INITS=samba #The name of the service in /etc/init.d/ COUNT=$(netstat -lpn | grep $ | wc -l) if [ $COUNT -lt 1 ] then /etc/init.d/$INITS start fi
We can also change the actions taken when we find out the service isn’t running. For example with GroupWise we probably want to add a command after “then” to remove the leftover pid file:
rm /var/run/novell/groupwise/pidfile.pid
Where pidfile.pid is the name of the service that has crashed. Otherwise the agent won’t restart.
Environment:
This script should work everywhere on any SUSE version.
No comments yet