Nagios: Host and Service Monitoring Tool
Nagios:
Host
and Service Monitoring Tool – quick overview and basic installation
guide
What
is Nagios ?
This
is a sample screen from our installation / service problems view.
Basically
Nagios (http://www.nagios.org) is an open source host, service and
network monitoring tool. It let’s you manage different types of
services and hosts running on different operating systems like linux,
netware, windows, aix ,…. It’s flexible in configuration and can be
extended as much as you want. It’s configured within text files and
managed with a web browser.
When
you do a basic installation you get a set of Nagios check programs
which let you start monitoring your first hosts and services,
beginning with the installation in less than an hour.
Based
on that default configuration you can start extending the
configuration to your special needs.
You
can use the existing check programs or even add more check programs
if you take a look at http://www.monitoringexchange.org
where other developers put a lot of their check programs for
download. Or just write your own in any programming language that is
available on linux. Based on the services, you can setup time frames
for the monitoring process as well as notifications when alarms
arise. You can, but do not have to. It’s always your choice and
that’s great with Nagios. If you do not want to setup any
notifications, ignore its configuration, you can always watch the
service problems in the web browser. That is what I currently use.
It’s good to come to the office in the morning take one look at the
Nagios service problems view and know what is going on. Or when I’ll
do an eDirectory migration in our 260+ netware server tree Nagios is
open in the background to see which servers aren’t reachable.
Here
is a list of check programs in the base system:
check_breeze
check_http check_nt check_ssh
check_by_ssh
check_icmp check_ntp check_swap
check_dhcp
check_ifoperstatus check_nwstat check_tcp
check_dig
check_ifstatus check_oracle check_time
check_disk
check_imap check_overcr check_udp
check_disk_smb
check_ircd check_ping check_udp2
check_dns
check_ldaps check_pop check_ups
check_dummy
check_load check_procs check_users
check_file_age
check_log check_radius check_wave
check_flexlm
check_mailq check_real negate
check_fping
check_mrtg check_rpc urlize
check_ftp
check_mrtgtraf check_sensors utils.pm
check_hpjd
check_nagios check_smtp utils.sh
check_nntp
check_snmp
At
http://www.monitoringexchange.org
you’ll find a lot more if you are missing something here.
Here
are some things we wanted to solve with Nagios:
We
are running a lot of different management solutions (ZEN for server,
IBM Tivoli, HP / Compaq System Insight Manager, McAfee ePolicy
Orchestrator, ……) and all of them are in some parts very strong
and powerful. The problem is,
that some of them are really time consuming installations,
configurations and administrations. It’s problematic to add own
services and hard to get a single view of all current services and
the problems that existing.
So
I went around looking for a solution that is even quiet
easy to deploy but as powerful as possible to handle all our
requirements. Of course I will try to drop out some other
existing management solutions in our company for this, but I am sure
it’s not possible to drop them all for this one. A combination of
them will be the best for us.
Here
are some samples (beside a lot of default monitoring requirements
like cpu, filesystem, …) I had to deal with during the day to day
administration and that should be able to automate with Nagios:
-
check
Bordermanager SurfControl logfiles to see if the update happened -
check
ftp/tls and ftp/ssh server functionality with a real ftp user
connect -
check
time synchronization on oes/linux server -
check
running virus scanner processes (like LinuxShield from McAfee) -
check
HP Proliant server status from the HP Proliant support pack agents -
check
DNS round robin functionality -
check
Vmware remote connection and web interface ports -
check
ZEN linux management mirror process for ZLM and RCD targets -
check
memory / swap usage on linux servers -
check
linux drbd synchronization status -
check
server availability on one screen during eg. eDirectory migrations -
check
ldap functionality -
check
web applications -
check
oes/linux cluster ressources -
….
So
how does it work ?
Just
to say it in a simple way, you define a host by it’s name and ip
address and a view other parameters and assign it some services.
Behind a service is nothing other than a check program configured
that runs in a predefined interval some tests. Like the ftp services
uses the check_ftp program to do some ftp connects to the server. It
reports the result of it with a exit code back to Nagios. Exit code
0, if the test was okay = green, exit code 1 means warning = yellow,
exit code 2 means critical = red and exit code 3 means unknown =
orange.
Nagios
itself shows you in the service problems view, the currently existing
problems (hopefully this list is short or even empty) or in the
service details a list of all configured services and their status.
Depending
on your notification configuration it also notifies you with
regarding information.
If
you are interested in such a system monitoring tool the time is well
worth to do a test installation and spend a view hours with it.
We are monitoring currently 308 hosts and about 1058 services.
Additional
configurations / extensions:
There
are a lot of configurations / extensions that are not covered in this
document. As you discover more and more of the possible
configurations you will find things like how to put a web application
link behind a host extra note. Like a host is down, just select it’s
extra host notes and you are forwarded directly to the eg. HP remote
insight board. Or if a server shows some warnings from the HP
Proliant agents, you can select the additional service tasks and come
directly to the Proliant web management interface on the server.
There is no need for us to enter the url for that web page anymore
or even take a look at the HP system insight manager for server
status. That’s all in Nagios now.
I’ll
try to show you how you can setup a Nagios installation with a little
linux knowledge in less than an hour or two. I think that’s the best
way to take a look at it. After that I’ll show how to add a new check
program for Nagios to monitor if a single file exists.
So
here is a basic installation guide:
Unnecessary
to say that you should do it on a test server first. If you use eg.
PuTTY for ssh connect to your server you can copy / paste the
commands from this document directly into the shell. So you do not
have to write them yourself.
-
Do
a default OES/Linux or SLES 9 – 32 bit installation.Nagios
does not need very much memory (less than 32mb) and disk space (less
than 50mb). So it could be a ?small? server or even a virtual
machine.
-
Install
/ remove some packages of your installationOES and SLES
contains Nagios version 1.2 in it’s distribution but I’ll decided to
use the current version 2.0 rc1 from http://www.nagios.org.
So I had to remove and install a view packages with yast:remove
if installed: nagios,
nagios-nsca and
nagios-pluginsinstall
if not yet done: gd-devel
and libpng-devel
packages
and the whole Simple
Webserver selection
-
Download
the Nagios and the plugin tarballDownload the most recent
Nagios and the official plugin tarball from the current Nagios
version from http://www.nagios.org/download
and copy
it to /tmp
on your test server.
Right
now there is the Nagios version 2.0rc1 available. Normally I prefer
to use only rpms for installation, but this time I use the tarball
so I can configure some additional parts during compilations and
installation. If tried this installation procedure with other 2.0x
version and it was working well.current nagios
tarball: nagios-2.0rc1.tar.gz
current
plugin tarball: nagios-plugins-1.4.1.tar.gz
-
Create
the Nagios user and groupYou’re
probably going to want to run Nagios under a normal user account, so
add a new user and group to your system with the following
command:#
useradd
-m nagios
# groupadd nagios
-
Create
the installation directoryCreate
the base directory where you would like to install Nagios as
follows…#
mkdir
/usr/local/nagiosChange
the owner of the base installation directory to be the Nagios user
and group you added earlier as follows:#
chown
nagios.nagios /usr/local/nagios
-
Identify
the web server userYou’re
probably going to want to issue external
commands
(like acknowledgements and scheduled downtime)
from the web interface. To do so, you need to identify the user your
web server runs as (typically wwwrun). This setting is found
in your web server configuration file. The following command can be
used to determine quickly what user Apache is running as:#
grep
-R "^User" /etc/apache2/*
Normally
the user is the wwwrun. We will add it to the Nagios group in the
next step. If the user differs, be sure to use in step 7 the right
one.
-
Add
a command file groupNext
we’re going to create a new group whose members include the user
your web server is running as and the user Nagios is running as.
Let’s say we call this new group ‘nagcmd‘:#
groupadd
nagcmdNext,
add the users that your web server and Nagios run as to the newly
created group with the following commands:#
usermod -G nagcmd wwwrun
# usermod -G nagcmd nagios
-
Extract
the Nagios tarball#
cd /tmp
# tar -xvzf nagios-2.0rc1.tar.gz
# cd nagios-2.0rc1
-
Compile
the Nagios packageRun the configure script to initialize
variables and create a Makefile as follows:#
./configure
–prefix=/usr/local/nagios –with-cgiurl=/nagios/cgi-bin
–with-htmurl=/nagios –with-nagios-user=nagios
–with-nagios-group=nagios ?with-command-group=nagcmd
-
Compile
the binariesCompile
Nagios and the CGIs with the following command:#
make
all
-
Installing
the binaries and HTML filesInstall
the binaries and HTML files (documentation and main web page) with
the following command:#
make
install
-
Installing
an init script
If you want, you can also
install the init script /etc/init.d/nagios with the following
command:#
make
install-init
-
Installing
command mode
If you want, you can also
install the command mode environment with the following command:#
make
install-commandmode
-
Installing
sample config filesNow
we install some default configuration file, that have to be changed
a little bit later:#
make install-config
-
Installing
the plugins
Plugins are usually
installed in the libexec/ directory of your Nagios
installation (i.e. /usr/local/nagios/libexec). Plugins are
scripts or binaries which perform all the service and host checks
that constitute monitoring.#
cd
/tmp
# tar -xvzf nagios-plugins-1.4.1.tar.gz
# cd
nagios-plugins-1.4.1
# ./configure
#
make
# make install
-
Setup
the Apache web interfaceTo
make Nagios accessible through the apache web server we have to
setup a config file for it. Create the config file as follows:
#
vi
/etc/apache2/conf.d/nagios.confInsert
this elements into that new file:
ScriptAlias
/nagios/cgi-bin /usr/local/nagios/sbin
<Directory
"/usr/local/nagios/sbin">
AllowOverride AuthConfig
Options ExecCGI
Order allow,deny
Allow from
all
</Directory>Alias /nagios
/usr/local/nagios/share
<Directory "/usr/local/nagios/share">
Options None
AllowOverride AuthConfig
Order allow,deny
Allow from all
</Directory>After
that restart the apache web server with the following command:
#
rcapache2
restart
-
Setup
a minimum Nagios configuration#
cd /usr/local/nagios/etc
# cp cgi.cfg-sample cgi.cfg
# cp
nagios.cfg-sample nagios.cfg
# cp minimal.cfg-sample
minimal.cfg
# cp resource.cfg-sample resource.cfgFor
proper use of this Nagios configuration we have to create two
additional, empty config files. Do not copy the sample files for
this one, there would be duplicate command definitions.#
touch checkcommands.cfg
# touch misccommands.cfg
-
The
last configuration steps …Set the Nagios user and group as
owner of the Nagios installation:#
chown -R nagios.nagios /usr/local/nagiosDeactivate
the authentication for the cgi’s:#
vi cgi.cfgSearch for the line
?use_authentication=1?
and change it to ?use_authentication=0?.
That’s for testing easier to handle.
But not all functions are
possible if it’s disabled.
For production use later it should be
activated but then you have to configure some other parts of Nagios
as well.
-
Verify
the Nagios configuration and start it#
/usr/local/nagios/bin/nagios -v
/usr/local/nagios/etc/nagios.cfgThe ?Total
Warnings? and ?Total Errors? should be 0 if you have done
everything correct.
If so just start it the first time:#
/etc/init.d/nagios startActivate Nagios to start
within the runlevel scripts automatically#
insserv nagios
-
Test
Nagios access with a web browserhttp://<servername
or ip address>/nagiosYou should see the
following:
Now
you can start exploring Nagios yourself.
Nice
to know: All screens refresh every 30 seconds, no need to reload
them.
That time can be changed even to lower values in the Nagios
configuration files.
Just let me show you the most important
views:
1.
Tactical Overview:
This
screen gives you an overview of the current status of the monitored
services and hosts. Take a look at the ?Hosts? and you see that
you are currently monitoring only one host. In the ?Services?
line you see that you are monitoring 5 services and maybe all are
reporting the status ?OK?. As a summary the ?Network Health?
– Host and Service health bar is filled completely with green,
indicating all configured hosts and services are OK.
Most
fields on this page are links to more specific views. If you want to
know more about your 5 monitored services you can either klick on the
?5 OK? filed, or choose ?Service Detail? on the left.
2.
Service Detail
This screen shows you all the
services configured to be monitored and their current status.
As
we saw in the tactical screen here are our five services we monitor
right now.
You can see basic informations about each service
on this page:
Host the host to which this services are
configured
If this field is marked red, the host itself is
down,
if it’s just grey the server is up and reachable with
ping.
Status show the current status of the service
OK =
green
Warning = yellow
Critical = red
Unknown =
orange
Last Check date and time when it has been checked the last
time
Duration shows for how long the service in this
status
Attempt how many attempts were needed for the check
Status
Information this is the output from the check program
Again if
you want to know more about a single service, select it by its name
and you are redirected to a more detailed page about it.
3.
Host detail
This is the same view as the service detail,
showing the details of the monitored hosts. Therefore I have no
screen shot of it. You would see all configured hosts and have again
the choice to select one to get more informations about it.
4.
Service Problems
Hopefully this screen will be empty
as long as possible. This is the screen that I have opened the day
long. On top of this document is an actual screen shot of our system.
There are some service problems. Hmm.. something to do for me …
Whenever a service reports a failure you will get the information on
this page. The browser refreshes also every 30 seconds and you get
the current list of failed services.
When a service reports a
failure the line will be shown here. The interval a service should be
checked can be configured in the Nagios configuration. The minimum
interval is 1 minute. When the next check reports everything is okay
for that service, it will disappear from this list. So this is the
page where you can see the failures of your monitored services right
now and even actual.
Last
but not least for this article I want to show you how you can add
another service with a new check program. I think this is the best
way to understand the configuration for the hosts and services.
To
show you how you could add new check programs on your own, here is an
easy sample:
We add a simple bash script that checks if
the file /tmp/nagios.chk
is available. If it is there and it’s executable the service goes to
critical, if it is there and not executable it’s going to
warning and if it doesn’t exist the service is ok.
-
Create
the executable check file#
vi /usr/local/nagios/libexec/check_file_exist.shAdd
the following to that file:#!/bin/bash
#
#
Check if a local file exist
#
while getopts F: VAR
do
case "$VAR" in
F ) LOGFILE=$OPTARG ;;
* ) echo
"wrong syntax: use $o -F <file to check>"
exit 3 ;;
esac
doneif test "$LOGFILE" =
""
then
echo "wrong syntax: use $0 -F <file
to check>"
# Nagios exit code 3 = status UNKNOWN =
orange
exit 3
fiif test -e "$LOGFILE"
then
if test -x "$LOGFILE"
then
echo "Critical
$LOGFILE is executable !"
# Nagios exit code 2 = status
CRITICAL = red
exit 2
else
echo "Warning
$LOGFILE exists !"
# Nagios exit code 1 = status WARNING =
yellow
exit 1
fi
else
echo "OK: $LOGFILE
does not exist !"
# Nagios exit code 0 = status OK = green
exit 0
fiNow set the file attributes:
#
chown nagios.nagios /usr/local/nagios/libexec/check_file_exist.sh
#
chmod +x /usr/local/nagios/libexec/check_file_exist.sh -
Add
the check program to the nagios configurationEach new check
command has to been defined once in the global Nagios
configuration:#
vi /usr/local/nagios/etc/minimal.cfgAdd the
following block at the end of the file:define
command{
command_name check_file_exist
command_line $USER1$/check_file_exist.sh -F /tmp/nagios.chk
} -
Add
a new service to the localhostEach new service has to be
defined once in the Nagios configuration and can be assigned to a
single host, multiple hosts or even a host group. We assign it only
to the localhost that is already defined in this base
configuration:#
vi /usr/local/nagios/etc/minimal.cfgAdd the
following block at the end of the file:define
service{
use generic-service
host_name localhost
service_description File check
is_volatile
0
check_period 24×7
max_check_attempts 4
normal_check_interval 5
retry_check_interval
1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period
24×7
check_command
check_file_exist
} -
Verify
Nagios configuration and restart itAfter all changes of the
config files you should check the Nagios configuration and you have
to restart Nagios after that:#
/usr/local/nagios/bin/nagios -v
/usr/local/nagios/etc/nagios.cfgThe Total
Warnings and Total Errors should be 0 if you have done everything
correct.
So restart it with:#
/etc/init.d/nagios restart -
Check
if the new program is workingFirst take a look at the
tactical screen and you should see that one service is in status
pending.
That means no check was done before for this service.
Wait a view minutes and it should disappear as pending and the
number of OKs should increment from 5 to 6.Now create the
file and watch the tactical screen, the service detail screen or the
service problems screen.#
touch /tmp/nagios.chkAs we set the
normal_check_interval to 5 minutes in the service definition, you
should get the warning message during that time. Now add the
executable attribute and watch:#
chmod +x /tmp/nagios.chkThe status should change
during the check interval to critical.
When you delete the file
the service should return to status ok.#
rm /tmp/nagios.chk
So
that’s all for the moment. I hope I have shown you a little bit about
Nagios and how it works.
For
me it’s a great tool and it saves me a lot of time during the day-to-day business.
If
you continue to work with it, there are a lot of things that could be
made better with the configuration files. Please remember this is
only a simple installation of it. If you would like I can write some
more articles about it and how we manage our config files and what
other check programs we added to the system. We even added user
authentication for Nagios access with ldap to our eDirectory and so
on.
Rainer
Brunold
No comments yet