Nagios: Host and Service Monitoring Tool
and Service Monitoring Tool – quick overview and basic installation
is Nagios ?
is a sample screen from our installation / service problems view.
Nagios (http://www.nagios.org) is an open source host, service and
network monitoring tool. It let’s you manage different types of
services and hosts running on different operating systems like linux,
netware, windows, aix ,…. It’s flexible in configuration and can be
extended as much as you want. It’s configured within text files and
managed with a web browser.
you do a basic installation you get a set of Nagios check programs
which let you start monitoring your first hosts and services,
beginning with the installation in less than an hour.
on that default configuration you can start extending the
configuration to your special needs.
can use the existing check programs or even add more check programs
if you take a look at http://www.monitoringexchange.org
where other developers put a lot of their check programs for
download. Or just write your own in any programming language that is
available on linux. Based on the services, you can setup time frames
for the monitoring process as well as notifications when alarms
arise. You can, but do not have to. It’s always your choice and
that’s great with Nagios. If you do not want to setup any
notifications, ignore its configuration, you can always watch the
service problems in the web browser. That is what I currently use.
It’s good to come to the office in the morning take one look at the
Nagios service problems view and know what is going on. Or when I’ll
do an eDirectory migration in our 260+ netware server tree Nagios is
open in the background to see which servers aren’t reachable.
is a list of check programs in the base system:
check_http check_nt check_ssh
check_icmp check_ntp check_swap
check_ifoperstatus check_nwstat check_tcp
check_ifstatus check_oracle check_time
check_imap check_overcr check_udp
check_ircd check_ping check_udp2
check_ldaps check_pop check_ups
check_load check_procs check_users
check_log check_radius check_wave
check_mailq check_real negate
check_mrtg check_rpc urlize
check_mrtgtraf check_sensors utils.pm
check_nagios check_smtp utils.sh
you’ll find a lot more if you are missing something here.
are some things we wanted to solve with Nagios:
are running a lot of different management solutions (ZEN for server,
IBM Tivoli, HP / Compaq System Insight Manager, McAfee ePolicy
Orchestrator, ……) and all of them are in some parts very strong
and powerful. The problem is,
that some of them are really time consuming installations,
configurations and administrations. It’s problematic to add own
services and hard to get a single view of all current services and
the problems that existing.
I went around looking for a solution that is even quiet
easy to deploy but as powerful as possible to handle all our
requirements. Of course I will try to drop out some other
existing management solutions in our company for this, but I am sure
it’s not possible to drop them all for this one. A combination of
them will be the best for us.
are some samples (beside a lot of default monitoring requirements
like cpu, filesystem, …) I had to deal with during the day to day
administration and that should be able to automate with Nagios:
Bordermanager SurfControl logfiles to see if the update happened
ftp/tls and ftp/ssh server functionality with a real ftp user
time synchronization on oes/linux server
running virus scanner processes (like LinuxShield from McAfee)
HP Proliant server status from the HP Proliant support pack agents
DNS round robin functionality
Vmware remote connection and web interface ports
ZEN linux management mirror process for ZLM and RCD targets
memory / swap usage on linux servers
linux drbd synchronization status
server availability on one screen during eg. eDirectory migrations
oes/linux cluster ressources
how does it work ?
to say it in a simple way, you define a host by it’s name and ip
address and a view other parameters and assign it some services.
Behind a service is nothing other than a check program configured
that runs in a predefined interval some tests. Like the ftp services
uses the check_ftp program to do some ftp connects to the server. It
reports the result of it with a exit code back to Nagios. Exit code
0, if the test was okay = green, exit code 1 means warning = yellow,
exit code 2 means critical = red and exit code 3 means unknown =
itself shows you in the service problems view, the currently existing
problems (hopefully this list is short or even empty) or in the
service details a list of all configured services and their status.
on your notification configuration it also notifies you with
you are interested in such a system monitoring tool the time is well
worth to do a test installation and spend a view hours with it.
We are monitoring currently 308 hosts and about 1058 services.
configurations / extensions:
are a lot of configurations / extensions that are not covered in this
document. As you discover more and more of the possible
configurations you will find things like how to put a web application
link behind a host extra note. Like a host is down, just select it’s
extra host notes and you are forwarded directly to the eg. HP remote
insight board. Or if a server shows some warnings from the HP
Proliant agents, you can select the additional service tasks and come
directly to the Proliant web management interface on the server.
There is no need for us to enter the url for that web page anymore
or even take a look at the HP system insight manager for server
status. That’s all in Nagios now.
try to show you how you can setup a Nagios installation with a little
linux knowledge in less than an hour or two. I think that’s the best
way to take a look at it. After that I’ll show how to add a new check
program for Nagios to monitor if a single file exists.
here is a basic installation guide:
to say that you should do it on a test server first. If you use eg.
PuTTY for ssh connect to your server you can copy / paste the
commands from this document directly into the shell. So you do not
have to write them yourself.
a default OES/Linux or SLES 9 – 32 bit installation.
does not need very much memory (less than 32mb) and disk space (less
than 50mb). So it could be a ?small? server or even a virtual
/ remove some packages of your installation
OES and SLES
contains Nagios version 1.2 in it’s distribution but I’ll decided to
use the current version 2.0 rc1 from http://www.nagios.org.
So I had to remove and install a view packages with yast:
if installed: nagios,
if not yet done: gd-devel
and the whole Simple
the Nagios and the plugin tarball
Download the most recent
Nagios and the official plugin tarball from the current Nagios
version from http://www.nagios.org/download
it to /tmp
on your test server.
now there is the Nagios version 2.0rc1 available. Normally I prefer
to use only rpms for installation, but this time I use the tarball
so I can configure some additional parts during compilations and
installation. If tried this installation procedure with other 2.0x
version and it was working well.
plugin tarball: nagios-plugins-1.4.1.tar.gz
the Nagios user and group
probably going to want to run Nagios under a normal user account, so
add a new user and group to your system with the following
# groupadd nagios
the installation directory
the base directory where you would like to install Nagios as
the owner of the base installation directory to be the Nagios user
and group you added earlier as follows:
the web server user
probably going to want to issue external
(like acknowledgements and scheduled downtime)
from the web interface. To do so, you need to identify the user your
web server runs as (typically wwwrun). This setting is found
in your web server configuration file. The following command can be
used to determine quickly what user Apache is running as:
-R "^User" /etc/apache2/*
the user is the wwwrun. We will add it to the Nagios group in the
next step. If the user differs, be sure to use in step 7 the right
a command file group
we’re going to create a new group whose members include the user
your web server is running as and the user Nagios is running as.
Let’s say we call this new group ‘nagcmd‘:
add the users that your web server and Nagios run as to the newly
created group with the following commands:
usermod -G nagcmd wwwrun
# usermod -G nagcmd nagios
the Nagios tarball
# tar -xvzf nagios-2.0rc1.tar.gz
# cd nagios-2.0rc1
the Nagios package
Run the configure script to initialize
variables and create a Makefile as follows:
Nagios and the CGIs with the following command:
the binaries and HTML files
the binaries and HTML files (documentation and main web page) with
the following command:
an init script
If you want, you can also
install the init script /etc/init.d/nagios with the following
If you want, you can also
install the command mode environment with the following command:
sample config files
we install some default configuration file, that have to be changed
a little bit later:
Plugins are usually
installed in the libexec/ directory of your Nagios
installation (i.e. /usr/local/nagios/libexec). Plugins are
scripts or binaries which perform all the service and host checks
that constitute monitoring.
# tar -xvzf nagios-plugins-1.4.1.tar.gz
# make install
the Apache web interface
make Nagios accessible through the apache web server we have to
setup a config file for it. Create the config file as follows:
this elements into that new file:
Allow from all
that restart the apache web server with the following command:
a minimum Nagios configuration
# cp cgi.cfg-sample cgi.cfg
# cp minimal.cfg-sample
# cp resource.cfg-sample resource.cfg
proper use of this Nagios configuration we have to create two
additional, empty config files. Do not copy the sample files for
this one, there would be duplicate command definitions.
# touch misccommands.cfg
last configuration steps …
Set the Nagios user and group as
owner of the Nagios installation:
chown -R nagios.nagios /usr/local/nagios
the authentication for the cgi’s:
Search for the line
and change it to ?use_authentication=0?.
That’s for testing easier to handle.
But not all functions are
possible if it’s disabled.
For production use later it should be
activated but then you have to configure some other parts of Nagios
the Nagios configuration and start it
Warnings? and ?Total Errors? should be 0 if you have done
If so just start it the first time:
Activate Nagios to start
within the runlevel scripts automatically
Nagios access with a web browser
or ip address>/nagios
You should see the
you can start exploring Nagios yourself.
to know: All screens refresh every 30 seconds, no need to reload
That time can be changed even to lower values in the Nagios
Just let me show you the most important
screen gives you an overview of the current status of the monitored
services and hosts. Take a look at the ?Hosts? and you see that
you are currently monitoring only one host. In the ?Services?
line you see that you are monitoring 5 services and maybe all are
reporting the status ?OK?. As a summary the ?Network Health?
– Host and Service health bar is filled completely with green,
indicating all configured hosts and services are OK.
fields on this page are links to more specific views. If you want to
know more about your 5 monitored services you can either klick on the
?5 OK? filed, or choose ?Service Detail? on the left.
This screen shows you all the
services configured to be monitored and their current status.
we saw in the tactical screen here are our five services we monitor
You can see basic informations about each service
on this page:
Host the host to which this services are
If this field is marked red, the host itself is
if it’s just grey the server is up and reachable with
Status show the current status of the service
Warning = yellow
Critical = red
Last Check date and time when it has been checked the last
Duration shows for how long the service in this
Attempt how many attempts were needed for the check
Information this is the output from the check program
you want to know more about a single service, select it by its name
and you are redirected to a more detailed page about it.
This is the same view as the service detail,
showing the details of the monitored hosts. Therefore I have no
screen shot of it. You would see all configured hosts and have again
the choice to select one to get more informations about it.
Hopefully this screen will be empty
as long as possible. This is the screen that I have opened the day
long. On top of this document is an actual screen shot of our system.
There are some service problems. Hmm.. something to do for me …
Whenever a service reports a failure you will get the information on
this page. The browser refreshes also every 30 seconds and you get
the current list of failed services.
When a service reports a
failure the line will be shown here. The interval a service should be
checked can be configured in the Nagios configuration. The minimum
interval is 1 minute. When the next check reports everything is okay
for that service, it will disappear from this list. So this is the
page where you can see the failures of your monitored services right
now and even actual.
but not least for this article I want to show you how you can add
another service with a new check program. I think this is the best
way to understand the configuration for the hosts and services.
show you how you could add new check programs on your own, here is an
We add a simple bash script that checks if
the file /tmp/nagios.chk
is available. If it is there and it’s executable the service goes to
critical, if it is there and not executable it’s going to
warning and if it doesn’t exist the service is ok.
the executable check file
the following to that file:
Check if a local file exist
while getopts F: VAR
case "$VAR" in
F ) LOGFILE=$OPTARG ;;
* ) echo
"wrong syntax: use $o -F <file to check>"
exit 3 ;;
if test "$LOGFILE" =
echo "wrong syntax: use $0 -F <file
# Nagios exit code 3 = status UNKNOWN =
if test -e "$LOGFILE"
if test -x "$LOGFILE"
$LOGFILE is executable !"
# Nagios exit code 2 = status
CRITICAL = red
$LOGFILE exists !"
# Nagios exit code 1 = status WARNING =
echo "OK: $LOGFILE
does not exist !"
# Nagios exit code 0 = status OK = green
Now set the file attributes:
chown nagios.nagios /usr/local/nagios/libexec/check_file_exist.sh
chmod +x /usr/local/nagios/libexec/check_file_exist.sh
the check program to the nagios configuration
Each new check
command has to been defined once in the global Nagios
following block at the end of the file:
command_line $USER1$/check_file_exist.sh -F /tmp/nagios.chk
a new service to the localhost
Each new service has to be
defined once in the Nagios configuration and can be assigned to a
single host, multiple hosts or even a host group. We assign it only
to the localhost that is already defined in this base
following block at the end of the file:
service_description File check
Nagios configuration and restart it
After all changes of the
config files you should check the Nagios configuration and you have
to restart Nagios after that:
Warnings and Total Errors should be 0 if you have done everything
So restart it with:
if the new program is working
First take a look at the
tactical screen and you should see that one service is in status
That means no check was done before for this service.
Wait a view minutes and it should disappear as pending and the
number of OKs should increment from 5 to 6.
Now create the
file and watch the tactical screen, the service detail screen or the
service problems screen.
As we set the
normal_check_interval to 5 minutes in the service definition, you
should get the warning message during that time. Now add the
executable attribute and watch:
chmod +x /tmp/nagios.chk
The status should change
during the check interval to critical.
When you delete the file
the service should return to status ok.
that’s all for the moment. I hope I have shown you a little bit about
Nagios and how it works.
me it’s a great tool and it saves me a lot of time during the day-to-day business.
you continue to work with it, there are a lot of things that could be
made better with the configuration files. Please remember this is
only a simple installation of it. If you would like I can write some
more articles about it and how we manage our config files and what
other check programs we added to the system. We even added user
authentication for Nagios access with ldap to our eDirectory and so