Nagios: Host and Service Monitoring Tool

Share
Share

Nagios:

Host
and Service Monitoring Tool – quick overview and basic installation
guide

What
is Nagios ?


This
is a sample screen from our installation / service problems view.

Basically
Nagios (http://www.nagios.org) is an open source host, service and
network monitoring tool. It let’s you manage different types of
services and hosts running on different operating systems like linux,
netware, windows, aix ,…. It’s flexible in configuration and can be
extended as much as you want. It’s configured within text files and
managed with a web browser.

When
you do a basic installation you get a set of Nagios check programs
which let you start monitoring your first hosts and services,
beginning with the installation in less than an hour.

Based
on that default configuration you can start extending the
configuration to your special needs.

You
can use the existing check programs or even add more check programs
if you take a look at
http://www.monitoringexchange.org
where other developers put a lot of their check programs for
download. Or just write your own in any programming language that is
available on linux. Based on the services, you can setup time frames
for the monitoring process as well as notifications when alarms
arise. You can, but do not have to. It’s always your choice and
that’s great with Nagios. If you do not want to setup any
notifications, ignore its configuration, you can always watch the
service problems in the web browser. That is what I currently use.
It’s good to come to the office in the morning take one look at the
Nagios service problems view and know what is going on. Or when I’ll
do an eDirectory migration in our 260+ netware server tree Nagios is
open in the background to see which servers aren’t reachable.

Here
is a list of check programs in the base system:

check_breeze
check_http check_nt check_ssh

check_by_ssh
check_icmp check_ntp check_swap

check_dhcp
check_ifoperstatus check_nwstat check_tcp

check_dig
check_ifstatus check_oracle check_time

check_disk
check_imap check_overcr check_udp

check_disk_smb
check_ircd check_ping check_udp2

check_dns
check_ldaps check_pop check_ups

check_dummy
check_load check_procs check_users

check_file_age
check_log check_radius check_wave

check_flexlm
check_mailq check_real negate

check_fping
check_mrtg check_rpc urlize

check_ftp
check_mrtgtraf check_sensors utils.pm

check_hpjd
check_nagios check_smtp utils.sh

check_nntp
check_snmp

At
http://www.monitoringexchange.org
you’ll find a lot more if you are missing something here.

Here
are some things we wanted to solve with Nagios:

We
are running a lot of different management solutions (ZEN for server,
IBM Tivoli, HP / Compaq System Insight Manager, McAfee ePolicy
Orchestrator, ……) and all of them are in some parts very strong
and powerful. The problem is,
that some of them are really time consuming installations,
configurations and administrations. It’s problematic to add own
services and hard to get a single view of all current services and
the problems that existing.

So
I went around looking for a solution that is even quiet
easy to deploy but as powerful as possible to handle all our
requirements. Of course I will try to drop out s
ome other
existing management solutions in our company for this, but I am sure
it’s not possible to drop them all for this one. A combination of
them will be the best for us.

Here
are some samples (beside a lot of default monitoring requirements
like cpu, filesystem, …) I had to deal with during the day to day
administration and that should be able to automate with Nagios:

  • check
    Bordermanager SurfControl logfiles to see if the update happened

  • check
    ftp/tls and ftp/ssh server functionality with a real ftp user
    connect

  • check
    time synchronization on oes/linux server

  • check
    running virus scanner processes (like LinuxShield from McAfee)

  • check
    HP Proliant server status from the HP Proliant support pack agents

  • check
    DNS round robin functionality

  • check
    Vmware remote connection and web interface ports

  • check
    ZEN linux management mirror process for ZLM and RCD targets

  • check
    memory / swap usage on linux servers

  • check
    linux drbd synchronization status

  • check
    server availability on one screen during eg. eDirectory migrations

  • check
    ldap functionality

  • check
    web applications

  • check
    oes/linux cluster ressources

  • ….

So
how does it work ?

Just
to say it in a simple way, you define a host by it’s name and ip
address and a view other parameters and assign it some services.
Behind a service is nothing other than a check program configured
that runs in a predefined interval some tests. Like the ftp services
uses the check_ftp program to do some ftp connects to the server. It
reports the result of it with a exit code back to Nagios. Exit code
0, if the test was okay = green, exit code 1 means warning = yellow,
exit code 2 means critical = red and exit code 3 means unknown =
orange.

Nagios
itself shows you in the service problems view, the currently existing
problems (hopefully this list is short or even empty) or in the
service details a list of all configured services and their status.

Depending
on your notification configuration it also notifies you with
regarding information.

If
you are interested in such a system monitoring tool the time is well
worth to do a test installation and spend a view hours with it.

We are monitoring currently 308 hosts and about 1058 services.

Additional
configurations / extensions:

There
are a lot of configurations / extensions that are not covered in this
document. As you discover more and more of the possible
configurations you will find things like how to put a web application
link behind a host extra note. Like a host is down, just select it’s
extra host notes and you are forwarded directly to the eg. HP remote
insight board. Or if a server shows some warnings from the HP
Proliant agents, you can select the additional service tasks and come
directly to the Proliant web management interface on the server.
There is no need for us to enter the url for that web page anymore
or even take a look at the HP system insight manager for server
status. That’s all in Nagios now.

I’ll
try to show you how you can setup a Nagios installation with a little
linux knowledge in less than an hour or two. I think that’s the best
way to take a look at it. After that I’ll show how to add a new check
program for Nagios to monitor if a single file exists.

So
here is a basic installation guide:

Unnecessary
to say that you should do it on a test server first. If you use eg.
PuTTY for ssh connect to your server you can copy / paste the
commands from this document directly into the shell. So you do not
have to write them yourself.

  1. Do
    a default OES/Linux or SLES 9 – 32 bit installation.

    Nagios
    does not need very much memory (less than 32mb) and disk space (less
    than 50mb). So it could be a ?small? server or even a virtual
    machine.

  2. Install
    / remove some packages of your installation

    OES and SLES
    contains Nagios version 1.2 in it’s distribution but I’ll decided to
    use the current version 2.0 rc1 from
    http://www.nagios.org.
    So I had to remove and install a view packages with yast:

    remove
    if installed:
    nagios,
    nagios-nsca
    and
    nagios-plugins

    install
    if not yet done:
    gd-devel
    and
    libpng-devel
    packages
    and the whole
    Simple
    Webserver
    selection

  3. Download
    the Nagios and the plugin tarball

    Download the most recent
    Nagios and the official plugin tarball from the current Nagios
    version from
    http://www.nagios.org/download
    and copy
    it to
    /tmp
    on your test server.


    Right
    now there is the Nagios version 2.0rc1 available. Normally I prefer
    to use only rpms for installation, but this time I use the tarball
    so I can configure some additional parts during compilations and
    installation. If tried this installation procedure with other 2.0x
    version and it was working well.

    current nagios
    tarball:
    nagios-2.0rc1.tar.gz
    current
    plugin tarball:
    nagios-plugins-1.4.1.tar.gz

  4. Create
    the Nagios user and group

    You’re
    probably going to want to run Nagios under a normal user account, so
    add a new user and group to your system with the following
    command:

    #
    useradd
    -m nagios
    # groupadd nagios

  5. Create
    the installation directory

    Create
    the base directory where you would like to install Nagios as
    follows…

    #
    mkdir
    /usr/local/nagios

    Change
    the owner of the base installation directory to be the Nagios user
    and group you added earlier as follows:

    #
    chown
    nagios.nagios /usr/local/nagios

  6. Identify
    the web server user

    You’re
    probably going to want to issue
    external
    commands

    (like acknowledgements and scheduled downtime)
    from the web interface. To do so, you need to identify the user your
    web server runs as (typically wwwrun). This setting is found
    in your web server configuration file. The following command can be
    used to determine quickly what user Apache is running as:

    #
    grep
    -R "^User" /etc/apache2/*

    Normally
    the user is the wwwrun. We will add it to the Nagios group in the
    next step. If the user differs, be sure to use in step 7 the right
    one.

  7. Add
    a command file group

    Next
    we’re going to create a new group whose members include the user
    your web server is running as and the user Nagios is running as.
    Let’s say we call this new group ‘nagcmd‘:

    #
    groupadd
    nagcmd

    Next,
    add the users that your web server and Nagios run as to the newly
    created group with the following commands:

    #
    usermod -G nagcmd wwwrun
    # usermod -G nagcmd nagios

  8. Extract
    the Nagios tarball

    #
    cd /tmp
    # tar -xvzf nagios-2.0rc1.tar.gz
    # cd nagios-2.0rc1

  9. Compile
    the Nagios package

    Run the configure script to initialize
    variables and create a Makefile as follows:

    #
    ./configure
    –prefix=/usr/local/nagios –with-cgiurl=/nagios/cgi-bin
    –with-htmurl=/nagios –with-nagios-user=nagios
    –with-nagios-group=nagios ?with-command-group=nagcmd

  10. Compile
    the binaries

    Compile
    Nagios and the CGIs with the following command:

    #
    make
    all

  11. Installing
    the binaries and HTML files

    Install
    the binaries and HTML files (documentation and main web page) with
    the following command:

    #
    make
    install

  12. Installing
    an init script

    If you want, you can also
    install the init script /etc/init.d/nagios with the following
    command:

    #
    make
    install-init

  13. Installing
    command mode

    If you want, you can also
    install the command mode environment with the following command:

    #
    make
    install-commandmode


  14. Installing
    sample config files

    Now
    we install some default configuration file, that have to be changed
    a little bit later:

    #
    make install-config

  15. Installing
    the plugins

    Plugins are usually
    installed in the libexec/ directory of your Nagios
    installation (i.e. /usr/local/nagios/libexec). Plugins are
    scripts or binaries which perform all the service and host checks
    that constitute monitoring.

    #
    cd
    /tmp
    # tar -xvzf nagios-plugins-1.4.1.tar.gz
    # cd
    nagios-plugins-1.4.1
    #
    ./configure
    #
    make
    # make install

  16. Setup
    the Apache web interface

    To
    make Nagios accessible through the apache web server we have to
    setup a config file for it. Create the config file as follows:


    #
    vi
    /etc/apache2/conf.d/nagios.conf

    Insert
    this elements into that new file:

    ScriptAlias
    /nagios/cgi-bin /usr/local/nagios/sbin
    <Directory
    "/usr/local/nagios/sbin">
    AllowOverride AuthConfig
    Options ExecCGI
    Order allow,deny
    Allow from
    all
    </Directory>

    Alias /nagios
    /usr/local/nagios/share
    <Directory "/usr/local/nagios/share">
    Options None
    AllowOverride AuthConfig
    Order allow,deny
    Allow from all
    </Directory>

    After
    that restart the apache web server with the following command:

    #
    rcapache2
    restart

  17. Setup
    a minimum Nagios configuration

    #
    cd /usr/local/nagios/etc
    # cp cgi.cfg-sample cgi.cfg
    # cp
    nagios.cfg-sample nagios.cfg
    # cp minimal.cfg-sample
    minimal.cfg
    # cp resource.cfg-sample resource.cfg

    For
    proper use of this Nagios configuration we have to create two
    additional, empty config files. Do not copy the sample files for
    this one, there would be duplicate command definitions.

    #
    touch checkcommands.cfg
    # touch misccommands.cfg

  18. The
    last configuration steps …

    Set the Nagios user and group as
    owner of the Nagios installation:

    #
    chown -R nagios.nagios /usr/local/nagios

    Deactivate
    the authentication for the cgi’s:

    #
    vi cgi.cfg

    Search for the line
    ?use_authentication=1?
    and change it to ?use_authentication=0?.
    That’s for testing easier to handle.
    But not all functions are
    possible if it’s disabled.
    For production use later it should be
    activated but then you have to configure some other parts of Nagios
    as well.

  19. Verify
    the Nagios configuration and start it

    #
    /usr/local/nagios/bin/nagios -v
    /usr/local/nagios/etc/nagios.cfg

    The ?Total
    Warnings? and ?Total Errors? should be 0 if you have done
    everything correct.
    If so just start it the first time:

    #
    /etc/init.d/nagios start

    Activate Nagios to start
    within the runlevel scripts automatically

    #
    insserv nagios

  20. Test
    Nagios access with a web browser

    http://<servername
    or ip address>/nagios

    You should see the
    following:


Now
you can start exploring Nagios yourself.

Nice
to know: All screens refresh every 30 seconds, no need to reload
them.
That time can be changed even to lower values in the Nagios
configuration files.

Just let me show you the most important
views:

1.
Tactical Overview:

This
screen gives you an overview of the current status of the monitored
services and hosts. Take a look at the ?Hosts? and you see that
you are currently monitoring only one host. In the ?Services?
line you see that you are monitoring 5 services and maybe all are
reporting the status ?OK?. As a summary the ?Network Health?
– Host and Service health bar is filled completely with green,
indicating all configured hosts and services are OK.

Most
fields on this page are links to more specific views. If you want to
know more about your 5 monitored services you can either klick on the
?5 OK? filed, or choose ?Service Detail? on the left.


2.
Service Detail

This screen shows you all the
services configured to be monitored and their current status.
As
we saw in the tactical screen here are our five services we monitor
right now.

You can see basic informations about each service
on this page:

Host the host to which this services are
configured
If this field is marked red, the host itself is
down,
if it’s just grey the server is up and reachable with
ping.
Status show the current status of the service
OK =
green
Warning = yellow
Critical = red
Unknown =
orange
Last Check date and time when it has been checked the last
time
Duration shows for how long the service in this
status
Attempt how many attempts were needed for the check
Status
Information this is the output from the check program

Again if
you want to know more about a single service, select it by its name
and you are redirected to a more detailed page about it.

3.
Host detail

This is the same view as the service detail,
showing the details of the monitored hosts. Therefore I have no
screen shot of it. You would see all configured hosts and have again
the choice to select one to get more informations about it.


4.
Service Problems

Hopefully this screen will be empty
as long as possible. This is the screen that I have opened the day
long. On top of this document is an actual screen shot of our system.
There are some service problems. Hmm.. something to do for me …
Whenever a service reports a failure you will get the information on
this page. The browser refreshes also every 30 seconds and you get
the current list of failed services.

When a service reports a
failure the line will be shown here. The interval a service should be
checked can be configured in the Nagios configuration. The minimum
interval is 1 minute. When the next check reports everything is okay
for that service, it will disappear from this list. So this is the
page where you can see the failures of your monitored services right
now and even actual.

Last
but not least for this article I want to show you how you can add
another service with a new check program. I think this is the best
way to understand the configuration for the hosts and services.

To
show you how you could add new check programs on your own, here is an
easy sample:

We add a simple bash script that checks if
the file /tmp/nagios.chk
is available. If it is there and it’s executable the service goes to
critical, if it is there and not executable it’s going to
warning and if it doesn’t exist the service is ok.

  1. Create
    the executable check file

    #
    vi /usr/local/nagios/libexec/check_file_exist.sh

    Add
    the following to that file:

    #!/bin/bash
    #
    #
    Check if a local file exist
    #
    while getopts F: VAR
    do
    case "$VAR" in
    F ) LOGFILE=$OPTARG ;;
    * ) echo
    "wrong syntax: use $o -F <file to check>"
    exit 3 ;;
    esac
    done

    if test "$LOGFILE" =
    ""
    then
    echo "wrong syntax: use $0 -F <file
    to check>"
    # Nagios exit code 3 = status UNKNOWN =
    orange
    exit 3
    fi

    if test -e "$LOGFILE"
    then
    if test -x "$LOGFILE"
    then
    echo "Critical
    $LOGFILE is executable !"
    # Nagios exit code 2 = status
    CRITICAL = red
    exit 2
    else
    echo "Warning
    $LOGFILE exists !"
    # Nagios exit code 1 = status WARNING =
    yellow
    exit 1
    fi
    else
    echo "OK: $LOGFILE
    does not exist !"
    # Nagios exit code 0 = status OK = green
    exit 0
    fi

    Now set the file attributes:

    #
    chown nagios.nagios /usr/local/nagios/libexec/check_file_exist.sh
    #
    chmod +x /usr/local/nagios/libexec/check_file_exist.sh

  2. Add
    the check program to the nagios configuration

    Each new check
    command has to been defined once in the global Nagios
    configuration:

    #
    vi /usr/local/nagios/etc/minimal.cfg

    Add the
    following block at the end of the file:

    define
    command{
    command_name check_file_exist
    command_line $USER1$/check_file_exist.sh -F /tmp/nagios.chk
    }

  3. Add
    a new service to the localhost

    Each new service has to be
    defined once in the Nagios configuration and can be assigned to a
    single host, multiple hosts or even a host group. We assign it only
    to the localhost that is already defined in this base
    configuration:

    #
    vi /usr/local/nagios/etc/minimal.cfg

    Add the
    following block at the end of the file:

    define
    service{
    use generic-service
    host_name localhost
    service_description File check
    is_volatile
    0
    check_period 24×7
    max_check_attempts 4
    normal_check_interval 5
    retry_check_interval
    1
    contact_groups admins
    notification_options w,u,c,r
    notification_interval 960
    notification_period
    24×7
    check_command
    check_file_exist
    }

  4. Verify
    Nagios configuration and restart it

    After all changes of the
    config files you should check the Nagios configuration and you have
    to restart Nagios after that:

    #
    /usr/local/nagios/bin/nagios -v
    /usr/local/nagios/etc/nagios.cfg

    The Total
    Warnings and Total Errors should be 0 if you have done everything
    correct.
    So restart it with:

    #
    /etc/init.d/nagios restart

  5. Check
    if the new program is working

    First take a look at the
    tactical screen and you should see that one service is in status
    pending.
    That means no check was done before for this service.

    Wait a view minutes and it should disappear as pending and the
    number of OKs should increment from 5 to 6.

    Now create the
    file and watch the tactical screen, the service detail screen or the
    service problems screen.

    #
    touch /tmp/nagios.chk

    As we set the
    normal_check_interval to 5 minutes in the service definition, you
    should get the warning message during that time. Now add the
    executable attribute and watch:

    #
    chmod +x /tmp/nagios.chk

    The status should change
    during the check interval to critical.
    When you delete the file
    the service should return to status ok.

    #
    rm /tmp/nagios.chk

So
that’s all for the moment. I hope I have shown you a little bit about
Nagios and how it works.

For
me it’s a great tool and it saves me a lot of time during the day-to-day business.

If
you continue to work with it, there are a lot of things that could be
made better with the configuration files. Please remember this is
only a simple installation of it. If you would like I can write some
more articles about it and how we manage our config files and what
other check programs we added to the system. We even added user
authentication for Nagios access with ldap to our eDirectory and so
on.

Rainer
Brunold

Share
(Visited 1 times, 1 visits today)

Leave a Reply

Your email address will not be published. Required fields are marked *

No comments yet

8,158 views