Analyzing Access Statistics with the Webalizer
Martin Sommer
General
Webalizer is an open source program to display homepage access statistics. It has been ported to a variety of different platforms, including Linux for PC, Alpha and PPC, Solaris for Sparc and Windows. The analysis options are quite extensive and also depend on how Webalizer's configuration file and that of the Web server have been set up. Since the majority of Linux users surely use Apache as their Web server, only the required settings for the Apache configuration file httpd.conf are described. Generally, these settings can be used on other Web servers as well.
Webalizer is included in SUSE 7.1 Professional and can be installed with YaST. On the net, Webalizer is available in the download area of the Webalizer homepage or one of its mirror sites - for example, mrunix.net. There, you will find the source code as well as complete binaries for all platforms.
For an installation in German (or in other languages), DLR or the Swedish company Chalmers, for example, provide the source files for download.
Installation
If you are using the source files for installation, you must compile the command sequence
./configure
make
make install
as usual. For detailed instructions that include all the options available for ./configure, go to Webalizer's homepage and read the simple Installation Guide. Those who want to unpack the binaries should save the zipped file to a separate directory and unpack it there. Then copy the program file webalizer into a bin directory. The configuration file may be saved to the Webalizer directory or the /etc directory. The webalizer.1 file contains a good manual.
Using Webalizer
The following information refers to Webalizer Version 2.01. Webalizer analyzes log files. Usually, users want to analyze the main log file for homepage access data including all subordinate pages. The main log file's default name is access_log, which is located in the directory /var/log/httpd. The program analyzes different parameters logged by the Web server and saves them to the access_log file. The most important parameters include:
- The user's IP address
- Access date and time
- All files the users loads
Apache Configuration
Those who want to receive additional user-related information must make the appropriate settings in the Apache Web server's configuration file /etc/httpd/httpd.conf. Two of these options are particularly interesting: the "referer" and the "agent", that is, the page from which the user accessed the page, and the user's computer system, including operating system and browser. The disadvantage of requesting more information is that the log files for frequently visited pages, which are already quite large, will grow even larger.
In the file httpd.conf, simply activate the line
# CustomLog /var/log/httpd/access_log combined
by removing the hash mark (#). Alternatively, you can also log the agent and referer information to separate files (the directory size, however, will then grow even faster). To use this option, simply activate the two lines
# CustomLog /var/log/httpd/referer_log referer
# CustomLog /var/log/httpd/agent_log agent
located above the previous line.
Configuring Webalizer
Please note: In general, a configuration file is not necessary. However, all settings that are not default settings must be added manually to the start command. This is why it is much more convenient to specify in the configuration file all settings for the program to use. Once Webalizer has been installed, the directory includes a standard configuration file with a number of preset, useful options. Other settings are provided only as examples so that users can simply activate them without having to worry about incorrect syntax. There are default settings for most of the options, which means that, in general, users do not need to make entries in the .conf file.
Accessing the Configuration File
Due to the limited scope of this article, only the most important configuration file options are discussed here. By default, the file is called webalizer.conf. The file should be located in the /etc/ directory, so that Webalizer can locate it without a path name upon startup. To use the file, simply start the process with webalizer. If you are using different configuration files for different tasks, you not only need to use /etc/webalizer.conf as the configuration file, you must always use the -c option to specify the path name for the program as well. For example, if you are using the webalizer.test file in the /etc/ directory, you need to enter the following program call:
webalizer -c /etc/webalizer.test
Log File and Output Directory
You must use the configuration file to specify which log file to use i.e., there is no default. However, /var/log/httpd/access_log is already preset. Enter the file name in the following line:
LogFile /var/log/httpd/access_log
There are several log file formats that you can use. The standard format is clf. The procedure is identical for zipped log files in gz format. Users may want to select this option for the occasional need to zip large log files and to free up disk space.
Similarly, you should specify the directory where you want to save the results. The directory name is part of the following line:
OutputDir /usr/local/httpd/htdocs/webalizer
Naturally, it makes sense to create a few output directories. If you are using different configuration files to perform different jobs, you should, of course, define the respective output directory in the corresponding .conf file. Otherwise, the system overwrites old data, or - depending on the settings - appends new data to old data that is completely irrelevant to the old data, if, for example, you searched a different log file or have specified other unusual settings.
Since Webalizer generates HTML output, you may want to view your pages in your browser below the document root directory (usually /usr/local/httpd/htdocs/). In other words, the pages are then located in a directory accessible via http, such as, for example, http://www.yourdomain.de/webalizer. Keep in mind, however, that your statistics are then accessible by outside users as well. If you want to avoid this situation, you must password-protect the directory, or place the directory above the document root directory. In the latter case, however, you are limited to loading the directory as a file in your browser: file:///Path_to_Webalizer_directory/index.html. For businesses, the data may very well be important for internal business management purposes, so naturally not everybody should have access to it. For that reason, as well as to provide better protection from hackers, the directory should be located above the document root directory, if possible.
Incremental or not?
Having made the two very important entries for the log file and the output directory, the configuration file procedure will next prompt you with Incremental yes or no. Short answer: Many hits: yes, few hits: no. If your page gets a large number of hits, it makes sense to zip the access_log file more frequently and to start over with a new file. Webalizer keeps an internal history that is activated by selecting Incremental yes. Generally, it is not necessary to activate the history for small, entirely private Web sites.
Output
Once you have selected the setting for the prompt described above, the program continues to prompt you with a series of relatively unimportant parameter options, or with parameters that already have a sensible default setting, such as, for example, the DNS lookup. This parameter determines which database the system uses to resolve IP addresses, a process that refers to translating IP addresses into actual Web addresses. There are many output parameters with selections that affect text output (usually, these parameters start with HTML). Among these, the only important parameter is the one that specifies which file type is counted as a "page" and included in the eventual "page impression" output (lines with PageType). The default settings here are htm* and cgi. If you are using php and/or Perl, it is easy to activate the corresponding lines. If you are using other formats, you must add these lines (e.g., PageType asp).
Further down the list, there is another interesting parameter - the display size for "top tables". Once again, you can simply accept the default setting as an intelligent selection. Still, you should play with this parameter's settings to create an output format that reflects your personal preferences. As discussed above, the agent and the referer that you can specify here are included in the output only if the Apache configuration is activated.
For example, if you want to use home.html instead of index.html as your default start page, you must define the appropriate entry in the IndexAlias section.
The subsequent section with the keywords Hide, Group, Ignore- and Include is of greater importance for an intelligent homepage traffic analysis. In this section, you can specify for the system to hide or ignore completely all access activity that originates, for example, from your own machine, from other computers on the same network (e.g., all computers in your company) or from unwelcome users. On the other hand, you also have the option to hide all users, or to explicitly specify for the system to display selected users only (e.g., for internal purposes). You can (and should) also hide image access data (or access information related to other homepage file types, e.g., txt or tpl). Otherwise, every button click counts as a hit. If you select the Hide keyword to conceal certain information, the system ignores the corresponding numbers for tables and graphics displaying "top" statistics. This data, however, does appear in the "total" tables at the beginning of the Webalizer output, which means that the system does count those hits at that point. On the other hand, if you select Ignore, the system completely ignores that access data, even in the "total" tables. In addition, some grouping functions are available here that you can use to group certain parameters. Playing around with the grouping functions can result in clearer and better-organized output.
The Search Word Function
The last section of interest is located towards the end of the file and concerns the definition of search words for search engines (lines starting with SearchEngine). Here you will already find default settings for a few important search engines. These settings reveal which variables include the search word. You can use this information to analyze what information the users entered in the search engine to find the page. In terms of output, however, the system does not differentiate between search engines. It merely compiles a hit list for the search words. This can be helpful, for example, when you want to optimize the homepage meta tag keywords. This function can become quite a bit more interesting if you are dealing with large Web pages that have their own search functions and you want to analyze the information visitors were looking for on your own page. This analysis makes it possible to redesign your homepage to better meet user needs, since the analysis reveals just what the visitors are looking for. To perform such an analysis, it is best to first deactivate the other search engines by adding the hash mark #. For example, add the following line for the open source search engine htdig, which is also being used on the portal:
SearchEngine htsearch words=
In this context, you must specify htsearch instead of htdig, since the system requires a substring from the URL. After a search with htdig, the URL contains the string htsearch, not the string htdig. The words= string is entered here, since words is the name of the variable that htdig uses to save search terms.
You can view an example of a working file here. To provide a clearer overview, all comments were removed from this file, as well as all parameters that are superfluous and those parameters with meaningful default settings.
|
|
|