Searching Web Sites with ht://Dig
Martin Sommer
Table of Contents
This article is designed to serve as an introduction to the use of the free search engine ht://Dig. By no means does this article attempt to present all the features and functions of this complex program; rather, the focus is on presenting and explaining those options that are most useful in everyday use. The sample configurations given will meet the needs of most users for a search engine that is easy to use for their own Web site and intranet. In the following, "ht://Dig" refers to the entire program package while "htdig, htmerge" etc. is used for the individual programs.
What is ht://Dig?
ht://Dig is a free full-text search engine for Web sites and intranets. It is not suitable for searching the whole of the Internet. Nevertheless, it can be considered a very powerful engine offering a wide range of options that are easy to configure and which are all superbly commented on the htdig homepage. Among the program's outstanding features are the fuzzy search option (makes words out of substrings, finds words that sound similar as well as with all possible word endings), Boolean expression searching (with AND, OR and NOT) and the (almost) completely editable output templates. A short list with the major features can be found here.
Requirements
htdig runs on a number of different Unix and Linux systems and is released under the GNU GPL. In this article it is assumed that htdig has been installed on SUSE LINUX with YaST. The compilation of the program and the settings that need to be made are therefore not discussed. Apart from a relatively large amount of disk space, there are no further requirements for installing and successfully running the search engine. In terms of disk space requirements, however, htdig is quite greedy. Basically, you need about 12KB per document. So, for 1,000 HTML pages you would need 12 MB.
Installation
The ht://Dig package is installed with yast. It is part of the n Series (Network Support). When you install the program, the directory /opt/www is created and all the necessary files are decompressed there. That's basically it as far as installation is concerned. In order to search Web sites, it is a good idea to copy the search program (htsearch) into the Web server's cgi-bin directory (this is usually /usr/local/httpd/cgi-bin). After installing, htsearch is located in the directory /opt/www/htdig/bin.
The individual Programs
- htdig
- This is the "spider". This program is responsible for parsing all the files, and gathering and saving the necessary information about them. ht://Dig calls this program the "search robot".
- htmerge
- This is the "indexer". It generates the document index and the word database using the information gathered (or 'dug up') by htdig.
- htfuzzy
- The fuzzy search, which uses the endings script and the endings dictionary to create a database from all the words found, and with which all possible forms of a word in the relevant language are recognized and included in the search results. This also makes it possible for substrings to be completed to words. htfuzzy only needs to run once in order to create these databases, since the databases are not dependent on the documents. In addition to the endings database, htfuzzy can also create further databases. A synonym dictionary is required to generate a database of synonyms. An English one is installed with the program. Creating soundex and metaphone databases makes little sense. These are used to find words that sound similar. However, this can be taken too far, with the effect that the results have little in common with the search term. See below for details on activation.
- htnotify
- htnotify scans the files to find those that are out-of-date. If any out-of-date files are found, an e-mail message is sent to the relevant person. The e-mail address, subject and date as of when a file may be considered to be out-of-date can be defined using meta-tags for each file. The relevant meta-tags in the head of an HTML page look like this:
<meta name="htdig-email" content="maintainer@bigpage.de">
<meta name="htdig-email-subject" content="Update page!!!">
<meta name="htdig-notification-date" content="01/07/2001">
- htsearch
- This is the actual search function called when the user clicks on the Submit button in the search form. It can be called with both the "POST" and "GET" methods. Preferably, use "GET" if you can as the variables passed will then appear in the URL and Webalizer can then be used to analyze the search terms, for example. (See the Webalizer article for more information). htsearch also uses a lot of the settings in the configuration file – in particular, those concerning the location of the databases to be searched as well as how the result pages are generated and the images to use on the results pages.
Configuration
The section that follows describes a sample search engine configuration, shows how to set up the search form and how to modify the output templates to suit your own requirements. After installing htdig, you will find very useful sample files for each of these three areas in the directories /opt/www/htdig/conf (file: htdig.conf) and /opt/www/htdig/common (files: search.html, header.html, footer.html, long.html, short.html, wrapper.html, nomatch.html and syntax.html). Most of these sample files can be used as-is or simply modified by replacing the values there with your own ones.
The Search Form
There are a number of variables that can be passed as "hidden" values with the search format. The only variable that is actually required is words. This is passed with search form's text box and it is where the search terms are saved for htsearch (see below). Most of the variables do not have to be defined here as there are defaults. The most important is the name of the configuration file (normally, this is htdig.conf (default)). If the configuration file has been renamed or if you are using several configuration files in different search forms, then they have to be specified in the form. For example, if the configuration is in the file myfiles.conf, the following line needs to be added in the form's source text:
<input value="myfiles" type="hidden" name="config">
If you have chosen to leave the defaults as they are, the following lines will be sufficient for a simple and complete form:
<form method="get" action="/cgi-bin/htsearch">
<input value="" type="text" size="12" name="words">
<input type="submit" value="Find">
</form>
All the other variables that you can use in the search form are well described on the ht://Dig pages.
The Configuration File
The configuration file is the most important editable element in ht://Dig. For the sake of clarity, it is assumed that you are using one configuration file only and that it is called htdig.conf (default). As listing and explaining the important options in htdig.conf is beyond the scope of this article, we have provided a sample file here with detailed comments on the most important settings. The most important variables are explained briefly again below.
ht://Dig uses variables that are unique to ht://Dig in the configuration file. These variables identify the directories for certain files. The paths, however, can also be specified as absolute paths. Here are some of the path variables:
- ${CONFIG_DIR}: the directory in which the .conf files are located. By default, this is /opt/www/htdig/conf
- ${COMMON_DIR}: the directory in which the templates, the databases created by htfuzzy, the corresponding dictionaries and the "bad words" (see below for more information) are located. By default, this is /opt/www/htdig/common
- ${DATABASE_DIR}: the directory in which the generated word databases are located. By default this is /opt/www/htdig/db
There are further possible ones: ${IMAGE_DIR} and ${BIN_DIR} e.g., for the directory with the images or the one with the scripts. They do not have to be used when the images and the programs are located in the intended directories. In that case, the image file name is enough; the default path in front (therefore, ${IMAGE_DIR}) is added automatically.
The Most Important Attributes in the Configuration File
- database_dir
- The database directory. This is where the databases, which are created when htdig and htmerge search and index the pages, are stored.
- start_url
- The URL(s) as of which the pages are searched recursively.
- exclude_urls
- Pages that you definitely want to exclude from the search.
- search_algorithm
- Possible attributes: exact, endings, substring, synonyms, prefix, metaphone, soundex. These attributes are assigned values between 0 and 1. This is how htsearch weights the search. Only exact, endings and substring are actually useful here. This is a bit cryptic as even a weighting of endings 0 will search for and return the requested word with all possible endings; or a weighting of substring 0 , for example, will expand the search string into words.
- keywords_meta_tag_names
- Here, you can specify the meta-tags to be searched.
- locale
- This is particularly important in non-English speaking locales: the language locale. For example, if you want German umlaut characters in HTML documents (ä, ö, ü and ß) to be recognized, then you would need to set locale: de_DE here. These letters will then be recognized, regardless whether they are written as ä, ö, ü and ß, ä, ö, ü and ß or ä, ö, ü and ß on the HTML page.
- excerpt_length
- Here, you can specify the number of bytes (in other words characters) for the small excerpt that is printed with the search result.
- match_method
- This specifies whether entered words "and", "or" etc. are combined with Boolean operators. "and" is the default. The other values are "or" and "boolean".
- use_meta_description
- If this is true, the long results output (incl. excerpt ) also includes the contents of the description meta-tag and not just the first few lines of the body text.
- The following four attributes are extremely useful (it is almost impossible to do without them) for searching in German Web sites. In the original version of the program, which only exists in English and which is only designed to search English-language Web sites, the English endings dictionary and the endings script, which creates the endings, is of little use. The solution is to browse the Internet for German dictionaries and the corresponding script (the "affix file"). These attributes are not necessary for searches on English-language sites as the defaults apply.
- endings_affix_file
- For German Web sites, enter: endings_affix_file: german.aff
- endings_dictionary
- enter the dictionary here: endings_dictionary: german.0
- In order to find these words, htfuzzy also needs to have run at least once. It creates the large endings databases. These are defined with the following attributes (or their paths), and they are usually stored in the common directory.
- endings_root2word_db:
- For example: endings_root2word_db: ${common_dir}/r2wgerman.db
- endings_word2root_db
- For example: endings_word2root_db: ${common_dir}/w2rgerman.db
- A quick word about the file ${common_dir}/bad_words: This file typically contains (English) words to be excluded from any indexing operations as these kinds of words are never searched for. "the", "and", "for", "with", "not", "by" etc. are examples of such words. In order to prevent the corresponding German words from being indexed, enter them manually in the file: und, an, auf, bei, zu, in, an, ab, der, die, das....
The Output Templates
Basically, there are two ways in which you can use the templates to display the search results or to generate the results pages. Either use a template for the page header, the body (the actual results) and the footer each. Normally, these templates are then in the files header.html, long.html and footer.html. Alternatively, you can use the file short.html for the middle part (the actual results) - however, it takes the concept of 'short' to extremes, which is why we recommend more verbose (longer) output. The second option is the file wrapper.html. Here, everything is processed in a single file and the link output is generated in a somewhat different manner. This file is not quite so flexible as the 3-file approach, however it is simpler. In the examples that follow, the 3-file alternative was used. If no match is found or if the search syntax entered for a Boolean search is incorrect, the files nomatch.html and syntax.html are used.
Here, too, the sample files provided are very useful. As they are HTML files, they are very easy to edit (assuming you know your way around HTML, of course ;-). Also, JavaScript and cascading style sheets can be integrated without any trouble. Unfortunately, PHP cannot be used; PHP-based Web sites and the dynamic results pages returned by the search engine do not work together, which is why it is necessary to offer static output in this case. The variables you can use here are more flexible. They are defined in htsearch's program code and, therefore, cannot be modified (unless you happen to be a C++ wizard and plan to rewrite htsearch).
The most important variables for the output templates are listed briefly here. A list of all the template variables can be found here.
- $&(URL): This is where the link is saved.
- $&(EXCERPT): This is for the short text that is displayed with each match. You define the contents and length of the excerpt in the configuration file.
- $&(WORDS): This is where the entered search terms are saved; on the results page, this appears in the URL as words=.... This is used in header.html, the header for the results page (as well as in nomatch.html).
- $&(LOGICAL_WORDS): All the words which the search engine will search for using the fuzzy search feature (i.e., completed substrings and the search terms with all possible endings. To display all of these on the results page, the header template (header.html) must include $&(LOGICAL_WORDS) instead of $&(WORDS). To view, enter a term in the Quick Search form on the ht://Dig site; the "logical words" then appear on the results page.
- $&(MATCHES): The number of matches
Below is an example of the file long.html. This file is used to generate the search results (without a header or footer). This is simply a definition list which is filled with the different variables.
<dl><dt><strong><a href="$&(URL)">$&(TITLE)</a></strong>$(STARSLEFT)
</dt><dd>$(EXCERPT)<br>
<i><a href="$&(URL)">$&(URL)</a></i>
<font size="-1">$(MODIFIED), $(SIZE) Bytes</font>
<br>
</dd></dl>
Starting and Automation Options
All that needs to be done now is to make sure that the spider and indexer will run regularly so that the databases are always up-to-date. The first time round, you should also run htfuzzy once as explained in order to create the endings databases. Otherwise the following will happen at startup: When the default configuration file i.e., /opt/www/htdig/conf/htdig.conf is being used, all you need to do is change to the directory /opt/www/htdig/bin as root and start both programs. If you want the old databases to be deleted in the process instead of simply attaching new results, then you also need to specify the -i option. Startup looks like this:
./htdig -i
./htmerge
If you are using a configuration file with a different name, e.g., /opt/www/htdig/conf/myfiles.conf, also change to /opt/www/htdig/bin and enter:
./htdig -c ../conf/myfiles.conf -i
./htmerge -c ../conf/myfiles.conf
Enter the following to create the databases with htfuzzy:
./htfuzzy endings -c ../conf/myfiles.conf #generates the endings database
./htfuzzy synonyms -c ../conf/myfiles.conf #generates a synonym database
Further options for htfuzzy are metaphone and soundex, each of which generates additional databases. See above for details.
To finish, it is a good idea to write a shell script that will handle the start routine and then to start this as often as required as a cronjob. If the script is called digger.sh, is located in /usr/local/httpd/cgi-bin and is to run once every twenty-four hours at 4:30 A.M., start the Crontab editor as root with crontab -e, type i for vi Insert mode and enter the following line:
30 4 * * * /usr/local/httpd/cgi-bin/digger.sh
Then press Esc and ZZ to save the cronjob.
Tips and Tricks
Documents in Other Formats
Ht://Dig will also search other document formats such as Word and PDF in addition to HTML and TXT files. This simply requires the appropriate parsers. With Acrobat Reader, it is possible to search PDF files offered for download on Web sites directly. The parser simply needs to be activated with the following line in the configuration file:
pdf_parser: path_to_parser/acroread -toPostScript
Documents embedded in PHP
If the entire Web site is PHP-based, chances are that the documents are not being displayed using hard links but that they are being called up by a PHP script, which reflects specific user settings or in order to build the right templates around the document. As the search engine output does not support PHP and the links returned in the search results are always hard links, you need to use a little workaround: the url_part_aliases attribute. With this, the output link can be different to the one actually found. This requires two configuration files: a "FROM" file with a "FROM" string and a "TO" file with a "TO" string. The portal search function is a good example. Here, every found article is called by the script content.php with which the user configuration is also passed. Htsearch now replaces parts of the hard link on the article found with a string contained in the script content.php in addition to a standard user configuration. The exact syntax to use can be found here.
Links
Unfortunately, the ht://Dig site uses frames, which means that most of the links below will take you to the respective frame only but without displaying the necessary navigation bar on the left. You may prefer to visit the main page and get an overview for yourself using the good navigation system in place. The most important page with the attributes for the configuration file can be accessed as follows via the navigation bar: Go to main page (http://htdig.org) --> Configuration file --> Alphabetical or By Program
Homepage: http://htdig.org Mirror: http://htdig/sourceforge.net Attributes: http://htdig.org/attrs.html Features and requirements: http://htdig.org/require.html FAQ: http://htdig.org/FAQ.html ht://Dig sample files: http://htdig.org/config.html htdig program: http://htdig.org/htdig.html htmerge program: http://htdig.org/htmerge.html htfuzzy program: http://htdig.org/htfuzzy.html htnotify program: http://htdig.org/htnotify.html htsearch program: http://htdig.org/htsearch.html Meta-tags: http://htdig.org/meta.html Templates and variables: http://htdig.org/hts_templates.html PHP and ht://Dig: http://www.devshed.com/Server_Side/PHP/search/ - very interesting (and complex) instructions on how ht://dig can be combined with dynamic PHP pages.
|
|
|