Changes to wwwstat: httpd logfile analysis package ================================================== Copyright (c) 1994, 1996 Regents of the University of California. See the file LICENSE for licensing and redistribution information. See the file INSTALL for installation information. See the file README for more information. If you have any suggestions, bug reports, fixes, or enhancements, join the mailing list by sending a mail message with "subscribe" in the subject to . Known problems 2.0 will be my last (planned) official release of wwwstat. I have set up the mailing list for people who would like to help each other use wwwstat or distribute modifications. List discussions, updates, and other trivia are archived at and . Version 2.0 04 November 1996 Added splitlog script for splitting logfile by virtual host or URL path. Added manual for splitlog in all three formats. Changed manpage.* to wwwstat.html and wwwstat.ps. Changed wwwstatrc to wwwstat.rc to be more PC-friendly (yuck). Changed mechanism for finding the configuration files on @INC. Made timestamp parser slightly more lenient (for phttpd brokenness). Removed unnecessary split on address. Version 2.0b1 07 October 1996 Added user and system config files [suggested by everyone]. Replaced old getopts with hand-coded function, which means that multiple search options are allowed (they get OR'd together). Added ability to read in any number of old summary files. Rewrote inclusion mechanism to parse by section heading. Added options to enable/disable output of CGI headers. Added options to enable/disable output by section. Added options to change sort ordering function for each section. Added options to display only top N for each section. Added option to display sections for both sorted top N and all entries. Added options to enable/disable creating link to each archive entry. Added options to truncate archive URL by level and/or filename. Added -R option for displaying daily stats in reverse order [Reinier Post]. Added -m|M options for selection based on the HTTP method. Added option to lookup DNS (with cached results) on unresolved addresses. Added option to disable escaping of "+" and "." in -aAnN regexps. Added config ability to exclude or replace any URL matching pattern with a special string in the archive listing. Added config ability to exclude or replace any subdomain match with a special string in the domain listing (overrides country-code). Removed parsing of srm.conf and -s option for greater portability. Removed -i include option (now we just look at first line of file). Added "--" as last option indicator (to avoid treating files as options). Added "-" as filename to indicate standard input. Added "+" as filename to indicate the default logfile. Added summary and estimates for all HTTP/1.1 status codes. Added %Y pattern for placing year in link to last summary. Added -X option for setting last summary URL on command-line. Added -H option for setting HTML title and heading text. Made the DirectoryIndex a perl regex so that it can match multiple forms of index/overview/... file or script names. Replaced country-codes file and initialization with the %DomainMap table in domains.pl, which will make it easier to override names. Now displays empty tables rather than error on no matching data. Added workaround for perl4 bug of overflowing %12d in printf. Stopped reversing of already-reversed unresolved subdomains. Forced parsing of timestamp to be more discriminating [Bob Kieronski]. Improved efficiency of matches containing variable patterns [Dan Klein]. Added example perl script for monthly log rotation. Added wwwerrs perl script for analyzing the error_log. Changed distribution URL for dual http/ftp access to my site. Added a new man page. Version 1.1 never released Added the ability to exclude today via the -D 'today' option (or include only today via -d 'today' option). This vastly simplifies nightly runs to generate the previous day's summary. Remove NULLs from the logfile entry before processing [Terry West]. Assume 200 response if "-" (unknown) is in logfile. Replace any %7E with the original tilde "~" in the archive section. Fix dumb browsers' inability to parse relative URLs (on 200 status). Added example for globbing "hidden" directories. Version 1.01 April 24, 1994 Minor change: new log format uses leading zero in day number field, so that is added to oldlog2new and blanked by space in wwwstat. Version 1.0 April 23, 1994 Now supports the NCSA httpd_1.2 "common" log format. As a result, all attempts to figure out file size are gone and there is no longer any need for all those fstat tests. Code for srm parsing of aliases and scripts has been removed. Basically, the entire log parsing section was rewritten and then placed in a subroutine to allow for multiple logfiles. Bunches of unnecessary backslashes removed from print statements. Time of last update now includes GMT offset instead of full GMT. Tries to estimate size of headers and error messages to account for bytes that are not included in the log entry byte count. Allows perl regular expressions (where possible) in all searches. Allows multiple logfiles to be analyzed in sequence, with any compressed logfiles automatically recognized by their file extension. Removed -f and -z options because they are no longer needed. Added -c option for searching based on server response code. Added the uppercase options -A, -C, -D, -T, and -N which perform the negation of the corresponding lowercase letters, i.e. they force wwwstat to not include any log entries with the given pattern in the address, response code, date, time, or archive name. Version 0.4 (later called oldwwwstat) April 19, 1994 Removed escapes to allow regular expressions in -d and -t searches. Fixed minor bug of outputing instead of . Made use of $startTag and $endTag explicit for report output. Added option to append subdomain info on end of local hosts. Added support for IdentityCheck (rfc931) logfile format. Added output of Totals by Remote Identifier when Do_Ident is requested. Added -r option to select Do_Ident when IdentityCheck is enabled. NOTE: For security reasons, you should not publish to the web any report that lists the Remote Identifiers. This option is intended for server maintenance only. Version 0.3 March 9, 1994 Added links for last server summary, table-of-contents, and a reference to the standard distribution site (all because similar things looked good in Kevin Hughes' getstats output). Automatically determines URL of previous month's summary. Now allows extra spaces on Alias directive lines in srm.conf. Now recognizes Redirect directives and estimates size of message. No longer counts automatically redirected directory names twice -- it estimates size of redirect message and counts that instead. Now uses normal printf's instead of perl forms. Reversed order of printed fields to allow for long names and the ability to read its own output (see the -i option below). Updated the country-codes file to reflect latest standards/spelling. Added the following options (phew!): Display Options: -h Help -- just display the usage message and quit. -e Display all invalid log entries on STDERR; -- this is great for finding trashed log entries for cleaning. -l Do display full IP address of clients in my domain. -L Don't display full IP address of clients in my domain. -o Do display full IP address of clients from other domains. -O Don't display full IP address of clients from other domains. -u Do display IP address from unresolved domain names. -U Don't display IP address from unresolved domain names. -v Verbose display (to STDERR) of each log entry processed; -- useful, but not recommended for long logs. -x Display all requests of nonexistant files to STDERR; -- this is great for finding misadvertized or moved URLs. Input Options: -f Read from the following access_log file instead of the default; -- allows you to read archived (or test) logfiles. -z Use zcat to uncompress the log file while reading [requires -f]; -- allows you to read compressed archive logfiles; use "gzip -9" to get factor of 10 reduction in file sizes. -s Get the server directives from the following srm.conf file; -- allows you to archive the configuration along with the log. -i Include the following file (assumed to be a prior wwwstat output); -- incredibly great, allows you to keep partial summary periods in wwwstat output files and purge the logfile. Inventive admins can find many uses for this, such as being used by scripts to provide fast, up-to-the-minute stats. Search Options (include in summary only those log entries): -a Containing the following "substring" in the IP address. -d Containing the following "substring" in the date. -t Containing the following "substring" in the time. -n Containing the following "substring" in the archive (URL) name. -- allows you to restrict logfile summaries to an area of particular interest; great for custom author summaries; Search strings are matched as substrings, prefix (if string starts with a caret "^"), or suffix (if string ends with "$"). Note that strings containing $ must be enclosed in single quotes for most shell command lines. Version 0.2 January 21, 1994 Added support for the /~username form of files. Added general support for Alias and ScriptAlias configurations. Now reads the server config file to get site configuration. Sped up the process by caching file sizes (fewer file stats). Added options to display full IP addresses in subdomain listing. Expanded some form field sizes. Now sorts archive section by name. Version 0.1 January 14, 1994 Added support for HTML output. Added reversed subdomain statistics. Added the logic for grouping files in archive sections. Rewrote the whole damn thing. Version 0.0 Originally from fwgstat 0.035 (jem@sunsite.unc.edu) with all the extra options stripped out and many bugs fixed. In turn, fwgstat was heavily based on xferstats, which is packaged with the Wuarchive FTP daemon. Fwgstat is for multi-server stats.