NTP Monitoring System

CESNET technical report number 23/2007
also available in PDF, PostScript, and XML formats.

Vladimír Smotlacha
6.12.2007

1   Abstract

The NTPMON system is aimed at centralized monitoring of NTP processes. Its main features include storing of collected data in a database, graphical visualisation of observed parameters, generation of events associated with unexpected parameter changes or values out of range, and sending of e-mail warnings whenever a serious NTP malfunction is detected. NTPMON is used in CESNET for verification of NTP servers performance and for accurate time checking of all passive/active network monitoring sites.

Keywords: NTP, monitoring

2   Introduction

All passive or active monitoring systems require an accurate local clock. Knowledge of exact system time is essential mainly for timestamps of captured packets and time sensitive applications, for example one-way delay measurement. The expected absolute accuracy (difference between local time and UTC) varies from 10-3 s to 10-5 s.

The most common clock synchronization method in the networking environment is the NTP [Mil92], optionally with a GPS receiver as an external time source. The NTP process should be monitored otherwise we have no evidence that any particular time dependent measurement is correct.

There exist universal tools for network services monitoring (e.g., Nagios), however they mainly test the NTP service availability and do not deal with the performance. Another tool is NTP Time Server Monitor but it is designed mainly for local NTP monitoring. We looked for a centralized system which can monitor many external NTP sites and we decided to develop such a system and integrate it into our network monitoring infrastructure.

This document describes NTPMON, a centralized NTP monitoring system which checks parameters of NTP processes running on remote workstations, collects data and plots graphs. When any parameter is above or under the specified threshold, the system generates an event, in some case also an alert which is sent via e-mail to the administrator. NTPMON can monitor even sites that are administered by another authority as it does not need any nonstandard cooperation with the monitored site.

3   Data collection

There exists no single method how to collect all important parameters of the NTP process. Some data are logged by the NTP process, other are available via system functions (e.g., adjtimex(), ntp_adjtime()) and tools (e.g., ntpq). As our goal was to implement a centralized system without any piece of new software running on monitored site, we decided to avoid logs mining and all locally running tools.

We designed and programmed agents for parameters collection. Each monitoring agent runs and saves data independently.

3.1   NTP process polling

The agent ntpq queries periodically the status of the remote NTP process by the command ntpq -c rl. Parameters are parsed and inserted into the database.

Status of each particular NTP process is described by a set of qualitative and quantitative parameters. NTPMON displays selected subset of these parameters:

3.2   NTP client

In principle, the agent clie compares local time with time of the observed system, therefore it has to operate on computer having very accurate and stable clock - we consider it as a reference clock REF.

The agent clie behaves like a NTP client which sends NTP query to the monitored (remote) NTP process. According to the response, the agent clie calculates and stores into the database:

As a side effect, the agent also checks that the remote NTP service is available.

The value of θC is evaluated on the assumption of symmetrical one-way delay between the server and the client, therefore it is only an estimation of the real offset θ. Let we assume that REF clock uncertainty is negligible. The relation between the real offset θ of the remote NTP clock and the measured value θC is done by the formula

θC - δC/ 2 ≤ θ ≤ θC + δC/ 2,

where θC is the time offset calculated by clie and δC is the delay between REF and remote clock.

3.3   SNMP client

NTP version 4 is going to support SNMP, unfortunately, it is not yet neither finally standardized by the IETF nor implemented. In the future, when NTP v.4 will be widely deployed, we assume to program the snmp agent, which will probably replace the ntpq agent.

4   Database

NTPMON uses two databases, the MySQL and the RRD (round-robin database).

4.1   SQL database

Agents store all collected data into the MySQL database and they also check specified parameters and compare them with either the threshold or the previous value. Whenever a limit is exceeded, the agent generates an event and stores it into the database.

We decided to avoid any floating point types, therefore we restricted field types to CHAR (text of fixed length) and INT (integer value). We choose appropriate parameter units:

4.2   RRD database

NTPMON displays several types of parameters in graphs - all such parameters are stored in the RRD, as it implements two useful features: graphs plotting and old data aggregation that corresponds to interval displayed by daily, weekly and monthly graphs. The database contains individual values and average, minimum and maximum for every 10 minutes, 1 hour and 6 hours.

Each monitored site has its own RRD database which is split into two parts in order to avoid an interaction of agents: time offset, frequency offset and dispersion is collected and stored by the ntpq agent, measured time offset and measured delay is collected and stored by the clie agent.

5   Events

Agents check in real-time values of collected parameters and generate static events (i.e., the value exceeds a threshold) or dynamic events (i.e., the value changes too rapidly). A set of events and thresholds have been selected according to our long time experience with NTP, and we assume to update continuously the heuristic algorithms that generate events. Currently, we recognize following 11 types of events that belong to 3 groups:

  1. availability

    • no system response - the observed system did not replayed in one minute,

    • no NTP service - the observed system did not answered by valid NTP message,

    • system restart,

  2. qualitative parameters change

    • OS version- OS has been changed recently,

    • NTP version- NTP has been changed recently,

    • stratum - Stratum level has been changed,

    • REFID - ID of reference NTP server server has been changed,

    • PPS signal.

  3. threshold exceeded

    • offset - measured offset exceeded (Stratum-1 server) 50 μs or 1 ms (Stratum-2 +),

    • delay - measured delay (round-trip time) between monitored site and reference site exceeded 20 ms,

    • frequency stability.

NTPMON implements an aggregation of events in order to reduce the number of past, less important events. Aggregation is done in two steps every week and month. The aggregation includes the deletion of warnings and the assignment of coarser time intervals to events.

6   Alerts

Some events are so serious that it is necessary to inform the system administrator. Currently, NTPMON generates an alert and sends an e-mail when

We designed alert rules with the intention to avoid alerts in some non-standard but frequent situation, including temporary lost of connectivity, monitored system restart, and long-time NTP system poor performance. Our goal is not to disturb the system administrator too often, therefore alert signalizes only the system condition requiring an intervention.

7   Graphs

NTPMON generates graphs of following parameters for interval of 6 hours, one day, one week and one month:

All graphs can be plotted with two possible ranges of Y-axis: the predefined and the dynamically adjusted. Predefined range is suitable for brief comparison of several graphs but it does not show values exceeded the limit. The dynamic value shows all values in observed time interval.

When user clicks on any graph, it is displayed a detailed, two times larger graph with dynamically adjusted range of Y-axis.

8   Implementation

NTPMON front-end has been programmed in PHP v.4 and both agents have been written in C. The application includes also several PHP and bash scripts.

NTPMON is split into two computers. The clie agent runs on 'reference NTP system', a dedicated NTP server which has stable and accurate system clock. The computer is equipped with an oven controlled oscillator and the system clock is synchronized by the 1pps signal from a rubidium clock. All other parts of NTPMON, including the front-end and the database are installed and operated on a standard Linux server.

Using NTPMON is simple and intuitive. The user has to select program parameters in several sections:

User finishes the selection by clicking on the Go button and all graphs and tables are immediately displayed. When user clicks on any graph, more detailed, two time bigger graph is plotted.

9   Conclusion

NTPMON currently monitors 12 sites running NTP - it includes our NTP servers, all PerfMON sites (i.e., CESNET network monitoring system) and several testing computers. We plan to add several new features in next version, for instance sending alerts also by SMS, implement profiles specifying subsets of investigated sites, access to archive graphs. NTPMON is available from its home page.

10   Appendix A. Screen snapshots

[Figure]

Figure 1: Input screen (large image)

[Figure]

Figure 2: Detailed graph

[Figure]

Figure 3: Status table (large image)

[Figure]

Figure 4: Plotted graphs (large image)

11   Appendix B. SQL database structure

The database consists of four main tables:

The following list of fields is not complete, it shows and explains only selected items:

host

sample

meas

event

References

[Mil92] Mills D.L.: Network Time Protocol (Version 3) Specification, Implementation and Analysis, RFC 1305, IETF, 1992.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz