NTP Monitoring System
CESNET
technical report number 23/2007
also available in PDF,
PostScript, and
XML formats.
Vladimír Smotlacha
6.12.2007
1 Abstract
The NTPMON system is aimed at centralized monitoring of NTP processes. Its main features include storing of collected data in a database, graphical visualisation of observed parameters, generation of events associated with unexpected parameter changes or values out of range, and sending of e-mail warnings whenever a serious NTP malfunction is detected. NTPMON is used in CESNET for verification of NTP servers performance and for accurate time checking of all passive/active network monitoring sites.
Keywords: NTP, monitoring
2 Introduction
All passive or active monitoring systems require an accurate local clock. Knowledge of exact system time is essential mainly for timestamps of captured packets and time sensitive applications, for example one-way delay measurement. The expected absolute accuracy (difference between local time and UTC) varies from 10-3 s to 10-5 s.
The most common clock synchronization method in the networking environment is the NTP [Mil92], optionally with a GPS receiver as an external time source. The NTP process should be monitored otherwise we have no evidence that any particular time dependent measurement is correct.
There exist universal tools for network services monitoring (e.g., Nagios), however they mainly test the NTP service availability and do not deal with the performance. Another tool is NTP Time Server Monitor but it is designed mainly for local NTP monitoring. We looked for a centralized system which can monitor many external NTP sites and we decided to develop such a system and integrate it into our network monitoring infrastructure.
This document describes NTPMON, a centralized NTP monitoring system which checks parameters of NTP processes running on remote workstations, collects data and plots graphs. When any parameter is above or under the specified threshold, the system generates an event, in some case also an alert which is sent via e-mail to the administrator. NTPMON can monitor even sites that are administered by another authority as it does not need any nonstandard cooperation with the monitored site.
3 Data collection
There exists no single method how to collect all important parameters of the NTP process. Some data are logged by the NTP process, other are available via system functions (e.g., adjtimex(), ntp_adjtime()) and tools (e.g., ntpq). As our goal was to implement a centralized system without any piece of new software running on monitored site, we decided to avoid logs mining and all locally running tools.
We designed and programmed agents for parameters collection. Each monitoring agent runs and saves data independently.
3.1 NTP process polling
The agent ntpq queries periodically the status of the remote
NTP process by the command ntpq -c rl. Parameters are parsed and
inserted into the database.
Status of each particular NTP process is described by a set of qualitative and quantitative parameters. NTPMON displays selected subset of these parameters:
stratum - a "synchronization distance" to the primary NTP server. Primary NTP server (i.e., NTP server with an external clock) has stratum 1,
time offset - difference between the NTP server time and the local time,
frequency offset - correction factor of local clock frequency. It is expressed as a relative unit-less value in ppm (parts per million). It is not important the frequency offset value itself but rather its variation due to changes of oscillator frequency,
root dispersion - maximal difference between local time and the root (primary) NTP server time. Its calculation is based on the assumption of the worst possible oscillator (un)stability.
3.2 NTP client
In principle, the agent clie compares local time with time of the observed system, therefore it has to operate on computer having very accurate and stable clock - we consider it as a reference clock REF.
The agent clie behaves like a NTP client which sends NTP query to the monitored (remote) NTP process. According to the response, the agent clie calculates and stores into the database:
measured time offset (θC) - time difference between both REF time and monitored computer time,
measured delay (δC) - propagation delay of NTP query and response.
As a side effect, the agent also checks that the remote NTP service is available.
The value of θC is evaluated on the assumption of symmetrical one-way delay between the server and the client, therefore it is only an estimation of the real offset θ. Let we assume that REF clock uncertainty is negligible. The relation between the real offset θ of the remote NTP clock and the measured value θC is done by the formula
θC - δC/ 2 ≤ θ ≤ θC + δC/ 2,
where θC is the time offset calculated by clie and δC is the delay between REF and remote clock.
3.3 SNMP client
NTP version 4 is going to support SNMP, unfortunately, it is not yet neither finally standardized by the IETF nor implemented. In the future, when NTP v.4 will be widely deployed, we assume to program the snmp agent, which will probably replace the ntpq agent.
4 Database
NTPMON uses two databases, the MySQL and the RRD (round-robin database).
4.1 SQL database
Agents store all collected data into the MySQL database and they also check specified parameters and compare them with either the threshold or the previous value. Whenever a limit is exceeded, the agent generates an event and stores it into the database.
We decided to avoid any floating point types, therefore we restricted field types to CHAR (text of fixed length) and INT (integer value). We choose appropriate parameter units:
timestamps - all timestamps have resolution 1 s and are expressed by an unsigned integer value - number of UTC seconds since 0:00:00 1.1.1970 ,
time offset, delay, dispersion - expressed in microseconds by a signed integer value,
frequency offset - expressed in ppb (parts per billion, i.e., 10-9 or ns/s) by a signed integer value.
4.2 RRD database
NTPMON displays several types of parameters in graphs - all such parameters are stored in the RRD, as it implements two useful features: graphs plotting and old data aggregation that corresponds to interval displayed by daily, weekly and monthly graphs. The database contains individual values and average, minimum and maximum for every 10 minutes, 1 hour and 6 hours.
Each monitored site has its own RRD database which is split into two parts in order to avoid an interaction of agents: time offset, frequency offset and dispersion is collected and stored by the ntpq agent, measured time offset and measured delay is collected and stored by the clie agent.
5 Events
Agents check in real-time values of collected parameters and generate static events (i.e., the value exceeds a threshold) or dynamic events (i.e., the value changes too rapidly). A set of events and thresholds have been selected according to our long time experience with NTP, and we assume to update continuously the heuristic algorithms that generate events. Currently, we recognize following 11 types of events that belong to 3 groups:
availability
no system response - the observed system did not replayed in one minute,
no NTP service - the observed system did not answered by valid NTP message,
system restart,
qualitative parameters change
OS version- OS has been changed recently,
NTP version- NTP has been changed recently,
stratum - Stratum level has been changed,
REFID - ID of reference NTP server server has been changed,
PPS signal.
threshold exceeded
offset - measured offset exceeded (Stratum-1 server) 50 μs or 1 ms (Stratum-2 +),
delay - measured delay (round-trip time) between monitored site and reference site exceeded 20 ms,
frequency stability.
NTPMON implements an aggregation of events in order to reduce the number of past, less important events. Aggregation is done in two steps every week and month. The aggregation includes the deletion of warnings and the assignment of coarser time intervals to events.
6 Alerts
Some events are so serious that it is necessary to inform the system administrator. Currently, NTPMON generates an alert and sends an e-mail when
the NTP service is discontinued for more than 15 minutes after at least 2 hours of normal operation,
the time offset exceeds 500 μs (Stratum-1) or 10 ms (Stratum-2 +) threshold at least two times during 15 minutes interval after 24 hours of normal operation.
We designed alert rules with the intention to avoid alerts in some non-standard but frequent situation, including temporary lost of connectivity, monitored system restart, and long-time NTP system poor performance. Our goal is not to disturb the system administrator too often, therefore alert signalizes only the system condition requiring an intervention.
7 Graphs
NTPMON generates graphs of following parameters for interval of 6 hours, one day, one week and one month:
time offset - time offset reported by the NTP process. Predefined range is from -50 μs to +50 μs.
frequency offset - correction factor of local clock frequency. Predefined range is from AVR - 1 ppm to AVR + 1 ppm where AVR is the average frequency offset.
root dispersion - maximal difference between local clock and the root (primary) NTP server. Predefined range is from 0 ms to +5 ms.
measured time offset - time offset measured by the clie agent. Predefined range is from -50 μs to +50 μs.
measured delay - round-trip time spent by NTP protocol packets between monitored and reference sites. Predefined range is from 0 ms to +5 ms.
All graphs can be plotted with two possible ranges of Y-axis: the predefined and the dynamically adjusted. Predefined range is suitable for brief comparison of several graphs but it does not show values exceeded the limit. The dynamic value shows all values in observed time interval.
When user clicks on any graph, it is displayed a detailed, two times larger graph with dynamically adjusted range of Y-axis.
8 Implementation
NTPMON front-end has been programmed in PHP v.4 and both agents have been written in C. The application includes also several PHP and bash scripts.
NTPMON is split into two computers. The clie agent runs on 'reference NTP system', a dedicated NTP server which has stable and accurate system clock. The computer is equipped with an oven controlled oscillator and the system clock is synchronized by the 1pps signal from a rubidium clock. All other parts of NTPMON, including the front-end and the database are installed and operated on a standard Linux server.
Using NTPMON is simple and intuitive. The user has to select program parameters in several sections:
list of sites,
type interval of displayed graphs and/or events: last 6 hours, last 24 hours, last 7 days, last 30 days, selected day, selected week or selected month,
beginning or end of time interval - valid only when selected day / week / month is chosen,
displayed objects status, graphs, events.
User finishes the selection by clicking on the Go button and all graphs and tables are immediately displayed. When user clicks on any graph, more detailed, two time bigger graph is plotted.
9 Conclusion
NTPMON currently monitors 12 sites running NTP - it includes our NTP servers, all PerfMON sites (i.e., CESNET network monitoring system) and several testing computers. We plan to add several new features in next version, for instance sending alerts also by SMS, implement profiles specifying subsets of investigated sites, access to archive graphs. NTPMON is available from its home page.
10 Appendix A. Screen snapshots
Figure 1: Input screen (large image)
Figure 2: Detailed graph
Figure 3: Status table (large image)
Figure 4: Plotted graphs (large image)
11 Appendix B. SQL database structure
The database consists of four main tables:
host - description of monitored sites. Majority of fields are filled by the system administrator, only operating system version and NTP process description are updated by the agent,
sample - table stores data collected by the ntpq agent,
meas - table stores data collected by the clie agent,
event - all agents check specified parameters and compare them with either the threshold or the previous value. Whenever a limit is exceeded, the agent generates an event and stores it into the table.
The following list of fields is not complete, it shows and explains only selected items:
host
- id - unique host system identification
- name - unique short host name (human readable)
- url - network address
- descr - long host name
- os - operating system type and version
- ver - NTP version
sample
- id_host - link to the host table
- time - sample timestamp
- stratum - NTP stratum
- refid - source of synchronization (NTP server, external clock)
- offset - time offset (declared by the system)
- freq - relative frequency offset
- disper - time dispersion (traced to stratum-1 server)
- reftime - last reference time
- stabil - frequency stability
- status - clock status
meas
- id_host - link to the host table
- time - sample timestamp
- stratum - NTP stratum
- refid - source of synchronization (NTP server, external clock)
- mea_offset - time offset (measured by the reference system)
- mea_delay - time delay (between local and reference clock)
event
- id_host - link to the host table
- time - timestamp of the event
- id_ev - type of the event
- id_var - variable (parameter) associated with the event
- par - value of the variable
References
| [Mil92] | Mills D.L.: Network Time Protocol (Version 3) Specification, Implementation and Analysis, RFC 1305, IETF, 1992. |