Monitoring of the eduroam.cz
CESNET
technical report number 28/2006
also available in PDF,
PostScript, and
XML formats.
Jan Tomášek
23. 11. 2006
1 Abstract
eduroam is an IP roaming infrastructure build up to allow users of one organisation to connect to a WiFi network at another organisation. Authentication is always done at the home organisation of the user trying to access the network. This process requires cooperation of many systems maintained by different people. The described monitoring provides end to end tests of every combination of possible roaming situation to assure that system will work for everyone.
2 Introduction & Motivation
The infrastructure of eduroam is built as a hierarchy of RADIUS servers. The servers do not communicate directly with each other. A server of each institution processes only requests which contains its home realms. Any other request is forwarded to an upper-level server, which works as an intermediary. It forwards the authentication request to the server appropriate for the realm or to a top-level server when access request does not belong to .cz.
During the implementation of eduroam for the first few organisations we had to resolve many problems when users of some organisation were able to roam while users of a different organisation were not. This was mainly caused by miss-configuration and wrong cooperation between different RADIUS servers and clients. That phase of implementing eduroam learned us that we need real end to end monitoring which will make sure that authorisation requests will work for every possible combination of home and visited organisations.
3 System architecture
Having a WiFi-enabled testing computer simulating real visitors at every site would help the monitoring. But such an installation would hardly be maintainable. Our approach is much simpler to maintain but still closely simulating the real traffic.
The main component of the system is the Driver. It is responsible for scheduling tests on all RADIUS servers configured in the Config database. which uses configuration. The actual tests are conducted by Probes simulating authentication requests on individual RADIUS servers. The selection of the type of the Probe is based on the configuration since home institutions may differ in their preferences on EAP method. In theory it is possible to run the probes on different hosts but we have decided to maintain just one installation of them.
All Data collected by the Driver are stored in the DB (DataBase) together with the configuration. The DB is used as a source for different presentation views such as: View Map showing weather map of eduroam, Report View showing reliability of eduroam in the past, User UI is addresses simplified user interface for common eduroam users to assists them when locating well working eduroam site nearby, and Admin UI for detailed reports about all system components and configuration maintenance interface.
4 Current implementation
- Probes
- Probes are realized as a wrapper around the eapol_test from wpa_supplicant package. The wrapper was implemented by Pavel Poláček as a shell script and is publicly available online. At this moment only methods for testing PEAP-MSCHAPv2 and EAP-TLS are implemented.
- Driver
- The Driver is realized by Nagios 2.2, a well known tool for monitoring network infrastructure. It supports running probes locally from the computer hosting Nagios or remotely by nrpe (Nagios remote probe execution API).
- Data storage
- Nagios 2.0 introduced its Event Broker interface which can be used
for writing plug-ins (in form of dynamically linked library). Those
plug-ins get information about any event in Nagios.
ndoutils is such a plug-in for Event Broker designed to collect all Nagios events and store them in a MySQL database. I had some problems when trying to run ndoutils. The MySQL tables are not very optimised for performance, indices are completely missing. Even after adding indices and deploying scripts for removing old history data the module was still causing monitoring failures.
Currently the Nagios internal data file is being used as the data storage. I consider developing eduroam.cz-specific database structure and using Event Handler scripts to record data in the future.
5 Configuration of system
The configuration of Czech NREN servers and of monitoring system itself is stored in CESNET CAAS (LDAP Directory). It stores all information about RADIUS servers, connected organisations, realms and administrators responsible for the realms. CAAS offers a good user interface for NREN and organisational administrators, the LDAP directory itself provides very good control on access permissions to configuration entries.
The configuration files for Nagios as well as for the rest of eduroam.cz related services is being built by perl scripts. The script ncp.pl used for building Nagios configuration files maintains dependencies between monitored services. This dependency logic significantly reduces the number of tests and false error reports when some parts of eduroam.cz is down.
5.1 Monitored services and their dependencies
Nagios by default distinguishes between a host and a service running on the host. If a host is down no service is being tested. However, the dependencies between services in eduroam.z are a bit more complicated. Services usually depend on one or more other services running on other hosts. For that reason, I do not use the default Nagios dependencies at all. Hosts are by default always considered to be up and the dependencies are set up only between services.
- PING
- PING service is monitored on every host in system. Tests of all
other services are made dependent on it, some directly the others
indirectly through other services.
PING service on eduroam.cz-connected RADIUSes depends on PING at gw, the gateway through which monitoring is connected to the Internet. The purpose of this rule is to protect administrators of monitored RADIUSes from false alarm messages when the monitoring system is disconnected from network.
- RACOON
- RACOON (IPsec) service is monitoring if the racoon daemon on radius1.eduroam.cz is healthy and working.
- IPSEC-radius1.eduroam.cz
- IPSEC-radius1.eduroam.cz service monitors PINGs from
radius1.eduroam.cz to a particular RADIUS server. The
test is executed from radius1.eduroam.cz using
nrpe. The purpose of this test is to verify that
IPsec on the RADIUS server is working.
This service depends on PING on each respective host, on PING at radius1.eduroam.cz and on RACOON on radius1.eduroam.cz. Notification of problem is sent to administrators of each respective server.
- Home realm
- Each monitored RADIUS server has at least one realm marked as the
home one. The home realms are being tested much more frequently than others to
know if the local RADIUS server is working correctly.
This test depends on the PING on the same host. Notification of problem is sent to administrators of each respective server.
- Other realms
- Finally the non-local realms are tested. This tests simulates visits of
visitors from other realms at the monitored RADIUS server (organisation).
This test depends on PING, home realm, IPSEC-radius1.eduroam.cz on the tested RADIUS server and on tested realm on its home RADIUS server. No notifications are for this test, problems are just recorded for further processing.
5.2 Aggregated servers
Some organisations are running more than one local RADIUS server. To simplify dependency system, I have defined aggregated (virtual) servers on which services are considered OK if they are OK on at least one of the original servers.
6 Load caused by monitoring
One of the interesting questions on end-to-end monitoring is: How will the number of performed tests grow with the number of connected organisations?
Obviously, the number of EAP sessions going through the national proxy server grows with the square of the number connected organizations.
TA ... number of Test Acccounts = number of connected organisations
RS ... number of RADIUS SERVERS; for the calculation I assume RS = 1.5 * TA, that means that every second organisation have two RADIUS servers
Ng ... frequency of testing with a guest account (tests per hour); for the calculation I assume Ng = 2, that means that a test is done every 30 minutes
Nl ... frequency of testing with a local account (tests per hour); for the calculation I assume Nl = 12, that means that a test is done every 5 minutes
Qm=Ng*(TA-1)+Nl ... number of tests sent to organisational RADIUS server directly by monitoring per hour
Qt=Ng*(TA-1)+Nl+Ng(RS-1) ... number of tests which has to be processed by organisational RADIUS server per hour. Nl+Ng(RS-1) is the number of tests processed locally, the rest is forwarded to the NREN level server.
Qt=Ng*(TA+RS-2)+Nl ... simplified version of above equation
Qnren=RS*Ng(TA-1) number of tests passing through the NREN server
| TA=20 | RS=30 | TA=50 | RS=75 | TA=500 | RS=750 | TA=2000 | RS=3000 | |
|---|---|---|---|---|---|---|---|---|
| [t/h] | [t/sec] | [t/h] | [t/sec] | [t/h] | [t/sec] | [t/h] | [t/sec] | |
| org. server | 108 | 0,03 | 258 | 0,07 | 2508 | 0,70 | 10008 | 2,78 |
| NREN servers | 1140 | 0,32 | 7350 | 2,04 | 748500 | 207,92 | 11994000 | 3331,67 |
Table 1: Number of tests processed by organisational and NREN RADIUS servers
The table shows the number of tests processed by organisational and NREN RADIUS servers. When reading the table, please note that one test consists of approximately 10 RADIUS packets.
Conclusions:
- the monitoring system does not put any serious load on organisational RADIUS server
- due to the eduroam structure, this type monitoring can not be used when number organisations in NREN is higher than approximately 500. In that case methods for traffic load-balancing and reducing frequency of roaming tests must be investigated.
7 User interface
Nagios provides its own user interface for administrators (examples are on Nagios site). Because our monitoring produces a huge amount of data, using the native Nagios interface is not always easy to use.
A better overview of the whole eduroam.cz user interface is offered by the eduroam availability matrix. The matrix is able to display the healthy status of the whole eduroam.cz infrastructure. Three examples of its it produces are presented bellow.
7.0.1 Two institutions are down
The image shows the eduroam.cz availability matrix output when two sites (osu.cz and vse.cz) are down. They are marked red (status 2 means timeout) and the rest is green (status 0 - OK).
A red row means that the site (for example osu.cz) is not able to host anyone. A red column shows that the users from the respective realm (e. g. osu.cz) are not be able get online anywhere.
7.0.2 Some miss-configurations in infrastructure
This image show a real world situation. The institution lf1.cuni.cz is not able to host some visitors. Some of them just timeout (red with status code 2), some receive false Access-Reject (yellow with status code 1). Such a situation on the map requires close attention of eduroam administrators.
The institution fel.cvut.cz disabled monitoring totally (orange, code 3, monitoring data are unknown) but their users are able to roam - the column for fel.cvut.cz is green.
No visitor is able to get online at vscht.cz except for their own users. In addition, their users can get connected everywhere. Output like this usually means that forwarding to the upper-level server is miss-configured at the site. Again, the attention of the administrator is needed.
7.0.3 Infrastructure crash
To produce this image, I took the infrastructure down: disabled RADIUS server on radius1.eduroam.cz and disabled dependency system in monitoring. But I did not leave it down for the time long enough for all green to disappear. Close studying reveals:
- Some sites are responding with Access-Reject even in the situation when their RADIUS server is disconnected from eduroam.cz. This is typical for freeRadius.
- Some sites go down even for their own users under heavy load caused by monitoring (note: dependency system was down for this experiment which highly increased the amount of test requests). Note the sssvt.cz line. Properly maintained sites should have green fields for their realms.
8 Conclusions
I had developed this monitoring system mainly during the first quarter of 2006. For the rest of the year I was introducing minor changes mainly to tune the performance and to reduce impacts of monitoring activities on the infrastructure. A deeper study of the impacts was started by the feedback received after my presentation of the system at the TF-Mobility meeting in Catania, Sicily.
This end-to-end monitoring system enables discovering of the eduroam.cz infrastructure problems before real users find them up. This was proven several times during 2006. Not only did it help when connecting new organisations the system proved to be z valuable tool for discovering small configuration glitches that can have great impact on users.
I plan to introduce a user interface for common users in the future. I am working on a clickable map of the Czech Republic showing actual status of each site as well as historical reports. I am also preparing an ad-hoc monitoring add-on to the existing system which would allow users to conduct roaming tests in situation when scheduling granularity is too big.