Network monitor

The internet does not have a simple star or tree topology, it’s rather more like a mesh structure. Add to this the existence of complex routing tables, firewall rules and DNS views, and you’ll realise that it’s impossible to monitor the availability of a network or node from a single point of view. For long, I’ve been thinking about a distributed system of monitoring stations, that exchange information on which networks/nodes they perceive as being unreachable. Systems like this do exist and their services are offered both free and commercially, but I would like to create something that is flexible and easy to deploy for an indivual person running a few servers.

The first step for this, however, is a program that runs on only one host and gathers all possible data from there. I’ve started here by creating an engine that can read a list of IP-addresses or hostnames and will send, receive and interpret ICMP Echo (‘ping’) messages. Next I’ve developed a configuration-file format that allows the users to represent the network topology to a certain degree. Lastly, I’ve added a statistical routine that tries to interpret the ping data (basically RTTs, round-trip times). The reason for this is what’s fast for one node may be very high latency for another. This routine learns what ‘good performance’ is for a certain node and then starts tagging probes that are outside the margin of ‘normal’ latency. These tags range from yellow (‘jitter’ or minor congestion), blue (‘lag’ or heavy congestion) to red (packet loss or unreachability). The colour tags for the probes are shown on screen as a grid, with each monitored node having a column and time being on the vertical axis. Below this grid, the individual probe results are displayed, and next to it a tree of the network topology is shown with the nodes tagged with the colour of their current ‘state’. State is defined as the least congested state seen in the last x probes (x being 3 for me, so far). A bell-character (beep or screen flash) can be output for packet loss, or can be inverted to notify of a node that was down for a while coming back up again.

This functionality is currently in beta testing. I’ll probably reach a point suitable for first release within one or two months.