A Monitoring Infrastructure for PlanetLab
Part of the CoDeeN project
Please pour a drink for our dead server. CoMon the service was getting harder to maintain as the various PlanetLab forks moved away from shared accounting mechanisms. The final straw was a building power shutdown that killed its disk shelves.
photo by Linda, with a CC license
CoMon provides a monitoring statistics for PlanetLab at both a node level and a slice level. It can be used to see what is affecting the performance of nodes, and to examine the resource profiles of individual experiments.
The status page provides several views of PlanetLab, including node-centric, slice-centric, and others. To see more views, click on any of the links shown in the "Summaries:" line at the top of each page. Also available are pages showing the nodes with problems, and the slices with problems, which can be useful for general problem monitoring.
How you use CoMon depends on what you need. If you suspect that your experiment is acting strangely on some nodes, you may want to use the node-centric view to see if any of its statistics seem out of line with respect to other nodes.
If you're running a long-running test, you can use the slice-centric views to see how your slice is consuming resources, in aggregate or on each node. For example, using the "Slice Max" page, you can see what the maximum amount of memory your slice is using on any node, and then you can click on that value to see a two-day history. This kind of technique is useful for spotting memory leaks, etc.
Practically everything in CoMon is sortable by clicking on the column headings. On the node-centric view, most cells have two values, and so each column has two separate headings, which are both clickable.
Almost every value is also graphable. Clicking on any value loads a two-day history of that value. In the case of the per-slice/per-node pages, the values tend to lag a little. This is normal, and is due to the fact that we simply generate more graphs than the disk can handle. None of the data gets lots, but the graphs do lag a little.
CoMon also has support for selecting rows based on user-provided criteria. It works by allowing you to add a C-style expression to CoMon queries that selects only the rows that satisfy the expression. The current set of supported operators covers the popular comparison operators, (>, <, >=, <=, ==, and !=), the parentheses, and logical and/or (&&, ||). Column names are derived by using just the alphanumeric characters with no spaces, and constants are also recognized.
To make it easier to use CoMon with various forms of scripting, the data output format can be specified. By default, CoMon's output is html tables in a human-friendly format. Options for script-enabled formats are nameonly for just the row names, formatcsv for comma-separated values, and formatspaces for space-separated values. You can also specify a limit on the number of rows to display, and you can sort by the column of your choice. Example queries using all of these features are shown below.
Note about spaces: Many browsers and utilities correctly convert spaces in the URL to the appropriate escape sequence. However, it appears that curl and Firefox 3 (and possibly others) do not perform the escaping. As such, the queries below do not include any spaces in the URL. Please be careful when using tools that do not understand how to properly handle spaces in the URL.
The best way of accessing the data from CoMon is to use the query interface above. For the most part, we try to keep the column names and representations stable. In the event that you like pain, two daemons run on each node, and make its data available in different ways. The node-centric daemon listens on port 3121, and returns its data as soon as a connection is made - no request needs to be sent. The slice-centric daemon lists on port 3120 and expects an HTTP request for "/cotop". For example, you could generate a request for http://planetlab-1.cs.princeton.edu:3120/cotop
The data from these sensors is what is actually used to generate the data seen in CoMon, and these values are achived into the "monall" (node-centric) and "topall" (slice-centric) dumps. The format of the "monall" dump is that it has a first line indicating the start time, both in seconds since the epoch as well as a human-readable value. Then, each node has a set of name-value lines, with the node's name and IP address included in the lines. Nodes are separated by blank lines. The "topall" dump is more web-like, but each node's data is delimited by a line specifying the start time via the word "Start". Following that is the node name, and then the dump from the daemon.
The archived files contain the results of the checks every 5 minutes. They are available as http://summer.cs.princeton.edu/status/dump_cotop_YYYYMMDD.bz2 for the slice-centric data and http://summer.cs.princeton.edu/status/dump_comon_YYYYMMDD.bz2 for the node-centric data. Obviously, replace YYYYMMDD with the year, month, and day of the data being requested. The current day's file does not have the bz2 extension, and is not compressed. This data is manually moved to an off-line repository periodically, so it's hard to predict how many days back the on-line archive goes. Unfortunately, due to disk space issues, we no longer have an on-line archive of older data.
KyoungSoo Park and Vivek Pai, with input from lots of others. We may collectively be contacted at princeton_comon at slices.planet-lab.org