Bundle execution time randomization in cfengine3

Read this article in Russian.

In our environment we use cfengine to manage servers across the organization. Having a fairly large infrastructure we have to give a lot of thought to such things as smoothing the load on cfengine hubs and other parts of the infrastructure.

This article presents some approaches to bundle execution time randomization. This might be useful when you have a bundle which is going to affect a lot of servers and you don’t want it to execute simultaneously across a whole lot, thus causing a pressure point and possible event storms.

The first approach which comes to mind is splayclass() function, which defines a class if the system clock lies within a scheduled time-interval that maps to a hash of the first argument – arbitrary string, usually set to fqdn. Different strings will hash to different time intervals, and thus one can map different tasks to time-intervals. The code utilizing this function looks like this:

This will execute report at a random moment every hour.

A nuisance with this function is that it’s somewhat limited, having only “hourly” and “daily” policies. With “hourly” policy, the class will be defined for a 5-minute interval every hour, and with “daily” policy then the class will be defined for one 5-minute interval every day. This might be either too frequent or too seldom for a specific case. This also might be a problem if you use an cfengine run interval different from a default one.

To address this nuisance we might employ dist keyword in classes’ definition which generates a probabilistic class distribution. For example:

In this example class “percent_of_runs_15”  will be defined in 15 out of (15+85=) 100 cases or in 15% of cf-agent runs. Considering that cf-agent runs with 5 minutes interval by default, that makes 15% out of (24*12 =) 288 runs per day, or 43 runs, or approximately twice per hour at a random moment. Tuning the sum and the initial number we might change the random frequency at which the class will get defined.

Dist might give us even more flexibility, for example when we need the bundle to execute at the random hour every 12 hours, but at that hour we’d like bundle to run every 5 minutes. This might be needed when bundle requires multiple runs to fix things (deleting stuff from a file is a good example). So for that matter we might combine dist keyword with persistent classes, like in:

This approach seems to be more flexible, but it also contains an issue – due to its nature, dist is probabilistic and that means it doesn’t guarantee that the percent of distribution will be exact. In fact, you should keep in mind that +/- error is a norm here and, for instance, running the 15%/85% example drew results from 13% to 18% for 15% class.

We can also apply the approach with persistent classes to splayclass() function in the following manner:

Which would allow us to execute a report (or bundle) every 5 mins throughout a random hour of the day.