Aleksey Tsalolikhin, 9 July 2010

On Thursday, June 24th, the USENIX 2010 conference in Boston hosted the first Configuration Management Summit on automating system administration using open source configuration management tools. The summit brought together developers, power users and new adopters.

Why Configuration Management?

Internet use is growing and new services are appearing hourly. The number of servers (both physical and virtual) is becoming uncountable. Automation of system administration is a must to handle the deluge; else swarms of sysadmins would be needed to handle all these systems.

Other drivers for automating system administration:

  • In companies with multiple sysadmins working the old way, in interactive root sessions, there is a potential for sysadmins making changes at the same time to step on each other’s toes (and on the config!);

  • system administration is a relatively new profession, without a standard curriculum, so practitioners have different philosophies and practices. Going from organization to organization, it is a challenge for a new sysadmin to learn:

    • how is the system setup,

    • why was it setup that way,

    • how it needs to be setup to keep operating,

    • how to set it up that way again in case of disaster or normal growth.

Automating system administration addresses all the above and makes new things possible.

For example, a CM tool can respond faster than a human sysadmin to a deviation from configuration policy to remedy it or it may automatically instantiate, configure and bring online a new virtual server instance if an old one dies.

There are over a dozen different CM tools actively used in production.

So many choices can bewilder a sysadmin searching for a CM tool.

The summit included representatives for 4 tools: Bcfg2 (pronounced "bee-config 2"), Cfengine, Chef and Puppet.

The summit had three parts: 4 presentations; a panel session; and a mini BarCamp with 6 presentations. The panel session was quite lively.

I will attempt to compare and contrast the 4 tools; however using any robust configuration management tool, with discipline, is better than administering systems manually.

Quick Overview of the Four Tools

Bcfg2: Came out of Argonne National Lab. Lightweight on the node. Each server can easily handle 1000 nodes. Relies on centralization. Uses a complete model of each node’s configuration, both desired and current.

Strengths: Reporting system and debugging.

Weaknesses: Documentation. (New set of documentation is coming out now, but still weak in examples.) Sharing policies between sites is not easy; group names need to be standardized first.

Cfengine: Came out of Oslo University. Strong philosophy of allowing decentralization and potential local autonomy. Oriented toward consensus building as opposed to top-down policy dictation. Underlying philosophies are promise theory, convergence and self-healing. Also has a healthy paranoid streak and an impressive security record (only 3 serious vulnerabilities in 17 years).

Strengths: Highly multiplatform (it even runs on underwater unmanned vehicles!). Lightweight. Largest userbase - more companies using it than all the other tools combined! Able to continue operating under degraded condition (network down, for example).

Weaknesses: It’s hard to get started because there is a lot to learn.

Chef: Has its origins in Ruby-on-Rails world in the cloud. Grew out of dissatisfaction with Puppeti’s non-deterministic ordering. Resilient (each node can run stand-alone if the server disappears). Sequence of execution is tightly ordered.

Strengths: Cloud integration (automating provisioning and configuration of new instances in one fell swoop). Multi-node orchestration (more below). Reusable policy cookbooks and highest degree of recipe reuse amongst its users amongst the four tools.

Weaknesses: Attributes have 9 different levels of precedences (role, node, etc.) and this can be daunting.

Puppet: Grew out of dissatisfaction with Cfengine 2. Centralized model, however if the server is unreachable, node agents will still run, applying the cached configuration. Simple and human-readable DSL gives safety at cost of flexibility. Determines and runs delta changes only.

Strengths: Large community of users (over 2000 users on the Puppet mailing list).

Weakness: Puppet server right now is a potential bottleneck (which is solved by going to multiple servers.) Execution ordering can be non-deterministic. (But reports will always tell you what succeeded and what failed. And order can be mandated if order is required.)

Bcfg2

Bcfg2 was represented by its creator, Sr. Systems Administrator Narayan Desai.

Some notes on the name: pronounced "bee-config two". The "B" in Bcfg2 originally stood for "bundle"; now it doesn’t stand for anything. The 2 indicates version 2.

Narayan thinks of configuration management as an API for programming your configuration.

Goals of Bcfg2’s design:

  • Efficient representation of diverse configuration.

  • Scalability to thousands of nodes.

  • Programmability (ability to insert code, not just declarative descriptions of desired configuration).

  • Model configuration in simple unambigious terms. (No way to end up with two different models of the same configuration.) "Closing the loops between goals and reality" is very important in Bcfg2. To enable this, there is a lot of reporting built in; client reports its state upstreat to the server to enable "gap analysis". This is a very popular feature.

  • Support extensive configuration debugging. Help the sysadmin get to the bottom of things quickly. Bcfg2 has full system introspection capability (why is Bcfg2 making the decisions that it is). Many people think the debugger is the coolest feature of Bcfg2.

  • The Bcfg2 client can be run in dry-run (no changes), interactive (are you sure you want to do this?) and non-interactive mode, which is facilitates learning it.

  • Composition of information from a number of sources. (For example, put together different policies from different organizational sources such as organizational policy (FTP should not be running) and departmental policy (all user home directories must be NFS-mounted from fileserver HAPPYHOME).)

  • Expose plugin API to all aspects of the configuration process to enable handling of corner/edge cases. (Flexibility is an important part of Bcfg2’s design.)

Because Bcfg2 can capture the entire configuration of a system into a model you can diff the goals model (the desired state) and the current model and find out your degree of deviation, or you can diff two current models (Server A and Server B) and find out what’s different about their configuration.

Bcfg2 is very flexible - for example, you can designate a node as an examplar, and the Bcfg2 server will copy it’s config to other nodes.

Bcfg2 is configuration "plumbing" (it just works). Here is how:
  1. Server probes client’s local state. (Client gets the probes from the server; client runs the probes; feeds data back to server.)

  2. Server builds configuration goals and sends them to the client.

  3. Client validates it’s local state and figures out what it has to do to meet the goals.

  4. Finally state information is sent back to the server and it is processed and can be reported.

Bcfg2 Policy: Install the Postfix Package
<Base>
    <Group name='Mail-server'>
        <Package name='postfix'/>
    </Group>
</Base>
Bcfg2 Policy: Install Multiple Packages
<Base>
    <Group name='Web-server'>
        <Package name='apache2'/>
        <Package name='apache2-mod_php'/>
        <Package name='php5'/>
    </Group>
</Base>
Bcfg2 Policy: Generate MOTD file using templating plug-in

Generate /etc/motd (message of the day) file that describes the system in terms of its Bcfg2 metadata and probe responses.

Here is the template (stored on the server):

 ------------------------------------------------------------------------
                    GOALS FOR SERVER MANAGED BY BCFG2
 ------------------------------------------------------------------------
 Hostname is ${metadata.hostname}

 Groups:
 {% for group in metadata.groups %}\
  * ${group}
 {% end %}\

 {% if metadata.categories %}\
 Categories:
 {% for category in metadata.categories %}\
  * ${category}
 {% end %}\
 {% end %}\


 {% if metadata.Probes %}\
 Probes:
 {% for probe, value in metadata.Probes.iteritems() %}\
  * ${probe} \
    ${value}
 {% end %}\
 {% end %}\

 -------------------------------------------------------------------------
                        ITOPS MOTD
 -------------------------------------------------------------------------
Please create a Ticket for any system level changes you need from IT.

This template gets the hostname, groups membership of the host, categories of the host (if any), and result of probes on the host (if any). The template formats this in with a header and footer that makes it visually more appealing.

Output

One possible output of this template would be:

 ------------------------------------------------------------------------
                     GOALS FOR SERVER MANAGED BY BCFG2
 ------------------------------------------------------------------------
 Hostname is cobra.example.com

 Groups:
  * oracle-server
  * centos5-5.2
  * centos5
  * redhat
  * x86_64
  * sys-vmware

 Categories:
  * os-variant
  * os
  * database-server
  * os-version


 Probes:
  * arch    x86_64
  * network    intranet_network
  * diskspace    Filesystem            Size  Used Avail Use% Mounted on
 /dev/mapper/VolGroup00-LogVol00
                        18G  2.1G   15G  13% /
 /dev/sda1              99M   13M   82M  13% /boot
 tmpfs                 3.8G     0  3.8G   0% /dev/shm
 /dev/mapper/mhcdbo-clear
                       1.5T  198M  1.5T   1% /mnt/san-oracle
  * virtual    vmware

 -------------------------------------------------------------------------
                        IT MOTD
 -------------------------------------------------------------------------
 Please create a Ticket for any system level changes you need from IT.

Screenshots of Reporting System

images/summary_cal.jpg
images/node_dropdown.jpg
images/item_detail.jpg
Configuration Management Tips from the author of Bcfg2

Narayan pointed out configuration management can benefit from software engineering approaches:

  • Version control of configuration policy.

  • Testing and validation of changes to configuration policy.

  • Release management process for changes to configuration policy.

Narayan said, "It’s really hard to do roll backs due to lack of roll back support in package management systems". His solution? The best way to roll back is a filesystem snapshot.

Cfengine

Mark Burgess, the author of Cfengine, started his presentation by re-focusing the projector. The image was not blurry to start, just not completely in focus and I had no trouble reading the prior presenter’s slides; but it was very crisp after the adjustement!

Such attention to detail inspired my confidence.

Cfengine is the granddaddy of open source configuration management tools, dating back to 1993 and Mark worked as a part-time sysadmin at the University of Oslo struggling with handling many different kinds of Unix and Unix-like systems.

Mark describes Cfengine as an agent-based change management system with "convergent" or "self-healing" behavior. (Cfengine will continously return a system to the configured state or keep it there if it’s already there. Another way to put it, regardless of where you start from, you can always get to the defined state.)

The Cfengine language is a largely declarative language for describing desired or "promised" states. Like Bcfg2, the language is a pragmatic mix of declarative and procedural.

Cfengine includes a self-learning monitoring framework (to deal with an unknown environment) and a knowledge management framework (to help handle complexity of system configuration). Cfengine introduced the idea of "classes", which are patterns in space and time and implicit if/then tests.

Examples of Cfengine Classes
  • The name of an operating system (Solaris or Red Hat Enterprise Linux)

  • Architecture (x86 or SPARC)

  • Time (Sunday, or 3 AM - 3:59 AM)

  • The name of a host, or a user-defined name of a group of hosts.

  • Any arbitrary string.

You can use Boolean logic with classes to select systems for a configuration promise. For example: Linux servers on x86 platform with -dev in the name should have their OS updated on Sunday at 3 AM.

Cfengine is model-based in the sense that you describe the model of the end state that you want.

Cfengine is self-documenting because you are using a declarative language.

Cfengine is lightweight (1.9 MB footprint). It has very few prerequisites (Berkeley DB library, crypto library and optional PCRE library). Today, Cfengine runs on everything from unmanned underwater vehicle to Nokia handheld phones to supercomputing clusters.

Because it’s a C binary with very few prerequisites, it has the largest span of systems it runs on out of all the open source CM tools.

At first, Cfengine was modeled as a computer immune system, helping a system stay healthy in an uncertain, changing, and possibly hostile environment.

The current philosophy of Cfengine is "promise theory", where the defined state is promised by different system components (such as files, packages, processes, etc.), and Cfengine is a "promise engine" — an engine for keeping promises.

Key Principles of Cfengine’s Design:
  • Voluntary cooperation, local autonomy. Cfengine allows local control of policy in anticipation of consensus building amongst human administrators. Voluntary cooperation is expected; so Cfengine always pulls policy, never pushes it. A policy push is indistinguishable from attack. (Cfengine has had 3 security vulnerbilities in 17 years due to this principle.)

  • Pragmatism. Work with what you’ve got: allow shell commands.

  • Resilience Expect the unpredictable (therefore convergence back to promised state). Design allows for a single point of control without a single point of failure. (If a policy server goes away, the Cfengine agents on nodes will keep running using cached policy.)

  • Allow freedom. For example, allow use of package systems. Cfengine is about "constraint", not "control". The philosophy back of this is, "You do not control environments, you participate in environments."

  • Convergence. Run Cfengine many times and the system should always get better and it should never worse. Always move closer to the promised state; or stay there. Stay there by always trying to move closer to it. (This counteracts the natural force of entropy which would result in system state drift over time.)

Promise Theory

Promise Theory is based on the key principles of convegence and autonomy.

Everything is a promise in cfengine language. Files promise to be there (and are created or copied by Cfengine if they are not); packages promise to be installed; processes promise to be running.

Cfengine configuration is composed of promises and patterns. A class is an example of a pattern; a list of packages to be installed is another (see example below).

Another practical pattern in Cfengine is abstraction of promise details so you can see at a glance what is promised, and can still drill down if necessary to get the promise details.

For example:

Abstracted Promise
copy_from => my_secure_cp("myfile","myserver")
Promise Body (Like "Contract Body" - Contains Details)
body copy_from my_secure_cp(file,server)
{
source      => "$(file)";
servers     => { "$(server)" };
compare     => "digest";
encrypt     => "true";
verify      => "true";
force_ipv4  => "false";
collapse_destination_dir => "false";
copy_size => irange("0","50000");
findertype => "MacOSX";
# etc etc
}
How Does Cfengine Work?
  1. The agent wakes up and classifies its environment (time, network address, OS, group defined by LDAP, etc.) This sets up all the classes.

  2. The agent reviews and execute promises. It may download the latest promise policy from a server; or use its local copy. Executing the promises, Cfengine will make 3 passes, checking everything and fulfulling as many promises as it can. For example, if SNMP packages promises to be installed, and SNMP daemon promises to be running; on the first pass, Cfengine could install SNMP package; on the second pass, it would start the daemon.

  3. The agent reports on success.

Cfengine Promise: Install the Postfix Package:
packages:
  "postfix"
     package_policy => "add",
     package_method => yum;
Cfengine Promise: Install Multiple Packages.

First, create a variable of type "list of strings" named @match_package.

Second, use an implicit loop over each element of the list (like in perl, @var is an array, $var is a scalar/string), and promise that package is added using YUM.

Loops are implicit in Cfengine, this is a powerful abstraction.

 vars:

  "match_package" slist => {
                           "apache2",
                           "apache2-mod_php5",
                           "php5"
                           };
 packages:

    "$(match_package)"

         package_policy => "add",
         package_method => yum;

Chef

Aaron Peterson, seasoned systems engineer, Technical Evangelist of Opscode, Inc., presented. Chef is primarily a configuration management library system, and system integration platform (help integrate new systems into existing platforms.)

Chef grew out of power user dissatisfaction with aspects of Puppet. Made available in 2009, Chef is in beta (version 0.9.x.x) and is settling down now.

Key Principles of Chef’s Design:
  • The cloud is the future, be able to operate in the cloud. (For example, automated provisioning of new server instances.)

  • Fully automated infrastructure is hard. Make it easier through configuration sharing and re-use. Chef is a library for CM (or a CM system built on that library).

  • Facilitate integration of new servers into an existing platform.

  • Idempotence. Describe the end state, and Chef will get you there and keep you there. However, Chef only takes actions when they differ from the description of the end product.

  • Reasonability (easy to think about).

  • Sane defaults (yet easily changed).

  • Hackability (easy to extend).

  • TMTOWTDI (There Is More Than One Way To Do It).

  • Pragmatism. Chef’s language is Ruby with some DSL. The Ruby mix allows you to include programming.

  • Enable infrastracture as code to benefit from software engineering practices such as agile methodologies, code sharing through github, release management, etc.

  • Enable you to solve your problem.

  • Data-driven. Configuration is just data.

  • Cultivate a culture of quality and reusability through Chef cookbooks. Example freely available cookbooks: Postgresql, tomcat6, Apache2, Kickstart, OpenSSL, etc.

  • Chef exposes data and behavior over HTTP to enable integration with external tools.

  • Recipes are run in order. Nodes have a run list: what roles or recipes to apply, in order.

Example: Run List
"run_list": [
  "role[webserver]",
  "role[database_master]",
  "role[development]"
]

Chef: Infrastructure is Code

Chef Recipe: Install the Postfix Package
package "postfix" do
  action :install
end
Chef Recipe: Install Multiple Packages
package "apache2" do
  action :install
end

package "apache2-mod_php5" do
  action :install
end

package "php5" do
  action :install
end
Chef Recipe: Upgrade sudo package and configure /etc/sudoers using a template
package "sudo" do
  action :upgrade
end

template "/etc/sudoers" do
  source "sudoers.erb"
  mode 0440
  owner "root"
  group "root"
  variables(
    :sudoers_groups => node[:authorization][:sudo][:groups],
    :sudoers_users => node[:authorization][:sudo][:users]
  )
end

Manage configuration as resources.

Put them together in recipes.

Track it like source code.

Configure your servers.

The above example is two part, a package resource recipe, and an accompanying file template recipe.

Under the hood, a Chef Provider will handle the required action to bring the resource into the described state. Example provider for the above: Chef::Provider::Package::Yum.

Recipes are lists of resources (files, packages, processes, filesystems, users, etc.)

Cookbooks are packages of recipes.

How Does Chef Work?
  • Install Chef. Install all the Ruby gems/dependencies.

  • Create your Chef repository

Beginner’s Cookbook:
git clone  git://github.com/opscode/chef-repo.git
Advanced Version:
git clone  git://github.com/opscode/chef.git

Chef assumes you start from a base OS. (Which is particularly true in cloud environments, where it’s financially feasible for servers to have brief lifetimes.)

  • Fire up Chef Server and Chef Client. Managed clients do most of the work. (Or else you can use Chef Solo which can run recipes without a server).

Chef prefers failure over non-deterministic "success" when something goes wrong. If it cannot complete the recipe, in full, and in sequence, that is failure. This is one of the primary things differentiating it from Puppet, where ordering is non-deterministic and a policy may be fulfilled partially but it can be hard to predict which parts get fulfilled.

Example of Node Attributes
default[:nginx][:gzip] = "on"
default[:nginx][:gzip_http_version] = "1.0"
default[:nginx][:gzip_comp_level] = "2"
default[:nginx][:gzip_proxied] = "any"

default[:nginx][:keepalive] = "on"
default[:nginx][:keepalive_timeout] = 65

default[:nginx][:worker_processes] = cpu[:total]
default[:nginx][:worker_connections] = 4096
Platform Support - Operating Systems

Chef is known to run on the following platforms: * Ubuntu (6.06, 8.04-9.10) and Debian (5.0) * RHEL and CentOS (5.x) * Gentoo (1.12.11.1) * FreeBSD (7.1) * OpenBSD (4.4) * MacOS X (10.4, 10.5) * OpenSolaris (2008.11) * preliminary support of Windows

Platform Support - Clouds

There are three cloud providers supported by "knife", the Chef commandline tool: Rackspace, Amazon EC2, Terremark. The support is through the OSS fog project with support made visible in "knife" as cloud provider support for API interaction matures and support in fog matures.

Example: instantiate a server; configure it as a Ruby-on-Rails Web app server:
knife rackspace server create 'role[rails]'
knife ec2 server create 'role[rails]'
knife terremark server create 'role[rails]'
Web UI

There is an optional Web UI. Anything that can be done through it (such as adding users or roles) can be done on the CLI with "knife" or interactively via "shef", the Chef shell.

The commercial platform has some additional Web UI features, but again, the primary use of Chef is expected to be via libraries, following the "infrastructure as code" paradigm.

Configuration Management Tip from Aaron Peterson

Never depend on a single sysadmin’s knowledge. Put all the knowledge into a Chef cookbook.

Puppet

Michael de Haan of Puppet Labs, and formerly of Red Hat engineering, presented.

Key Principles of Puppet’s Design:
  • Puppet is centralized.

  • Puppet internal logic is graph based. It uses decision trees and reports on what it was able to do; and what failed (and everything after it). Manual ordering is very important, as decision trees will be based on it. Ordering is very fine-grained.

  • Puppet language is a data center modelling language representing the desired state. The Puppet language is designed to be very simple and human readable. This prevents you from inserting (Ruby) code but it also makes it safer (prevents you from shooting yourself in the foot). However you can still call external (shell) scripts. Also, an upcoming version (2.6) will support programming in a a Ruby DSL.

  • Portability. Works anywhere Ruby works.

  • Pluggability. Puppet does not allow arbitrary language in the code; however there is pluggability: server side functions can interact with an external data source (e.g. query database or read a textfile). There is a feature called "external_nodes" which you can enable on the puppet server (the puppetmaster) which will kick in whenever a puppet client (puppetd) connects. Instead of having the node name and its class membership and attributes stored in your puppet config, you can have it stored in an external database, and "external_nodes" will fetch that info.

How does Puppet Work?

Puppet only performs actions that are necessary. The basic formula for Puppet’s operation is: server-side, poll information from the client then decide what to do and tell the client what to do. In detail:

The server gets the client to tell the server about itself. These are facts in Puppet. The configuration policies are the manifests.

The server compares the facts (what is) to the manifests (what should be), and, if necessary, creates instructions to the clients on the managed nodes for moving from what is to what should be. These instructions are encoded as a JSON catalog.

Manifests + Facts -→ JSON catalog -→ Nodes

The JSON catalog contains a declarative description about desired state, and the client then runs that catalog to achieve the desired state.

Puppet is pre-installed on Ubuntu (cloud and main editions).

If a service subscribes to a file, and the file changes, the service will know it automatically needs to restart. For example:

service { 'sshd':
  ensure => running,
  subscribe => File['/etc/ssh/sshd_config'],
}

file { '/etc/ssh/sshd_config':
  ensure => present,
  source => puppet:///sshd/sshd_config,
  owner => root,
  group => root,
}
Puppet Language

Resource Types are the building blocks of Puppet configuration. Here is a simple example:

file { "/etc/passwd":
    owner => root,
    group => root,
    mode => 644
}

This is the "file" resource type. It controls ownership and access permissions to the named file.

Providers are what make the resource type an actuality; or it’s the part of Puppet that actually executes the configuration, the interface between the resource description and the OS; the "doer".

There can be multiple providers for a resource, for example you might specify mod-php package be installed, and it could by installed by package providers for dkpg, rpm, yum, openbsd, and so on. The most appropriate provider will be picked automatically; or you can specify certain features in the resource type, and then the providers will be probed for what features they support.

There is an advanced and experimental feature "exported resources" that allows one host to configure another host (in Puppet terms, it allows resources to move between hosts) — this allows inter-node orchestration.

Puppet, of course, can export reports.

What Lies Ahead? What Are the Challenges in Configuration Management?

Narayan: "Configuration meta-programming" or "multi-node orchestration". For example: "NTP clients should talk to our NTP servers", or "the ssh_known_hosts file should contain entries for all machines", or "the load balancer should direct traffic to all production Web servers".

Mark Burgess: Including network devices in configuration management; manipulating mechanical devices (such as controlling satellite position in Earth orbit); most importantly, knowledge management (tracking state, understanding intentions, aligning with business goals). Mark is working on tying Cfengine with ISO13250 Topic Maps.

Appendix - Quick Comparisons

What Language Is The Tool Written In?
  • Bcfg2 Python.

  • Cfengine C.

  • Chef Ruby.

  • Puppet Ruby.

How Long Has it Been Around?
  • Bcfg2 2004

  • Cfengine 1993

  • Chef 2009 - currently in beta

  • Puppet 2005

How Widely Is It Used?
  • Bcfg2 Used at at least 100 sites.

  • Cfengine Used in over 5000 companies. Mark’s conservative estimate is over 1,000,000 computers running Cfengine today. There are over two thousand sites with tens of servers; and thirteen sites with tens of thousands of servers.

  • Chef There are 14 companies listed at http://www.opscode.com/adoption Chef, including EngineYard and RightScale.

  • Puppet Puppet has over 80 organizations using it. [Corrected on 4 Jan 2010: the "over 80" figure is for PAYING users, and that figure is now over 100 per James Turnbull of Puppet Labs. There are over 5000 sites using the free version of Puppet, running on over 1,000,000 nodes.] Top users are:

    • Google: 45K Macs + internal servers.

    • Zynga: 80k servers.

    • JPMorganChase: 35k servers.

Does It Allow Re-use of Configuration Policies?

Bcfg2 Recipes are in the source code control repo in version 2 but sharing is not easy, group names need to be standardized first.

Cfengine Promises are shared through the Cfengine company which vets and standardizes them in the Community Open Promise Body Library: http://www.cfengine.org/manuals/CfengineStdLibrary.html

Chef Recipes are very actively shared at http://cookbooks.opscode.com/

Puppet Manifests are shared at http://forge.puppetlabs.com/