Orchestrating Puppet with Serf

By Bill Ward | June 16, 2017

If you have used puppet in a production environment before, then you have probably used MCollective which is the orchestration system for puppet enterprise. Most of the time this works out great. In my shop though, we have had many problems with MCollective. This post gives you a way to replace MCollective with HashiCorp’s Serf and explains how this is a much better solution.

MCollective vs Serf

Both MCollective and Serf acheive the same thing: orchestrating systems. They do this very differently from each other though. MCollective is used to do a lot of things including: running puppet agents, r10k on your compile masters, checking node availability, etc. It does this using a Hub and Spoke topology where you have nodes connect to your spokes which in turn connect to your hub. The nodes communicate using ActiveMQ messaging. MCollective can be secured (and is by default when installed with Puppet Enterprise) by using SSL Validation.

Serf is from HashiCorp and is relatively new. From their website: “Serf is a tool for cluster membership, failure detection, and orchestration that is decentralized, fault-tolerant and highly available.” Serf uses a gossip protocol to broadcast messages to the cluster. It can broadcast events or queries to the cluster. Events are fire and forget and can be used for stuff like notifying a cluster of webservers to deploy the new version of your application. Queries send messages to the cluster and expect a response. Events are forwarded to a system and held in case of a system failure until the system comes back online. Queries are usually time sensitive so if the system is down the query for that system is lost. With serf you write custom event handlers that do stuff when a node receives a custom event or query. Finally, serf can be secured using a shared secret cryptosystem. This barely touches on what all serf can do. For more information, check out HashiCorp’s Serf Site.

The main advantage that serf gives us over MCollective is the gossip protocol. With a minimal amount of configuration, you can have a large number of servers in your serf cluster. If one or more of the servers has a failure, serf will find out and notify all other cluster members of the failure. There is no central point of failure that would cause our nodes to not talk to each other.

Workflow Overview

Refer to the diagram below. We start with a Jenkins job that builds our application and compiles it. During the Build Release step we configure a shell script that fires off a custom serf event called “deploy.” Our puppet server is configured to respond to this custom event. This calls a special script that pulls the updated puppet module code from git (that was made during our jenkins run) and installs the updated module on our puppet master. Once that is complete the script fires off a custom query that will have a puppet agent(s) start a puppet run. This sends a custom event to our logging system that puppet has started. We want to send everything to a central logging system so that we can troubleshoot any issues later that may happen during our deployment. The jenkins server runs continuous queries to see if the puppet run has completed. Once the puppet run is completed then we send another event to our logging server that the puppet run has completed and send what the status is. Jenkins finally gets an answer that the puppet run is finished and queries the logging server for the final status.

Serf Workflow

How it works

The first thing we do is install serf on all our servers. The is a simple process outlined on the serf website. Joining servers into a cluster is also super simple. You just need to run the serf agent with the -join parameter pointing to the IP address of any other serf agent already running. As other members join the serf protocol will notify all the other servers of any new servers automatically. This makes scaling your serf clusters a snap.

The next step we need to do is setup tags for our different agents. Tags are a way of separating our cluster into functional areas or by node. When we send a custom event or query, it is broadcast through out the cluster using the gossip protocol. When we send our puppet run event we only want the puppet agent we specify to run puppet so that our application gets deployed to that server. For these we assign all the puppet agents a tag key/value of ‘fqdn={hostname}’. This allows us to select which systems we want to run puppet on. But what if you want to send it to a whole segment of servers like your QA environment? simple assign another tag key/value of ‘segment=qa’ and send your event/query to it.

On each of your nodes that you want to handle these events you have to write an event handler for the serf agent and pass it to the agent using the ‘-event-handler’ parameter. I followed the instructions here to use an event handler router.

Firing off a puppet run

We need to write a custom query handler for our puppet runs. This is put into the /etc/serf/handlers/ directory and made executable. Here is a simple example that starts a puppet run:

#!/bin/bash

/opt/puppetrun.sh $>/dev/null $disown

echo "Starting Puppet Run"

This simply calls another script that will do the actual puppet run ( a one-liner: /opt/puppetlabs/bin/puppet agent –test ). We have to fork it off to its own process because serf queries expect an answer back within 15 seconds by default. The serf agent will capture our echo statement and send it back as a response.

We trigger this by running the below command from our puppet master:

# serf query -tag fqdn=agent-01 puppetrun

Conclusion

We could go into more detail on how to replace other MCollective functionality with Serf but this article is already getting long. I have all my terraform code representing an example setup for this at GitHub admintome/puppet-serf. This configures a puppet master and three agents with serf installed on everyting using SystemD.

I hope you enjoyed this post. If it was helpful or if it was way off then please comment and let me know.

Subscribe to our mailing list

indicates required
Email Format

comments powered by Disqus