Disaster Averted

Working for a client recently we discovered a ZFS server running that was not being monitored appropriately.

Now you might be looking at this and wondering WTF? I was, and it took me a bit to figure out everything going on and why those failover disks didn't kick into action.

Looking at the image above, you a can see that the zpool is composed of two RAIDZ2 arrays, both of which are having problems, but that the zpool has a number of spares available. Now when ZFS decides it has a degraded disk but has available spare disks, it will automatically pull one of the spares as well as the degraded disk and put them into a mirror together. Now if you keep that in mind and look again at the image above, you can see that raidz2-0 has two degraded disks and one which is faulted. But, you can also see that all three of those have been properly put into a mirror with a spare disk and so the data is safe. 

Then apparently, two disks on the raidz2-1 side became faulted. This is where it gets serious because there were only 3 valid spare disks to begin with and all three have now been allocated to handle the degraded disks in raidz2-0. "But wait" you say, "I see there are three more spare disks available". This is true, but there is a small problem with these disks...they are Advanced Format or 4K sector disks. The other disks that make up the pool are not 512 byte sector size. ZFS will not replace a 512 byte sector disk with a 4K sector disk. Something about geometry and maths. 

So on the brink of losing 30 TeraBytes of what would be irrecoverable, and extremely important data. We spent the next week nail-biting while we Spinrite'd old 3TB drives and shoving them into this thing as soon as they were available. Eventually we did succeed in getting this thing back into a non-critical state. We were lucky to able to do so and have implemented multiple layers of controls and backup strategies to help prevent and if necessary recover from this type of incident.

TL;DR monitoring is important um kay.


Nagios PushOver Integration

Recently I discovered a great little app called PushOver. This app allows me to receive push notifications on my iOS devices or Mac (it also works for Windows and Android platforms as well, but really who wants to use those) from any service that utilizes the PushOver API such IFTTT. As it turns out, you can very easily integrate PushOver into Nagios as well utilizing a script called created by Jedda Wignall (@jedda) . And that is what we did.

It starts with getting the PushOver app installed on your device. It's a few bucks, but seriously, the coolness of getting this to work is worth it.

Next, you'll need to grab Jedda's script from GitHub and put it somewhere on your Nagios server. 

You'll then need to configure Nagios to utilize this script for notifcations. So in the commands.cfg, where you have notify-host-by-email and notify-service-by-email commands, also add PushOver commands. They should probably look similar to:

# 'notify-host-pushover' command definition
define command{
        command_name    notify-host-pushover
        command_line    /usr/lib64/nagios/plugins/ -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$NOTIFICATIONTYPE$ Host $HOSTNAME$ $HOSTSTATE$"

# 'notify-service-pushover' command definition
define command{
        command_name   notify-service-pushover
        command_line   /usr/lib64/nagios/plugins/ -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$HOSTNAME$ $SERVICEDESC$ : $SERVICESTATE$ Additional info: $SERVICEOUTPUT$"

Since we still wanted to receive Nagios alert emails in addition to PushOver alerts, it was necessary to define a new PushOver Contact Template:

define contact{
        name                            generic-pushover
        host_notifications_enabled      1
        service_notifications_enabled   1
        host_notification_period        24x7
        service_notification_period     24x7
        host_notification_options       d,u
        service_notification_options    c,u
        host_notification_commands      notify-host-pushover
        service_notification_commands   notify-service-pushover
        can_submit_commands             1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0

This enabled us to create new contacts to send the PushOver notifications to.

define contact{
        use                    generic-pushover
        contact_name           alek_pushover
        alias                  Alek Pushover
        contactgroups          smartalek_ops
        address1               pushover_user_key_goes_here
        address2               pushover_application_api_key

Thats it. Just restart Nagios and you should be good to go.

SaltStack First State

In my last couple of posts, I've been going over SaltStack and this one is no different. Today I want to tackle setting up a state file. As I'm a Nagios guy, we'll be setting up a state to install NRPE.

If you don't already have a working SaltStack environment, please see my previous post on getting one up and running.

Before we get going on the state file itself, I need to take a minute to explain file_roots. In the master config file, there is a setting for 'file_roots'. By default, the master sets the base file_roots to be the /srv/salt directory. This is where we will store our state files. But first we need to create this directory and while we're at it a directory for the state file we're about to create.

mkdir -p /srv/salt/nrpe

Now onto the setting up our NRPE state.

# Set the minion id as a jinja variable that we can use later
{% set hostname = salt.grains.get('id') %}

# Make sure the NRPE package is installed.
    - name: nrpe
# Drop in the nrpe.cfg file specific for this host. Or if its a new host, just the default
    - name: /etc/nagios/nrpe.cfg
    - source: 
      - salt://hosts/{{ hostname }}/etc/nagios/nrpe.cfg
      - salt://nrpe/nrpe.cfg
    - makedirs: True
    - user: root
    - group: root
    - mode: 644

# Make sure NRPE is running.
    - name: nrpe
    - enable: True
    - watch:
       - file: nrpe_config_file

In this state file, the first line that is not a comment looks really fancy and not at all like the rest of the YAML that composes this file. Thats because it isn't YAML, it's Jinja. This 'set hostname' Jinja statement is setting a variable that can be used later in the state to a value that it's retrieving from the Salt Grains system.

The Grains system is a collection of facts about the Minion that are generated when the salt-minion process is started. There is all kinds of cool stuff in there and I recommend executing 'salt-call grains.items' on one of your minions to have a look at everything that is in there by default.

Now in this state file, there are 3 states to be satisfied. They go by the ID declarations of nrpe_install, nrpe_config_file, and nrpe_running. ID declarations are simply the names of different states and can be anything so long as they are not repeated. In some cases, ID declarations can even be assumed, but that is a story for another day.

Each of these three states then has a state declaration and a function declaration. ID declaration nrpe_install has a state declaration of pkg and a function declaration of installed. This is simply telling Salt to use the 'installed' function of the 'pkg' module. Now just like in coding, different functions have different input requirements to achieve their purpose. This is no different in that everything indented to be beneath a function declaration is simply a parameter for that function. To find out the requirements of these functions, we need to look at the SaltStack documentation:




You should get used to looking up and reading the documentation pages. They are essential in utilizing SaltStack. Though of course, like any nix utility, you can read the documentation from the command line.

# List all salt modules
salt-call sys.list_modules

# List all functions for a partiuclar module
salt-call sys.list_functions pkg

# Description of use for a function of a module
salt-call sys.doc pkg.installed

Now getting back on track, in the nrpe_config_file ID declaration, you may have noticed that the source parameter has two arguments, one of which includes our Jinja variable that was defined at the top. Now if you read the documentation, you should know that the source parameter of the file.managed function allows multiple locations to be listed for the file we are to manage. In this particular instance, I'm using Jinja to let me handle any snowflake systems I might have by telling Salt to use the nrpe.cfg that is found in /srv/salt/hosts/snowflake_system_name/etc/nagios/nrpe.cfg. If this directory doesn't exist, then Salt will just move onto using /srv/salt/nrpe/nrpe.cfg. 

Now I hope you picked up on the fact that salt:// is actually referring to /srv/salt. It's important to understand that state files get rendered on the Minion and so when a salt:// request is made to the master, the master looks at its file_roots parameter and starts from there.

Lastly, I need to talk about the requisites that are being used in the nrpe_running ID declaration. Now service.running is a special circumstance that allows a watch requisite to be listed underneath it. This watch requisite means that the service named NPRE will only be restarted when there is a change detected in the ID declaration nrpe_config_file.

The require requisite means that ID declaration nrpe_running cannot be run until after ID declaration nrpe_install has been run.

Now that we've covered all of that, lets get to the business of actually running our state. So first copy the code above into a file on the master called '/srv/salt/nrpe/init.sls'. Then copy the text below into the file '/srv/salt/nrpe/nrpe.cfg'. This is going to be the nrpe.cfg file that Salt deploys for us. The one that comes with the NRPE package has a ton of white space and commented lines. This one is just shortened up so we can make sure our Salt state worked.

command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20

With all of that done, on the master execute: 'salt '*' state.sls nrpe'. Now you may be asking, "wait I named my state file init.sls. How did calling state.sls nrpe work?" It worked because salt makes the assumption that you will have in file roots either a file called nrpe.sls or a folder called nrpe with a file inside called init.sls. Either way will work, but it's usually easier to use the folder method as you can put other things needful for the state to run correctly in that same folder.

I hope that this is at least clearer than mud. I remember struggling a bit when I was first trying to learn Salt, but hopefully this post will help you along.

SaltStack Up And Running



Setting up KVM/Libvirt on CentOS7

I always get so annoyed with myself when I can't remember things I've done in the past. Setting up KVM/Libvirt is one of those things. Recently while working for a client, I found myself fumbling around trying to figure out why their VMs couldn't talk to the internet. So rather than let future me bang his head against the desk trying to figure this out again, I thought I'd write a quick post detailing the setup and solution. Hopefully others will find this post helpful as well.

First off, this procedure is written for CentOS7. It may well work on other distros, but I haven't tested it.

Okay, so you have your CentOS7 host installed and updated. Now its time to install the necessary packages and set the libvirt daemon to run on reboot: 

yum -y install libvirt libvirt-python qemu qemu-kvm
systemctl enable libvirtd

Now that KVM/libvirt is installed, we can work on the area that usually causes me problems, networking. First lets disable and stop NetworkManager else it will cause problems with the networking setup we will be implementing.

systemctl disable NetworkManager
systemctl stop NetworkManager

Next, lets make a new config for a bridge interface by creating the file /etc/sysconfig/network-scripts/ifcfg-br0. It should contain our system's network settings. If you use static addressing, this is where you will will want to include the IP address, subnet mask, gateway, and DNS settings. I use DHCP, so it's a bit easier.


Next, lets edit the network config file that the system is currently using. For me on this particular machine that file is /etc/sysconfig/network-scripts/ifcfg-enp1s0. Due to the way that RHEL7 based systems now name their interfaces, yours may be named differently. In any case, we're simplifying this file as we've already moved the network settings to our bridge file. The big thing that this one will do is tell the interface to use the new bridge. 


Alright, if all of that is done, then our last step is to restart networking.

systemctl restart network

If all went well, you should have a new br0 interface with an IP. 

We're almost done, but there are a couple of things left that need to be addressed. First, you need to allow IP forwarding on your host, else your VMs will not be able to pass traffic. To enable this, add the following to /etc/sysctl.conf.

net.ipv4.ip_forward = 1

That done, tell the OS to re-read the file.

sysctl -p /etc/sysctl.conf

The last step is to make sure IP Tables allows your VM traffic to traverse the system. Now, there are two ways to do this. You can disable IP tables on bridges altogether, which means that you rely on the VMs to provide their own firewalls. To go this route, add/edit another variable in /etc/sysctl.conf.

net.bridge.bridge-nf-call-iptables = 0

If you choose to have IP Tables on the host to continue evaluating traffic destined for the VMs, i.e. leaving net.bridge.bridge-nf-call-iptables with the default value of 1, then you will need to make sure that there are IP Tables rules running on the host to allow passing traffic meant for your virtual machines.

To take a little bit of the headache out of this, I've started on a Salt state to handle the easy stuff.

SaltStack After 6 Months

Late last year, I decided that I needed to start trying figure out this whole configuration management thing. I was feeling late to the party and being able to not only understand everything that went into a machine, but to also have the ability to replicate that machine by using code, sounded pretty nice. So rather than spend a whole lot of time trying to figure this stuff out on my own, I signed up for a class on Puppet. 

"But wait" you say, "the title of this post is SaltStack After 6 Months ... I thought I was reading about SaltStack".

And you're right, a couple of weeks after taking that course I decided that the Puppet syntax was not really for me... i.e. I was having a hard time reading/writing it. Its nothing that I couldn't have overcome given some time. But around that time a friend and fellow systems engineer turned me onto SaltStack.

"Yeah, it's pretty cool because it's Python and you get remote execution on all of your Minions...and because they call them Minions." -Fellow Systems Engineer

I wasn't really sold on Salt's remote execution. Actually, at the time, I failed to see how amazing and useful this feature is. Nor was I sold on them calling nodes Minions. To me this was a no brainer because I knew how to write some Python. It meant I wouldn't need to spend time spinning my wheels trying to learn a new language. I was wrong. While I didn't end up spinning my wheels on learning Puppet's Ruby like syntax, instead I spent it trying to learn YAML and Jinja. These are the mainstays of utilizing SaltStack for configuration management. Well, Jinja is kinda optional as you can switch between several templating languages, but YAML is a must. Python makes up the core of SaltStack, but if you're trying to setup config management, get to love YAML and Jinja.

I spent a lot of time just trying to understand how these two things worked together and it doesn't help that I started with the complicated task of setting up user accounts. It would have been far easier to start with something like setting up NRPE. Where what is required is simply to install a package, install a customized nrpe.cfg, and restart the service. And that is where I recommend you start if you're just starting out with SaltStack. 

My next post will be about setting up a minimal SaltStack environment (really just one virtual machine) so that we can play around with some things. In the post after that, I'll actually show you how to create a state to manage NRPE.