Disaster Averted

Working for a client recently we discovered a ZFS server running that was not being monitored appropriately.

Now you might be looking at this and wondering WTF? I was, and it took me a bit to figure out everything going on and why those failover disks didn't kick into action.

Looking at the image above, you a can see that the zpool is composed of two RAIDZ2 arrays, both of which are having problems, but that the zpool has a number of spares available. Now when ZFS decides it has a degraded disk but has available spare disks, it will automatically pull one of the spares as well as the degraded disk and put them into a mirror together. Now if you keep that in mind and look again at the image above, you can see that raidz2-0 has two degraded disks and one which is faulted. But, you can also see that all three of those have been properly put into a mirror with a spare disk and so the data is safe. 

Then apparently, two disks on the raidz2-1 side became faulted. This is where it gets serious because there were only 3 valid spare disks to begin with and all three have now been allocated to handle the degraded disks in raidz2-0. "But wait" you say, "I see there are three more spare disks available". This is true, but there is a small problem with these disks...they are Advanced Format or 4K sector disks. The other disks that make up the pool are not 512 byte sector size. ZFS will not replace a 512 byte sector disk with a 4K sector disk. Something about geometry and maths. 

So on the brink of losing 30 TeraBytes of what would be irrecoverable, and extremely important data. We spent the next week nail-biting while we Spinrite'd old 3TB drives and shoving them into this thing as soon as they were available. Eventually we did succeed in getting this thing back into a non-critical state. We were lucky to able to do so and have implemented multiple layers of controls and backup strategies to help prevent and if necessary recover from this type of incident.

TL;DR monitoring is important um kay.


Git Cycle

Git Cycle is a process that I created to help in small-ish environments where Linux servers are often unique little snowflakes and the admins are accustomed to making changes on the fly rather than through config management. By using the Git Cycle, I only had to convince the admins to git add and git commit the config files they were changing on those systems.

I developed this methodology using SaltStack as my config management system, but there's no reason why it can't be used in other systems. However, everything here is based on using SaltStack.

Overview Example:
  1. An Admin makes a change in the nrpe config file, /etc/nagios/nrpe.cfg
  2. The Admin then git adds and commits /etc/nagios/nrpe.cfg
  3. From the Salt Master, the git_cycle.sh script is kicked off via crontab.
  4. It connects to this server and pulls the changes from a git repo
  5. The next time I execute a state run in the config management system, the changes that were found and pulled back will be used, thus avoiding overwriting the changes that the admin made.

How it works:
    1. On the Salt Master we need to do some setup work:
        a. Create a git user and group.
        b. Generate a SSH public/private key pair for the git user and put the public key into the SaltStack pillar data for git_cycle.
        c. Copy the git_cycle.sh script into the git user's home directory.
        d. Create an entry in the git user's crontab to execute the git_cycle.sh script. Bonus points for redirecting the output to a log file.
        e. Create a folder to store the files from your hosts in. For me, I put those files in alongside the state files that will be using them. So, /srv/salt/hosts.
        f. For each Minion I want to retrieve files for, just create a sub directory. 
            └── Minion1
            ├── Minion2
    2. Running the git_cycle/init.sls SaltStack state file on a Minion does several things:
        a. It creates a "git" user and group.
        b. Adds a public key to the git user's authorized_keys file. 
        c. Adds specified users to the git group
        d. Copies a .gitconfig file to each user's home directory.
        e. It creates a git repository in the root of the Minion's filesystem. Putting it there allows files from anywhere in the filesystem to be added.
    3.  Go through and git add and commit some files on your minion.
    4. Back on the Salt Master as the Git user: 
        a. Execute the bash script git_cycle.sh. Accept when prompted about the remote hosts keys.
        b. You should now see the Minion files populating in the /srv/salt/hosts/ directory.
    5. Keep in mind that you have these files now when crafting your states. In the below example, salt first tries to use the host specific nrpe.cfg file that we git pull from the server. 
    Failing that it will fall back to the default file which I keep stored with the state file.

# Drop in the nrpe.cfg file specific for this host. Or if its a new host, just use the default
{% set hostname = salt.grains.get('id') %}
    - name: /etc/nagios/nrpe.cfg
    - source: 
      - salt://hosts/{{ hostname }}/etc/nagios/nrpe.cfg
      - salt://nrpe/nrpe.cfg
    - makedirs: True
    - user: root
    - group: root
    - mode: 644

You can checkout the state and git_cycle.sh script at: https://github.com/alektant/git_cycle


Creating a SpinRite Virtual Machine

Like many IT Pros out there, occasionally I have need to recover data off a hard disk. And while I have a copy of SpinRite which usually makes such recoveries possible, I do not have a spare computer to run it on. And so perusing the interwebs a while back I came up with a procedure that allows running SpinRite in a VM against a physical hard disk.

This post a few assumptions about your setup.

  • You are running on a Mac using OS X.
  • You have a copy of VMware Fusion to create and run Virtual Machines.
  • You have an ISO file created using the SpinRite.exe binary on a Windows system. If not, please see "I purchased and downloaded the SpinRite program file.  Now what?" in the Gibson Research Corporation FAQ.
  • You have an external USB drive with a disk in need of recovery.

Just one last thing before we get started, I am not responsible if these instructions do not work for you. They’ve worked for me, time and time again and I hope that others will find them useful. Understand that we are going to physically map a hard drive into a VM, so if you map the wrong drive, it could mean bad things. All that to say if anything breaks, you can keep the pieces. Now on to the fun.

Without further ado, lets start by creating a new virtual machine in VMware Fusion.

When you finish creating the VM, it will boot automatically. Go ahead and turn it off.

Now, let's point the VM’s CD-ROM at the SpinRite.iso.

Once you reach this screen, click the drop down and choose, “select disk or image”. Finder will pop up and allow you to navigate to where you have your SpinRite.iso file.

Next, attach the external disk to your Mac and power it on.

Open the Disk Utility application on your Mac. Select the external disk and click the “Info” button. You are looking for the value of “BSD device Node”. In this example, it is disk3.

Armed with the disk identifier, we can now create a Raw Device Mapping so that the VM can talk directly to the physical hard disk. Unfortunately, this is not something that VMware has made available via the VMware Fusion interface, so we have to use the Terminal app to get it done.

Open the Terminal App on your Mac and enter the following command being sure to replace where I used “disk3” with the BSD device node value that your disk has:

/Applications/VMware\ Fusion.app/Contents/Library/vmware-rawdiskCreator create /dev/disk3 fullDevice ~/Desktop/rdm ide

You should now have a file on your desktop with the name of rdm.vmdk. We need to add this into our VM. 

Use finder to navigate to where your new VM is located on disk. Once you’ve found it, control click it to reveal the context menu and select “show package contents”

Go ahead and drag the rdm.vmdk file from your desktop to this folder. You can also delete any other .vmdk files that exist in that directory as we won't be using them.

Now that the mapping for our external disk is in the virtual machine folder, let's tell the VM to use it. Control click on the .vmx file and choose “Open With” followed by “TextEdit”. You will be presented with a a whole lot of key value pairs. The only one we are interested in, is the ide0:0.fileName. Change it's value to "rdm.vmdk".

 Be sure that the quotation marks do not change to more fancy looking quotation marks. OS X has a bad tendency to do that it and it can really screw things up. In this instance, if that happens and you try to start up the virtual machine, VMware will spit up an error that it cannot find the disk.

Be sure that the quotation marks do not change to more fancy looking quotation marks. OS X has a bad tendency to do that it and it can really screw things up. In this instance, if that happens and you try to start up the virtual machine, VMware will spit up an error that it cannot find the disk.

Now start up the VM. You will be prompted for your password by OS X.

Follow the prompts by SpinRite and you should be able to select your disk and begin recovery or maintenance operations.

When you finish running the VM, I would recommend hanging onto it. This will greatly simplify checking another disk in the future because all you will need to do is create a new rdm file and copy it into the VM directory.

UDR Transport Utility

Today I'd like to go over a utility that I've been exposed to that has changed the way I move large files between systems. It is called UDR. UDR is a wrapper for UDT; which is short for UDP-based Data Transfer, utility. As you may have guessed, UDT uses the UDP protocol to transfer large files really, really fast. 

The problem is that up to now, the best way that I've seen to transfer large files is to run multiple rsync instances in order to fill the amount of bandwidth available. A single instance of rsync is simply too high overhead to transfer a large file very fast. Of course its not rsync's fault. From what I've seen the problem more lies on the fact that rsync utilizes TCP which has characteristics which in high bandwidth/high performance situations cause it to not perform well.

As an example of the performance difference, when transferring a file using rsync from one server to another with a 1Gb link between, I was unable to send faster than 160 Mbps. Sending the same file using UDT though, and it transferred at 900 Mbps. Watching where the load was on the machines during the transfer made it seem that UDT could have actually sent the file faster, except the receiving server was using a single rotational disk and could only write at 112 MBps.

The best part is that UDT has a wrapper as I mentioned called UDR. UDR uses the UDT utility to transfer files, but then uses rsync to verify they transferred correctly. And the setup and syntax are really easy to understand if you're used to using rsync. Keep in mind that rsync uses SSH, so you'll be asked for login information for the destination server unless you're using SSH keys. For my example though, the command looked similar to:

udr rsync -av --progress /path/to/file dest_server:/dest/path/

Now UDR does support some options of it's own, such as encryption, timeout value, and ports to use . See the help for more details.

For more information on UDT: http://udt.sourceforge.net/doc.html

For more information on UDR: https://github.com/LabAdvComp/UDR

And for a Salt State to get this up and running: https://github.com/alektant/salt_states/tree/master/udr



Through some colleagues of mine, I recently discovered a new monitoring and management tool called Amon. Amon is the full time project of one Martin Rusev and is shaping up to be a fantastic tool. Martin has been kind enough to provide me with a hosted instance of Amon to mess around with and do some testing on for the last several weeks. I've found that it provides a very clean and easy to use interface for viewing the current status of nodes, allows easy creation of new dashboards from within the web interface, and comes with several plugins to pull metrics aside from load and disk utilization, which are already there by default.

"But my network monitoring system already has all of that", you might say. The key differentiator to me is that Amon also utilizes the SaltStack remote execution engine for running commands and even scripts on nodes/minions. And, on the roadmap is a feature that will add the ability to execute state files from within Amon.

I'm not quite familiar enough with it to drop Nagios, but the more I play with Amon, the more I want to deploy it into my environments. Oh and as a followup to last weeks post, Amon also supports PushOver as well as host of other notification methods.

If you're in Ops and using SaltStack, contact Martin to give Amon a try.

Nagios PushOver Integration

Recently I discovered a great little app called PushOver. This app allows me to receive push notifications on my iOS devices or Mac (it also works for Windows and Android platforms as well, but really who wants to use those) from any service that utilizes the PushOver API such IFTTT. As it turns out, you can very easily integrate PushOver into Nagios as well utilizing a script called notify_by_pushover.sh created by Jedda Wignall (@jedda) . And that is what we did.

It starts with getting the PushOver app installed on your device. It's a few bucks, but seriously, the coolness of getting this to work is worth it.

Next, you'll need to grab Jedda's script from GitHub and put it somewhere on your Nagios server. 

You'll then need to configure Nagios to utilize this script for notifcations. So in the commands.cfg, where you have notify-host-by-email and notify-service-by-email commands, also add PushOver commands. They should probably look similar to:

# 'notify-host-pushover' command definition
define command{
        command_name    notify-host-pushover
        command_line    /usr/lib64/nagios/plugins/notify_by_pushover.sh -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$NOTIFICATIONTYPE$ Host $HOSTNAME$ $HOSTSTATE$"

# 'notify-service-pushover' command definition
define command{
        command_name   notify-service-pushover
        command_line   /usr/lib64/nagios/plugins/notify_by_pushover.sh -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$HOSTNAME$ $SERVICEDESC$ : $SERVICESTATE$ Additional info: $SERVICEOUTPUT$"

Since we still wanted to receive Nagios alert emails in addition to PushOver alerts, it was necessary to define a new PushOver Contact Template:

define contact{
        name                            generic-pushover
        host_notifications_enabled      1
        service_notifications_enabled   1
        host_notification_period        24x7
        service_notification_period     24x7
        host_notification_options       d,u
        service_notification_options    c,u
        host_notification_commands      notify-host-pushover
        service_notification_commands   notify-service-pushover
        can_submit_commands             1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0

This enabled us to create new contacts to send the PushOver notifications to.

define contact{
        use                    generic-pushover
        contact_name           alek_pushover
        alias                  Alek Pushover
        contactgroups          smartalek_ops
        address1               pushover_user_key_goes_here
        address2               pushover_application_api_key

Thats it. Just restart Nagios and you should be good to go.

SaltStack First State

In my last couple of posts, I've been going over SaltStack and this one is no different. Today I want to tackle setting up a state file. As I'm a Nagios guy, we'll be setting up a state to install NRPE.

If you don't already have a working SaltStack environment, please see my previous post on getting one up and running.

Before we get going on the state file itself, I need to take a minute to explain file_roots. In the master config file, there is a setting for 'file_roots'. By default, the master sets the base file_roots to be the /srv/salt directory. This is where we will store our state files. But first we need to create this directory and while we're at it a directory for the state file we're about to create.

mkdir -p /srv/salt/nrpe

Now onto the setting up our NRPE state.

# Set the minion id as a jinja variable that we can use later
{% set hostname = salt.grains.get('id') %}

# Make sure the NRPE package is installed.
    - name: nrpe
# Drop in the nrpe.cfg file specific for this host. Or if its a new host, just the default
    - name: /etc/nagios/nrpe.cfg
    - source: 
      - salt://hosts/{{ hostname }}/etc/nagios/nrpe.cfg
      - salt://nrpe/nrpe.cfg
    - makedirs: True
    - user: root
    - group: root
    - mode: 644

# Make sure NRPE is running.
    - name: nrpe
    - enable: True
    - watch:
       - file: nrpe_config_file

In this state file, the first line that is not a comment looks really fancy and not at all like the rest of the YAML that composes this file. Thats because it isn't YAML, it's Jinja. This 'set hostname' Jinja statement is setting a variable that can be used later in the state to a value that it's retrieving from the Salt Grains system.

The Grains system is a collection of facts about the Minion that are generated when the salt-minion process is started. There is all kinds of cool stuff in there and I recommend executing 'salt-call grains.items' on one of your minions to have a look at everything that is in there by default.

Now in this state file, there are 3 states to be satisfied. They go by the ID declarations of nrpe_install, nrpe_config_file, and nrpe_running. ID declarations are simply the names of different states and can be anything so long as they are not repeated. In some cases, ID declarations can even be assumed, but that is a story for another day.

Each of these three states then has a state declaration and a function declaration. ID declaration nrpe_install has a state declaration of pkg and a function declaration of installed. This is simply telling Salt to use the 'installed' function of the 'pkg' module. Now just like in coding, different functions have different input requirements to achieve their purpose. This is no different in that everything indented to be beneath a function declaration is simply a parameter for that function. To find out the requirements of these functions, we need to look at the SaltStack documentation:




You should get used to looking up and reading the documentation pages. They are essential in utilizing SaltStack. Though of course, like any nix utility, you can read the documentation from the command line.

# List all salt modules
salt-call sys.list_modules

# List all functions for a partiuclar module
salt-call sys.list_functions pkg

# Description of use for a function of a module
salt-call sys.doc pkg.installed

Now getting back on track, in the nrpe_config_file ID declaration, you may have noticed that the source parameter has two arguments, one of which includes our Jinja variable that was defined at the top. Now if you read the documentation, you should know that the source parameter of the file.managed function allows multiple locations to be listed for the file we are to manage. In this particular instance, I'm using Jinja to let me handle any snowflake systems I might have by telling Salt to use the nrpe.cfg that is found in /srv/salt/hosts/snowflake_system_name/etc/nagios/nrpe.cfg. If this directory doesn't exist, then Salt will just move onto using /srv/salt/nrpe/nrpe.cfg. 

Now I hope you picked up on the fact that salt:// is actually referring to /srv/salt. It's important to understand that state files get rendered on the Minion and so when a salt:// request is made to the master, the master looks at its file_roots parameter and starts from there.

Lastly, I need to talk about the requisites that are being used in the nrpe_running ID declaration. Now service.running is a special circumstance that allows a watch requisite to be listed underneath it. This watch requisite means that the service named NPRE will only be restarted when there is a change detected in the ID declaration nrpe_config_file.

The require requisite means that ID declaration nrpe_running cannot be run until after ID declaration nrpe_install has been run.

Now that we've covered all of that, lets get to the business of actually running our state. So first copy the code above into a file on the master called '/srv/salt/nrpe/init.sls'. Then copy the text below into the file '/srv/salt/nrpe/nrpe.cfg'. This is going to be the nrpe.cfg file that Salt deploys for us. The one that comes with the NRPE package has a ton of white space and commented lines. This one is just shortened up so we can make sure our Salt state worked.

command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20

With all of that done, on the master execute: 'salt '*' state.sls nrpe'. Now you may be asking, "wait I named my state file init.sls. How did calling state.sls nrpe work?" It worked because salt makes the assumption that you will have in file roots either a file called nrpe.sls or a folder called nrpe with a file inside called init.sls. Either way will work, but it's usually easier to use the folder method as you can put other things needful for the state to run correctly in that same folder.

I hope that this is at least clearer than mud. I remember struggling a bit when I was first trying to learn Salt, but hopefully this post will help you along.

SaltStack Up And Running



Setting up KVM/Libvirt on CentOS7

I always get so annoyed with myself when I can't remember things I've done in the past. Setting up KVM/Libvirt is one of those things. Recently while working for a client, I found myself fumbling around trying to figure out why their VMs couldn't talk to the internet. So rather than let future me bang his head against the desk trying to figure this out again, I thought I'd write a quick post detailing the setup and solution. Hopefully others will find this post helpful as well.

First off, this procedure is written for CentOS7. It may well work on other distros, but I haven't tested it.

Okay, so you have your CentOS7 host installed and updated. Now its time to install the necessary packages and set the libvirt daemon to run on reboot: 

yum -y install libvirt libvirt-python qemu qemu-kvm
systemctl enable libvirtd

Now that KVM/libvirt is installed, we can work on the area that usually causes me problems, networking. First lets disable and stop NetworkManager else it will cause problems with the networking setup we will be implementing.

systemctl disable NetworkManager
systemctl stop NetworkManager

Next, lets make a new config for a bridge interface by creating the file /etc/sysconfig/network-scripts/ifcfg-br0. It should contain our system's network settings. If you use static addressing, this is where you will will want to include the IP address, subnet mask, gateway, and DNS settings. I use DHCP, so it's a bit easier.


Next, lets edit the network config file that the system is currently using. For me on this particular machine that file is /etc/sysconfig/network-scripts/ifcfg-enp1s0. Due to the way that RHEL7 based systems now name their interfaces, yours may be named differently. In any case, we're simplifying this file as we've already moved the network settings to our bridge file. The big thing that this one will do is tell the interface to use the new bridge. 


Alright, if all of that is done, then our last step is to restart networking.

systemctl restart network

If all went well, you should have a new br0 interface with an IP. 

We're almost done, but there are a couple of things left that need to be addressed. First, you need to allow IP forwarding on your host, else your VMs will not be able to pass traffic. To enable this, add the following to /etc/sysctl.conf.

net.ipv4.ip_forward = 1

That done, tell the OS to re-read the file.

sysctl -p /etc/sysctl.conf

The last step is to make sure IP Tables allows your VM traffic to traverse the system. Now, there are two ways to do this. You can disable IP tables on bridges altogether, which means that you rely on the VMs to provide their own firewalls. To go this route, add/edit another variable in /etc/sysctl.conf.

net.bridge.bridge-nf-call-iptables = 0

If you choose to have IP Tables on the host to continue evaluating traffic destined for the VMs, i.e. leaving net.bridge.bridge-nf-call-iptables with the default value of 1, then you will need to make sure that there are IP Tables rules running on the host to allow passing traffic meant for your virtual machines.

To take a little bit of the headache out of this, I've started on a Salt state to handle the easy stuff. https://github.com/alektant/salt_states/tree/master/kvm

SaltStack After 6 Months

Late last year, I decided that I needed to start trying figure out this whole configuration management thing. I was feeling late to the party and being able to not only understand everything that went into a machine, but to also have the ability to replicate that machine by using code, sounded pretty nice. So rather than spend a whole lot of time trying to figure this stuff out on my own, I signed up for a class on Puppet. 

"But wait" you say, "the title of this post is SaltStack After 6 Months ... I thought I was reading about SaltStack".

And you're right, a couple of weeks after taking that course I decided that the Puppet syntax was not really for me... i.e. I was having a hard time reading/writing it. Its nothing that I couldn't have overcome given some time. But around that time a friend and fellow systems engineer turned me onto SaltStack.

"Yeah, it's pretty cool because it's Python and you get remote execution on all of your Minions...and because they call them Minions." -Fellow Systems Engineer

I wasn't really sold on Salt's remote execution. Actually, at the time, I failed to see how amazing and useful this feature is. Nor was I sold on them calling nodes Minions. To me this was a no brainer because I knew how to write some Python. It meant I wouldn't need to spend time spinning my wheels trying to learn a new language. I was wrong. While I didn't end up spinning my wheels on learning Puppet's Ruby like syntax, instead I spent it trying to learn YAML and Jinja. These are the mainstays of utilizing SaltStack for configuration management. Well, Jinja is kinda optional as you can switch between several templating languages, but YAML is a must. Python makes up the core of SaltStack, but if you're trying to setup config management, get to love YAML and Jinja.

I spent a lot of time just trying to understand how these two things worked together and it doesn't help that I started with the complicated task of setting up user accounts. It would have been far easier to start with something like setting up NRPE. Where what is required is simply to install a package, install a customized nrpe.cfg, and restart the service. And that is where I recommend you start if you're just starting out with SaltStack. 

My next post will be about setting up a minimal SaltStack environment (really just one virtual machine) so that we can play around with some things. In the post after that, I'll actually show you how to create a state to manage NRPE.

PXE Booting

In this post I’m going to try and give a detailed set of instructions for setting up a PXE server that can boot to Ubuntu Live, WinPE, or a number of other utilities. The instruction set given here allows booting to: Ubuntu Desktop 12.04 x86 & x64, WinPE x86 & x64, Clonezilla, Memtest, MHDD, Parted Magic, and lastly Derik’s Boot And Nuke (DBAN).

One last thing before getting started, you pretty much need to sudo every command, but I’ve left it off.

I’m using Ubuntu 12.04 Server Edition as my PXE server. So right off the bat I need to install some packages:

apt-get install tftpd-hpa syslinux nfs-kernel-server samba proftpd

Now edit /etc/default/tftpd-hpa. We need to change our TFTP_Directory and add the option for a remap file. It should look like the following:

vim /etc/default/tftpd-hpa
TFTP_OPTIONS=”–secure –map-file /etc/tftpd.remap”

You can also add the argument ‘-vvv’ to ‘TFTP_OPTIONS’ to get some more verbose output in ‘/var/log/syslog’ to help in troubleshooting.

If you notice, TFTP_OPTIONS includes the argument ‘–map-file /etc/tftpd.remap’. If you are not planning on trying to PXE boot Windows, then you don’t need it. However, since I am planning to boot Windows, I need to create a remap file.

vim /etc/tftpd.remap
#Remap path separators
gr \ /
#Locate bootmgr.exe
r ^bootmgr.exe /WinPE/bootmgr.exe
#Map /Boot to /Windows
r ^/Boot /WinPE
#Map /boot to /Windows
r ^/boot /WinPE

Next, create a directory for tftp to have access to and then copy the necessary files to enable tftpbooting.

mkdir /media/pxe
cp /usr/lib/syslinux/pxelinux.0 /media/pxe/
cp /usr/lib/syslinux/vesamenu.c32 /media/pxe/
mkdir /media/pxe/pxelinux.cfg
touch /media/pxe/pxelinux.cfg/default
touch /media/pxe/pxelinux.cfg/pxe.conf

This is where our menu system starts. When it comes to these menus, it’s important to remember that a pxe booted machine essentially gets chroot’d to our pxe directory, in this case it’s going to be ‘/media/pxe’. This means that to the PXE booted machine ‘/media/pxe’ becomes ‘/’.

Add some params to the ‘/media/pxe/pxelinux.cfg/pxe.conf’. If you want a logo, replace logo.png with it. Note that the image needs to be 640×480:

MENU BACKGROUND pxelinux.cfg/logo.png
menu width 80
menu rows 14
menu color border 30;44 #ffffffff #00000000 std

Add entries to '/media/pxe/pxelinux.cfg/default'. This one defaults to booting to the local hard drive after 60 seconds. But if one of the options (Utilities, Ubuntu, or Windows) is chosen, it will display menus that exist in the location indicated by the MENU INCLUDE parameter.

DEFAULT vesamenu.c32
MENU INCLUDE pxelinux.cfg/pxe.conf
LABEL BootLocal
localboot 0
Boot to local hard disk
MENU BEGIN Utilities
MENU TITLE Utilities
LABEL Previous
MENU LABEL Previous Menu
Return to previous menu
MENU INCLUDE utilities/utilities.menu
LABEL Previous
MENU LABEL Previous Menu
Return to previous menu
MENU INCLUDE ubuntu1204/ubuntu.menu
LABEL Previous
MENU LABEL Previous Menu
Return to previous menu

Looking at the MENU INCLUDE params in /media/pxe/pxelinux.cfg/default, I'm going to need some folders. But because I know I want to include several items in the utilities menu, I'm going to go ahead and create directories for those items as well.

mkdir -p /media/pxe/utilities/{clonezilla,dariks-boot-and-nuke,memtest,mhdd,parted-magic}

Download the PXE specific version of Clonezilla from http://www.clonezilla.org/livepxe.php and mount it to /mnt and copy the needed files to it's utility directory.

cp /mnt/live/{filesystem.squashfs,initrd.img,vmlinuz} /media/pxe/utilities/clonezilla/

Next, download DBAN from http://sourceforge.net/projects/dban/, mount it, and copy it's needed files to the utility directory.

cp /mnt/dban.bzi /media/pxe/utilities/dariks-boot-and-nuke/

Continue this process of downloading, mounting, and copying the necessary files from each of these isos.

You can get Memtest from http://www.memtest.org/#downiso and you will only need the memtest86+ file. 

For MHDD, get it from http://hddguru.com/software/2005.10.02-MHDD/. You will need the mhdd.img file.

Get Parted Magic from http://partedmagic.com/ and copy out the files: bzImage and initrd.img.


Now you need to build the /media/pxe/utilities/utilities.menu file. If you've placed all the files as described, this file should look like this:

MENU LABEL Memtest+ v4.20
KERNEL utilities/MemTest86-4.20/memtest86+-4.20
Boot Memtest+ v4.20
KERNEL memdisk
APPEND initrd=utilities/MHDD/mhdd.img
Boot MHDD v4.6 Hard Drive Utility
MENU LABEL Clonezilla
KERNEL utilities/clonezilla/vmlinuz
APPEND boot=live netboot=nfs initrd=utilities/clonezilla/initrd.img config noswap nolocales edd=on nomodeset ocs_live_run=”ocs-live-general” ocs_live_extra_param=”” ocs_live_keymap=”” ocs_live_batch=”no” ocs_lang=”” vga=788 fetch=tftp://your server ip/utilities/clonezilla/filesystem.squashfs
Boot Clonezilla
MENU LABEL Parted Magic
LINUX utilities/parted-magic/bzImage
INITRD utilities/parted-magic/initrd.img
APPEND edd=off load_ramdisk=1 prompt_ramdisk=0 rw vga=normal loglevel=9 max_loop=256
Parted Magic
MENU LABEL Darik’s Boot and Nuke (2.2.7-Beta)
KERNEL utilities/Dariks-Boot-And-Nuke/dban.bzi
APPEND nuke=”dwipe” silent floppy=0,16,cmos
Warning!!! This will erase the hard drive!

XE boot Ubuntu Live or Ubuntu Server:

I will actually need two sets of folders. I need some for tftp access and some for nfs access.
Download the latest Ubuntu distro from http://www.ubuntu.com/download
Start by building this folder structure:
mkdir -p /media/pxe/ubuntu1204/desktop/amd64
mkdir -p /media/pxe/ubuntu1204/server/amd64
mkdir -p /media/pxe/nfs/ubuntu1204/desktop/amd64
mkdir -p /media/pxe/nfs/ubuntu1204/server/amd64

Now mount the desktop iso to /mnt
Copy the files initrd.lz and vmlinuz from /mnt/casper to /media/pxe/ubuntu1204/desktop/amd64/
And then copy /mnt/* to /media/pxe/nfs/ubuntu/desktop/amd64/
cp -a /mnt/* /media/pxe/nfs/ubuntu1204/desktop/amd64/ && cp -a /mnt/.disk /media/pxe/nfs/ubuntu1204/desktop/amd64/

cp /mnt/casper/{initrd.lz,vmlinuz} /media/pxe/ubuntu1204/desktop/amd64/
cp -a /mnt/* /media/pxe/nfs/ubuntu1204/desktop/amd64/
cp -a /mnt/.disk /media/pxe/nfs/ubuntu1204/desktop/amd64/

After that unmount the iso
Repeat for all the other Ubuntu versions but placing the files in the appropriate directories.

With all of that done, you just need to create the /media/pxe/ubuntu1204/ubuntu.menu file. It should look like what follows:

MENU LABEL Ubuntu Desktop 12.04 (64-bit)
KERNEL ubuntu1204/desktop/amd64/vmlinuz
APPEND boot=casper netboot=nfs nfsroot= initrd=ubuntu1204/desktop/amd64/initrd.lz
Boot the Ubuntu 12.04 Desktop 64-bit DVD
MENU LABEL Ubuntu Server 12.04 (64-bit)
KERNEL ubuntu1204/server/amd64/vmlinuz
APPEND boot=casper netboot=nfs nfsroot= initrd=ubuntu1204/server/amd64/initrd.gz
Boot the Ubuntu 12.04 Server 64-bit DVD

Windows – now it gets quite a bit more complicated.
Credit for helping to figure out this section is due from reading the following two sites:



First you will need a Windows machine with WAIK (Windows Automated Installation Toolkit installed)
Download from here: http://www.microsoft.com/en-us/download/details.aspx?id=5753
In that Windows machine open notepad. Paste in the following:

cd “C:Program FilesWindows AIKToolsPETools”
mkdir c:temp
call copype.cmd amd64 c:tempWinPE-x64
del /q etfsboot.com
move isobootboot.sdi boot.sdi
rmdir /s /q iso
imagex /mountrw winpe.wim 1 mount
copy mountWindowsBootPXEpxeboot.n12 pxeboot.n12
copy mountWindowsBootPXEbootmgr.exe bootmgr.exe
copy mountWindowsSystem32bcdedit.exe bcdedit.exe
copy “C:Program FilesWindows AIKToolsamd64imagex.exe” mountwindowssystem32imagex.exe
copy /Y c:users<username>DesktopPXE_Toolsstartnet.cmd mountwindowssystem32startnet.cmd
copy /Y c:users<username>DesktopPXE_ToolsWinPE.bmp mountwindowssystem32WinPE.bmp
imagex /unmount mount /commit
rmdir /q mount

bcdedit -createstore BCD
set BCDEDIT=bcdedit -store BCD
%BCDEDIT% -create {ramdiskoptions}
%BCDEDIT% -set {ramdiskoptions} ramdisksdidevice boot
%BCDEDIT% -set {ramdiskoptions} ramdisksdipath Bootboot.sdi
%BCDEDIT% -create {bootmgr} -d “Windows Boot Manager”
%BCDEDIT% -set {bootmgr} timeout 30

for /f “tokens=3″ %%a in (‘%BCDEDIT% -create -d “WinPE-x86″ -application osloader’) do set GUID1=%%a
%BCDEDIT% -set %GUID1% systemroot Windows
%BCDEDIT% -set %GUID1% detecthal Yes
%BCDEDIT% -set %GUID1% winpe Yes
%BCDEDIT% -set %GUID1% osdevice ramdisk=[boot]WinPEWinPE-x86winpe.wim,{ramdiskoptions}
%BCDEDIT% -set %GUID1% device ramdisk=[boot]WinPEWinPE-x86winpe.wim,{ramdiskoptions}

for /f “tokens=3″ %%a in (‘%BCDEDIT% -create -d “WinPE-x64″ -application osloader’) do set GUID2=%%a
%BCDEDIT% -set %GUID2% systemroot Windows
%BCDEDIT% -set %GUID2% detecthal Yes
%BCDEDIT% -set %GUID2% winpe Yes
%BCDEDIT% -set %GUID2% osdevice ramdisk=[boot]WinPEWinPE-x64winpe.wim,{ramdiskoptions}
%BCDEDIT% -set %GUID2% device ramdisk=[boot]WinPEWinPE-x64winpe.wim,{ramdiskoptions}
%BCDEDIT% -displayorder %GUID1% %GUID2%

del /q bcdedit.exe
cd c:
move c:temp c:WinPE
mkdir c:WinPE-x86
copy “C:Program FilesWindows AIKToolsPEToolsx86winpe.wim” c:WinPE-x86
cd c:WinPE-x86
mkdir c:WinPE-x86mount
imagex /mountrw winpe.wim 1 mount
copy “C:Program FilesWindows AIKToolsx86imagex.exe” mountwindowssystem32imagex.exe
copy /Y c:users<username>Desktopstartnet.cmd mountwindowssystem32startnet.cmd
copy /Y c:users<username>DesktopWinPE.bmp mountwindowssystem32WinPE.bmp
imagex /unmount mount /commit
rmdir /q mount

cd c:WinPE
move c:WinPE-x86 c:WinPEWinPE-x86

move WinPE-x64BCD .
move WinPE-x64boot.sdi .
move WinPE-x64bootmgr.exe .
move WinPE-x64efisys.bin .
move WinPE-x64efisys_noprompt.bin .
move WinPE-x64pxeboot.n12 .

echo LABEL 1 > WinPE.menu
echo MENU LABEL WinPE >> WinPE.menu
echo KERNEL WinPE/pxeboot.0 >> WinPE.menu
echo APPEND – >> WinPE.menu
echo TEXT HELP >> WinPE.menu
echo Boot WinPE >> WinPE.menu
echo ENDTEXT >> WinPE.menu
cd c:

Save it on the desktop as a ‘winpe.bat’ file
Change the 4 <username> entries to match your username.
On the desktop, create a file called ‘startnet.cmd’ and open it up in notepad.
You can put any commands you want your WinPE session to run in here. I like to have it automatically map back to the SMB share on my PXE server.

net use z: \<pxe_server><share> user:nobody
dir z:

You should also put an image that your want as the background on the desktop and rename it to ‘WinPE.bmp’
Run the ‘winpe.bat’ file.
Now copy ‘c:WinPE’ to ‘<PXE_Server>/media/pxe/WinPE’.

Back on our PXE Server:
Our PXE server won’t boot anything that doesn’t end in .0, so we need to create a symlink,
`cd /media/pxe/WinPE`
`ln -s pxeboot.n12 pxeboot.0`

Restart the TFTP-HPA process: `sudo /etc/init.d/tftp-hpa restart`

Now we need to configure our other listening services:
`vim /etc/exports`
Add the line: ‘/media/pxe/nfs <allowed_subnet>(ro,async,no_root_squash,no_subtree_check)’
Your <allowed_subnet> should be something like ‘′
As I use IP Tables to help me secure my servers, I need to make sure that NFS only uses ports that I know about.
`vim /etc/default/nfs-common`
STATDOPTS=”-p 4000 -c 4004″
`vim /etc/default/nfs-kernel-server`
RPCMOUNTDOPTS=”-p 4002 -g”
`/etc/init.d/nfs-kernel-server restart`

`vim /etc/samba/smb.conf`
Add a section:

comment = Stuff
browseable = yes
create mask = 0775
directory mask = 0775
path = /media/Something
guest ok = yes
guest only = yes
guest account = nobody
writable = yes

Save it.
`/etc/init.d/smbd restart`
Now put anything in there that you want available to your PXE booted WinPE sessions.
I use this method to prototype and deploy special Windows builds.
If you copy the contents of a Windows install iso here, PXE boot to WinPE and run setup.exe, it will install Windows.

The last thing you need to do is to get your PXE server working is point your DHCP server at it.



For the last several weeks I’ve been working on implementing a Nagios setup in my client’s environment. For those who don’t know, Nagios is an amazing piece of open source software that enables monitoring of pretty much any kind of network node or service. Unfortunately, it’s also amazingly complicated. There are several components to Nagios and in order use this solution effectively, you need to understand how to setup all of them.

I think the logical place to start with implementing this solution is with the server application. I chose to use Ubuntu 12.04 server as my host OS for this service. This makes the installation very easy: `sudo apt-get install nagios3 nagios-nrpe-plugin`. This should install all of the necessary components for nagios server and the nagios web pages to run.

Before we go any further, we need to make a few configuration changes to get Nagios configured to run correctly.

Edit ‘/etc/nagios3/nagios.cfg’

Change to ‘check_external_commands=1′

Edit ‘/etc/group’

Add ‘www-data’ to the nagios group: ‘nagios:x:127:www-data’

Edit ‘/etc/nagios3/cgi.cgi and edit/add entries:



Execute `chmod g+x /var/lib/nagios/rw’

With the server installed and configured, the next step for me was to create host-groups based on common services. This can be done by editing the file ‘/etc/nagios3/conf.d/hostgroups_nagios2.cfg’. For example, I would create a ‘web-servers’ host group with an entry like:

define hostgroup {

hostgroup_name  web-servers

alias Web Servers

members web-server1,web-server2


The alias field is what will be displayed on the Nagios webpage for this host-group. The members field defines which servers are to be included in this group. The names that are used here are not necessarily the hostnames of those servers. Rather they are the names that we use in other configuration files that define those hosts…more on that in a minute.

With the host-group defined, the next thing I want to do is define a service to check against that host-group. For this I’ll edit the file ‘/etc/nagios3/conf.d/services_nagios2.cfg’. Now I’m going to create a check here that uses the check_http plugin which will run from the Nagios server and check to make sure that a webpage is being served on the web server. This command to use this plugin is also called ‘check_http’ and is defined in ‘/etc/nagios-plugins/config/http.cfg’. So if you need to modify the way it is checking, edit http.cfg. But for our purposes here, we’ll just stick with the default.

define service {

hostgroup_name web-servers

service_description HTTP

check_command check_http

use generic-service


Note that this check does not require an agent to be installed on the machine it is checking since it is directly querying that service.

Another way we can check to make sure the Apache2 web server is running is by using the nrpe plugin to talk to the agent we’ll be installing on our web server in a minute, and having the agent check to make sure that a web server, in this case apache2,  is running. This method depends upon the architecture of the system being checked as to how it needs to be called. So for this you should create a new configuration file just for the client machine.

It will require these lines to be created:

define host {

use  generic-host

host_name web-server1

alias Web Server 1

address <IP or Domain Name of web-server1>


Now append into that file this service definition if it is Linux:

define service {

hostgroup_name web-servers

service_description Apache2

check_command check_nrpe!check_apache

use generic-service


Or this service definition if it is Windows:

define service {

hostgroup_name web-servers

service_description Apache2

check_command check_process!apache2.exe

use generic-service


The reason for this difference is that Nagios has made the job of checking Windows a bit easier on us by already having several commands in the ‘/etc/nagios-plugins/config/nt.cfg’ file that interface with the check_nrpe plugin for us.

Okay, so with the checks configured, it’s time to install agents on Windows and Linux client machines. Again, I use Ubuntu in my environment pretty much exclusively so it’s very simple to install the Nagios agent: `sudo apt-get install nagios-nrpe-server`. From there I just need to jump into the nrpe.cfg file and make a few changes to allow my Nagios server to talk to the agent, and then setup the checks it will be allowed to call for.

On the client, edit the file ‘/etc/nagios/nrpe.cfg’ and change the the entry ‘allowed_hosts=′ so it reflects the IP address of your Naigos server. Now scroll to the bottom and add the line

 ‘command[check_apache]=/usr/lib/nagios/plugins/check_procs -a apache’.

This line tells the agent that when told to run a check called ‘check_apache’ to execute ‘/usr/lib/nagios/check_procs’ which is a standard Nagios plugin that looks through the process list for various things based on the arguments it is supplied. In our case, ‘-a apache’, tells it to look for proceses that have ‘apache’ in them.

Now on Windows, I have to go to sourceforge.org and download nsclient++ and install it on my Windows client. Much like the Linux client, I have to change some lines in the configuration file once its done installing. This is to allow it to load specific modules that will be needed of the checks that I want to run. Also I need to allow my Nagios server to talk to it. As well I need to get my machine specific checks in here as well. I ended up changing this config file quite a bit and removed all of the comments for the sake of readability.

Make sure the following lines are uncommented:












That should pretty much do it. Now go back to your Naigos server. Since we listed in our host group that we have web-server1 and web-server2, make sure that you create a host configuration file for web-server2. If you don’t Nagios will error when you test the config. The other option is just to remove web-server2 from the web-servers host-group. Now run:

 `sudo nagios3 -v /etc/nagios3/nagios.cfg`

This tests your configuration. If it comes back without any errors then proceed to run:

 `sudo /etc/init.d/nagios3 restart`

I hope this guide helps prevent others from struggling as much as I did.

Account Recovery

This week saw attention given to a new angle in which we all need to take precautions to protect ourselves online. If you haven’t seen or heard about it yet, you should definitely check out the article: http://www.wired.com/gadgetlab/2012/08/apple-amazon-mat-honan-hacking/

The gist of which describes a truly epic piecing together of the weaknesses of several different services, the password recovery mechanisms they use, and the data they consider to be good enough for authentication.

Now, I’ve been a user of LastPass for a while now, and while I think this service is fantastic and highly recommend it, I also think I allowed it to give me a little bit of a false sense of security. I’ve thought myself pretty well insulated from online assault by using LastPass and having different, randomly generated passwords for every service I use. It turns out I was wrong. As was demonstrated at Matt Honen’s expense, real consideration needs to be given to the account recovery options, procedures, and personal data that all of your online services have.

As we all should have learned 4 years ago in the Sarah Palin/Yahoo Mail hack that you can have as secure a password as you want, but if you don’t put much thought into the account recovery questions, it doesn’t mean a thing. For my security questions, I’ve decided to answer them with still more passwords. My reasoning is that with more and more of our lives being recorded online by ourselves, companies we engage with, and through different governmental bodies loading the public record onto the internet, it’s not so hard to imagine someone doing a little bit of digging to find out what my first car was, or what city my mother was born in…

With all of this in mind, I spent a number of hours today going through all of my major online services and reviewing how the password and account recovery options work as well as what personal and credit card data they were storing. And while it was a huge PIA and sucked away much of the day, I may have just spared myself from the kind of pain that Matt went through.

Making The Most Of Resources

Last week a DB server lost one of it’s OS partition drives. I should explain that this particular server is a custom build and consists of two arrays run off different controllers. The data storage is run off a 13 drive RAID 5 array. And the OS is run off a 2 drive RAID 1. For whatever reason whoever built this server decided to use old scsi 160 drives for the OS. So when the drive went bad I didn’t have a replacement readily available. What I eventually decided upon was to just replace the both SCSI drives with SATA disks. I was fortunate in that the motherboard has a couple of SATA connectors built in. Though I lost the hardware RAID 1, all the real work is going to be done off the RAID 5 anyway.

I setup the software RAID during the install process for Ubuntu 12.04 Server. It’s not exactly hard but not exactly intuitive either and so I ended up looking around the web just to make sure I knew what I was doing.

Now for the big change. This server has thirteen 750 GB drives which builds out to 7.5 TB. The database that was being run on here took only a fraction of that; somewhere around 15 GB.  It seemed to be a waste to use so many spindles just to run a single mid-level usage MySQL instance. What if, instead I hooked this server up as another node in the SAN. Sure it wouldn’t have the fancy features that machines like NetApp and Equallogic come with, but thats not the point. The point is to make the best use out of the resources at hand. The hot spare DB server that took over when I brought this one down, was handling the load with ease and it has far fewer spindles. What better use could I put this machine to?

The configuration I ended up with for the RAID 5 so far includes 6 logical volumes. Two of those volumes host iSCSI luns which are attached to virtual machines which now run the backup MySQL instances for two primary DB servers. Neither of those virtual machines, nor the custom box show much load and I’m confident that should one of the primary databases go down, their virtual machine backup could easily pickup the slack. The other four logical volumes I have attached to VMware VDR virtual machine instances over NFS. The usage with all of six of these virtual machines running their services is now pushing the load of the custom machine pretty much to capacity. I’m now planning to max out the number of hard drives this machine can house in order to create another RAID5. That will allow me to move some of the services over to a different set of spindles.

A Look Back At Security

Wow, what a week in terms of security. What with the all the information coming out about the Flame malware and the LinkedIn, eHarmony, and LastFM hacks. These recent events have certainly turned my focus on it’s heels. So this week I spent the majority of my time going back and making sure my systems are as locked down as they can be. New threats and attack vectors pop up all the time, but some of them can be mitigated by layering security. The idea being that if Apache has a new vulnerability that will expose you, but the system is limited to who it can talk to by IPtables, then maybe it’s a headache you don’t have to worry so much about. Now, securing each application as best you can is just general good security practice. And when it comes to a new and highly sophisticated piece of malware like Flame, that good practice may be the only thing standing in the way of data compromise. But lets be honest, how many admins actually have the time to go back, check, and harden all the applications after they get them up and running. I can tell you from experience, it’s a luxury rarely afforded. But sometimes you just have to let items pile up in the ‘to-do’ list. For me, making sure that the systems I’ve set up are as secure as I can make them is worth it.

So this week, I focused on learning what I could about Apache2 and double checking my IPtables rules. This led me down a path filled with a lot of head shaking as I realized how easy it can be to misconfigure a web server. So, if you’re looking to secure your Apache2 install you need to do some reading on the Apache website. Get to know how it works and what you need from it. Most of the web servers I oversee are fairly simple static pages. For those I was just able to go through, adjust my default config to be locked down a bit more using ‘order deny,allow’ directives, and then disable modules that aren’t needed to serve those pages. Next week, I plan on trying to get those installations running out of chroot jails. Now, I do have a few other web servers that require PHP to run. If you have a web server running PHP, stop what you are doing right now and go look through that config file because it will almost certainly need some work. Here is a good tutorial on how to get started securing both Apache and PHP.

Lastly, with all the hacks that happened this week, I feel the need to say what so many other security experts have been saying for a long time. And that is, do not use the same password for every site you create an account on. Each account should use a different password. And while that sounds onerous it doesn’t have to be. And so I have a recommendation for password management, LastPass. I’ve been using them for about 2 years now and haven’t had any issues. I’ve recently turned several people, including my girlfriend, on to using this product I believe in it so much. The pricing is very reasonable at only $12 per year. And now they even have an enterprise offering which I’ll be looking into on Monday.

Endian Community Firewall

Today I got to play with pretty cool open source firewall called Endian Community (http://www.endian.com/en/community/).It’s a UTM distro based on RHEL that configures via a pretty slick web interface. Endian is loaded down with a ton of features including: Iptables firewall, Snort IDS/IPS, Squid proxy, Dansguardian content filtering, ClamAV virus protection for proxy, VPN, SpamAssassin Spam Filter, nTop traffic graphs, and on an on.

As the needs of my project were pretty simple, I’m not using most of those features. However, the parts that I have configured, namely the Iptables firewall and Squid proxy, were pretty easy and straightforward to get going. So if you’re looking for a decent UTM appliance and have some old hardware laying around, you may want to give Endian Community a shot.