One place for hosting & domains

      Solutions

      The Importance of Data Backups

      Backups are important, even in the filesystem level!

      I’ve been a Linux user for around 13 years now and am amazed with how progressive the overall experience has become. Thirteen years ago you were using either Slackware 3, Redhat 5.x or Mandrake usually. Being 14 I was one of the “newbies” stuck on Mandrake because my 56k modem was what is known as a softmodem – a modem that lacks quite a bit of hardware and relies on your computer’s resources to actually function. Back then to make these work in Linux was a complete nightmare and Mandrake was the only one that worked out of the box with softmodems.

      Back in those days you didn’t have the package management tools you have today be it yum, aptitude, portage or any other various package management utilities. You had rpmfind.net to find your rpms while praying to god you found the right ones for your specific operating system as well as playing the dependency tracking game. Slackware was strictly source installs and the truly Linux proficient would pride themselves in how small of a Slackware install footprint they could get to have a running desktop.

      Growing frustrated at not understanding the build process and being constantly referred to as a newbie who uses “N00bdrake” I forced myself into the depths of Linux and after a year or so had a working Slackware box with XFree86 running Enlightenment with sound and support for my modem. I learned an extensive amount about how Linux works, compiling your own kernel, searching mailing lists to find patches for bugs, applying patches to software and walking through your hardware to build proper .conf files so daemons would function specifically. It seems that this kind of knowledge is being lost with Linux users these days as they are not forced to drop down to the lowest level of Linux to make their systems function.

      A good example of this is recently dealing with a hard drive with an ext3 filesystem that was showing no data on it. If you used the command df which shows partition disk usage the data was shown as taking up space, but you couldn’t see it. A lot of people conferred and figured that the data was completely lost for good while I sat there saying NOPE waiting for someone to give a correct answer. Unable to get one, I divulged that the reason this happened is because a special block what is known as a superblock had become corrupted and the journal on the filesystem lost all information. Issuing a fsck would not fix the issue and routinely would check out as “ok” as it is using the bad primary superblock. There are actually multiple superblocks on ext2, ext3 as well as ext4 partitions. These exist specifically for backup purposes should your main one become corrupt to correct such an issue. Having toasted Linux countless times playing around with things such as software raid and hard-locking due to a poorly configured kernel, I have probably spent more time than I should’ve in the past reading about how the ext filesystem works. You in the past might have overlooked this when creating a filesystem in Linux but you will see output similar to this when creating an ext filesystem:

      Superblock backups stored on blocks:

      32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

      These are very, very important blocks and integral to maintaining data on your filesystem. Lost these numbers? You can still get them from a few methods as well:

      1. First you need to know what size of blocks you used on your filesystem. The default is 1k so unless you actually issued a specific command when using mkfs.ext3 then your blocksize is 1024.
      2. Now issue the command “mke2fs -n -b block-size /dev/sdc1”. This is assuming that sdc1 is the corrupt partition throwing no data. Since you’re issuing the -n flag this means the command will not actually make a new partition but will give you those precious backup superblocks.
      3. Now take any of those superblocks and make sure that your partition is unmounted. Issue “fsck -fy -c -b 163840 /dev/sdc1” to hopefully fix your partition. Once completed mount the drive and more-than-likely all your data will be in the folder lost+found. It might have lost the initial folder name but at least your data is there, and with a little bit of play you can figure out which folder is which.

      Now take a breather, relax and be happy that your data is not completely gone. I suggest in the future pulling up a source-based distribution like Slackware and try setting up an entire system without using any package management. See how it goes, prepare to read a lot of documentation but in the end you will be thankful as you will learn more about Linux this way than any other method.

      The Value of Virtualization

      You’re probably familiar with many arguments for virtualizing your systems. Virtualization can make your systems more secure, by reducing the number of applications and users on a single machine. It can make it easier to scale, utilizes your resources more efficiently, reduces costs, is faster to set up, and shields you from hardware failures, providing you with better uptime. VMs have another advantages over conventional servers, though, which is less commonly listed but still pretty important: they’re automatically instrumented.

      Let me explain what I mean. Lets say you’re having some problems running your website on a traditional server. Your traffic has gone up, and now you are having outages during peak times. You speak to your tech support staff, and the admins agree that there’s a problem — but they’re not exactly sure what the cause is.

      Usually at this point, the admins will start ‘keeping an eye’ on the system in question. This often means being logged in and running top or vmstats. If the problem recurs, hopefully the admins will catch it, and the output from the monitoring software will give them hints as to what went wrong. If the admin is not around when the problem happens, though, they might not get the data they need, and then the process will have to start all over again.

      Another solution is to start monitoring the server using a monitoring program like Cacti or Ganglia. This is a little more reliable than manual monitoring because the software won’t get bored or distracted and miss the fault event. But monitoring software has its own problems. It is often a hassle to setup. It requires punching holes in your firewall, making your server less secure. It takes up resources on an already precarious machine, possibly making downtime more likely. And if the problem affects the network, the remote monitoring machine might not be able to communicate with the trouble server to get any useful data at the exact time when the data is needed.

      This is where virtualization comes to the rescue. The hypervisor — the software which makes virtualization possible — already has a lot of statistics about the virtual machine. Our Cascade cloud platform automatically gathers such statistics for every VM, storing the data in approximately five-minute increments in our own internal logging database. The data gathering happens in the context of the node, not the VM, meaning that the VM will not see a performance impact from the monitoring. Also, the fact that every VM is already monitored means that if a fault occurs, you won’t have to wait for a second fault to figure out what went wrong. The data to analyze the original fault might already be there.

      Let me give you a concrete example. Just today, we had a problem with a customer’s VM; his PHP site went offline and the VM required a reboot to bring the site back online. The admins had a hunch that the VM was overloaded and couldn’t handle the traffic, but they didn’t know what resource was running low.

      Here are some graphs generated by our internal system, Manage, which allowed our admins to get to the bottom of the problem. First, lets start with a graph of network bandwidth for the VM:

      network

      This graph illustrates the problem precisely. Around 8:50 PM, the VM stopped serving requests, or the amount of data served dropped precipitously. When admins logged in, they saw this in the kernel logs:

      Oct 31 20:58:03 vm1 kernel: INFO: task php:41632 blocked for more than 120 seconds.

      But why did this happen? Maybe the CPU usage for the VM was too high? Well, we can answer this question using our CPU graphs, which display CPU usage data for both the VM as a whole on its node and for each virtual CPU inside the VM:

      cpu

       

      Sure enough, the VM is pretty busy. It is making good use of all of its virtual CPUs, and its overall load on its node is often over 100%. However, the VM has 4 VCPUs, which means that if CPU were the limiting resource, the load would be as high as 400%. It looks like each VCPU is only being about 25% utilized. Also, the fault occurred at 8:50 PM, and we don’t see a CPU spike around that time. In fact, CPU usage for some of the virtual CPUs appears to drop around 8:50 — VCPU 0, at least, had nothing much to do during the outage.

      So what could be the problem? For an answer, lets turn to yet another batch of data we are able to get from the hypervisor: disk statistics

      disk

       

      The VM is not a particularly major user of disk IO — mostly steady writes consistent with saving log activity, with a few read spikes which might indicate someone searching through the file system or perhaps a scheduled backup. But here’s something interesting: right around the time the VM experienced its failure, swap file usage skyrocketed. Now we know exactly why the VM failed: it ran out of memory, and swap was too slow to fulfill the heavy traffic requests the VM demanded.

      Getting data like this on a regular, traditional server would have required complex monitoring software, a steady stream of network traffic, a whole another monitoring server and skilled labor to set the whole thing up. On a Cascade VM, you get this kind of data for free, automatically. You won’t see these exact same graphs in LEAP, our customer portal, as these are generated for our internal interfaces only. But the data behind these graphs is also available to LEAP, which will generate much prettier, more usable visualizations, which permit you to easily drill down and explore what’s happening with your virtual machine.

      The conclusion to this story is that we increased the amount of memory available to the VM from 4 GB to 8 GB. This required just a quick reboot of the VM, with none of the downtime or stress required for pulling a physical server out of the racks and opening it up. This solved the customer’s problems, with better performance and no outages. So here is yet another way virtualization with Cascade and LEAP makes your life easier.