Nov 112010

Background: Our oldest focus at RoundTower Technologies is backup.  Because of this, we are very familiar with backup systems and since my background is in VMware, I specialize in backing up virtualized environments.  As you know, Change Block Tracking (CBT) in vSphere allows your backup and replication processes to be much more efficient.  CBT basically sets a marker when a backup or replication occurs and tracks which disk blocks have been changed.  When the next backup or replication occurs, CBT tells the app exactly which blocks have changed.  This is a huge benefit to backup and replication as those apps used to have to figure out which blocks changed by comparing snapshots which can take a long time and use a lot of CPU.

You may know a little about Avamar.  It’s a backup solution that uses source-based deduplication to perform backups.  It basically always takes full backups and it only stores pieces of files that it has not seen before in the entire environment.  Every thing that it has seen across your organization is tracked.  This includes which client it was seen on and when, but only one copy of the file piece is stored on disk.  This creates extremely efficient and rapid backup.  For VMware enviornments, Avamar can take file level backups by running a client in the GuestOS or a VM image level backup by running a proxy VM in the infrastructure.

When you combine these two technologies together, the result is the best of both worlds.  Specifically referring to the image level backups with CBT enabled.  This means that Avamar only backs up the pieces of the vmdk files that it has not seen before and with CBT, it only scans the blocks that have changed from the last backup when looking for pieces to deduplicate.  Very efficient and very optimized – we’re talking hundreds of GB in just minutes.

Here the issue I ran into: I added a client to Avamar and setup a policy to do Image Level backups of the VM.  I kicked one off and the Avamar starts by creating a snapshot of the VM and mounting the snap to the Avamar Proxy VM.  Avamar then queries CBT on vSphere and gets the list of blocks that changed since the last backup.  The proxy then scan thru only the blocks that changed and only send the file segments within those blocks that it has not seen before to the actual disks for backup.  When finished, Avamar unmounts the snap from the proxy and deletes the snap.  When I ran thru this procedure at the customer site, the first backup took about 15 minutes for 100GB on their system.  This is expected as there is no CBT information yet so the proxy must read thru the entire 100GB to determine what file pieces it has and has-not seen before and that takes the majority of the 15 minutes.  On the second backup however, I expect that CBT will only show the proxy the blocks that changed and then it will dedupe only those and store all of the other blocks in Avamar from the inventory of blocks it already has (as CBT said those blocks have not changed).  When I did go and run the second backup it took 15 minutes. It should have taken only a minute or two.  What’s the deal?

The solution:  I did some hard digging on the net for a solution. I was sent this article on the EMC support site from one of our other Engineers (thanks Judson!).  Basically it said that VMware has an issue (documented here) with CBT and VMware snapshots.  In a very specific scenario, a customer could restore a snapshot of a VM from vCenter and it’s CBT information would be inaccurate.  When the backup or replication was looking to CBT for the blocks that changed, it could provide incorrect information.  This would backup or replicate incomplete information without showing an error of any kind.  That’s bad.

Avamar knows about this issue and protects people.  It does this by looking to see if CBT is in use and if there are any VMware snapshots older than the last retained backup of the data in Avamar.  If there is an older snap, Avamar assumes that a customer could revert to it any time (or already did) and that the CBT data could be invalid – so it ignores the CBT info and reads thru the entire VM.  This is why my backup above took 15 minutes each time.  I had snapshots on that VM older than the oldest Avamar backup retained.  When I removed the snaps the next backup took 15 minutes (I later found that this was to reset the CBT information).  The next backup after that took 47 seconds.  Now we’re in business.

If you see these kinds of performance issues on Image Level backups in Avamar, try cleaning out the VMware Snapshots.  This issue does not affect file level backups only Image-level. I hope this helps out the users who are trying to run Image level with Avamar.  Now you’ll know what to try when performance for the backups slows down for no apparent reason.

Thanks and good computing.