Horizon Mirage is a part of the Horizon Suite from VMware and it is generating a lot of buzz. I’m not going to go into the benefits why, you can read the link I’ve provided for that. However, one of the most amazing things about Mirage is that it user a technology called sourced-based deduplication in order to backup all of the desktop endpoints. Let’s talk about that technology, how it works and when it works best.
Source-based deduplication works by having a server in the datacenter with a lot of capacity attached to it. We’ll refer to this server as the “repository.” Now for the endpoints (which, in the case of Mirage, are Windows-based desktop/laptops.) The client will begin by taking backups of the endpoints (Mirage calls them snapshots) and copying them to the repository. It’s this process and how it works that is so amazing. You would immediately think that when I take a backup of a endpoint that is 10GB on disk, the system will send 10GB over the network. For the FIRST machine that you backup, it typically does. It sends practically the whole image of the endpoint to the server for the first endpoint you backup. It’s when you go to backup the second endpoint where the magic starts to happen. Once the first endpoint has been “ingested”, for any additional endpoints added, the repository will use the data it has already seen to comprise all future backups. I know this can be somewhat confusing, you can look at this article for some comparisons of different deduplication technologies. For our example, let’s go a little deeper into exactly what happens during this process.
We will begin with the first Windows desktop that is 10GB on disk total and back it up. The repository will “ingest” the files from the endpoint. When it does this, it runs a hashing algorithm against the file to give it a hash code. Once it does that for every file, the client also breaks the file into “blocks” or “chunks.” It then runs a hashing algorithm against those chunks. After all this it stores the backup down on disk in the repository. Now, for the next (and every subsequent) client we want to backup or capture: The client will ask the server for it’s hash table of files. This is a small amount of data sent from the server to the client because the hash table is a list of all of the hash codes for all of the files in the repository not the actual data in the files. The client then takes this data and analyzes each file on the second endpoint’s file system. It develops a list of files that it has never seen before in the repository (and tells the repository which files are on this endpoint that the repository has seen before.) Typical we see about 90-95% common files between images. This is where it starts to get even more crazy efficient. So the client has figured out which files the server already has in the repository and has told the server a list of those files that are on Endpoint #2 that the server has seen before. Now the client looks at the files that the server has not seen before. Let’s suppose there are 100 files that list that the server has not seen before. The client will separate those files into blocks at the client (this is why it’s called sourced-based, the majority of the processing and checking for deduplicated data happens at the enddpoint, not the server). So the client has separated the 100 files into blocks and runs the same hashing algorithm on the blocks. Now the client compares the blocks to the blocks the server has in the repository and develops a list of blocks that the server has not seen before. Let’s say the client finds 10 blocks that the server has never seen before. It tells the server to mark down all of the blocks that are on this endpoint as being part of this endpoints backup. Note: to this point in the process, the client has not sent any of the backup actual data to the server yet. The last step is to take the blocks of files that are unique to this endpoint and compress them and send them to the server for storage, thus completing the backup, inventorying all of the common data and sending the unique data.
Whew! What does all this look like in reality? Let’s take a look at this log entry from a Proof-of-concept we are running for a customer right now:
This is a initial first upload from a client to the Mirage repository. This endpoint is running a Windows 7 base image. It is about 7,634 MB on disk (listed by the total change size.) Since this is the first time this endpoint has been backed up, all of the data on the endpoint is listed in the total change size. On all subsequent backups, this capacity will be the size of the files that have changed since the last backup. The next statistic is the killer number: Data Transferred is 29MB! Mirage took a full backup of this system’s 7,634 MB and only sent 29MB (the unique data) over the network to the repository!
Here’s how it got there: Mirage inventoried 36,436 files on the endpoint that had changed since the last backup (all the files on the endpoint had “changed” since there was no previous backup of this endpoint.) Mirage ran the hash on all of those files and found that there were 2,875 files that it had not seen before in the repository (the Unique Files number). These 2,875 files totaled 221MB (the Size after file dedupe number). Then Mirage pulled those files apart and looked for the blocks of those 2,875 files that it had not seen before. Once Mirage found those unique blocks they wittled down the 221MB of files that were unique to 95MB of blocks that were unique (the Size after Block Dedupe number). Mirage then takes the 95MB of unique blocks (which is the real uniqueness of this endpoint) and compresses it. Every single step in processing at this point has happened at the client. The last step is to send the unique data to the Mirage Server (repository). This data sent is 29MB of actual data for a full backup! (the Size after compression number) This whole process took 5 minutes and 11 seconds on the client. This first backup of the endpoint will take longer because the hashing has to happen on all of the changed files (36,436 files for this backup). However, all subsequent backups from this machine will only look at the files that have changed since the last backup because we already have a copy of the files that have not changed.
Where source-based dedupe works and where it does not
Sourced-based dedupe works the best when we have tons of endpoints with very similar OSes, apps and data (this is why it’s perfect for desktops and laptops). Where source-based dedupe has it’s challenges is when the files are big and really unique. Audio and video files are like this. Unless the files are copies, no two video files are alike, at all. Not all is lost if your users perform video or audio editing or just work with a lot of these files. There are ways to accommodate that as well. We would typically recommend using folder redirection or persona management to move those files to a network drive where we would backup with the typical methods and offload them from the endpoints. We can also exclude certain file types from being backed up at all by Mirage.
As shown above, Mirage includes an upload policy which allows you to set rules on file types you do not want to protect from the endpoints. Some standard ones included already are media files (however as you see in rule exceptions, media files in the c:\windows directory will be backed up).
Mirage is definitely the way to go for any mobile endpoints or branch office endpoints where bandwidth limits and connectivity reliability make VDI a less-than-optimal choice for the management a recoverability of these endpoints. I don’t recommend products that don’t work as advertised. Once the light bulb kicks on and customers understand this technology the real value of it shines thru. Make no mistake, Mirage is not a mirage, it’s a reality and a really good one at that.