It was only natural, since I invented Network Address Translation (NAT), that I somehow find a way to use it in network storage. Doing so resulted in a unique storage appliance - one that is useful, resilient, fast, very flexible, and, with our new price of $1,995, very affordable. Use two and you have a No Single Point of Failure (NSPoF) configuration. It also resulted in the appliance not having any disks.
Here’s how the VSX started out back in 2008. We had had the SRX Media Arrays (SRX) out for a while. The SRX was still simple, fast, and very affordable. It did what it did really well, but I wanted to offer something more.
For many customers, what the SRX offers suits their needs perfectly fine. SRX serves up simple block storage. Anywhere you can use a disk drive, you can use logical units (LUNs) from the SRX. The operating system using the SRX can use the block device just as if it were a local disk.
But these OS, for the most part, lack the ability to do what’s called logical volume management.
The SRX LUNs can be created from a single disk or multiple disks arranged in RAID configurations, which are then exported as ATA-over-Ethernet (AoE) targets. Since you can have up to 36 disks in a single SRX, you can have from 1 to 36 LUNs exported from that array. (You can have 65,000 arrays, by the way, so can be awash in LUNs if you want. It’s block storage so it scales.)
What you can’t have is a LUN smaller than a single disk drive. Nor can you have a small LUN on a striped RAID, which gives you faster performance in some circumstances. You do things in chunks of whole disks.
Another thing you can’t do is dymanically allocate blocks from the LUN for use. These features include thin provisioning, taking snapshots for backup, and copying changes to a remote system.
The SRX, to be cheap and go fast, performs simple RAID mapping. The mapping for a RAID converts the LUN’s Logical Block Address (LBA) into a particular disk element of the RAID and an LBA on that disk. This is a simple mathmatical relationship. (In the case of a write, you will also have to update any redundant information on the array.)
At first I thought about how to do this on the SRX - how would I add another level of indirection to the system? But then I realized that there was a better way than compromising the elegance and simplicity of the SRX by adding more and more features. What was needed was NAT for ATA-over-Ethernet.
I’ve been good at adding boxes to a network. In 1994 I created the NTI PIX NAT Firewall, inventing stateful packet inspection and network address translation. I added a box to the network that translated martian internal IP addresses to public IP addresses on the fly. The data structure used to do this was also what was needed to do stateful packet inspection. Today all current firewalls use stateful packet inspection.
Why not add a box to the storage network and do the same here?
The VSX is the result. The "V" stands for virtual. The software runs on commodity hardware with just a bunch of NICs and no disk other than the small flash DoM used to boot it. And by small, I mean the least expsenive DoM. Currently that’s a 16 GB DoM. Of that we leave 16 GB free. The amount of data we use is very tiny, something on the order of five megabytes. So, what we leave free is really 15,992,500,000 bytes free. It’s just the cheapest DoM we can buy.
So, how does it work?
You install the VSX and create a pool. To this pool you add SRX LUNs. This creates available storage in 4 MB Physical Extents. These LUNs are called Physical Volumes (PV) when placed in a pool. The Phycial Volume Table (PVT) tracks the use of each physical extent.
Then you can create Logical Volumes (LV) out of the 4 MB PVs in the pool.
Some of the extents in the pool hold the metadata to keep track of the PVs and LVs. Each SRX LUN in the pool has a table (PVT) that tracks the use of a Physical Extent. Likewise, Logical Volume Table (LVT) keeps track of the extents in an LV.
The LVs are then given an AoE LUN number and exported in exactly the same way as RAIDs on the SRX are exported. They just appear as local drives to the hosts.
For thin provisioning, one creates a lot of empty LVT entries, going to the pool as needed to create places to write data. For thick provisioning, one goes ahead and allocates the extents when the LV is created.
To do a snapshot, one just creates a new LVT, and thereby a new volume, whose entries all point to the same Physical Extents. Shared extents are copy on write and get copied when one of the voumes want to write to it.
Flags in the Physical Volume entries keep track of which extents have been changed since the last time the LV was remotely copied. Only these changes get copied to the remote disaster site.
I know this explanation is kind of terse. I didn’t want to make this blog post a paper on all the details of how the VSX works.
But it still leaves the question - what’s the relation betwen all this and NAT?
When an AoE request arrives at the VSX, it simply looks upthe LUN and LBA on the SRX, and translates the AoE address and puts it back out on the network. It merely translates the address in-flight!
A very useful thing, Network Address Translation.