Tuesday, 16 December 2008

Redundant Array of Inexpensive Clouds - Pt II

In my previous post I started the discussion on how cloud storage could actually be useful to organisations and not be simply for consumer use.


One of the big issues that will arise is the subject of standards. To my knowledge, there is no standard so far which determines how cloud storage should be accessed and how objects should be stored. Looking at the two main infrastructure providers, Amazon and Nirvanix, the following services are offered:

S3 (Simple Storage Service) - storage of data objects up to 5GB in size. These objects are basically files with metadata and can be accessed via HTTP or BitTorrent protocols. The application programming interface (API) uses REST/SOAP (which is standard) but follows Amazon's own standards in terms of functions to store and retrieve data.

Elastic Block Store (EBS) - this feature offers block-level storage to Amazon EC2 instances (elastic compute cloud) to store persistent data outside of the compute instance itself. Data is accessed at the block level, however it is still stored in S3.


Storage Delivery Network (SDN) - provides file-based access to store and retrieve data on Nirvanix's Internet Media File System. Access is via HTTP(S) using standard REST/SOAP protocols but follow Nirvanix's proprietary API. Nirvanix also offer access to files with their CloudNAS and FTP Proxy services.

The protocols from both Amazon and Nirvanix follow standard access methods (i.e. REST/SOAP) but the format of the APIs are proprietary in nature. This means the terminology is different, command structures are different, the method of storing and retrieving objects is different and the metadata format for referencing those objects is different.

Lack of standards is a problem. Without a consistent method for storing and retrieving data, it will become necessary to program to each service provider implementation, effectively causing lock-in to that solution or creating significant overhead for development.

What about availability? Some customers may choose not to use one service provider in isolation, in order to improve the availability of data. Unfortunately this means programming to two (or potentially more) interfaces and investing time to standardise data access to those features available in both products.

What's required is middleware to sit between the service providers and the customer. The middleware would provide a set of standardized services, which would allow data to be stored in either cloud, or both depending on the requirement. This is where RAIC comes in:

RAIC-0 - data is striped across multiple Cloud Storage infrastructure providers. No redundancy is provided, however data can be stored selectively based on cost or performance.

RAIC-1 - data is replicated across multiple Cloud Storage infrastructure providers. Redundancy is provided by multiple copies (as many as required by the customer) and data can be retrieved using the cheapest or fastest service provider.

Now there are already service providers out there offering services that store data on Amazon S3 and Nirvanix SDN; companies like FreeDrive and JungleDisk, however these companies are providing cloud storage as a service rather than offering a tool which integrates the datacentre directly with S3 and SDN.
I'm proposing middleware which sits on the customer's infrastructure and provides the bridge between the internal systems and the infrastructure providers. How this middleware should work, I haven't formulated yet. Perhaps it sits on a server, perhaps it is integrated into a NAS application, or a fabric device. I guess it depends on the data itself.
At this stage there are only two cloud storage infrastructure providers (CSIPs), however barriers to entry in the market are low; just get yourself some kit and an API and off you go. I envisage that we'll see lots of companies entering the CSIP space (EMC have already set their stall out by offering Atmos as a product, they just need to now offer it as a service via Decho) and if that's the case, then competition will be fierce. As the offering count grows, then the ability to differentiate and access multiple suppliers becomes critical. When costs are forced down and access becomes transparent, then we'll truly have usable cloud storage.


Taylor Allis said...

Chris - wow, RAIC - great, original thought. I think there is a great middleware opportunity here.

Another way to look at cloud storage is another tier of storage with a very low $/GB - in thinking about it this way, RAIC is a great idea. However, one of the reasons why some users may want to move to the cloud is b/c they don't want to deal with RAID levels in the first place. But if you want reliability and reducdancy out of your cloud storage, then this is the way to do it. This could be a feature from the cloud providers - if users see that their data is being replicated to different sites behind the cloud then they get relaiblity and ease of use.

I have been interested in how SmugMug has leveraged the cloud with S3 - all their data sits there, non-replicated and they use disk in their own data center as hot cache...

Chris M Evans said...

Thanks Taylor!

I agree with the tiering issue; it is possible to make Cloud another tier. Question is, how to seamlessly access it.

craig said...

Chris- great insight. My company will be releasing a product in Q1 that addresses much of the middleware functionality you've proposed. In terms of the interface, we've chosen iSCSI. I'd be happy to share more on one-off basis, or discuss beta testing CloudArray when it's avaialble.

Chris M Evans said...

Craig, yes I would be very interested. If you post a comment with your contact details and I will get hold of you from there (I won't publish the contact details comment)

Geoff said...


Do you have a website up and running where this concept (Cloud Array) is hashed out more? Would like to learn more about your solution in this context.

craig said...

We haven't yet posted content on CloudArray, but will do so later this month on our corporate site: www.twinstrata.com. I'm also going to be briefing Chris on the solution later this week.