Wednesday, 7 February 2007

Write Acceleration

Reading Sangod's recent post on write acceleration, I couldn't help writing a response, as I've been looking at this whole subject recently.

First of all, I don't disagree with the concept of synchronous. Yes, the I/O must be confirmed (key word there) at the remote and local site before the host is given acknowledgement of the I/O being complete. Typically, enterprise arrays will cache a user I/O, issue a write to the remote array (which will also be cached), acknowledge the I/O to the host and destage at some later stage.

Sangod mentioned two techniques; cranking up buffer credits and acknowledgement spoofing. Buffer credits are the standard way in which the fibre channel protocol manages flow control. As each FC device passes data (for example an HBA to a switch) then the device can keep sending packets of data whilst the receiving device issues the R_RDY signal back. Buffer credits are essential as FC packets take a finite amount of time to travel a fibre optic cable. The longer the distance between devices, then the more packets which must be "on the line" in order to fully utilise the link. The rule of thumb is 1 buffer credit for each km of distance on a 2Gb/s connection. So if you don't have enough buffer credits, then you don't make best use of the bandwidth you have available. Having lots of data in transit does not compromise integrity as nothing has been confirmed to the originating device.

Moving on, there are devices which can perform write acceleration by reducing the SCSI overhead by acknowledgement spoofing. I've taken the liberty of loaning a couple of graphics from Cisco to explain how a SCSI transaction works.

When a SCSI initiator (source device) starts a write command, it issues a FCP_CMND_WRT to the target device. This is confirmed by the target with an FCP_XFER_RDY. The initiator then issues data transfer (FCP_DATA) repeatedly until all the data is transferred. The target confirms successful receipt of all of the data with FCP_RSP "Status Good". The initial preamble as the write request is started can be reduced by the switch connected to the initiator issuing an immediate FCP_XFER_RDY, allowing the initiator to immediately start sending data. The data transfer and the FCP_CMND_WRT operate in parallel, saving the time of this part of the exchange. No integrity is risked as nothing is confirmed by the target until all data is received.
I see no issue with this kind of spoofing as the initiator is not being told about the completion of the I/O before it has actually happened. What I do see as a concern is that the target may not be able to accept the data and therefore if the write fails, the source needs to be able to roll back if that happens.
In terms of getting best performance, multiple replication groups (e.g. multiple RA groups) would make best use of any WA technology. Cisco publish stats which show that. So WA could be good and safe.

1 comment:

jg said...

I suspect the biggest part of the issue is that I have not yet heard of an SRDF mode which provides an "inbetween". Right now, there are only a few:

SRDF/S - Full Synchronous mode - all writes are ACK'd when they are committed to disk.

SRDF/Async - No acknowledgement is required before ACK is sent to the host.

SRDF/A - SRDF/A is the best of both worlds. SRDF/A puts checkpoints into the RDF datastream. Both start and end checkpoints need to be received by the target Symm before that data is committed to disk. Since all SRDF is essentially a serial datastream, this provides a clear recovery point should the stream failure. If then ending checkpoint isn't received, the delta-set is discarded and the last checkpoint is your recovery-point. Depending on speed and distance/latency you can theoritcally be as little as a couple of minutes out of sync over thousands of kilometers.

SRDF/Async is broken into three subtypes:

Full Async - data is not acknowledged. Changed tracks are not monitored while session is active. An unplanned session interuption requires a full resync.

Adaptive Copy - Disk mode - data is acknowledged as soon as it's committed to disk. Changes are constantly tracked and an unplanned interruption can be recovered from immediately. This is the most commonly used Async mode.

Adaptive Copy - Write-Pending mode - data is acknowledged as soon as it's committed to remote cache.Changes are constantly tracked and an unplanned interruption can be recovered from immediately so long as the remote Symmetrix doesn't experience a simultaneous failure.