While researching distributed cloud storage we came up with these interesting points on storage throughput along with some great links for additional reading.
- Input/Output Per Second (IOPs). This is the most-commonly quoted measurement. For Amazonâ€™s Relation Database Service (RDS) public cloud offerings, there even exists â€œProvisioned IOPS Storageâ€ (http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.html) that specifies the IOPS to request based upon storage gigabytes (GBs). For example, a SQL Server Enterprise with 500GB of storage has a recommended IOPS of 5000.
Unfortunately, IOPS is a scam as by itself it has almost zero relevance to overall storage throughput (http://www.brentozar.com/archive/2013/09/iops-are-a-scam/). The problem is that measuring number of theoretical I/O operations ignores the other key elements of storage throughput as follows…
- Block Size. The size of each I/O operation has a tremendous effect on throughput. For example, Amazon expects I/O to occur in block sizes of 16KB or less (https://www.datadoghq.com/2013/07/aws-ebs-provisioned-iops-getting-optimal-performance/). However, SQL Server uses 64KB block sizes; thus, purchasing 5000 IOPS results in only (5000 / 16) = 312.5 available IOPS with SQL Server. This actually matches Microsoft best practices for SharePoint to multiply content data size by .5 for the number of expected IOPS (high end; see http://technet.microsoft.com/en-us/library/cc298801.aspx#sectoin1b). Thus, a 500GB SharePoint content database would need around 250 IOPS at 64KB each â€“ this leaves (312.5 â€“ 250) = 62.5 IOPS still available for the Amazon datastore.
- Throughput. How fast the operations can be carried is a real concern. A 1GbE pipe gives just under 100MB/s throughput, and the public Internet can be counted on to limit this unless the entire communications chain is known to be at a higher speed. At an average sustained throughput of 3MB/s observed on many SMB WAN links routing through the public Internet, the effective IOPS of a remote SQL Server database would be around 45 (45 x 64KB blocks = 2880KB). In other words, not a scenario for high-volume updates.
- Latency. Numerous factors are at work here. The communication protocol has latency from store and forward within the switch, the switching software itself, wireline delay, and queuing if the network starts to saturate. Latency exists at the application level in terms of optimal block sizes (such as 64K for SQL Server), at the CPU level based upon memory page access and bus speeds, at the adapter level especially considering the shared PCIe bus, at the storage processor level such for shared storage, at the storage array controller such as RAID, and at the individual disk level. The net effect of all latency is hard to estimate; one blogger found that expected correlation between higher IOPS and latency for Amazon EBS storage was not at all what was expected (more IOPS did not necessarily introduce more latency). See https://www.datadoghq.com/2013/08/aws-ebs-latency-and-iops-the-surprising-truth/ for details.
(For a great 2008 primer on Ethernet latency please see â€œLatency on a Switched Ethernet Networkâ€ at http://www.ruggedcom.com/pdfs/application_notes/latency_on_a_switched_ethernet_network.pdf.)
It is only by looking at all four elements â€“ IOPS, block sizes, throughput, and latency â€“ that a true understanding of storage capabilities can be had. In a virtualized environment, storage has traditionally been the bottleneck because server compute limits are typically so much higher than true effective storage throughput. As the hypervisor hosts support more VMs, the problems of a limited disk throughput begin to show â€“ and that is true regardless of whether using locally-attached storage or network-attached storage (both SAN and NAS).