Hunting for bandwidth on a consumer NVMe drive

The Samsung SSD 970 EVO 500GB claims a sequential read bandwidth of 3400 MB/s this is a story of trying to achieve that number.

Starting point

We start with a Haswell based PCIE Gen2 machine. First of all write out data across the drive with fio

[global]
bs=64k
refill_buffers
rw=randwrite
ioengine=libaio
iodepth=128
direct=1
[write-random]
filename=/dev/nvme0n1

Quick sanity check to ensure we have data on the drive. The Usage shows 500GB.

$ sudo nvme list
[sudo] password for gary:
Node          SN               Model                         Namespace Usage                      
------------- ---------------- ---------------------------   --------- -------------------------  
/dev/nvme0n1  S466NX0M105530F  Samsung SSD 970 EVO 500GB      1         500.11  GB / 500.11  GB   

It sure looks like random data

$ sudo od -x /dev/nvme0n1 | head
0000000 2100 70d1 1737 77fb 2420 5d9a aa7e 0a63
0000020 4484 b9a3 4682 00ed 6890 b4f2 4bfe 0b78
0000040 4d12 a256 af06 135f c9a2 adc1 a4b5 09e6
0000060 3934 d0e7 fdb5 11cf e726 1d82 1743 16cb
0000100 5ce4 f01d d426 17c1 ab9c 2f91 c276 1a20
0000120 3573 f024 0085 1c4d 06ae a34b a7fe 18ab
0000140 60d5 5112 235a 0a58 cc1a 19b7 dbe2 0430
0000160 f983 5d29 0ea0 08cb bf30 0ee3 333f 079e
0000200 77e6 0244 3ef6 15a4 8efc 8455 86bc 0903
0000220 b1df c90c 2eac 17a6 163b 4032 614f 1cf7

Experiment 1 – read random data serially (PCIe Gen2)

Let’s see what bandwidth we can achieve – we will read the first 100G using 4 threads using 1MB IO size, 32 OIO per thread See Experiment 1 for fio file and full output

  read: IOPS=1058, BW=1058MiB/s (1110MB/s)(400GiB/387050msec)

That’s a little over 1GB/s nowhere near the 3.4 GB/s we’re looking for. Admittedly, this is a PCI Gen2 bus, but even Gen2 should allow more than 1GB/s across 4 lanes.

Experiment 2 – read trim’d blocks serially (PCIe Gen2)

We can try a trick to trim the first 100G and see if that helps. fio will allow us to send trim to the drive – the output of nvme list shows that usage went from 500G to ~400G after the trim

  • Before trim (Note the “Usage Column)
Node          SN              Model                           Namespace  Usage                      
------------- ----------      ------------------------------  ---------  --------
/dev/nvme0n1  S466NX0M105530F Samsung SSD 970 EVO 500GB       1          500.11  GB / 500.11  GB   
  • After trim (Note the “Usage Column)
Node          SN              Model                             Namespace Usage                      
-----------   --------------- --------------------------------- --------- ----------
/dev/nvme0n1  S466NX0M105530F Samsung SSD 970 EVO 500GB        1         392.73  GB / 500.11  GB    

When we read the blocks that have been trimmed – the device returns NULL – it is not an error to read a trimm’d block.

$ sudo od -x /dev/nvme0n1
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*

Then re-run the same fio file as before the first 100G using 4 threads using 1MB IO size, 32 OIO per thread See Experiment 2 for fio file and full output

 read: IOPS=1582, BW=1583MiB/s (1660MB/s)(400GiB/258802msec)

The results are better, ~1600 MB/s but still less then half of what Samsung claim the drive can deliver. Remember that this machine is a PCIE Gen 2 device. The M.2 card that is the format of this drive uses 4 lanes. The Gen2 lanes run at about 500MB per lane. So 4 lanes should give 2GB/s BUT the encoding scheme of the bus gives us only 8 bits of data for every 10 bits that pass across the bus. So, in terms of throughput – the max data throughput per Gen2 lane is 400MBs (8/10ths or 80% of the 500MB/s of the spec). Since the device uses 4 lanes the available PCI bandwidth is 1600 MB/s which is about what we get. Which means that this test is now hitting the 4-lane maximum of Gen2 PCIe. At this point we need a fsater bus.

Experiment 3 read random data serially (PCI Gen3)

Next I physically move the drive to a PCIe Gen3 based machine and read the portion of the drive that contains the random data

Node            SN                Model                        Namespace Usage                     
--------------- ----------------- ---------------------------- --------- ------
/dev/nvme0n1    S466NX0M105530F   Samsung SSD 970 EVO 500GB    1        392.73  GB / 500.11  GB

Since we trimm’d the first 100G of the device – we need to add 100G offset to the fio file to skip the trimmed blocks offset=100g See Experiment 3 detail for fio files and full output

  read: IOPS=1064, BW=1065MiB/s (1117MB/s)(400GiB/384672msec)

Interestingly despite being on PCIe Gen3 – the throughput of the device is about the same (around 1GB/s) as the performance of the same device on PCI Gen2, and nowehere near the 3400 MB/s claimed by Samsung.

Experiment 4 – read trim’d blocks serially (PCIe Gen3)

Now we switch back to reading the trimmed section of the drive – this time over PCIE Gen3 See Experiment 4 detail for the full fio file and output. The available data throughput of Gen3 for 4 lanes is much closer to the bus bandwidth of 1GB per lane because in Gen3 there are 128 databits for every 130 bits that pass across the bus (~2% overhead rather than 20% for Gen2) so about 3.9GB/s

  read: IOPS=3222, BW=3222MiB/s (3379MB/s)(400GiB/127124msec)

Now we see 3379 MB/s from the drive, which is pretty close to the claimed 3400 MB/s. We know that the bandwidth of PCI Gen3 is 3900 MB/s so we can assume that we’re not limited by the PCI bandwidth.

Conclusion

The specs for this device claim 3400 MB/s for the device in a PCI Gen3 slot. We proved that regardless of Gen2 or Gen3 reading trimmed blocks is much faster than reading real data blocks. On Gen2 we maxed out the bus speed, on Gen3 the only way we can get close to 3400 MB/s is to read only trimmed blocks. The speed of the device when reading data blocks is actually 3x slower. We get about the same speed reading data blocks on Gen2 and Gen3 so we know that the ~1GB/s when reading data is due to the device speed and not the bus speed. Since the blocks are so large and queue depth high – we can see from the full fio outpt that we are not CPU limited either.

Experiment 1 detail

fio file

[global]
direct=1
group_reporting
bs=1024k
ioengine=libaio
iodepth=32
rw=read
numjobs=4

[read]
size=100g 
filename=/dev/nvme0n1

fio output

read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 1 (f=1): [_(1),R(1),_(2)][100.0%][r=1498MiB/s][r=1498 IOPS][eta 00m:00s]
read: (groupid=0, jobs=4): err= 0: pid=15597: Wed Nov 30 08:04:00 2022
  read: IOPS=1058, BW=1058MiB/s (1110MB/s)(400GiB/387050msec)
    slat (usec): min=17, max=521, avg=105.58, stdev=11.96
    clat (msec): min=15, max=328, avg=120.35, stdev=48.21
     lat (msec): min=15, max=328, avg=120.45, stdev=48.22
    clat percentiles (msec):
     |  1.00th=[   41],  5.00th=[   60], 10.00th=[   64], 20.00th=[   78],
     | 30.00th=[   92], 40.00th=[   97], 50.00th=[  110], 60.00th=[  126],
     | 70.00th=[  144], 80.00th=[  171], 90.00th=[  192], 95.00th=[  207],
     | 99.00th=[  232], 99.50th=[  241], 99.90th=[  257], 99.95th=[  268],
     | 99.99th=[  284]
   bw (  MiB/s): min=  544, max= 2918, per=100.00%, avg=1059.14, stdev=79.90, samples=3081
   iops        : min=  544, max= 2918, avg=1059.08, stdev=79.90, samples=3081
  lat (msec)   : 20=0.01%, 50=3.31%, 100=40.61%, 250=55.88%, 500=0.19%
  cpu          : usr=0.87%, sys=3.95%, ctx=408356, majf=0, minf=32818
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=409600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1058MiB/s (1110MB/s), 1058MiB/s-1058MiB/s (1110MB/s-1110MB/s), io=400GiB (429GB), run=387050-387050msec

Disk stats (read/write):
  nvme0n1: ios=415780/0, merge=0/0, ticks=50035559/0, in_queue=49210716, util=100.00%
Experiment 2 detail

fio file

[global]
direct=1
group_reporting
bs=1024k
ioengine=libaio
iodepth=32
rw=read
numjobs=4
[read]
size=100g
filename=/dev/nvme0n1

fio output

read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 processes
Jobs: 2 (f=2): [R(2),_(2)][98.9%][r=1578MiB/s][r=1578 IOPS][eta 00m:03s]
read: (groupid=0, jobs=4): err= 0: pid=15814: Wed Nov 30 08:17:45 2022
  read: IOPS=1582, BW=1583MiB/s (1660MB/s)(400GiB/258802msec)
    slat (usec): min=19, max=449, avg=93.12, stdev=17.15
    clat (msec): min=14, max=190, avg=80.23, stdev=28.86
     lat (msec): min=14, max=190, avg=80.33, stdev=28.86
    clat percentiles (msec):
     |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   48],
     | 30.00th=[   62], 40.00th=[   81], 50.00th=[   82], 60.00th=[   82],
     | 70.00th=[   90], 80.00th=[  112], 90.00th=[  122], 95.00th=[  124],
     | 99.00th=[  146], 99.50th=[  150], 99.90th=[  159], 99.95th=[  163],
     | 99.99th=[  169]
   bw (  MiB/s): min=  956, max= 3596, per=100.00%, avg=1590.91, stdev=115.35, samples=2054
   iops        : min=  956, max= 3596, avg=1590.88, stdev=115.35, samples=2054
  lat (msec)   : 20=0.01%, 50=21.24%, 100=53.61%, 250=25.14%
  cpu          : usr=1.29%, sys=5.08%, ctx=409864, majf=0, minf=32821
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=409600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1583MiB/s (1660MB/s), 1583MiB/s-1583MiB/s (1660MB/s-1660MB/s), io=400GiB (429GB), run=258802-258802msec

Disk stats (read/write):
  nvme0n1: ios=415751/0, merge=0/0, ticks=33357501/0, in_queue=32603432, util=100.00%
Experiment 3 detail

fio file with skip the first 100g of trimmed blocks and reading the next 100g of random data

[global]
direct=1
group_reporting
bs=1024k
ioengine=libaio
iodepth=32
rw=read
numjobs=4

[read]
size=100g offset=100g 
filename=/dev/nvme0n1

fio output

read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.28
Starting 4 processes
Jobs: 1 (f=1): [_(3),R(1)][98.2%][r=2504MiB/s][r=2504 IOPS][eta 00m:07s]
read: (groupid=0, jobs=4): err= 0: pid=5352: Wed Nov 30 08:40:36 2022
  read: IOPS=1064, BW=1065MiB/s (1117MB/s)(400GiB/384672msec)
    slat (usec): min=19, max=1172, avg=93.97, stdev=27.12
    clat (msec): min=11, max=347, avg=117.20, stdev=56.41
     lat (msec): min=11, max=347, avg=117.29, stdev=56.41
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   25], 10.00th=[   55], 20.00th=[   72],
     | 30.00th=[   85], 40.00th=[   94], 50.00th=[  109], 60.00th=[  125],
     | 70.00th=[  144], 80.00th=[  167], 90.00th=[  197], 95.00th=[  222],
     | 99.00th=[  257], 99.50th=[  271], 99.90th=[  292], 99.95th=[  305],
     | 99.99th=[  321]
   bw (  MiB/s): min=  490, max= 5110, per=100.00%, avg=1091.95, stdev=177.40, samples=3000
   iops        : min=  490, max= 5110, avg=1091.24, stdev=177.38, samples=3000
  lat (msec)   : 20=4.69%, 50=3.08%, 100=36.70%, 250=54.11%, 500=1.42%
  cpu          : usr=0.48%, sys=3.31%, ctx=406629, majf=0, minf=32816
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=409600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1065MiB/s (1117MB/s), 1065MiB/s-1065MiB/s (1117MB/s-1117MB/s), io=400GiB (429GB), run=384672-384672msec

Disk stats (read/write):
  nvme0n1: ios=515273/0, merge=0/0, ticks=60416749/0, in_queue=60416749, util=100.00%
Experiment 4 detail

fio file, this will read the first 100g which are trimmed in this experiment.

[global]
direct=1
group_reporting
bs=1024k
ioengine=libaio
iodepth=32
rw=read
numjobs=4

[read]
size=100g 
filename=/dev/nvme0n1

fio output

read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.28
Starting 4 processes
Jobs: 1 (f=1): [_(1),R(1),_(2)][100.0%][r=3275MiB/s][r=3275 IOPS][eta 00m:00s]
read: (groupid=0, jobs=4): err= 0: pid=53501: Wed Nov 30 08:55:02 2022
  read: IOPS=3222, BW=3222MiB/s (3379MB/s)(400GiB/127124msec)
    slat (usec): min=19, max=1716, avg=101.31, stdev=31.84
    clat (msec): min=8, max=115, avg=37.58, stdev=15.77
     lat (msec): min=8, max=115, avg=37.69, stdev=15.77
    clat percentiles (msec):
     |  1.00th=[   10],  5.00th=[   15], 10.00th=[   19], 20.00th=[   25],
     | 30.00th=[   29], 40.00th=[   33], 50.00th=[   35], 60.00th=[   40],
     | 70.00th=[   45], 80.00th=[   52], 90.00th=[   59], 95.00th=[   67],
     | 99.00th=[   81], 99.50th=[   85], 99.90th=[   93], 99.95th=[   97],
     | 99.99th=[  105]
   bw (  MiB/s): min= 1818, max= 7626, per=100.00%, avg=3390.91, stdev=293.70, samples=961
   iops        : min= 1818, max= 7626, avg=3390.87, stdev=293.66, samples=961
  lat (msec)   : 10=3.37%, 20=8.93%, 50=66.24%, 100=21.43%, 250=0.03%
  cpu          : usr=1.51%, sys=10.76%, ctx=403853, majf=0, minf=32810
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=409600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3222MiB/s (3379MB/s), 3222MiB/s-3222MiB/s (3379MB/s-3379MB/s), io=400GiB (429GB), run=127124-127124msec

Disk stats (read/write):
  nvme0n1: ios=578490/0, merge=0/0, ticks=21785496/0, in_queue=21785496, util=100.00%

trim fio workload

[global]
rw=trim
bs=1m
direct=1
ioengine=libaio
iodepth=32

[nvme]
size=100g 
filename=/dev/nvme0n1

Leave a Comment