What is the difference between Device level and System level testing?

Device level testing at the Block IO level allows you to compare SSD to SSD performance by applying a stimulus directly to the physical device at the HBA level.  Device testing isolates the Device Under Test (DUT) and minimizes or eliminates the File System, OS and Memory Cache from the performance measurements.

System level tests measure the logical device by applying a stimulus at the OS File system level.  The IO stimuli may be managed and affected by system memory cache and hardware and software drivers that exist between the test stimulus at the user level and the SSD Device Under Test.

By testing at the Device level, the test sponsor can achieve an "Apples-to-Apples" comparison of SSD Performance.

What is FOB, Sustained & Steady State?

"FOB" stands for "Fresh-Out-of-Box" and refers to the initial performance state when the device is new.  In this state, there are no writes to any of the memory cells and all of the blocks are erased and available for use.  This provides a peak, but quickly transitory, performance state.

"Sustained" refers to a performance state when the metric under observation is relatively time invariant, i.e. when the performance level is relatively stable and unchanging.  This can occur after a variety of "pre writes" or "pre conditioning" which applies some type of access pattern (RND/SEQ R/W) to the device for some amount of activity (Total GB Written).  Sustained performance will depend on the type and amount of pre writes and can be useful when the exact user application pre write conditions are known.

"Steady State" is the preferred methodology and is defined as a performance state that follows a precisely defined and repeatable pre conditioning regime.  This provides a valid comparison for performance data taken under conditions that can be replicated by a test sponsor or auditor. The PTS Steady State is a specific sequence of Purge, Pre Conditioning and Workload related pre writes that ensure performance measurements will fairly represent response to valid workloads.

How do SPC and PTS Performance tests differ?

Both SPC System level and PTS Device level testing is beneficial for understanding the difference in SSD Performance.

SPC is a system level test conducted by the Storage Performance Council.  SPC applies a specific workload to evaluate the performance of a storage system or subsystem.  In SPC testing, all hardware items are disclosed and run the Storage Performance Council test workload.  SPC is useful for evaluating SSD performance within a total System or Sub System.

PTS testing attempts to isolate SSD testing at the SSD Device level and to eliminate or normalize the contributions of the hardware and software system.  By applying a known and repeatable stimulus at the Block IO level, reasonable and fair performance comparisons can be made between SSDs on an "Apples-to-Apples" basis.


What is the typical PTS Pre Conditioning Methodology?

PTS Pre Conditioning for steady state is composed of several steps: 1) Purge; 2) Workload Independent Pre Conditioning; and 3) Workload Dependent Pre Conditioning.

Purge is the application of a Security Erase command for ATA or Format Unit command for SCSI devices.  Purge may also be accomplished via a vendor unique command that puts the SSD in the state "as if no writes have occurred."

Workload Independent PC (WIPC) usually applies 128KiB SEQ Writes to the device in an amount equal to twice the stated User Capacity.  The available Active Range of LBAs for WIPC may be set differently for Client or Enterprise tests.

Workload Dependent PC (WDPC) is the application of the test stimulus of interest to the device and measuring of the level of performance in "Rounds" until five consecutive "Rounds" meet a Steady State Window requirement.  The available Active Range of LBAs for WDPC may be set differently for Client or Enterprise tests.

How does the PTS determine Steady State?

PTS Steady State (SS) is determined when the "tracking variable" measurements fall within a specifically defined range, or Steady State Window.  See page 3 of the SNIA IOPS Report to see an example of a SS Window.

Steady State is met when the tracking variable is measured over five consecutive rounds wherein the maximum data excursion does not exceed 20% and the maximum slope excursion of the least squares linear fit between the rounds does not exceed 10%.

When the fifth round meets the SS Window requirements, test results are taken from the average of the fifth round and the preceding four rounds of the tracking variable.

CTS software algorithms determine when SS is achieved, stops the run and logs the results to database. Test Sponsors may optionally run twenty five rounds and manually post process the data to ascertain if, and when, the SS conditions were met.

What are OIOs, how are they determined, and why are they important?

OIO stands for Outstanding IO and refers to the Driving Intensity of the stimuli to the SSD under test.  The Host system, or test hardware, will send IO stimuli to the SSD and measure its response.

The OIO refers to the total number of IOs that are created and is defined as the number of Threads (or "workers") times the number of Queues per Thread (or "jobs").  This is often referred to as Threads by Queue Depth.  Note: care should be taken to distinguish between Queue Depths at different levels of the OS stack - e.g. SSD QD vs HBA QD.

It is important to note the total OIO in a given test to ensure that enough driving intensity is being applied to the DUT.  If a low OIO of one Thread and one Queue is applied, the test SSD may not be getting enough IOs to test the device, i.e. low OIO can "starve" the DUT and end up throttling measured performance as there are not enough OIO to measure the DUT maximum performance.  On the other hand, too many OIOs could create an increase in response time if the SSD Controller has to spend undue time managing (or switching) multiple threads at the expense of pushing IO rate.

Can I compare different test sw tools like IOmeter, vdBench, FIO and CTS?

There are a variety of test software tools, each of which has strengths and weaknesses.  Regardless of the test software used, any comparison testing should be done on the same hardware test system.  Test hardware can substantially affect - and may bottleneck - the performance results.

Similarly, test software has an even more profound effect on performance measurements.  Accordingly, great care should be taken to select the proper test software tool as well to ensure that comparisons are only made between measurements using the same software tools.

The biggest factor depends on whether the test tool is a file system level tool.  System level tools like IOmeter can be affected by the File system, system cache and various hardware and software drivers in the hardware/software stack.  Similarly, vdBench is a java based tool which will affect the timing of IOs.

FIO and CTS operate in the same OS level.  However, the way in which the tools spawn stimulus and measure and record IO results has a dramatic effect on measurements.  This becomes exacerbated as performance increases, IOs increase and latency / response times decrease.

Attempts to correlate results between test software tools has proven problematic.  Accordingly, test sponsors should always limit performance comparisons to identical test environments.  This is one of the prime considerations that the SNIA SSS Technical Working Group addressed in defining an Industry Standard Reference Test Platform for making fair comparison of SSD performance.


Is Windows or Linux a better test environment?

Each OS has its strengths and weaknesses.  Windows is more widely installed in user systems, so system level test under Windows is often valuable for functional test and comparison - i.e. how does SSD A behave in "my" Windows system.

However, for comparative benchmarking, it is often preferable to test under a Linux (or RHEL, CentOS or SUSE) OS.  Test stimulus IOs are affected by the hardware/software stack.  Various caches and drivers can divert, delay, repackage or otherwise affect the timing of OIOs and hence affect device level performance test.

In addition, SSD Test Best Practices dictate that the test sponsor isolate the DUT and disable as much system, background and adjacent activity as possible in order to minimize unwanted influences on DUT performance.

For example, high IO tests often rely on availability of cpu system resources.  These cpu resources can be diverted if the OS (Windows for example) are invoking background or parallel processes, or are addressing an error condition in a parallel DUT test.  Windows systems are also notorious for automatically updating and rebooting in the middle of the night.... often in the middle of a test run.

For these and other reasons, the RTP environment is based on a Back end server running under CentOS.

Which is preferred: Synthetic or Trace-based tests?

Synthetic and Trace-based tests both have valid applications.  However, for comparative SSD device level test, the use of synthetic test stimuli is preferred.

Synthetic testing refers to the use of a "known and repeatable" test stimuli - key for obtaining valid comparative performance  data.  Since SSD performance is highly dependent on write history, test stimulus and test environment, great care must be taken to control the nature, amount and type of test stimulus applied (whether as pre conditioning or test stimulus).

Trace based tests, on the other hand, are extremely valuable when examining SSD behavior relative to a specific workload.  However, use of Trace based stimulus is problematic in comparative SSD performance testing for several reasons.

As noted elsewhere, the hardware/software stack affects IOs as they move from user space to the SSD.  Any trace of IO traffic will reflect the IO as it traversed the specific hw/sw stack in which it was captured - and hence may not reflect how the IO would behave in another test or user system.  i.e. a Trace is an exact capture of a "given" IO, but is specific to that system and is not indicative of a universal or "typical" IO taken on other systems.

Notwithstanding the affect of the hw/sw stack on IO traffic, the "holy grail" of testing is the quest to define and capture a "typical user workload" and its analog at the SSD device level.  See following posts for a discussion on User Workloads.


Can "Typical User Workloads" be defined and used for testing?

The quest to capture, define and reproduce "Typical User Workloads" is arguably the holy grail of SSD benchmark testing.

To use a "Typical Workload" as a test stimulus, one has to achieve two requirements: 1) Define a Typical Workload; and 2) Synthesize the workload into a Device Level Stimulus.

User applications, or workloads, are typically characterized by an access pattern of IOs - Random or Sequential Read/Write ratios of Block Sizes.  For example, a streaming video may be comprised primarily of large block sequential Read traffic - which could be expressed as SEQ 128KiB 90:10 R/W mix.

Once again, the hw/sw stack will affect the composition of the user application access pattern as it traverses to the SSD.  The nature of the access pattern will necessarily be defined by the specific hardware and software used by "that particular user."  Hence, while one can capture a user application access pattern with relative ease, defining a "typical access pattern" is very difficult, with arguably as many access patterns as there are users and applications in the world.

Even if we agree that a given workload is "typical" in that it may be representative of a large group of users (if not a majority), it is a methodological challenge to synthesize this into a usable test stimulus.  This is because the timing of the access pattern IOs includes many idle periods.

Dealing with idle times in a given trace is problematic - compressing or eliminating idle times creates a non representative trace, one where the stimulus is constant and continuous when the original trace may occur over many hours or days.

Secondly, if the trace idle times are replicated, the test trace becomes unreasonably long and not conducive to test practice.

Finally, the way in which the original trace is captured is dependent on the user environment - and includes the type of mass storage used when the trace is captured.  Thus, the source mass storage device will inherently affect the trace - whether it was a HDD, slow SSD or fast SSD will affect the nature of the idle times and IO traffic captured in the trace.

Nevertheless, Industry works continue on this topic with efforts focused on characterizing the behaviors of specific user applications and the implementation of trace capture methodologies to capture summary IO statistics in various use cases.