Performance testing: how to survive terminology and start thinking about the goals

When it comes to those types of testing, where we measure how application performs in different conditions, and compare its metrics against different other applications or standards, almost each testing culture has its own definitions for the same terms, which is very confusing. Even performance testing itself may mean completely different things to different people. Commonly there are three groups of perceptions:

  • Performance testing seeing as an umbrella term for any type of testing related to application quantifiable limits, stability, throughput, etc. Thus it includes load, stress, volume, endurance, soak, peak-rest, spike, storm and other types of testing. An example of such view is expressed Software Performance Testing article on Wikipedia or Microsoft’s Fundamentals of Web Application Performance Testing.
  • Performance testing seen as a specific type of testing, separate from other types, such as load, stress, etc. One of the most ardent defenders of this point of view, explains his position in his blog post.
  • And in the middle there are all those, who believe that performance is an umbrella term for some types of testing, but does not include others. For example I once had a lengthy discussion with a colleague, who believed performance testing to be any type of testing that deals with operations on application (for example load, stress, endurance), but not with operations on data (e.g. volume).

Advocates of each approach have their supporting arguments, of course, and references to literature. So when working in a specific company, it’s always a good idea to either find out an existing or to establish a new common vocabulary. But besides that, does it really matter how you call it? Not at all. Even though I do have my preferences, and I would prefer to have one common umbrella term (and performance testing is seems perfect for this role), and I would like to see an agreement on what each of the other terms means, I am ready to give up on any definitions, if it shifts focus from reaching the testing goals to fighting the terms.

And as opposed to definitions, the goals of performance testing are usually quite clear.

***
The first goal is to make sure anticipated workload can be supported: application will not “break”, and its performance characteristics (time, throughput, or any other measurements relevant for the application) will not degrade below acceptable limits. Such testing is 80% planning, and 20% execution, as environment, transaction/event distribution, amount of operations/events, and volumes of data must be carefully planned to represent typical production environments with maximal possible proximity. This involves two important preparative steps:

  • Understanding the performance goals, finding the appropriate and relevant metrics, defining typical distribution of the operations, and typical environment, etc. This step involves interviewing multiple stakeholders, getting confusing “wish lists”, or an answer “I don’t know”… All of which could be a topic for the separate post.
  • Each transaction or operation should be tested by itself (including concurrency test) to eliminate the obvious problems (e.g. functional bugs, memory leaks, or other inefficiencies) within the transaction itself.

Usually first outcome of this type of testing is the necessity to deal with resolvable bottlenecks, e.g. inefficiencies within the environment and application itself. On later stages some further fine tuning can be required (e.g. database / application maintenance procedures and policies, hardware recommendations, etc.). This type of testing completes, when

  • Application is able to support all anticipated workload scenarios: it doesn’t break, and its performance characteristics are acceptable.
  • We collected performance characteristics of the application for each of those scenarios. Those characteristics can be used as a benchmark for other tests, or other versions of application.
  • We can specify hardware requirements and maintenance procedures required to support an anticipated workload.

***
Once we know that anticipated workload can be properly supported, the workload on the application is increased to and beyond the limits, which can be subdivided into two sub-goals:

  • Finding workload limit: maximal workload application can handle with acceptable performance characteristics. This is the point before application breaks or its performance significantly degrades. This point may exhibit non-optimal performance characteristics, just acceptable.
  • Finding what happens when workload exceeds the limit, that is the application breaks, its performance characteristics degrade to unacceptable levels, or one of the “unresolvable” bottlenecks is reached

This testing can use the same environment and transaction/event distribution as the load testing, but the amount of operations and volumes of data must be increased gradually to reach and exceed the limit. One of the most important goals of this testing is to make sure that anticipated workload is not dangerously close to workload limit, and that application fails predictably and somewhat gracefully.

Sometimes understanding whether application performs well with anticipated load is impossible, since nobody seems knows what the anticipated load is. In such case, working backwards (i.e.: finding the application limit, and understanding whether it’s acceptable, and close to typical workload) can help.

***
At the same time we can start the lengthiest of all test: test, that verifies whether anticipated workload can be sustained for a long period of time. This type of testing is commonly called endurance testing. “Long” is defined differently for different types of applications, but in enterprise environment we should be talking about at least few weeks. As a basis we can still use the same scenarios as in the first test, but it’s good to extend them with some additional “real environment” features, such as periodical issues with an underlying structure (e.g. lost network connections) and erroneous inputs, if those are not part of the original tests already. This testing can reveal “hidden” issues, like memory corruptions, caused by multiple failures, or insignificant memory leaks, that turn into a real problem over the time. It also allows to find out a magical “number of hours application can run without a failure” measure, so popular with managers.

***
When it’s already known how the anticipated workload is handled, and what are the application limits, the workload can be sharply or slowly increased from regular to maximum and then decreased back to regular, as if it goes through the rush-hour, peak or spike. Another version of this test, takes the workload from none to maximum and then back to idle. The goal of both tests is to observe how the application behaves and how long does it recover after a / peak. This test might be more important for certain applications, where such waves of workload should be expected on regular basis. In such case it might be a good idea to combine this type of testing with testing for long period of time, by creating alternating anticipated-maximal-anticipated-low-… workloads on an application for an extended period of time. Also distribution of the transactions or operations during the spike might be different from distribution of operations with regular workload (for example: in the morning many people try to login, while very few started to do something else).

***
Another common test that a distributed enterprise application may require is a test that determines when to scaling up or scaling out will be more beneficial to handle an increasing workload the hardware on which applications runs. Scaling up (also called vertical scaling) adds more hardware resources to an existing machines (e.g. more memory, faster hard drive, etc.), while scaling out (also called horizontal scaling) adds additional machines and distributes work between them. Here’s an example of such testing performed by Pentaho.

***
Another interesting area is the size and growth rate of the data in application back-end storage (e.g. database, or file system). Naturally this testing is part of all of the above tests, as all of them will produce large quantities of data, and thus resolution of many bottlenecks will require estimation and adaptation for data growth, or we may want to run tests on a large developed database, rather than on an empty one. However this testing can also be done separately, with the goal to provide recommendations specifically targeting DBAs / system administrators, who may not be involved or familiar with application itself.

Did I forget anything? I most surely did.

Performance testing: how to survive terminology and start thinking about the goals