Performance testing: how to survive terminology and start thinking about the goals

When it comes to those types of testing, where we measure how application performs in different conditions, and compare its metrics against different other applications or standards, almost each testing culture has its own definitions for the same terms, which is very confusing. Even performance testing itself may mean completely different things to different people. Commonly there are three groups of perceptions:

  • Performance testing seeing as an umbrella term for any type of testing related to application quantifiable limits, stability, throughput, etc. Thus it includes load, stress, volume, endurance, soak, peak-rest, spike, storm and other types of testing. An example of such view is expressed Software Performance Testing article on Wikipedia or Microsoft’s Fundamentals of Web Application Performance Testing.
  • Performance testing seen as a specific type of testing, separate from other types, such as load, stress, etc. One of the most ardent defenders of this point of view, explains his position in his blog post.
  • And in the middle there are all those, who believe that performance is an umbrella term for some types of testing, but does not include others. For example I once had a lengthy discussion with a colleague, who believed performance testing to be any type of testing that deals with operations on application (for example load, stress, endurance), but not with operations on data (e.g. volume).

Advocates of each approach have their supporting arguments, of course, and references to literature. So when working in a specific company, it’s always a good idea to either find out an existing or to establish a new common vocabulary. But besides that, does it really matter how you call it? Not at all. Even though I do have my preferences, and I would prefer to have one common umbrella term (and performance testing is seems perfect for this role), and I would like to see an agreement on what each of the other terms means, I am ready to give up on any definitions, if it shifts focus from reaching the testing goals to fighting the terms.

And as opposed to definitions, the goals of performance testing are usually quite clear.

***
The first goal is to make sure anticipated workload can be supported: application will not “break”, and its performance characteristics (time, throughput, or any other measurements relevant for the application) will not degrade below acceptable limits. Such testing is 80% planning, and 20% execution, as environment, transaction/event distribution, amount of operations/events, and volumes of data must be carefully planned to represent typical production environments with maximal possible proximity. This involves two important preparative steps:

  • Understanding the performance goals, finding the appropriate and relevant metrics, defining typical distribution of the operations, and typical environment, etc. This step involves interviewing multiple stakeholders, getting confusing “wish lists”, or an answer “I don’t know”… All of which could be a topic for the separate post.
  • Each transaction or operation should be tested by itself (including concurrency test) to eliminate the obvious problems (e.g. functional bugs, memory leaks, or other inefficiencies) within the transaction itself.

Usually first outcome of this type of testing is the necessity to deal with resolvable bottlenecks, e.g. inefficiencies within the environment and application itself. On later stages some further fine tuning can be required (e.g. database / application maintenance procedures and policies, hardware recommendations, etc.). This type of testing completes, when

  • Application is able to support all anticipated workload scenarios: it doesn’t break, and its performance characteristics are acceptable.
  • We collected performance characteristics of the application for each of those scenarios. Those characteristics can be used as a benchmark for other tests, or other versions of application.
  • We can specify hardware requirements and maintenance procedures required to support an anticipated workload.

***
Once we know that anticipated workload can be properly supported, the workload on the application is increased to and beyond the limits, which can be subdivided into two sub-goals:

  • Finding workload limit: maximal workload application can handle with acceptable performance characteristics. This is the point before application breaks or its performance significantly degrades. This point may exhibit non-optimal performance characteristics, just acceptable.
  • Finding what happens when workload exceeds the limit, that is the application breaks, its performance characteristics degrade to unacceptable levels, or one of the “unresolvable” bottlenecks is reached

This testing can use the same environment and transaction/event distribution as the load testing, but the amount of operations and volumes of data must be increased gradually to reach and exceed the limit. One of the most important goals of this testing is to make sure that anticipated workload is not dangerously close to workload limit, and that application fails predictably and somewhat gracefully.

Sometimes understanding whether application performs well with anticipated load is impossible, since nobody seems knows what the anticipated load is. In such case, working backwards (i.e.: finding the application limit, and understanding whether it’s acceptable, and close to typical workload) can help.

***
At the same time we can start the lengthiest of all test: test, that verifies whether anticipated workload can be sustained for a long period of time. This type of testing is commonly called endurance testing. “Long” is defined differently for different types of applications, but in enterprise environment we should be talking about at least few weeks. As a basis we can still use the same scenarios as in the first test, but it’s good to extend them with some additional “real environment” features, such as periodical issues with an underlying structure (e.g. lost network connections) and erroneous inputs, if those are not part of the original tests already. This testing can reveal “hidden” issues, like memory corruptions, caused by multiple failures, or insignificant memory leaks, that turn into a real problem over the time. It also allows to find out a magical “number of hours application can run without a failure” measure, so popular with managers.

***
When it’s already known how the anticipated workload is handled, and what are the application limits, the workload can be sharply or slowly increased from regular to maximum and then decreased back to regular, as if it goes through the rush-hour, peak or spike. Another version of this test, takes the workload from none to maximum and then back to idle. The goal of both tests is to observe how the application behaves and how long does it recover after a / peak. This test might be more important for certain applications, where such waves of workload should be expected on regular basis. In such case it might be a good idea to combine this type of testing with testing for long period of time, by creating alternating anticipated-maximal-anticipated-low-… workloads on an application for an extended period of time. Also distribution of the transactions or operations during the spike might be different from distribution of operations with regular workload (for example: in the morning many people try to login, while very few started to do something else).

***
Another common test that a distributed enterprise application may require is a test that determines when to scaling up or scaling out will be more beneficial to handle an increasing workload the hardware on which applications runs. Scaling up (also called vertical scaling) adds more hardware resources to an existing machines (e.g. more memory, faster hard drive, etc.), while scaling out (also called horizontal scaling) adds additional machines and distributes work between them. Here’s an example of such testing performed by Pentaho.

***
Another interesting area is the size and growth rate of the data in application back-end storage (e.g. database, or file system). Naturally this testing is part of all of the above tests, as all of them will produce large quantities of data, and thus resolution of many bottlenecks will require estimation and adaptation for data growth, or we may want to run tests on a large developed database, rather than on an empty one. However this testing can also be done separately, with the goal to provide recommendations specifically targeting DBAs / system administrators, who may not be involved or familiar with application itself.

Did I forget anything? I most surely did.

Advertisements
Performance testing: how to survive terminology and start thinking about the goals

“End User Experience” Testing

Sometimes running load, usability, functional and UI testing separately is not enough, as it operates on certain sub-set of variables, assuming the others to be static. It’s like projecting a cube in 2d. This is why one of the tests I like to do is “End User Experience” testing: simulating a real user, performing a real set of tasks.

Preparation:

1. Choose a few transactions or scenarios most commonly performed by the users. Say, if I did this type of testing for WordPress.com, I would probably choose “Add New Post”, “Search on site” and few more.

2. Define an overall goal for each transaction. It’s best if the goal is close to what typical real user would do. For example: if an average post length on WordPress is about 240 words, tested transaction “Add New Post” may have an overall goal of creating post with 240 words.

3. Break transactions into steps, and define data for each step: what exactly will you do during the transaction? How will you navigate from step to step? Which options, features, shortcuts you will use? And so on. Since there’s usually more than one way to accomplish the same task, defining those actions is very helpful for the analysis: it takes away the guessing game of “how did I actually accomplish it?” and it also allows to later concentrate on some transactions that are seems problematic. For example: in order to add new post, I may go to Dashboard, or I may just click a “New Post” button from the top menu. My final results may be different depending on how did I accomplish it, and thus it’s important to remember which way it was done.

At this step, we have something like the following table:

Transaction Goal Actions & Data
Add New Post Create post with 240 words
  1. Navigate to Dashboard
  2. Click Add New in Left-side menu
  3. Provide title (4 words)
  4. Type 100 words
  5. Provide link with 10 words
  6. Type 130 more words
  7. Click Publish
etc. etc. etc.

Testing

There are many ways to perform this test, for example:

  • Single “experienced” user runs the designed test in normal (not too fast, not too slow) pace, noticing time it took him or her to accomplish different steps of the testing, different inconveniences (was the scrollbar present? Was font too small?) and issues.
  • Same “experienced” user runs the same test, but this time automatic load test is running on background.
  • Same as above, but this time let “novice” user to run the test (how fast he or she will discover how to accomplish steps? How much time the mistakes this user will make will cost him or her? Will their wrongdoing cause any additional problems?)
  • and so on.
“End User Experience” Testing