This post is not about what types of performance issues a clustered file storage system runs into; rather it will delve into the process of quickly finding those issues and removing them.

That being said, many of the anecdotes and procedures are based in the field of distributed systems. Applying this ideology to embedded systems or other drastically different realms may not prove meaningful.

Maintaining Performance

Writing performant code should always be a goal in software, but it often isn’t the first priority. For many developers, code is correct as long as it passes all of the tests and that’s that. Because of this behavior, one of the first and most critical steps in having good performance is baking performance benchmarks into ‘correctness’ testing. This helps prevent performance regressions and, if combined with detailed profiling, these test runs can provide the necessary data for investigating performance. Although the implication of maintaining performance is that the performance is already high, this should be the first priority in performance. Without this in place, changes that regress performance are much more difficult to diagnose and prevent.

Obtaining Performance

The first step in improving performance is settling on a type or scenario that should improve. This means writing a performance test that categorizes that scenario, if it doesn’t already exist, and getting a baseline. In order to collect concrete useful data new systems may need to be created or employed. Tools like performance counters and system tracing are great starting places and the ideas behind them can be easily extended to measure usage in the codebase being improved.

Now that all the prerequisites are covered, we can start iterating. With systems in place to measure both overall and fine-grained performance the best place to start is often any systems using a large amount of one or multiple system resources. Depending on the product, the bottleneck could be in many different systems. The main side effect of a bottleneck is that work is stuck in a queue somewhere in the system. This can manifest as a system running at or near maximum capacity or as many jobs idling at the same point. Once we know what system is being overused, it is time to make a hypothesis about why that system is hot. Ideally this is as specific as possible and it is simple to come up with an experiment to quickly prove or disprove.

It is often best to create experiments that rely on an oversimplification of the system being tested. The reason why unit testing is so powerful is that it’s focused on testing one piece at a time; in performance testing it’s important to only change one factor at a time when testing prototypes. Stacks of prototypes that seem to show incremental improvements may have red herrings and are generally inconclusive.

Instead of focusing on writing thread-safe, correct code, the focus at this stage should be on prototypes that give the greatest performance gains regardless of correctness. If removing the protection around a data structure gives a huge performance gain, it’s probably worth optimizing that data structure or making it safe for multi-access. Optimizing that data structure without evidence that it is non performant has a high cost with unclear value.

Once an experiment is performed it should either support or contradict the hypothesis. Often many hypotheses will be contradicted before a performance bottleneck is found. Eventually, it should be possible to narrow down the problem space to something specific that can be improved. At this point it is time to switch from investigation mode to implementation. This can sometimes be difficult as it is a pretty big change of pace and style. Additionally, what needs to change could easily be in a part of the system where familiarity may be minimal or nonexistent. Again, this demonstrates why prototypes are an important part of the process. After a solution is built, the process is restarted; beginning again with collecting new baseline data.

To see this strategy in action, check out this blog post by another member of my team: Improving Lock Efficiency

Tips and Tricks

  • All investigation is useful. Document negative results to avoid duplicating effort.
  • Go after questions that can be answered quickly
  • Collect as much data as possible
  • Use multiple sources of evidence
  • Be confident in your tools or rebuild them until you are
  • Seriously collect more data
  • Use both micro and macro benchmarking
  • Make certain that the environment you’re testing in is constant across tests (unless this delta is the one factor being tested)
  • When coming up with a hypothesis, it’s often best to generate a few.
  • If there’s more than one hypothesis, order them using whatever heuristic search you want; some factors to consider are estimated likelihood that the hypothesis will lead to improved performance and how difficult the hypothesis will be to test.

Share with your network