Sciagraph: Performance observability for Python data pipelines

Speeding up your software is tricky when you don’t know why it’s slow

You’re running a data processing batch job written in Python, and you need it to be a faster. A whole lot faster.

And that means you need to identify the bottleneck. After all, you’ll need a very different fix depending on whether the bottleneck is an algorithm, a slow database query, lack of parallelism, or one of the many other potential causes.

If you don’t know where to look, you’re left stumbling around, trying random changes in the hopes that something will catch. And that means days of precious time wasted on failed experiments, plausible-but-wrong guesses, and other dead ends.

Can you even reproduce the problem?

If you can reproduce the problem, identifying the cause will be a lot easier, and it’ll also help you test whether your proposed solution actually helps. But reproducing unknown production performance problems on your development laptop may be impossible. Production is different than your laptop, from CPUs to memory to disk speed to network latency to the size of the data, and much more besides.

  • Is S3 latency a problem?
  • Are you swapping?
  • Is your program hitting concurrency bottlenecks on a machine with more CPUs?
  • Are your data inputs in production different, or is your database doing a table scan?

Testing on your laptop, it’s very hard to know.

So you end up wasting time guessing at causes, adding logging to production in the hopes of spotting the culprit, waiting for an already slow job to give you a hint… It’s a slow, painful process. And it’s even worse when the performance problem is intermittent.

How can you get accurate performance diagnostics, and quickly, when you can’t even reproduce the problem?

The most accurate performance data comes from production

Imagine if every time you discovered your code was too slow, you already had a report available that pointed you straight at the cause. Immediate insights would give you a fast feedback loop: you’d know exactly what to do next to speed up your code. It’d be like having X-ray vision into your code’s performance in production.

How do you gain this level of insight? By changing your approach: instead of using an unrealistic environment to reproduce a problem you don’t understand, you observe the running software in its natural habitat. Then, once you understand the cause, reproducing the problem and then fixing it will be much easier.

In short, you need to measure performance in production, continuously:

  1. Measure performance. Performance bottlenecks are often in unexpected places, which is why measurement is critical to finding the real culprit.
  2. In production. Production isn’t just a realistic environment, it’s the real environment.
  3. Continuously. Performance measurement should always be on by default.

Combine these three elements, and you will always have accurate performance information immediately available, whenever you need it.

How you implement this will depend on your programming language, toolchain, and application domain. If you’re running Python data pipelines in production, and you want to find performance and memory bottlenecks, you want Sciagraph.

Sciagraph is a continuous profiler designed specifically for finding performance and memory problems in Python data pipelines running in production. It’s designed top-to-bottom to be fast and robust, so you can leave it on by default. And it’s designed specifically for this problem domain: long-running, resource-intensive batch jobs.

Let’s see what Sciagraph can tell you.

Ready to speed up your code? Try out Sciagraph for free today

Identify performance bottlenecks in calculations, data loading, and more

Sciagraph gives you a timeline showing where your threads spent their time: both CPU and waiting (for locks, I/O, etc.). Here you can see two threads fighting over the Python Global Interpreter Lock—mouseover a frame to get more details.

You’ll have an easier time viewing this on a computer, with the window maximized; the output is not designed for phones!

Discover where and why you’re using too much memory

Sciagraph also reports peak memory usage. Here is the memory usage report for a different program; wider and redder means more memory usage. You can click on a frame to get a traceback.

Ready to speed up your code? Try out Sciagraph for free today