Understanding Sciagraph's output

Getting started with Sciagraph:

Installation

Trying it out

Understanding Sciagraph’s output

What’s changed? Comparing two profiling reports

Sciagraph gives you both performance profiling and peak memory profiling information. As a reminder, we’re looking at the profiling output for the following program:

import json, sys, time, os
from threading import Thread


def deepcopy_list(l):
    return [i[:] for i in l]


def main(length):
    start = time.time()
    l = []
    for i in range(length):
        l.append(list(range(i)))
    with open("output.json", "w") as f:
        s = json.dumps(l)
        f.write(s)
        f.flush()
        os.fsync(f.fileno())
    threads = [Thread(target=lambda: deepcopy_list(l)) for _ in range(3)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print("All done, elapsed: {:.2f} seconds".format(time.time() - start))


main(int(sys.argv[1]))

Understanding your job’s performance

In order to make your code faster, you need to understand where it’s spending its time. Sciagraph has two different views showing this information: a flamegraph, and a timeline.

The flamegraph: for single-threaded programs and a big-picture summary

Here’s the flamegraph output we get by profiling our example:

A flamegraph combines all the time spent in multiple threads and sub-processes into one big graph. Here’s how to interpret the flamegraph:

The wider or redder the frame, the higher percentage of time that function was responsible for. Each line is an additional call in the callstack. The redder frames being more meaningful is not true of all flamegraphs, it’s a Sciagraph-specific feature.
A single callstack may be the result of combining the runtime from multiple threads. And some of that code may have run in parallel… you can’t tell.
You can see if time was spent on computation or waiting (for disk, network I/O, locks, and so on), based on the bottom of each stack.
- In this example, most of the computation time was spent in json.dumps(l).
- Forcing a write to disk with os.fsync(f.fileno()) caused the program to wait as data got written to disk.
- There’s also a bunch of waiting during deepcopy_list(), for reasons we’ll talk about below.
Click on a frame to see a zoomed in view of that part of the callstack. You can then click “Reset zoom” in the upper left corner to get back to the main overview.
Hover over a frame with your mouse to get additional details.

The benefits of flamegraphs:

It shows you a big picture overview of the whole program in one graph.
If you’ve used other profilers like py-spy, austin, and many others, you’re already used to using flamegraphs.

Flamegraphs have some downsides, however:

You can’t see which code was running concurrently.
Left/right order doesn’t mean anything, so you can’t tell what order the code ran in.

For single-threaded programs, the downsides usually don’t matter. In our case, however, the flamegraph can be a little misleading: it’s hard to see where concurrency is, so it’s easy to be confused about why this code is slower than it should be. In particular, it’s unclear why there’s so much waiting in the deepcopy_list() function, because the visualization combines three parallel threads into a single framestack.

The timeline: visualizing concurrency and order of execution

The timeline view has a different set of tradeoffs than the flamegraph: it’s more complex, but you get to see concurrency and order of execution.

If you ran the previous program, the resulting report includes a profiling result that looks like this, with timelines for 4 different threads. Unlike a flamegraph, order does matter—execution flows from left to right. And you can see what was happening in different threads (or subprocesses, where relevant) at the same time.

In this visualization, each thread has its own timeline, showing the different callstacks running as time passed from the start of the job to its finish. We can now make a better guess at why there was so much waiting in deepcopy_list(), as we see it switching back and forth across threads—it’s probably waiting on the GIL.

More details:

The wider or redder the frame, the higher percentage of time that function was responsible for. Each line is an additional call in the callstack.
The graph is interactive. Hover the mouse pointer over a frame to see what percentage of time was spent there, and the untruncated text.
To optimize your code, focus on the wider and redder frames. These are the frames where most of the time was spent.

The result is sampled, picking up to 500-1000 samples more or less evenly spread out across the job’s execution. That means it might be a little inaccurate, and the elapsed time might not match reality exactly. However, for purposes of finding slow parts of the code, those inaccuracies shouldn’t matter: the longer the elapsed time, the more accurate the data, and the longer running lines of code are the ones we care about.

Having found the source of slowness, you can then go and speed up your code.

Understanding your program’s memory usage

Sciagraph also measures your job’s peak memory, since that is what determines how much RAM you need available. Here’s what the memory profile looks like:

What does this mean? This is a slightly different flamegraph: instead of showing elapsed time, it’s showing peak memory for the whole process (and any subprocesses). Specifically, it shows the callstacks responsible for memory allocations at the point in time when memory usage was highest. The wider or redder the frame, the higher percentage of memory that function was responsible for.

Remember, in a flamegraph, the order of frames is not meaningful. Unlike the timeline, which showed time moving from left to right, in this visualization left vs. right is arbitrary (at the moment it sorts alphabetically). Even if the order happens to match runtime order, that is accidental.

In this case, most of the memory was allocated doing l.append(list(range(i))). Smaller amounts of peak memory were due to other code paths.

Once you know the source of the memory allocations at the moment of peak memory usage, you can then go and reduce memory usage.

Next steps

Read one of the following to get going in production: