Speed up your Python data processing workflows with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs.
Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem. That’s where profilers come in: tools that will help you find speed and memory bottlenecks in your code.
But if you’re working on data science, scientific computing, or other data processing jobs, most profilers just aren’t a good fit.
You’re not writing a web application
Profilers that work well for web applications don’t necessary work as well when it comes to data processing. A different domain requires different approaches:
- Measurement: Web apps can suffer from memory leaks, so measuring change in memory usage is sufficient. But when your memory usage is a result of data processing, you need a different measure: sources of peak memory use.
- Visualization: Industry-standard flamegraph visualizations can’t display concurrency or order of execution. That’s fine for web applications, but for batch jobs that’s not enough. Are your workers idle? In which phase of the program is your bottleneck? You need a timeline visualization broken down by thread and process.
Sciagraph is a profiler that gives you deep visibility into your Python code’s speed and memory usage—with a focus on data processing workflows, in both measurement and visualizations.
Profile on your laptop—and in production!
Reproducing production performance problems on your laptop is often difficult, and sometimes impossible. And even measuring in production, reproducing problems after the fact is not easy.
Unfortunately, many profilers are simply not designed to run in production. And those that are designed to run in production focus on—you guessed it!—web applications.
That’s why you can use Sciagraph both during development, and optionally in production, with always-on continuous profiling:
- Runs with low overhead, so you can leave it always on by default. No more after-the-fact scrambling to reproduce the problem!
- Designed to be easy to set up and highly reliable with real workloads.
- Securely store profiling reports in the cloud so you don’t need to configure storage yourself.
Ready to speed up your code? Get started with Sciagraph’s free plan!
Identify performance bottlenecks in calculations, data loading, and more
Sciagraph gives you a timeline showing where your threads spent their time: both CPU and waiting for locks, network communication, filesystem reads and writes, and so on.
Note: You’ll have an easier time viewing this on a computer, with the window maximized; the output is not designed for phones!
Above you can see the profiling report for a program that reads in some text, splits it into words, filters out certain words, and writes the result to JSON. This is a typical structure for data processing workflows: load the inputs → process the data → write out the output.
Wider and redder frames means more of the time was spent in the part of the program. Hover your mouse over a frame to see the text; real reports also include zoom functionality.
In this example, you can see that:
- Reading the data was fast enough not to show up in the profiling.
- Processing the data, in this case filtering the words, was pretty CPU-intensive.
- Writing the data to disk was slow, but not because of CPU: it involved a lot of waiting. In this case, it’s because the program was writing to a remote filesystem.
Discover where and why you’re using too much memory
Sciagraph also reports peak memory usage, the high water mark:
You can click on a frame to zoom in on a stack trace; wider and redder frames means more memory usage.
This example shows the memory usage report for the same program. There are three main sources of memory usage, the most significant being parsing the input file into words.
- Multiprocessing support: Profile single-threaded, multi-threaded, and multiprocessing workflows.
- Fast setup: For simple Python processes, using Sciagraph may be as easy as setting an environment variable. And with MLFlow and Celery integration built-in, and other framework support planned, Sciagraph is designed to work out of the box with the frameworks you use. Need support for another framework? Send me an email to get it prioritized.
Optional support for continuous, always-on profiling in production:
- Fast and robust enough to run in production: Sciagraph is designed to minimize impact on your job’s performance, so you won’t even notice it’s there until you need it.
- Cloud storage for reports: No need to spend time figuring out how to store profiling reports: they can be automatically and securely stored in the cloud. That means you can easily access performance reports even if your runtime environment is ephemeral, like a container.
- Data privacy: Profiling reports never include user data, and when they’re uploaded to Sciagraph’s cloud storage they are encrypted end-to-end, so no one but you can access them.