Sciagraph: Performance observability for Python data pipelines
Your Python data processing pipeline is too slow—now what?
You’re running a data processing batch job written in Python, and you need it to be a faster. A whole lot faster.
- Your users need results, but you can’t deliver on time.
- Your jobs time out—or run out of memory—and have to be re-run.
- Your cloud computing bill is in the stratosphere.
- You’re iterating on the implementation—but your experiments take too long, because your code is too slow.
The first step to speeding up your code: identify your code’s performance bottlenecks. But that’s harder than you’d like.
Production is different than your laptop, from CPUs to memory to disk speed to network latency. Is loading from S3 a problem? Are you swapping? Is your program running with less parallelism when there’s more CPUs? Plus, performance problems with production data won’t necessarily show up when using test data.
How do you get accurate performance diagnostics, and quickly?
The most accurate performance data comes from production
The best place to get an accurate understanding of performance bottlenecks is by observing production. But how do you get that data?
By running all your production jobs with performance observability enabled from the start, by default. Unfortunately, existing alternatives are not sufficient:
- Tracing tools like Honeycomb and OpenTelemetry are useful, but require you to manually instrument your code; they can’t give you the detailed performance coverage of all your code—for that you need sampling-based performance profiling.
- Most performance and memory profilers are not designed to run constantly in production.
- Those that are designed for production are usually intended for web applications, not data pipelines.
You need a system designed for batch jobs, with low performance overhead, robust enough to run in production, and integrated with the frameworks you use. And that’s where Sciagraph comes in.
Ready to speed up your code? Try out Sciagraph today
Identify performance bottlenecks in calculations, data loading, and more
Sciagraph gives you a timeline showing where your threads spent their time: both CPU and waiting (for locks, I/O, etc.). Here you can see two threads fighting over the Python Global Interpreter Lock—mouseover a frame to get more details.
You’ll have an easier time viewing this on a computer, with the window maximized; the output is not designed for phones!
Discover where and why you’re using too much memory
Sciagraph also reports peak memory usage. Here is the memory usage report for a different program; wider and redder means more memory usage. You can click on a frame to get a traceback.
As easy to use as setting an environment variable
Just two extra lines in your
Dockerfile (plus passing in the access token at runtime) may be all you need to start getting speed and memory use insights:
FROM python:3.10-slim COPY . . RUN pip install -r requirements.txt # Start profiling with Sciagraph! RUN pip install sciagraph ENV SCIAGRAPH_MODE=process ENTRYPOINT ["python", "yourprogram.py"]
Performance observability for data science, scientific computing, and other data processing tasks
- Integrates with your tools: With MLFlow and Celery integration built-in, and other framework support planned, Sciagraph is designed to work with the frameworks you use. (Need support for a specific framework? Send me an email to get it prioritized.)
- For batch jobs: You’re not running a web application: you’re running a long-running data processing job that has a beginning, middle, and end. That means you need performance visualizations that match how your software runs.
- Fast and robust enough to run in production: Sciagraph is designed to minimize impact on your job’s performance.
- Cloud storage for reports: Profiling reports can be securely stored in the cloud so you can easily access performance reports even if your runtime environment is ephemeral, like a container.
- Data privacy: Profiling reports never include user data, and when they’re uploaded to Sciagraph’s cloud storage they are encrypted end-to-end, so no one but you can access them.