Search, browse, and track profiling reports

Sciagraph currently does not have a built-in way to search or browse profiling reports. Either you store the reports yourself, or you rely on Sciagraph’s cloud storage, which case you can download a specific report is by using the logged download and decryption keys.

What if you want to find which jobs were extra slow, so you can look at their profiling reports, or compare a report from last week to the latest one? You still have options.

Already using OpenTelemetry to record tracing information about your job? You can use Sciagraph’s OpenTelemetry integration to include the download instructions in those traces.

For more options, see below.

When you only have a few jobs a day

If you’re only running a handful of jobs a day, you could for example send information about each job to an internal Slack channel. Then you just have to scroll back through the Slack channel’s history to find a particular job.

Customizing where download instructions go is covered later in this document, but eventually I expect to include some built-in. If you want a specific integration, e.g. Slack, let me know.

When you have many jobs: using a logging/tracing storage and search backend

Once you’re running many jobs, a Slack message or email per job won’t scale. You need some way to store and search job information. Thing is, you probably should also be logging more general information about your jobs to help with debugging and performance analysis. Logging can give you information that profiling can’t, so you want to do both, and ideally tracing-based logging (see below).

For example, you could log:

The job’s elapsed time.
The version of the code that was used, and of dependencies and libraries.
The inputs to the job, or at least metadata that will help you reproduce them.
Intermediate states of your code, to help debug problems.
The outputs of the job, or metadata to help retrieve it.

Finally, Sciagraph will also log information about how to download a profiling report.

If you log this information into some sort of searchable backend, you can then find specific jobs based on some criteria (“a job from last week use an old version of the code”), look at the logs, and also look at the download instructions in the same logs.

Tracing

Tracing is an improved form logging that lets you store information as a tree of spans or actions; see this article on profiling vs logging.

OpenTelemetry is a standard API for among other things tracing messages, with a Python library available. Once you’re using OpenTelemetry to record information about your job, you can send it to many different open source tools or commercial backends, like Jaeger, Honeycomb, DataDog, and many others. These backends will typically both store the tracing logs and provide a GUI for searching and browsing, with varying degrees of sophistication.

Once you are storing tracing info into such a backend, you can search through the information to find relevant jobs, for example outlier jobs that are unexpectedly slow.

By using Sciagraph’s OpenTelemetry integration, when you find a particular job, the logs will also tell you how to download the Sciagraph profiling report.

Not using tracing yet?

Setting it up is pretty easy. For example, if you want to use the Honeycomb SaaS to store traces, you can:

Sign up for an account at https://honeycomb.io; their free tier may well suffice.
Look at their Python integration guide to see the basics of how it’s integrated. You don’t want to use sampling; that’s only necessary when sending huge numbers of logging info, which is much more common when scaling web applications that have many users sending many requests.
Follow this tutorial for basic integration with batch jobs, and how to find slow outliers.
Add two lines of code to also include Sciagraph download instructions in the tracing, by using the OpenTelemetry integration.

This may take as little as 30 minutes, it’s all pretty quick.

At this point you can also start recording more information in your traces to help you debug problems and identify issues (performance or otherwise).