Known limitations and future plans

Design constraints: unlikely to change

Sciagraph is intended to be low-overhead enough to run in production, and focuses on data processing applications using realistic amounts of data. That means the information it provides will be less helpful when used outside of that domain.

Sciagraph uses sampling for memory allocations:

Applications that only allocate small amounts of memory will not get useful results from Sciagraph. Try Fil or Memray instead.
Small, one-off allocations will not be profiled.
There may be some attribution errors or inaccuracies at the margin.

In general, these limitations shouldn’t be noticeable or meaningful for applications that allocate a few hundred MB or more. For example, lots of small allocations in one call location will be profiled if they happen sufficiently often.

Sciagraph uses sampling for performance profiling.

Sampling is currently done about 20 times a second. Fast-running function calls won’t show up in the results.
The profiling results may occasionally be inaccurate at the margins.

In general, jobs that run for more than a few seconds shouldn’t have any issues.

Memory profiling limits that might eventually be fixed

OpenBLAS allocates 32MB memory per thread it starts, and its thread pool is sized based on the number of CPU cores, so this problem will be worse with CPUs with many cores. If you don’t use BLAS APIs via NumPy’s linear algebra APIs, that memory won’t actually be used in practice, and therefore shouldn’t really count. Since the results are so misleading by default, Sciagraph omits memory allocated by OpenBLAS for now until I can come up with a better solution.

If you are using linear algebra with NumPy or some other library using OpenBLAS, your only real way to control the resulting memory usage is by controlling how many threads OpenBLAS uses, e.g. with threadpoolctl.

Other known issues:

mremap(), the API for resizing mmap()s, is not tracked yet. At worst this will result in inaccurate output.
New mmap()s and large calloc()s are counted as fully allocated even though they don’t actually use RAM until written to; this is the underlying cause of the OpenBLAS issue mentioned above.
Non-Python threads don’t show callstacks.