strace — Read the Syscalls Your Logs Won't Show

Your logs tell you what your code thinks it did. Your stack trace tells you where the code believed it was. strace tells you what the kernel actually did — every openat(), every read(), every connect() that returned -1 while your application swallowed the error and kept lying to you. When the gap between intent and reality is the bug, re-reading the source will not close it. You have to observe the boundary where userspace ends and the kernel begins.

Most engineers know strace -p <pid> as a panic button. Few actually read the output, and fewer reach for -f, -c, or -k when it counts. That gap is the difference between staring at a wall of syscalls and extracting a root cause in ninety seconds. Treating strace as a first-class instrument is a direct extension of hypothesis-driven debugging methodology: you form a precise claim about what the process is doing and let the evidence confirm or kill it.

What strace actually observes

strace is built on ptrace(2), the same kernel facility debuggers use. It attaches to a target and stops it on every transition across the syscall boundary — once on entry, once on exit — reading the registers to decode the syscall number, its arguments, and its return value. That is the whole model, and it explains the cost: every syscall now pays for two extra context switches into the tracer. On a syscall-heavy workload the target can slow by an order of magnitude. This is not a footnote — it is the reason strace belongs in diagnosis, never in a hot path or a latency-sensitive production service.

Reading a trace line — the part nobody teaches

Every line has the same shape: syscall(arguments) = return_value. The return value is where the truth lives. A successful call returns a file descriptor, a byte count, or 0; a failure returns -1 followed by the symbolic errno and its description. That errno is the signal you came for.

openat(AT_FDCWD, "/etc/app/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/app/config.yaml", O_RDONLY) = 3
read(3, "host: 127.0.0.1\nport: 6379\n", 4096) = 27
connect(4, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)

Four lines, two findings, zero guesswork: the app probed a config path that does not exist before falling back to the real one, and the Redis connection was refused outright. No log statement told you either fact — the syscalls did. The failure worth memorising is EFAULT: it means a syscall received a pointer to memory the process does not own. That is the same class of bad-address access which, followed to its conclusion by the MMU and the kernel, produces a segmentation fault.

The flags that separate signal from noise

-f — follow forks and threads. The moment a process calls fork() or spawns a worker pool, an unadorned trace goes blind to the children. Almost every real service needs this flag.
-e trace=… — filter by class instead of drowning. -e trace=network, -e trace=%file, -e trace=memory cut the stream to the syscalls you care about.
-c — aggregate instead of stream. Produces a summary table of time, call count, and errors per syscall. This is your profiler the moment you suspect a process is syscall-bound.
-T / -tt — time spent inside each syscall / absolute timestamps. Together they turn “it’s slow” into “it spends 600 ms in fsync“.
-y / -yy — decode descriptors, so read(3, …) becomes read(3</usr/share/app/config.yaml>, …) and sockets show their endpoints.
-k — attach a userspace stack trace to each syscall, bridging “which syscall” to “which line of code”.
-s N — raise the printed string length (the default truncates payloads).

The summary mode also answers “why is this slow?” without a dedicated profiler. Run any command under strace -f -c and the table ranks syscalls by cumulative time — the same technique that turns a sluggish terminal into a fixable list of offending stat() and openat() calls when you profile shell startup and .bashrc performance.

From raw trace to an actionable summary

Reading -c output by eye is fine once. When you want it in CI, in a regression gate, or aggregated across runs, parse it. The wrapper below runs a command under strace -f -c, captures the summary from stderr, parses it into typed records, and returns the syscalls that dominate time. It handles the obvious failure modes — strace missing, the target hanging, a malformed table — instead of assuming the happy path.

from __future__ import annotations

import shutil
import subprocess
from dataclasses import dataclass
from typing import Final

_HEADER_TOKENS: Final[tuple[str, ...]] = ("seconds", "calls", "syscall")


@dataclass(frozen=True, slots=True)
class SyscallStat:
    name: str
    seconds: float
    calls: int
    errors: int

    @property
    def usec_per_call(self) -> float:
        return (self.seconds / self.calls) * 1_000_000 if self.calls else 0.0


class StraceError(RuntimeError):
    """Raised when strace cannot be run or its summary cannot be parsed."""


def profile_syscalls(cmd: list[str], *, timeout: float = 30.0) -> list[SyscallStat]:
    """Run *cmd* under ``strace -f -c`` and return per-syscall stats, busiest first.

    Raises:
        StraceError: strace is missing, the command times out, or the summary
            table cannot be located in strace's stderr.
    """
    if not cmd:
        raise StraceError("empty command")
    if shutil.which("strace") is None:
        raise StraceError("strace is not installed or not on PATH")

    try:
        proc = subprocess.run(
            ["strace", "-f", "-c", "--", *cmd],
            capture_output=True,
            text=True,
            timeout=timeout,
            check=False,  # the traced command may legitimately exit non-zero
        )
    except subprocess.TimeoutExpired as exc:
        raise StraceError(f"traced command exceeded {timeout:.0f}s") from exc

    return _parse_summary(proc.stderr)


def _parse_summary(stderr: str) -> list[SyscallStat]:
    lines = stderr.splitlines()
    header = next(
        (i for i, ln in enumerate(lines) if all(t in ln for t in _HEADER_TOKENS)),
        None,
    )
    if header is None:
        raise StraceError("no summary table found; was -c passed to strace?")

    stats: list[SyscallStat] = []
    for line in lines[header + 1:]:
        if not line.strip() or line.lstrip().startswith("-"):
            continue
        if "total" in line:
            break  # the total row terminates the table
        cols = line.split()
        if len(cols) < 5:
            continue
        try:
            # cols: %time seconds usecs/call calls [errors] syscall
            seconds = float(cols[1])
            calls = int(cols[3])
            errors = int(cols[4]) if len(cols) >= 6 and cols[4].isdigit() else 0
        except (ValueError, IndexError):
            continue  # skip anything that is not a data row
        stats.append(SyscallStat(name=cols[-1], seconds=seconds, calls=calls, errors=errors))

    if not stats:
        raise StraceError("summary table present but no rows could be parsed")
    return sorted(stats, key=lambda s: s.seconds, reverse=True)


if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        raise SystemExit("usage: profile.py COMMAND [ARGS...]")

    for stat in profile_syscalls(sys.argv[1:]):
        marker = "  (has errors)" if stat.errors else ""
        name = stat.name.ljust(18)
        calls = str(stat.calls).rjust(8)
        print(f"{name}{stat.seconds:9.4f}s {calls} calls  {stat.usec_per_call:8.1f} us/call{marker}")

It is deliberately defensive: a missing binary, a timeout, and an unparseable table each raise a typed StraceError rather than returning silently wrong data. The check=False is intentional — the program you are tracing is allowed to exit non-zero; that is frequently the entire reason you are tracing it.

strace vs ltrace vs gdb vs perf — pick the right lens

strace is one lens, not the only one. Each tool observes a different layer at a different cost, and reaching for the wrong one wastes time or perturbs the very behaviour you are chasing.

Tool	Observes	Overhead	Best question	Production-safe?
strace	Syscalls (kernel boundary)	High (2 ctx switches/call)	What is the process asking the kernel to do?	No — diagnosis only
ltrace	Library / function calls (userspace)	High	Which library calls fire, with what arguments?	No
gdb	Full program state, memory, breakpoints	Very high (stops the world)	What is the exact state at this line?	No (interactive)
perf trace	Syscalls via perf_events	Low	How do syscalls behave under real load?	Yes
bpftrace / eBPF	Arbitrary kernel + user probes	Very low	Custom, low-overhead production tracing	Yes

When not to reach for strace

The overhead is not academic, and it is not the only constraint. On hardened systems, /proc/sys/kernel/yama/ptrace_scope can forbid attaching to a process you do not own — you will get EPERM no matter how root you feel. Inside containers, a restrictive seccomp profile may block ptrace entirely. And on a service handling live traffic, doubling the cost of every syscall is not a debugging session, it is an incident.

For anything that must run against production load, reach for perf trace or an eBPF tool such as bpftrace, which observe the same syscall boundary at a fraction of the cost. But on a box you control, with a process you can afford to slow down, strace remains the fastest way to answer the only question that matters when intent and reality diverge: what is this process actually doing? Once you can read its output fluently, that question stops being a mystery and becomes a lookup.

strace as a Debugging Primitive — When the Stack Trace Lies