These days I play a lot with CPU profiling tools (i.e., the tools that help to optimize the execution time of a program) and spent some time explaining to myself terminology and approaches to interpretation of profiling results. From a naive point of view, in order to speed up a program it is enough to glance at a flat list of function execution times, pick the heaviest function and rewrite it. But as we’ll see this is the most trivial use case of optimization, not always available (e.g. what if a library function appears to be the heaviest?) and actually there are more possibilities for optimization.

### How programs are profiled

If you ever tried to go beyond recording current time before and after some lengthy process and logging the delta, then you certainly turned your sight toward special tools called *profilers.* The task of a profiler is to measure resources usage by a program (in our case, to measure CPU time consumption). In an ideal world results are not affected by a method of measurement used, but in reality they do. This is why several measurement approaches exist.

A profiler needs to record what functions were invoked and how many times it took to execute a function. The simplest way of obtaining this data is *sampling.* When using this method, a profiler interrupts program execution at specified intervals and logs the state of program’s call stack. The problem with sampling is that even with hardware support it’s not 100% accurate. Some function calls can fall down through `holes’ of a sampling grid, and will not be seen in a profile. But this usually means that these functions are not worth optimizing—CPU spent very little time executing them. The great benefit of the sampling approach is that it almost doesn’t affect program execution, so it can be successfully used in production environment.

*Instrumentation* is often considered as a more `precise’ approach to profiling. Doing an instrumentation means inserting a special code that performs calls counting or calculation of execution time into prologs and epilogs of functions. The problem with this approach is that it can seriously skew the results of profiling. Consider a function which takes just a dozen CPU cycles to execute and add another dozen of cycles due to instrumentation—now the instrumented function runs twice as long as the original version! And getting the current time is a system call which means that it will take a lot more CPU cycles to burn. That’s why some profilers (e.g. GNU gprof) usually instrument code only to record function’s call count—this is a relatively cheap operation, and measure function execution time using sampling. The other solution is to restrict instrumentation only to a selected subset of program’s functions.

### Presenting profiling results

Raw data collected during a profiling session needs to be presented in some compact and useful way. Recall that during a session call stacks (stack traces) are logged. A call stack is a sequence of function calls: from the currently executed function up to the program’s entry point.

The most complete and at the same time compact representation for profiling data is a *call graph.* Each node of a call graph represents a function, and each directed edge between nodes represents a call of one function by another. Both nodes and edges of a call graph have associated weights, according to their CPU usage. The weight of an edge is the number of stack traces where this call can be found. For a node, two weights are concerned: *self weight* is the number of stack traces where the corresponding function was currently executed, *total weight* (also known as *cumulative weight*) is the number of stack traces where the corresponding function can be found. Let’s assume for now that every program’s function appears in a stack trace no more than once, it means that we disallow recursive or mutually recursive functions. This is only for simplicity, we’ll deal with them later.

The weights defined are related in the following way:

- total weight of a node is equal to the sum of weights of its incoming edges;
- total weight is equal to the node’s self weight plus the sum of weights of outgoing edges;
- thus, self weight of a node is always less or equal than its total weight, and they are equal for leaf nodes (having no outgoing edges).

When producing an output, normalized values of weight values are also used. Node’s weight can be related either to the total number of samples collected, or to its parent’s total weight.

To construct a call graph, one needs to process all stack samples gathered. For each function from a sample, a node corresponding to the function need to be found or created, and its total weight increased by one. For the node corresponding to the currently executed function, self weight also needs to be increased. Edges corresponding to calls between functions need to be established as well, and their weights to be increased.

The problem with call graphs is that it is not so easy to visualize them using standard GUI controls, and almost impossible using a text output. That’s why other representations are often used: a *flat view* and a *tree view*. A flat view historically appeared the first and is produced from a call graph by simply discarding all the edges and listing only nodes. For each list item self and total weights (or ratios) are printed. An algorithm for building a flat view is the same as for a call graph, except that relations between functions are not taken into account.

In contrast to a flat view, a tree view retains relations between nodes. Tree’s top-level node can either correspond to the program’s entry point—a *top down tree,* or top-level nodes can correspond to all functions ever being currently executed—this is called a *bottom up tree.* Usefulness of the latter view is that it starts from functions that were direct CPU consumers, not from their callers. The problem with a bottom up view is that it is counterintuitive: node’s children are indeed callers of the function and this makes working with self and total weights somewhat tricky.

A top down tree can be constructed from a list of gathered call stacks using the following algorithm. A stack sample is viewed as a route from tree’s root to some leaf. The route starts from program’s entry point. Along the way, non-existent nodes are created and total weights of nodes passed through are increased by one. When a node corresponding to stack sample’s last function is reached, its self weight is also increased by one.

One significant difference between a tree and a graph is that in a tree every node has only a single parent node. Thus, when mapping a call graph onto the corresponding top down tree, every node that has multiple incoming edges is being split into several `instances’ with weights split accordingly. That is, the sum of self weights of tree nodes corresponding to the same function must be equal to self weight of a call graph tree node, the same is true for total weight.

A bottom up tree is more interesting. It can be constructed from call stacks using the same algorithm as for building a top down tree, but when iterating through a call stack, one must start with the opposite end—the function currently executed. In the resulting bottom up tree call graph’s nodes are also split among different branches. E.g. the program’s root function is a leaf node of every bottom up tree branch. An interesting property of a bottom up tree is that a node can appear at the top level of a tree if and only if it has a non-zero self weight, so self weights of all nodes are concentrated in top-level nodes of a bottom up tree, and total weights are distributed among non-top-level nodes, which only can have zero self weight. Thus, it is enough only to have a single weight value for each node. For top-level nodes, this will be the self weight of the corresponding call graph node, and for the rest it will be a piece of the total weight. Similarly to a top down tree, summing up weights of all call graph node’s instances will give us its total weight.

### Recursive functions

Here comes the fun part. If you have a recursive function, or a set of mutually recursive functions, trying to calculate nodes’ and edges’ weights for them will result in a paradox, because node’s incoming edges can in the same time appear as its outgoing edges. This ruins our relations of weights.

To deal with this problem, let’s recall that recursion is equivalent to iteration. That means, a recursive function can be rewritten using a loop. In this case, calculating its weights isn’t a problem anymore. Thus, for a recursive function we can contribute the weight of its self-call edge to node’s self time, and don’t assign any weight to such edge.

While processing stack samples, it is important to distinguish between time spent in function calls (i.e. the number of stack samples) and the number of function calls themselves. A stack sample can contain multiple occurrences of a recursive function (and thus, multiple self-calls), but this only contributes as a single unit of a call graph node’s and edge’s weights. Thus, when constructing a flat view, every function from a stack sample needs to be counted only once. Self weight is always calculated right because in every stack sample there is only one currently executed function.

During a tree view construction, the right decision is not to perform cycles detection at all. This preserves a relation between weights of parent and child nodes: the total weight of a parent node equals to a sum of its self weight with the sum of child nodes total weights. But when it is needed to gather up several nodes in order to calculate weights of a corresponding call graph node, total weights of recursive function calls must be ignored. In a tree view, recursive calls are easy to detect: a node with the same label exists in node’s chain of parents.

### Using profiles to detect bottlenecks

Armed with the knowledge of how profiling data is collected and visualized, let’s figure out how to use it for speeding up a program. I can enumerate the following cases that are detectable by profiling paired with source code analyzis:

- an individual heavy function (e.g. a sorting function);
- excessive usage of a library function (e.g. of a memory copying function);
- repeatedly appearing sequence of function calls.

For finding an individual heavy function, a flat view is enough. Sort it by self weight in decreasing order, and such functions will bubble up to the top. Especially if their total weight is close to self weight (this means they do all the work by themselves), they are good candidates for optimization. This method works well both for recursive and non-recursive functions.

If a hot function can’t be optimized by itself (e.g. this is a library function), a bottom up view is helpful to look at its callers. Get the caller with the most weight (that is, the one that calls the function most frequently), and look up in the source code why it makes an excessive usage of the function. It often appears, that a result of function’s call can be cached, or maybe there is a more effective way of doing the same thing that doesn’t require calling the function at all.

Sorting a flat view by total weight in decreasing order and secondary by self weight can also reveal functions that inadequately cause heavy processing. Consider an example: let’s say we see a function `top5′ which has big total weight, but small self weight. Now we are looking at its callees in the top down view, and see that `top5’s time is mostly occupied by a call to `qsort’. Looking at `top5′ source we see that this call is redundant, because finding top 5 array items can be done using a single linear scan, without a need to sort an array.

A call graph view can actually be used in all of these cases, especially if it can be adjusted to highlight nodes and edges based on their weights. The only problem of a call graph, as I said, is that it can be big, so good scalable approach to viewing it is required.

But a call graph really shines when one wants to find a repeatedly appearing sequence of function calls. As every program’s function has exactly one corresponding node (unlike a tree view), and edges have weights, it is relatively easy to detect expensive sequences. After finding them and looking through functions’ code it is often seems that data calculated by a function is often used only partially or even thrown away by a caller (this happens when programs are built from big reusable `universal’ functions). Crafting a specialized version of a function that doesn’t perform unneeded calculations can speed up a program dramatically.