Skip navigation

CPU Profiling Explained

These days I play a lot with CPU profiling tools (i.e., the tools that help to optimize the execution time of a program) and spent some time explaining to myself terminology and approaches to interpretation of profiling results. From a naive point of view, in order to speed up a program it is enough to glance at a flat list of function execution times, pick the heaviest function and rewrite it. But as we’ll see this is the most trivial use case of optimization, not always available (e.g. what if a library function appears to be the heaviest?) and actually there are more possibilities for optimization.

How programs are profiled

If you ever tried to go beyond recording current time before and after some lengthy process and logging the delta, then you certainly turned your sight toward special tools called profilers. The task of a profiler is to measure resources usage by a program (in our case, to measure CPU time consumption). In an ideal world results are not affected by a method of measurement used, but in reality they do. This is why several measurement approaches exist.

A profiler needs to record what functions were invoked and how many times it took to execute a function. The simplest way of obtaining this data is sampling. When using this method, a profiler interrupts program execution at specified intervals and logs the state of program’s call stack. The problem with sampling is that even with hardware support it’s not 100% accurate. Some function calls can fall down through `holes’ of a sampling grid, and will not be seen in a profile. But this usually means that these functions are not worth optimizing—CPU spent very little time executing them. The great benefit of the sampling approach is that it almost doesn’t affect program execution, so it can be successfully used in production environment.

Instrumentation is often considered as a more `precise’ approach to profiling. Doing an instrumentation means inserting a special code that performs calls counting or calculation of execution time into prologs and epilogs of functions. The problem with this approach is that it can seriously skew the results of profiling. Consider a function which takes just a dozen CPU cycles to execute and add another dozen of cycles due to instrumentation—now the instrumented function runs twice as long as the original version! And getting the current time is a system call which means that it will take a lot more CPU cycles to burn. That’s why some profilers (e.g. GNU gprof) usually instrument code only to record function’s call count—this is a relatively cheap operation, and measure function execution time using sampling. The other solution is to restrict instrumentation only to a selected subset of program’s functions.

Presenting profiling results

Stack SamplesRaw data collected during a profiling session needs to be presented in some compact and useful way. Recall that during a session call stacks (stack traces) are logged. A call stack is a sequence of function calls: from the currently executed function up to the program’s entry point.

The most complete and at the same time compact representation for profiling data is a call graph. Each node of a call graph represents a function, and each directed edge between nodes represents a call of one function by another. Both nodes and edges of a call graph have associated weights, according to their CPU usage. Call graphThe weight of an edge is the number of stack traces where this call can be found. For a node, two weights are concerned: self weight is the number of stack traces where the corresponding function was currently executed, total weight (also known as cumulative weight) is the number of stack traces where the corresponding function can be found. Let’s assume for now that every program’s function appears in a stack trace no more than once, it means that we disallow recursive or mutually recursive functions. This is only for simplicity, we’ll deal with them later.

Node weightsThe weights defined are related in the following way:

  • total weight of a node is equal to the sum of weights of its incoming edges;
  • total weight is equal to the node’s self weight plus the sum of weights of outgoing edges;
  • thus, self weight of a node is always less or equal than its total weight, and they are equal for leaf nodes (having no outgoing edges).

When producing an output, normalized values of weight values are also used. Node’s weight can be related either to the total number of samples collected, or to its parent’s total weight.

To construct a call graph, one needs to process all stack samples gathered. For each function from a sample, a node corresponding to the function need to be found or created, and its total weight increased by one. For the node corresponding to the currently executed function, self weight also needs to be increased. Edges corresponding to calls between functions need to be established as well, and their weights to be increased.

The problem with call graphs is that it is not so easy to visualize them using standard GUI controls, and almost impossible using a text output. That’s why other representations are often used: a flat view and a tree view. A flat view historically appeared the first and is produced from a call graph by simply discarding all the edges and listing only nodes. For each list item self and total weights (or ratios) are printed. An algorithm for building a flat view is the same as for a call graph, except that relations between functions are not taken into account.

In contrast to a flat view, a tree view retains relations between nodes. Tree’s top-level node can either correspond to the program’s entry point—a top down tree, or top-level nodes can correspond to all functions ever being currently executed—this is called a bottom up tree. Usefulness of the latter view is that it starts from functions that were direct CPU consumers, not from their callers. The problem with a bottom up view is that it is counterintuitive: node’s children are indeed callers of the function and this makes working with self and total weights somewhat tricky.

Top-down treeA top down tree can be constructed from a list of gathered call stacks using the following algorithm. A stack sample is viewed as a route from tree’s root to some leaf. The route starts from program’s entry point. Along the way, non-existent nodes are created and total weights of nodes passed through are increased by one. When a node corresponding to stack sample’s last function is reached, its self weight is also increased by one.

One significant difference between a tree and a graph is that in a tree every node has only a single parent node. Thus, when mapping a call graph onto the corresponding top down tree, every node that has multiple incoming edges is being split into several `instances’ with weights split accordingly. That is, the sum of self weights of tree nodes corresponding to the same function must be equal to self weight of a call graph tree node, the same is true for total weight.

Bottom-up treeA bottom up tree is more interesting. It can be constructed from call stacks using the same algorithm as for building a top down tree, but when iterating through a call stack, one must start with the opposite end—the function currently executed. In the resulting bottom up tree call graph’s nodes are also split among different branches. E.g. the program’s root function is a leaf node of every bottom up tree branch. An interesting property of a bottom up tree is that a node can appear at the top level of a tree if and only if it has a non-zero self weight, so self weights of all nodes are concentrated in top-level nodes of a bottom up tree, and total weights are distributed among non-top-level nodes, which only can have zero self weight. Thus, it is enough only to have a single weight value for each node. For top-level nodes, this will be the self weight of the corresponding call graph node, and for the rest it will be a piece of the total weight. Similarly to a top down tree, summing up weights of all call graph node’s instances will give us its total weight.

Recursive functions

Recursive functionHere comes the fun part. If you have a recursive function, or a set of mutually recursive functions, trying to calculate nodes’ and edges’ weights for them will result in a paradox, because node’s incoming edges can in the same time appear as its outgoing edges. This ruins our relations of weights.

To deal with this problem, let’s recall that recursion is equivalent to iteration. That means, a recursive function can be rewritten using a loop. In this case, calculating its weights isn’t a problem anymore. Thus, for a recursive function we can contribute the weight of its self-call edge to node’s self time, and don’t assign any weight to such edge.

While processing stack samples, it is important to distinguish between time spent in function calls (i.e. the number of stack samples) and the number of function calls themselves. A stack sample can contain multiple occurrences of a recursive function (and thus, multiple self-calls), but this only contributes as a single unit of a call graph node’s and edge’s weights. Thus, when constructing a flat view, every function from a stack sample needs to be counted only once. Self weight is always calculated right because in every stack sample there is only one currently executed function.

Detecting recursionDuring a tree view construction, the right decision is not to perform cycles detection at all. This preserves a relation between weights of parent and child nodes: the total weight of a parent node equals to a sum of its self weight with the sum of child nodes total weights. But when it is needed to gather up several nodes in order to calculate weights of a corresponding call graph node, total weights of recursive function calls must be ignored. In a tree view, recursive calls are easy to detect: a node with the same label exists in node’s chain of parents.

Using profiles to detect bottlenecks

Armed with the knowledge of how profiling data is collected and visualized, let’s figure out how to use it for speeding up a program. I can enumerate the following cases that are detectable by profiling paired with source code analyzis:

  • an individual heavy function (e.g. a sorting function);
  • excessive usage of a library function (e.g. of a memory copying function);
  • repeatedly appearing sequence of function calls.

For finding an individual heavy function, a flat view is enough. Sort it by self weight in decreasing order, and such functions will bubble up to the top. Especially if their total weight is close to self weight (this means they do all the work by themselves), they are good candidates for optimization. This method works well both for recursive and non-recursive functions.

If a hot function can’t be optimized by itself (e.g. this is a library function), a bottom up view is helpful to look at its callers. Get the caller with the most weight (that is, the one that calls the function most frequently), and look up in the source code why it makes an excessive usage of the function. It often appears, that a result of function’s call can be cached, or maybe there is a more effective way of doing the same thing that doesn’t require calling the function at all.

Sorting a flat view by total weight in decreasing order and secondary by self weight can also reveal functions that inadequately cause heavy processing. Consider an example: let’s say we see a function `top5′ which has big total weight, but small self weight. Now we are looking at its callees in the top down view, and see that `top5’s time is mostly occupied by a call to `qsort’. Looking at `top5′ source we see that this call is redundant, because finding top 5 array items can be done using a single linear scan, without a need to sort an array.

A call graph view can actually be used in all of these cases, especially if it can be adjusted to highlight nodes and edges based on their weights. The only problem of a call graph, as I said, is that it can be big, so good scalable approach to viewing it is required.

But a call graph really shines when one wants to find a repeatedly appearing sequence of function calls. As every program’s function has exactly one corresponding node (unlike a tree view), and edges have weights, it is relatively easy to detect expensive sequences. After finding them and looking through functions’ code it is often seems that data calculated by a function is often used only partially or even thrown away by a caller (this happens when programs are built from big reusable `universal’ functions). Crafting a specialized version of a function that doesn’t perform unneeded calculations can speed up a program dramatically.

Image manipulation in Java

Created a small and simple utility class for myself to be able to perform common operations on images in Java.
Using this class you can load and save images, you’re able to resize images maintaining aspect ratio, you’re able to save jpeg images with specified compression, you can rotate images to any given angle, you can also watermark them.
Hope it’ll save you some time.
Read More »

On the Edge—Trying Chromium on Mac

about-chromiumImmediately after Google Chrome announce, it was known that it will be on Linux and Mac. Some day… And this day is coming. Right now it is possible to compile Chromium on Mac and it will be a working browser!

Well, almost. I’m trying to use it and it works—this post is composed using Mac Chromium. But it’s definitely far from being a polished consumer product yet. Major problems that I found so far: the Omnibox (navigation edit box) only accepts URLs; downloads doesn’t work, and tabs crash from time to time. Though the last isn’t a big problem because Chromium is a multi-process browser, so a tab crash doesn’t affect other tabs. That’s pretty cool comparing to Safari 4 Beta.

Chromium

What is really exciting is that Mac Chromium will become better and better every day. And any developer can help by contributing to a project.

Moving Editor Configuration into the Cloud

I know, I know—everybody does it in our days. But the results are so sweet that I can’t resist a temptation to make a post myself. So, if your editor configuration is already uploaded to some hosting, and is version-controlled, please go away, you won’t learn anything new here. If not, go on reading—it should be this way.

My migration was driven by circumstances. Currently I simultaneously develop for 3 platforms—this means 3 developer hosts, and of course I want to have the same editor configuration under all of them. Previously I copied my .emacs file all around, and soon it became annoying. Especially if you’re subscribed to tips&tricks feeds and always experiment with new packages.

So the decision was made to move all the configuration under publicly hosted version control in order to be able to retrieve it wherever I may roam. Now the details: which version control and what hosting. As I can’t live without git, I chosen it. This made github a quite natural choice for hosting.

Moving all the setup to a repository and integrating current setups from different machines was the next task at hand. I decided to be as conservative as possible and move on in baby steps. Step one was to move current config file into a repository and load it from the “main” .emacs file—done, works. Great, move to the next step—introduce variables in order to obtain some flexibility—done, works. And so on. Pretty soon I gathered all the packages in the repository and uploaded it into github. The next step—clone it on the second machine, integrate changes and push them back. Then sync the repository on the first machine and test that it still works. After some iterations, all the stuff was in place.

After several days of using the new setup, I discovered another advantage of having editor configuration under a version control. Now I can experiment freely with new stuff and have no more commented out blobs of configuration code. If a package or macro doesn’t work, but looks promising and I don’t have time to deal with it now—just move it into a separate branch! Also if some setting feels like not so useful as it seemed initially—just revert it with a single command. Boy, now I’m agile! ;)

Writing a SQL parser

Well… not exactly writing a parser but generating one.
For the task I’ve been working on recently I needed to have (inside my java code) a data model representing database tables, so I could perform data normalization according to it.

“CREATE TABLE” sql statement is a natural fit for representing the data model. The only thing I had to decide on is parser. How do I parse SQL? Can I use something off-the-shelf? I asked myself and it turns out yes I can!
I took a look at Javacc and then at Antlr. Antlr seemed more, I would say, mature, more documentation, tutorials, sample grammars, etc.

Antlr (ANother Tool for Language Recognition)

is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.

BTW, the data I was going to parse looked like this:


CREATE TABLE table_one (
    customer_number integer NOT NULL,
    address character varying(30)
);

CREATE TABLE table_two (
    id integer NOT NULL,
    city character varying(50)
);

I wrote following grammar for it:


grammar CreateTable;

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

table_list :
    (table_def)*;

table_def :
    'create' 'table' table_name table_element_list SEMICOLON;

table_name :
    (schema DOT)? table;

table_element_list :
    LEFT_PAREN table_element (COMMA table_element)* RIGHT_PAREN;

table_element :
    column_name data_type_def (column_constraint)?;

data_type_def :
    data_type lenght_constraint?;

schema : ID;

table : ID;

column_name :ID;

data_type : ID:

length_constraint :
    LEFT_PAREN NUMBER RIGHT_PAREN;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT	:	 '.';
NUMBER 	:	(DIGIT)+;
ID  : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ 	{ $channel = HIDDEN; } ;
fragment DIGIT :   '0'..'9' ;

This is all fine but I want my parser to return data right away like:


List<Table> tables = parser.table_list();
List<Column> cols = table.getColumns();
Column col = table.getColumn("id");
col.getName();
col.getType();
col.getConstriant();
col.getLength();

So we have to made a few changes to our parser rules in the grammar.


grammar CreateTable;

@header {
import java.util.ArrayList;
import java.util.List;

}

table_list returns [List tables]
	: {$tables = new ArrayList();}	(table_def { $tables.add($table_def.tbl); } )*  ;

table_def returns [Table tbl]
	: 'create' 'table' table_name
		table_element_list {
			$tbl = new Table(String.valueOf($table_name.text), $table_element_list.cols);
		}
		SEMICOLON ;

table_name
	:	(schema DOT)? table  ;

table_element_list returns [List cols]
	: {$cols = new ArrayList();} LEFT_PAREN te=table_element {$cols.add($te.col);} (COMMA te=table_element {$cols.add($te.col);})*   RIGHT_PAREN ;

table_element returns [Column col]
	:  column_name data_type_def (column_constraint)? {$col = new Column($column_name.cn, $data_type_def.tpe, $column_constraint.text, $data_type_def.len);};

column_name returns [String cn]
	:	ID {$cn = String.valueOf($ID.text); };

column_constraint
	:	'not' 'null';

data_type_def returns [String tpe, Integer len]
	: data_type {$tpe = $data_type.tp;} length_constraint? {$len = $length_constraint.len;}
	;

data_type returns [String tp]
	:
	((('character' 'varying')  | 'varchar') {$tp = "varchar";}
		| ('bit' 'varying' | 'varbit') {$tp = "varbit";}
		| ('double' 'precision' | 'float8') {$tp = "float8"; }
		| ('character' | 'char') {$tp = "char";}
		| ('integer' | 'int' | 'int4') {$tp = "integer";}
		| ID {$tp = $ID.text;}
		);

length_constraint returns [Integer len]
	: LEFT_PAREN NUMBER RIGHT_PAREN {$len = Integer.valueOf($NUMBER.text);} 	;

schema : ID;
table :	ID;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
DOT	:	 '.';
NUMBER 	:	(DIGIT)+;
ID  : (('a'..'z'|'A'..'Z' | '_') ((DIGIT)*))+ ;
NEWLINE:'\r'? '\n' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ 	{ $channel = HIDDEN; } ;
fragment DIGIT :   '0'..'9' ;

That’s basically it. I have a parser that I wanted and I think it took me less than I would have spent writing my own.
There are many sample grammars. Have a look maybe there is existing one for the language you’re about to parse. :)

P.S. The exact “CREATE TABLE” SQL statement is slightly more complex that I assume in the grammar.