Archive for the ‘C’ Category


Many command-line tools need to parse command-line arguments. In C, one of the most widely used functions for this purpose is getopt() and its GNU extension getopt_long(). However, these functions have two major issues. First, they are not portable. getopt is part of the POSIX standard but not the C standard; getopt_long is not part of any standards. In addition, getopt may behave differently depending on whether GNU extension is enabled. Using these functions can be tricky. Second, both functions rely on global variables, which may interfere with more complex use cases (e.g. sub-commands).

These limitations motivated the development of several other argument parsing libraries. While these libraries often have cleaner APIs and more functionality, most of them lack some getopt_long features. This blog post reviews several argument parsing libraries in C/C++ and introduces my own getopt replacement at the end.

Argument parsing libraries in C/C++

The following table lists common features in argument parsing libraries. Stars indicates getopt_long features.

Feature Explanation
post* Parse options after non-option/positional arguments
compact* When appropriate, “-a -b foo” can be written as “-abfoo”
mulocc* Keep track of an option occurring multiple times
order* Keep track of the order of options
oparg* A long option may optionally take an argument
type Built-in argument type checking and parsing
fmt Print formatted help messages
wchar Support multi-byte characters

The table below shows the feature sets of several command-line argument parsing libraries. Only libraries supporting both short and long options are considered (stars indicate 1- or 2-file libraries):

library lang post compact mulocc order oparg type fmt wchar
getopt_long C/C++ Y Y Y Y Y N N maybe
argh* C++11 semi N N N N N N ?
argp C/C++ Y Y Y Y ? N Y ?
argparse* C/C++ Y Y N N ? Y Y ?
args* C++11 Y Y Y N ? Y Y ?
argtable* C/C++ Y Y Y N ? Y Y ?
cxxopts* C++11 Y Y Y N ? Y Y ?
CLI11 C++11 Y Y switch N N Y Y ?
gopt* C/C++ Y Y switch N Y N N N
ketopt* C/C++ Y Y Y Y Y N N N
tclap C++ ? N N N ? Y Y ?

Notably, many libraries discard the relative order between options, arguably the least important getopt feature. They often add type checking and automatic help message formatting. I think type checking comes in handy, but message formatting is not as valuable because I prefer my own format over theirs.

The list in the table is of course incomplete. Some important ones that are missing include Boost’s Program_options and Google’s gflags, both of which are much heavier libraries. I haven’t spent enough time on them. If you have relevant information on them or your favorite library that is missing, or you think the table is wrong, please help me to improve it. Thanks in advance!

Ketopt: my single-header argument parsing library

I occasionally care about the order of options, a feature missing from most non-getopt libraries (argp has it but is not portable). In the end, I developed my own library ketopt (examples here, including one on sub-command). It is implemented in ANSI C and doesn’t invoke heap allocations. Ketopt has a similar API to getopt_long except that 1) ketopt doesn’t use any global variables and 2) ketopt has an explicit function argument to allow options placed after non-option arguments. Developers who are familiar with getopt_long should be able to learn ketopt quickly.


Command-line argument parsing is relatively simple (ketopt has <100 LOCs), but implementing it by yourself is tricky, in particular if you want to match the features in getopt_long. My ketopt is largely a portable getopt_long without global variables. In addition to mine, you may consider gopt in C. It is small, easy to use and supports key getopt_long features. For C++ programmers, cxxopts is a decent choice. It is feature rich, close to getopt_long, and has similar APIs to Boost’s Program_options and Python’s argparse.

I strongly discourage the use of libraries deviating too much from getopt (e.g. argh and tclap). Most end users expect getopt behaviors. When your tool acts differently, it will confuse users. Command-line interface is one the first things users experience. Please get it right.

Read Full Post »

Vector and matrix arithmetic (e.g. vector dot and matrix multiplication) are the basic to linear algebra and are also widely used in other fields such as deep learning. It is easy to implement vector/matrix arithmetic, but when performance is needed, we often resort to a highly optimized BLAS implementation, such as ATLAS and OpenBLAS. Are these libraries much faster than our own implementations? Is it worth introducing a dependency to BLAS if you only need basic vector/matrix arithmetic? The following post may give you some hints.


In this github repository, I implemented matrix multiplication in seven different ways, including a naive implementation, several optimized implementations with cache miss reduction, SSE and loop blocking, and two implementations on top of OpenBLAS. The following table shows the timing of multiplying two 2000×2000 or 4000×4000 random matrices on my personal Mac laptop and a remote linux server (please see the source code repo for details):







7.53 sec

188.85 sec

77.45 sec


6.66 sec

55.48 sec

9.73 sec
sdot w/o hints


6.66 sec

55.04 sec

9.70 sec
sdot with hints


2.41 sec

29.47 sec

2.92 sec
SSE sdot


1.36 sec

21.79 sec

2.92 sec
SSE+tiling sdot


1.11 sec

10.84 sec

1.90 sec
OpenBLAS sdot


2.69 sec

28.87 sec

5.61 sec
OpenBLAS sgemm


0.63 sec

4.91 sec

0.86 sec

7.43 sec

165.74 sec


0.61 sec

4.76 sec

You can see that a naive implementation of matrix multiplication is quite slow. Simply transposing the second matrix may greatly improve the performance when the second matrix does not fit to the CPU cache (the linux server has a 35MB cache, which can hold a 2000×2000 float matrix in cache, but not a 4000×4000 matrix). Transpose also enables vectorization of the inner loop. This leads to significant performance boost (SSE sdot vs Transposed). Loop blocking further reduces cache misses and timing for large matrices. However, OpenBLAS’ matrix multiplication (sgemm) is still the king of performance, twice as fast as my best hand-written implementation and tens of times faster than a naive implementation. OpenBLAS is fast mostly due to its advanced techniques to minimize cache misses.

As side notes, “sdot with hints” partially unrolls the inner loop. It gives a hint to the compiler that the loop may be vectorized. Clang on Mac can fully vectorize this loop, achieving the same speed of explicit vectorization. Gcc-4.4 seems not as good. The Intel compiler vectorizes the loop even without this hint (see the full table in README). Interestingly, the OpenBLAS sdot implementation is slower than my explicit vectorization on both Linux and Mac. I haven’t figured out the reason. I speculate that it may be related to cache optimization.

As to C++ libraries, Eigen has similar performance to OpenBLAS. The native uBLAS implementation in Boost is quite primitive, nearly as slow as the most naive implementation. Boost should ditch uBLAS. Even in the old days, it was badly implemented.


  • For multiplying two large matrices, sophisticated BLAS libraries, such as OpenBLAS, are tens of times faster than the most naive implementation.
  • With transposing, SSE (x86 only) and loop blocking, we can achieve half of the speed of OpenBLAS’ sgemm while still maintaining relatively simple code. If you want to avoid a BLAS dependency, this is the way to go.
  • For BLAS level-1 routines (vector arithmetic), an implementation with SSE vectorization may match or sometimes exceeds the performance of OpenBLAS.
  • If you prefer a C++ interface and are serious about performance, don’t use uBLAS; use Eigen instead.

Read Full Post »

Getopt for Lua

When I switched to a new programming language, one of the first thing I do is to find or implement a getopt() that is compatible with the elegant Berkeley getopt.c.

When I started to actually use Lua two months ago, I also spent some time to look for a getopt function, but none of them is satisfactory. PosixGetOpt seems to bind the Posix C library and may have compatibility issue. CommandLineModule is powerful, but seems overkilling. AlternativeGetOpt tries to mimic the Berkeley getopt, but its functionality is very limited in comparison to the C version. There is a getopt module in lua-stdlib, but it has massive dependencies and is not Berkeley compatible. lua-alt-getopt is the closest I can find, but I need a lightweight version without the getopt_long support, such that I can copy-paste a function without worrying about dependency.

In the end I implemented my own getopt for Lua. It is a single function in 50 lines. The following shows an example about how to use this function.

for opt, optarg in os.getopt(arg, 'a:b') do
    print(opt, optarg)

BTW, I have also started to build my own Lua library. The goal is still: free, efficient and independent. If you want to use a function, you may just copy and paste one or a few relevant functions. The length of a dependency chain is maximally 3 right now.

Read Full Post »

OOP in C? Don’t go too far.

I was reading some interesting articles about realizing object-oriented programming in ANSI C. It seems that most people commenting on these articles think this is a good idea in general. I want to say something different, though. In my view, it is fine to realize some basic OOP bits such as encapsulation and constructor, but we should not go too far.

In fact, most of well-formed C projects contain some basic OOP bits. To avoid using too many global variables or a long list of arguments of a C function, we usually put related variables in a struct and transfers a pointer to the struct between functions. Frequently we define functions to (de)allocate memory for the struct. Occasionally we even put the definition of the struct in .c files rather than .h to completely hide the details of the struct. This is basic encapsulation and (de)constructor. We frequently use “static” functions inside a source file. This is private function. We should stop here, though.

Most of these OOP-in-C articles further mimic methods, inheritance, messaging and more OOP bits. However, all these things come at the cost of speed and/or space efficiency. For example, although we may use pointers to functions to mimic methods, pointers take memory and prevent the compiler from inlining simple functions. If we really want to following the C++ methodology to make everything objects, the overhead on these bits is huge.

The most frequent motivation to using OOP in C is because the programmer needs portability while (s)he only knows OOP or thinks OOP is better.  I do not want to argue if OOP is better than procedural programming, but I really think it is big fault to try to mimic all the OOP bits in C in an unnecessarily  complicated way given all the overhead on performance. If you have to use C in your project, learn and be good at procedural programming which has been proved to be at least as good as OOP on a lot of practical applications.

Read Full Post »