Feeds:
Posts
Comments

Archive for the ‘Uncategorized’ Category

What is KANN?

See the GitHub repo page. In short, KANN is a flexible 4-file deep learning library, supporting convolutional neural networks (CNNs), recurrent neural networks (RNNs) and non-standard topologies addressable with differentiable computation graphs.

Why a new library?

The initial motivation is that I wanted to understand how deep learning frameworks work, down to the very details. The best way is to implement one by myself. After I got the basic working, I realized the code may be of use to other C/C++ programmers who prefer an efficient and flexible library without carrying all the non-trivial dependencies of mainstream frameworks. So, here we go.

Comparison to other deep learning frameworks

Theano and Tensorflow, with a code base many times larger than KANN, are definitely more powerful than KANN. Importantly, they can take the advantage of GPUs and even distributed computing, while KANN not. On the other hand, KANN comes close in flexibility and can be faster in the multi-threading mode for CPU-only training. KANN also has no extra dependencies by default, which makes it easy to deploy.

Tiny-dnn is a popular lightweight framework in C++. Importing pre-trained Caffe models is its particular strength that KANN lacks. However, tiny-dnn does not support RNNs and has difficulties in constructing non-standard model (e.g. variational autoencoder). It is several times slower than KANN and mainstream frameworks. Tiny-dnn also requires a C++11 compiler, which is not available everywhere yet (e.g. on CentOS 6).

Limitations

KANN does not support GPU right now. For MLPs and RNNs with no more than a couple of hundred hidden neurons, multi-threaded KANN is actually no slower than GPU-based implementations, because small matrix multiplications have not saturated the capacity of GPU yet. However, for CNNs and large RNNs, I have seen GPU-based implementations outperforming KANN by a factor of 5. The performance gap is probably larger with bigger networks.

KANN lacks some important operators, such as batch normalization (BN). A direct implementation of the original BN method is tricky as training needs an extra step different from normal training. It seems that Caffe et al are implementing a variant of BN with running average, but I am not so sure.

KANN does not support bidirectional RNNs and seq2seq models out of box. In principle, these models can be constructed with KANN by manually chaining RNN blocks, but I have not tried.

Conclusion

If you are looking for a tiny, standalone, performant, open source library in C/C++ that supports common components including MLP, CNN and RNN, and has the flexibility and extensibility close to mainstream deep learning frameworks, KANN might be your only viable choice as of now.

Read Full Post »

The best solution is pdftops from Poppler, a somewhat successor of xpdf (see also this article). It preserves the fonts in PDF and produces a small and proper vector graph. To compile poppler on OSX 10.9, I need to edit “configure” and remove compiling option “-fno-check-new” as clang does not support this option.

Following the answer from this page, I have also tried a few other options. InkScape generates a small vector EPS, but it loses some features. Convert from ImageMagick outputs a bitmap EPS, which defeats the goal of vector graphs.

Interestingly, directly using the “gs” command from GhostScript seems to generate a vector EPS, but using the pdf2ps script produces an EPS with bitmap fonts. It turns out that the difference is caused by “-dNOCACHE”, which is surprising. Anyway, even though “gs” works, it generates a much larger EPS in comparison to pdftops. The winner is still pdftops from xpdf/poppler, at least in my case.

Read Full Post »

Gv apparently calls pkg-config during configuration. When pkg-config or the pkg-config file for Xaw3D is not found, it will fall back to another configuration which does not work on Mac.

As Mac does not come with pkg-config by default, you need to first install it. You also need to specify where to find the pkg-config file for Xaw3D:

export PKG_CONFIG_PATH=/usr/X11/lib/pkgconfig/
./configure --x-includes=/usr/X11/include/ --x-libraries=/usr/X11/lib/ --enable-SIGCHLD-fallback

Read Full Post »

Several years ago I implemented knetfile for accessing remote files on ftp and http as if they are local (see also this blog post). I have been using the implementation for a while and the end users like the feature. However, with the increasing use of https among file sharing and cloud computing providers, supporting secured connection becomes more important. Several users have requested this feature. As a response, I implemented a new library kurl on top of libcurl.

Kurl is inspired by and learns from fopen.c, an example from the curl source code package. It supports random access and uses fixed-length buffer. It also fixes an issue where we may be waiting too long for select(). The APIs largely resemble knetfile, zlib and stdio. The following is a small example:

#include <stdio.h>
#include "kurl.h"
int main() {
  kurl_t *fp;
  unsigned char buf[256];
  fp = kurl_open("https://github.com", 0);
  kurl_seek(fp, 100, SEEK_SET);
  kurl_read(fp, buf, 256);
  kurl_close(fp);
  return 0;
}

In addition, kurl.c also comes with a simple main() function to achieve the basic curl functionality, which can be compiled with:

gcc -g -Wall -O2 -lcurl -DKURL_MAIN kurl.c -o kurl

Here are a little more details about kurl:

  • Two-file library. No installation.
  • The only dependency is libcurl, though libcurl may further depend on other libraries: e.g. openssl for https; libssh2 for sftp.
  • Directly accesses files in S3 with
    kurl_open("s3://bucket/object", 0)

    AWS credentials are either provided to kurl_open(), or by default read from ~/.awssecret (AccessKeyId and SecretKey on two lines; see Tim Kay’s aws tool for details).

  • Compilable with C++ compilers.
  • Buffered reading with a fixed buffer length. No potential buffer bloat.

Read Full Post »

I implemented a heap-free, lock-free and wait-free(?) scheduler for parallelizing simple independent “for” loops. For example, if we have a piece of code

data_type *data;
for (int i = 0; i < N; ++i)
    do_work(data, i);

where each cycle is largely independent of other cycles, we can process the loop with 4 threads:

data_type *data;
kt_for(4, do_work, data, N);

The 4 threads will end at about the same time even if each cycle takes very different time to process.

The scheduler uses a simplified task stealing algorithm to balance the load of each thread. Initially, given m threads, kt_for() assigns the i-th task/cycle to thread i%m. If a thread finishes earlier than other threads, the thread will steal a task from the most loaded thread. Thus as long as there remain enough tasks, no threads will be idle.

The original task stealing algorithm uses deques, but in our simpler case, the deque can be implicit. Task pushing and stealing can be achieved in a wait-free manner with the atomic fetch-and-add operation, making the scheduler highly scalable to many threads with little overhead.

To evaluate the efficiency of kt_for(), I parallelize the loop at line 32 in the following code that essentially computes the color of the Mandelbrot set in a 800×600 canvas:

#include <stdlib.h>

typedef struct {
	int max_iter, w, h;
	double xmin, xmax, ymin, ymax;
	int *k;
} global_t;

static void compute(void *_g, int i, int tid)
{
	global_t *g = (global_t*)_g;
	double x, x0 = g->xmin + (g->xmax - g->xmin) * (i%g->w) / g->w;
	double y, y0 = g->ymin + (g->ymax - g->ymin) * (i/g->w) / g->h;
	int k;
	x = x0, y = y0;
	for (k = 0; k < g->max_iter; ++k) {
		double z = x * y;
		x *= x; y *= y;
		if (x + y >= 4) break;
		x = x - y + x0;
		y = z + z + y0; 
	}
	g->k[i] = k;
}

int main(int argc, char *argv[])
{
	int i, tot, n_threads = 2;
	global_t global = { 10240*100, 800, 600, -2., -1.2, -1.2, 1.2, 0 };
	tot = global.w * global.h;
	global.k = calloc(tot, sizeof(int));
	for (i = 0; i < tot; ++i) compute(&global, i, 0);
	free(global.k);
	return 0;
}

The complete source code is at github. Here is the wall-clock time (gcc-4.7.2 and icc-13.1.3 on machine1; gcc-4.3.2 on machine2):

kt_run

Cilk

OpenMP
1 CPU, machine1, gcc

29.4

2 CPU, machine1, gcc

16.0

17.5
4 CPU, machine1, gcc

8.6

1 CPU, machine1, icc

26.8

2 CPU, machine1, icc

14.7

16.3
4 CPU, machine1, icc

8.3

9.5

1 CPU, machine2, gcc

60.9

2 CPU, machine2, gcc

30.7

31 CPU, machine2, gcc

2.2

On this example, kt_for() is faster than both Cilk+ and OpenMP, and also scales well to tens of CPU cores. Nonetheless, it should be noted that Cilk+ and OpenMP are much more versatile than my 50-line library; the microbenchmark may also over-emphasize the scheduler overhead. Please take the result with a grain of salt.

Read Full Post »

The Mandelbrot set is THE most popular example of fractal. There are thousands of implementations to plot the Mandelbrot set in different languages using different techniques. I re-implemented Mandelbrot set mainly for fun and for learning OpenGL, GLSL and HTML5 canvas. Here are the four implementations:

  1. glfractal.c in C and plain OpenGL. This program demonstrates how to use the basic OpenGL. It has an interesting feature of dumping the coordinates in a short HEX string such that we can come back to the position later – when you find a beautiful area in the Mandelbrot set, you can hardly find it again without this feature.
  2. glslfractal.c in C and GLSL using single-precision floating point numbers. This program is modified from the source code on this page. It runs much faster than the CPU-only version. However, you cannot zoom in too much due to the loss of precision.
  3. glslfractale.c in C and GLSL using emulated double-precision numbers. This program aims to alleviate the precision problem, but the result is not so satisfactory.
  4. HTML5+javascript (a live web page!). It also supports HEX dump/restore. You can click the following links to see a few areas I like. You may need to wait for a few seconds as your browser computes the image on the fly: image 1, image 2, image 3 and image 4. Enjoy.

There is also an unfinished webGL implementation. It is supposed to be as fast as the C+GLSL version.
Image 3

Read Full Post »

Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:

typedef struct {
  unsigned key;
  unsigned char val;
} UnpackedStruct;

The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.

Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:

typedef struct {
  unsigned key;
  unsigned char val;
}  __attribute__ ((__packed__)) PackedStruct;

With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.

Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:

Key type

Value type

size per elem

CPU seconds
Unsigned

uint8_t

5

10.249
UnpackedStruct

N/A

8

9.429
PackedStruct

N/A

5

9.287

The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.

At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.

Read Full Post »

Older Posts »