A Generic Buffered Stream Wrapper

Feeds:: Posts; Comments

A Generic Buffered Stream Wrapper

October 11, 2008 by attractivechaos

In C programming, the main difference between low-level I/O functions (open/close/read/write) and stream-level I/O functions (fopen/fclose/fread/fwrite) is that stream-level functions are buffered. Presumably, low-level I/O functions will incur a disk operation on each read(). Although the kernel may cache this, we cannot rely too much on it. Disk operations are expensive and so low-level I/O does not provide fgetc equivalent.

Stream-level I/O functions have a buffer. On reading, they load a block of data from disk to memory. If at a fgetc() call the data have been retrieved to memory, it will not incur a disk operation, which greatly improves the efficiency.

Stream-level I/O functions are part of the standard C library. Why do we need a new wrapper? Three reasons. First, when you work with an alternative I/O library (such as zlib or libbzip2) which do not come with buffered I/O routines, you probably need a buffered wrapper to make your code efficient. Second, using a generic wrapper makes your code more flexible when you want to change the type of input stream. For example, you may want to write a parser that works on a normal stream, a zlib-compressed stream and on a C string. Using a unified stream wrapper will simplify coding. Third, my feeling is most of steam-level I/O functions in stdio.h are not conventient given that they cannot enlarge a string automatically. In a lot of cases, I need to read one line but I do not know how long a line can be. Managing this case is not so hard, but doing this again and again is boring.

In the end, I come up with my own buffered wrapper for input streams. It is generic in that it works on all types of I/O steams with a read() call (or equivalent), or even on a C string. I show an example here without much explanation. I may expand this post in future. Source codes can be found in my programs page.

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include "kstream.h"
// arguments: type of the stream handler,
//   function to read a block, size of the buffer
KSTREAM_INIT(int, read, 10)

int main()
{
	int fd;
	kstream_t *ks;
	kstring_t str;
	bzero(&str, sizeof(kstring_t));
	fd = open("kstream.h", O_RDONLY);
	ks = ks_init(fd);
	while (ks_getuntil(ks, '\n', &str, 0) >= 0)
		printf("%s\n", str.s);
	ks_destroy(ks);
	free(str.s);
	close(fd);
	return 0;
}

Posted in development | Tagged C, myprog | 7 Comments

7 Responses

on December 21, 2009 at 11:01 pm | Reply Justin

I’ve been trying for a while to figure out how to use it to read/buffer C strings. I’m guessing it requires me to write a read() drop-in replacement, but I can’t get much further. Would you mind providing an example?
on May 20, 2010 at 1:59 am | Reply What does zlib-1.2.5 bring to us? « Attractive Chaos

[…] In the following table, gzread/gzgetc/gzgets are direct zlib function calls. ks_getc/ks_getuntil are from my generic buffered wrapper. […]
on December 22, 2015 at 5:25 pm | Reply Ugly Order

If you’re /just/ looking for efficiency, use mmap(2):
A benchmark, `time -p` of printing the contents of the blender binary (that’s 112MB.
Your Program (above):
real 2.50
user 0.65
sys 1.85
`awk ‘{ print $0 }’`:
real 0.18
user 0.12
sys 0.05
System `cat`:
real 0.03
user 0.00
sys 0.03
My program (mmap):
real 0.00
user 0.00
sys 0.00
So… awk is faster than that.
I was /very/ surprised. I thought awk would be insanely slow, doing it line-by-line.
As you can see, though, I beat cat ;).
- on December 22, 2015 at 5:33 pm | Reply attractivechaos
  
  That example only uses a buffer of 10 bytes. In practice, you would mostly like to use something like 4kb or 64kb. Some text files are line based. With mmap, you still need to break lines. Breaking lines actually takes significant amount of time. In addition, my library can trivially work with gzip files. Nontrivial with mmap().
  - on December 23, 2015 at 2:45 am someone
    
    Yeah… most programs aren’t really held up by reading from a file anyway. It’s more important that the API is good.
on December 22, 2015 at 6:03 pm | Reply Ugly Order

If you’re /only/ interested in performance, use mmap(2).
I did some benchmarks, and the above isn’t very fast.
The benchmarks were done with the following command:
`time -p prog ~/Programs/blender-2.76-rc3-linux-glibc211-x86_64/blender`
I compiled all the programs with:
gcc -O3 -static -flto -finline-functions -ffast-math prog.c
I’m not entirely sure -flto does anything if I set -static…
`cat` and `awk` are from debian repos.
So…
Above Program:
real 2.18
user 0.34
sys 1.82
`awk ‘{ print $0 }’`:
real 0.18
user 0.12
sys 0.05
`cat`
real 0.03
user 0.00
sys 0.03
My Program (mmap):
real 0.00
user 0.00
sys 0.00
I was very impressed by awk’s performance.
And… even when I went up to a 1.3GB file, the mmap version stays at 0.

These results may differ from your’s, especially if you use a HDD, not an SSD, like me. In fact, this is probably a large issue in writing a good solution to this problem. When running your program, CPU0 was 100% the whole time, showing that it was probably bottlenecked by the CPU.
It is likely the CPU drain is in `ks_getuntil`. It looks like it scans through char-by-char until it finds the right char. That’s line 97, 98 of kstream.h. Consider using strchr?
Or… just read input line-by-line instead of char-by-char.
- on December 22, 2015 at 6:13 pm | Reply Ugly Order
  
  oops. sorry for posting twice … feel free to delete

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog