Debugging the Kernel

Today I’m chasing down a bug in the kernel. Although you don’t need to worry about kernel debugging, it’s always wise to understand the history and strengths of any kernel your data depends on. As it turns out, the history of kernels, and how to debug them, is pretty interesting. (For those unfamiliar with the ins and outs of software, the kernel is the bit of an operating system that controls everything.)

In the case of Coraid’s operating system, EthOS, the kernel is the culmination of a lifetime of work by the people who invented Unix, C, scripting languages, and brought you curly brackets.

I’m referring to Bell Labs, specifically Ken Thompson and Dennis Ritchie, and their ground-breaking Plan 9 operating system. With EthOS, I’ve tailored their work to serve block storage. It’s a compact, powerful kernel, as evident from the services it provides, such as clean multiple core support with sleep and wake-up mechanisms.

There was a painful time in Unix’s history when people had to convert uniprocessor kernels into multicore ones. Actually, there weren’t even multicores back then, just multiple CPU chips. As you might imagine, this transition was bug-filled and bloated.

Unix made clever use of the fact that when in kernel mode, no other process could preempt what was going on. With the advent of symmetric multiprocessors (SMP), however, that simplification was no longer valid. The work to turn a kernel that assumed non-preemption into SMP did considerable damage.

So the folks at Bell Labs, the people who invented Unix in the first place, decided to scrap the project and start from scratch.

In the intervening years, they created an operating system not only with multiple core support, but local area networking, graphics, distributed namespaces, network-based authentication, and many other cutting-edge features.

In short, they invented a cloud operating system before “cloud” was even in our computer jargon dictionaries. Each process group has its own mutable namespace, what you think of as a file tree. Some of these files are on the file server. Some of them are resources in the operating system, such as processes or network connections.

You can import resources from another system; further evidence that this kernel is genuinely built for cloud. We debug Coraid SRX Media Array in the lab by importing the process directory and running the debugger from our machine. We control the SRX and VSX through writing things into these files whether it’s locally on the appliance or remotely.

Not only is the kernel powerful, but it’s also tiny by modern comparison. The current image has only 776,306 bytes of instruction. EthOS’s line count, the parts that aren’t device drivers or protocol code, is only 40,000 lines of C. Linux, on the other hand, is pushing past 15 million.

The size of any kernel is worth noting because its functionality is often in ratio to the amount of code it needs. As the loc/func ratio improves, bugs go down, and performance goes up. More efficient is, well, more efficient.

How do I debug a kernel? Well, the best way is to insert print statements in the code, recompile, and reboot the machine. Some people like debuggers, but they’ve never done much for me. Ken and Dennis just used prints to create all this stuff anyway.

Debugging like this only works if the kernel is fast. To compile the EthOS kernel after I’ve made a change to a single file takes only half a second of real time.

That’s right. 0.65r time. I can reboot the kernel in less than a minute. So the edit, compile, test cycle happens many times an hour.

If you think this is primitive, you should read the Ken Thompson interview in Peter Seibel’s book Coders at Work

Enough of this. I need to get back to tracking down my bug. I think it’s in pc/devether.c.

Debugging the Kernel

About the Author