Jeff wandered into Allen’s cubicle, right after lunch and soon after Allen’s arrival at work. Allen was one of those systems programmers, hair a bit long, the beard came and went, stayed until 3 AM and wandered back in around noon. Management talked about him in a sort of hushed, reverent tone. He was the kernel guy.
Jeff’s question was simple. What was this kernel thingy he had heard so much about? It was a question that only a new user interface programmer, just coming up to speed on VB.NET, could ask in way that was perfectly innocent. Jeff was young, curious, and, most of all, sincere.
Allen, still a bit sleepy, was delighted, in his laid back way and leaning back in his Aeron chair, with Jeff’s question. And he gave an enthusiastic answer, going on for about 45 minutes, a talk that included words and phrases that Jeff had only heard over the cubicle walls as he went past the systems programmer’s cube village on his way to the break room for a flavored soda water. This time, no more definitions of the words or explanation of the phrases were offered than they had been when overheard.
Back in the cubes of user interface land, Bud poked his head out and asked, “Well, did you find out what a kernel was?”
“Yes, but now I don’t understand what I know about it.”
Kernels and Hyperkernels
This is my fourth in a series of posts about virtual machine hypervisors. My intent in this series is to give you an intuitive feel for what virtual machines are, to provide a mental model of how it all works. In this installment, I will try to answer Jeff’s question. Hopefully, you’ll understand what you will know about it. Unlike our example above might suggest, it reality isn’t that hard.
Why learn what a kernel is? Because a virtual machine hypervisor is a kernel of kernels. But what does that mean? To understand one has to understand what the plain old, ordinary kernel is and does, then we can see what the kernel of kernels, the hypervisor, does.
They weren’t there originally, these kernels. The first machines lacked anything like them. You had a big box with a lot of blinking lights that read in cards and punched cards and printed on the big printers that looked like the IBM 407 card tabulator they had been using. When you started the machine and hit the go button, your program would read from the selected card reader right into memory. Its job was the first software in the machine. You would use the read instruction to read your data cards into a fixed place in memory. When you wanted to print a line or punch a new card in the card punch, you would move the data into a different fixed location and use the write instruction and the data would be sent to the output device. The instruction paused the machine until the input-output was complete. Nothing that we would think of as an operating system was anywhere in sight. You didn’t need it. You just did what you needed to do.
But spending a lot of time waiting for a card to read from a slow card reader was a waste of time for a machine that cost you a million dollars. (That’s $25M in 2019 dollars!) But that wasn’t the biggest slow-down in the beginning.
So Much Iron Just Waiting Around
At first operators would load up a deck of cards that constituted the job to run. The deck would have some program cards, some data cards, and a few cards that would delimit the data from the code. The operator would load each job in the computer, start it, and wait for it to finish. When getting back from coffee, the operator would notice that the job was done and would take out the results and setup for the next job. This seemed natural at first, like setting up the new electronic lab thingy for a job and running it, but, given the amazing speed of the massive calculator, management realized that we could get more done faster if we could cut out that business of doing nothing while the jobs were being setup.
So first, a tiny bit of software was written that hid up in high memory, that would read in one job at a time from a stack of jobs. When a job finished, it would jump to the tiny code in high memory that would read the next job from the stack. Now, the operator didn’t have to set up each individual job, but could batch them in a whole bunch. The tiny bit of code was called a “monitor,” and they were the first embryonic operating systems.
This wasn’t perfect, but it was a great improvement. There was still the possibility that a program with a bug would mess up the operation. If the job didn’t return to the monitor, or overwrote the monitor because of a bug that erroneously stored its data over the monitor, the machine could hang and the operator would have to manually restart the machine, inserting the monitor deck in front of the remaining jobs in the card reader.
This was the way systems worked for many years.
But, as is often the issue in the history of the computer, removing one bottleneck just makes another seem all the worse. Now, waiting on all the slow I/O from the card reader and printer was taking 90% of the wall clock time. Wouldn’t it be great if we could run more than one program at a time? If we loaded ten jobs in memory at once, we could keep the devices and the processor busy. One job could run while the others were waiting on their I/O to finish. There is more to it than that, of course, and memories had to get larger, and more paths to more I/O devices were needed.
Clearly, there were better things than cards to use. The Univac I had included 1,200 foot 1/2" reels of nickel-bronze tape. Dropping one on your foot would sent you straight to the podiatrist. IBM developed an iron oxide coated plastic, replacing the heavy dumbbells of Univac data tapes. The card jobs would be read and written onto 2,400’ reels of tape on a satellite machine, usually a machine like the IBM 1401, and the big expensive machine would read the jobs from the tape, going much faster than slinging cardboard slabs.
As the the manufacturing costs of mainframes kept dropping, one could see there was enough memory and I/O device to make running multiple jobs feasible. But by the late 1950s, it was clear that just growing the monitor wasn’t going to cut it. As the size of the jobs increased so did the opportunity for bugs. Buggy software was slamming that particular piece of memory the monitor was hidden in, causing the machine to have to be restarted often. Even the large, high speed mainframes that worked on the numerical number crunching jobs that put man into space, had this problem.
And it wasn’t just the monitoring code that was at risk. The other jobs could be clobbered by a buggy job on a tirade as well. When IBM started defining a new product line of computers that would encompass all their current product line, they gave a great deal of thought on how to allow multiple jobs to safely run in memory at the same time. How do they protect the monitor and the other jobs from getting eaten by bugs?
There was another issue as well as memory safety. When multiple jobs were running at the same time they were all accessing I/O devices at the same time. Each had to be told what device it was to use, which tape drives had its data tapes mounted. A simple bug in one job could overwrite the tape of another. This had to be taken care of as well as protecting memory.
Function Calls From Nowhere
While I’m talking about the input-output, I need to mention something that had evolved in the development of computers that was tied to I/O. Interrupts. A program doing I/O might be able to do useful work if the input and output instructions didn’t block while the I/O was being performed. The job could be doing something while the device was performing the I/O operation. But if the operation was indeed asynchronous, the machine would have to have a way to let the job know that the I/O had completed. This was done with a mechanism called an interrupt.
Sometimes called a trap, an interrupt is like a sudden subroutine call coming out of nowhere. New I/O instructions caused the machine to start the I/O but not to wait for it to finish, continuing the fetch-decode-execute cycle of the following instructions. Then, when the data was safely transferred, the machine would save the current value of the program counter into a fixed memory location and branch to another fixed memory location, usually the one right after the save location. Maybe the current program counter would be saved at word 10 and the program would branch to 11. This meant that instruction processing would start at that location, which usually was a jump to the code that did something with the data. When finished, a special RETURN-FROM-TRAP instruction would reverse the process, loading the saved program counter from word 10 into the program counter and the original sequence of instruction that had been interrupted would continue.
Different sets of locations would be defined for different sources of interrupts. I/O device A of the IBM 7094, to take just one example, would save the old program counter at location 42 and branch to location 43. (In case you’re ever programing a 7094, remember that those values are octal.) Device B used 44 and 45, and so on through channel H.
Obviously, a dozen jobs all messing about with I/O devices would not be a good idea for a reliable system. Each interrupt needs to go to a particular job that’s doing that particular I/O. Each job would have its own set of tapes mounted, and other jobs would need to keep their paws off of them. Hardware systems had evolved to use something called channels to access I/O devices instead of the simple I/O instructions previously used. Each device was connected by a cable, or bus, to I/O logic in the mainframe. That logic was referred to as a channel. There were a number of them in each system. There would be a sequence of channel instructions, collected into a channel program, that would be fetched-decoded-executed by the channel, just as if it were a kind of CPU, which, in a way, it was. When finished, the channel would generate an interrupt as described above. To kick off these channel programs an instruction called execute channel program or EXCP would be executed.
Like protecting the memory of the monitor and the other jobs, I/O channels had to be protected from the jobs.
The ingenious solution to this problem that Gene Amdahl, Gerry Blaauw, and Fred Brooks, the architects of the System/360, came up with while designing the System/360, was two fold. First, they created something they called a memory key. Each program would have a four bit value used to uniquely identify that job in the system. Likewise, there was a key associated with every 2048 bytes of memory. To load or store values from or to memory, your job key had to match the memory key of the locations you wanted to access. Job 12 would have one of sixteen possible key values and its memory pages would have the same value. If the job decided to go “walk about” it would quickly run afoul of the storage keys of some other job and get bounced out on its ear with an ABEND OC4.
So much for the memory protection problem. The multiple I/O problem was a bit harder to figure out. The obvious answer was to only let the monitor do I/O, but how to (1) prevent the job from doing so, and (2) how to communicate with the monitor the I/O that is wanted.
“If I can’t do it myself, will you do it for me?”
To prevent user jobs from running off and willy-nilly executing channel programs, A, B, and B added a CPU state bit, a single bit in the system that determined how the CPU would operate. When this bit was on, the CPU was in what they called the problem state, it was executing normal user jobs. When the bit was clear the CPU was executing in the supervisor state and had special privileges. In particular, only in the supervisor state could it execute the EXCP, execute channel program, instruction. If the problem bit was set, and a job tried to do an EXCP instruction, the system would interrupt with a privileged instruction trap. Jobs that tried to do their own I/O got ABENDed.
This is the kernel state that we know of today. We tend to think of the kernel state as the special thing, but in reality, at the time the problem state was the special thing. Before the 360, any program could execute a channel program, as we had done on the 7090 series. The problem state was the new thing, a state where some instructions would trap if attempted.
Where were these bits kept, the key and the problem bit? We will need to save them on interrupt and restore then on return, and the monitor needs the problem bit cleared. Along with the program counter, the storage key and the P bit was saved in something called the program status word or PSW for short. The condition code we talked about from last week was also stored in the PSW. When an interrupt occurred, the entire 64 bit PSW would be stored at fixed locations and new PSW for the interrupt handler would be loaded from different locations. A load PSW instruction would return from an interrupt.
But how could the user program get this privileged code to execute the I/O it needed?
A new instruction was provided that caused a new interrupt. This new instruction was called the supervisor call. When this instruction was executed, the current PSW would be stored at location 32 (decimal, if you’re going to be writing a kernel for the System/360), and a new PSW would be loaded from location 96. This new PSW had its P bit clear, had the monitor’s memory key, and had the program counter set for the entry to the supervisor. The user job would fill out a data structure, put a pointer to it in a specific register, and execute the supervisor call instruction. Once running, the monitor code would take a look at the parameters filled out by the job and see that the system call was to execute a channel program. The monitor checked the channel program for validity and then issued the EXCP instruction on behalf of the job.
While waiting on the I/O, the monitor would find some other job to run a job whose I/O had completed. It would move the current job’s PSW from location 32, and save it in a table of running jobs. When the monitor resumed a job, it merely executed a load PSW instruction, right from the table.
The same thing happens today with the de facto Intel x86 standard architecture. Registers are loaded with parameters to the kernel and an interrupt instruction makes the system call. Different bus adaptors have replaced the channels, but those adaptors work in a similar way, moving data autonomously from the I/O bus to main memory.
We’ve also kept some names of things. What originally was a user job, later, with the advent of timesharing, turned into user processes, we just call user mode. The monitor was renamed the control program, also referred to as the supervisor. IBM’s MVS operating system called the central part of the supervisor a nucleus, which is just Latin for kernel.
Okay, now that you understand how an instruction works, and also the how and why of kernel mode, what about virtual machines? Next week we cover some interesting history of VMs. Stay tuned.
The virtual machine ideas will start to creep in as we do.
Now, A Word from Our Sponsor
These blogs are made possible by my day job, developing products for and running SouthSuite, Inc., maker of the Coraid brand Ethernet-based storage area network hardware and software. Like the Cisco PIX Firewall, and the Cisco LocalDirector load balancer, I used what I knew about networking to solve a particular problem. This time the problem was how to add fast, easy to deploy and use block storage to a network for cheap. The answer, in some ways, was just like the PIX and LocalDirector: start with custom configured commodity hardware and add really good software.
And to invent a new SAN protocol, ATA-over-Ethernet.
The Coraid EtherDrive SAN System was the result. I think it is the easiest, least expensive way to add unlimited storage to your Hypervisor operation. My software runs on commodity hardware that turns it into easy to use block storage. Simple Ethernet cards, called EtherDrive HBAs, are placed in your hosts and makes our network block storage simply look like a local SAS drive, no matter what you stick into the bays. Easier and faster than iSCSI. Easier and cheaper than Fibre Channel. You already know almost all you need to know.
And it gives your SAN cloud-like economics. You buy our hardware/software combos, put them in your network, and add drives as you need them. Fill up a media array? Just get an additional one and add it to the network. It all can be as cheap as the equivalent of $0.001 per GB per month. Only, you don’t pay by the month. Your investment in the inexpensive system keeps giving dividends for years.
So, please suggest us to a friend. Over 1,700 companies have used our SAN system in the past.
- Easy to deploy, easy to manage
- Start for a few thousand dollars and grow as you need
- Unlimited growth to many petabytes
- Cheaper than the cloud; $0.001 / GB / Mo
- Expandable like the cloud.
Or email me at firstname.lastname@example.org, or call us toll free is 844.461.8820 or +1 706 521 3048