“I’m concerned,” admitted Anne, the newest hire at Adaptive. She had been snapped up from Hewlett-Packard to manage the software development on the STM/18, replacing Roger as my boss. “Brantley plans to write TCP/IP from scratch. Do you think he can do it?”
Roger smiled and replied, “I gave him the responsibility. He has to have the authority.”
I didn’t know of Anne’s doubts at the time. She was a great manager who was not only effective at coordinating a large group but genuinely pleasant to work for. Many years later, I would suggest hiring her during Coraid’s VC days in California.
The first time I met Anne was during an office move. At the time I was working mostly in the peace and quiet of my family’s apartment but was asked to come in and pack up what was left in my cubicle. One weekend, movers loaded up the entire office only to drive about 3000 yards to a larger building in the same office complex. We could all eat sandwiches from the park’s deli and still catch glimpses of Steve Jobs in the wild.
To my surprise, my decision to work from home affected my new cubicle assignment. In the back of the building was a small room, just large enough for four cubes, where they stuck John and me. By some miracle, it had a door. I could once again concentrate at the office. John called it the “Gun” room.
Early in my port of the Unix kernel, co-workers began to call it “Brantlix” or “Brantleyix.” The kernel was the work of Dennis Ritchie, Ken Thompson, and the rest of the bunch from Murray Hill, so naming it after myself didn’t sit right. I started to call it Generic Unix, the kind of brandless Unix you’d buy with a white label emblazoned with JUST UNIX across the front.
John started calling it Gun for short, so our office became the Gun room. It was a natural enough name because of our trips in his Mooney up to Gravelly Valley to enjoy some target practice. We even named a few early prototypes “Colt” and “Ruger.”
For the TCP/IP, though, I’d leave the Gun room and work from home.
The design unfolded as a consequence of Dennis Ritchie’s STREAMS design. At my previous job, I used a version of Streams that I’d written from scratch. Dennis had been kind enough to send me an include file with the data structures and a list of function names (those were the days before function prototypes). It was easy enough to work out what each function did from Dennis’ description in his paper on STREAMS.
At Adaptive, I built on that work. I’d written an Ethernet driver for the LANCE Ethernet chip, made by AMD. Unlike Berkeley Unix, it had its own special files in the /dev directory for a STREAMS driver.
/dev/en0 through /dev/en7 could be opened and an IOCTL call made to assign an Ethernet type to the interface. Packets arriving from the network would be placed on one of the eight interfaces based on the two byte Ethernet type value in the packet, and then it’d be sent upstream. A packet written into the device would be transmitted on the Ethernet.
The first program that used my Ethernet driver was a temporary protocol to use the port remotely over Ethernet. The protocol of choice in those days was Telnet, but that required TCP. Instead, I designed what has to be one of the simplest protocols ever invented.
The packet format was simply a single two-byte big-endian number containing a count of the bytes that followed. The Ethernet addresses were hardcoded, both in the client program and the server program in the Gun system. It had nothing: no resend, no acknowledgment, no login, no flow control, and no checksum. It worked great.
It depended on Ethernet, which was much more reliable than people thought even in the early days. In our Gun room we used cheapernet Ethernet, based on RG58 coax, and even it almost never dropped a packet. If it did, the chips would resend using a random timer and exponential backoff up to 32 times.
The rest of the building was even better. Mike Clair, the husband of Adaptive CEO Audrey MacLean, worked for SynOptics, so we had one of their cutting-edge hubs that reduced collisions even further. Our only problem area was a particularly cheap piece of equipment used to bridge the 10BaseT UTP network to the RG58 network in the test room. We did see some funnies there. I don’t remember how I fixed it, but it didn’t take much.
For the TCP/IP work, I started with the IP part, specifically in user space. There is an RFC for the Address Resolution Protocol (ARP), how a host can learn the Ethernet address for the IP address of another host. There’s also an RFC for the IP and another for the TCP, all still the current RFCs to this day. There is quite a bit of additional information in later RFCs about how to implement different methods of avoiding gotchas in the protocol. In 1990, the main additional RFC was RFC1122. But the IP was the first part I worked on.
The TCP/IP has to be in the kernel to be of any real use. Protocols need performance, both to provide high bit rates from the box and avoid having to use all the machine cycles to do so. Developing in the kernel adds time to the process if for no other reason than the lengthy process of compiling, loading the kernel on the test machine, and then rebooting. Rebooting can take time.
So in the name of efficiency, I’d start in user space so that I could move it into the kernel later. I did this in a few ways.
First, I wrote a few simple functions that simulated the features I needed from the kernel. In STREAMS, the data is held in a structure called a Block. Routines like allocb and freeb were used to allocate and free them. I used the actual Block structure, but the allocb and freeb were simple fakes.
Next, I decided not to implement all the code that’d go into the kernel, just the main parts, the bits that were the core of the protocol.
Lastly, I implemented the Ethernet driver as a device driver and the routines could open /dev/en1, using it to actually send and receive IP packets on our real company network. This saved a lot of time because it meant I didn’t have to write any scaffolding. I only had to use my Sun to try to make a TCP connection to the VME bucket.
It all worked very well. I wrote the ARP to answer the query from the SUN. I then read in the IP packets and displayed them on the VME bucket.
To fully test the IP, I next implemented the UDP protocol, which had its own RFC. This was easy since the UDP protocol is so simple. It has a header with only four fields: source port, destination port, the length of bytes in the packets, and an optional checksum.
With UDP in place, I wrote a simple server at port 7 to echo the frames back to the sender. An uncomplicated program on the Sun tested this out.
There is more to the IP stack than all this, though. There’s the question of where to send an outgoing packet when the destination IP address is not on the local segment. IP segments can arrive fragmented and in need of reassembling. Fragments were implemented after all the TCP was in place because it never happens on a local network using the standard maximum transmission unit of 1,500 bytes. Later, we’d set up two STM/18s, configure the MTU between them to be about 600 bytes and make a connection through them. This allowed us to test both the fragmentation and the reassembly.
Finally, I started on the TCP protocol itself. The core of the protocol is processing packets using structures called segments to keep track of them all. The TCP RFC included a diagram that followed the state of a connection from sending a sync packet to sending a finish packet and closing the connection. I just translated the diagram to code.
There’s a three-way handshake needed to start a TCP connection. Each end has a sequence number used to account for every byte in the segment sent. Each byte is not only numbered but has a sequence number in the header that corresponds to the number for the first byte in the segment. Acknowledgments from the receiver tell the sender which sequence number it expects next.
Thanks to the well-written RFC, development breezed by with nearly no surprises. Soon, I could telnet into the STM/18 from the Sun, all still in userspace.
No one else could login. The single user process only allowed a single connection. It forked and executed the shell when I did, and the user process was passing the data from my telnet session through pipes. It all worked amazingly well.
Then I got a call from Adaptive that I needed to return to the Gun room in order to help the application team move their programs onto the STM/18. Their code was written on the Suns, following the specifications of what would be provided on Gun rather closely. They were a good team, professional, and worked hard to make everything work.
While there, Anne mentioned that I should attend the daily build meetings. At first, I balked. My OS was a separate undertaking, so there was little reason for me to be there. I was also one of those people that avoided meetings with a passion, a sentiment I’ve found in many people who enjoy building things. In the end, though, I decided to attend.
The first topic of that day’s meeting was about ensuring we all used the source code control to check our work. Questions abounded. Do we check whenever we make a change or when it compiles? Or do we wait until it passes our unit tests? The argument that followed between different factions was enough of a distraction that I was able to bow out politely. I rarely went back.
Krish, one of the developers in that meeting, popped his head into the Gun room to let me know he’d found a bug. He described the problem, and I told him it sounded to me like an issue in his code. Krish demurred, insisting the issue was in Gun. Shrugging, I went back to my work.
Krish, being a sporting kind of guy, posted an email on the developer mailing list opening a friendly pool for where the bug was, in his code or mine. After a few of days he didn’t have any takers, so he sent another email asking why. A co-worker replied, “I don’t know how he does it, but the bugs are never in Brantley’s code.” Sure enough, Krish soon found his bug.
What was my secret? I’d stopped using the Sun weeks ago and had been living on the STM/18 prototypes almost exclusively. I had plenty of bugs. I just got a chance to fix them before anyone else could see them.
Once the application team’s port was finished, I returned to my apartment for a final push on the TCP. I moved everything down into the kernel, writing new glue to connect it with the rest of the system. It took about as long to do that as it did to write the user process version.
All and all, it took two man-months to write TCP/IP from scratch, three if you count the time I helped with the application port.
Anne breathed a sigh of relief. She admitted that she’d been on a team of twelve at HP who took 24 months to implement TCP/IP, hence her cautious attitude toward my work. Truth be told, if I’d been on a team of twelve, it would’ve taken me 24 months too.
There was still more work to do, of course, but not much. The system worked. It served its purpose in the STM/18 with distinction. It handled unexpected changes with grace, like the time the application outgrew its original 4 MB limit. I inserted a VME memory card into the system and it kept chugging along.
The last major thing I did was add a shared library. The application was 90% library and 10% program. This was the reverse of most things in the system.
It took about a week to add a new memory segment for the library. First, I created a binary file with all the library linked to it, a sort of do-nothing a.out. That a.out was loaded to run at the new segment address space, which was read-only and shared with all processes.
I then wrote a quick C program to read the entry points of that a.out and create an assembly program of stubs for each function. This was the “library” that the user program linked to. When the user called the stub, it merely jumped to the location in the new memory segment that contained the do-nothing a.out where the code lived.
The work was made a little more challenging because I was in the process of moving with my family back home. The next week I drove across the country to find a place for my family to live. I flew back to San Fran with a round-trip ticket that Betsy would use to move her and Charlie back to Georgia.
Once back in the office, I found that the simple shared library had a problem. I’d only included subroutines with no data in them, which made implementation easier but didn’t free up enough space. That week I modified the shared library address segment to have a data section as well. For a new process, it wasn’t initially mapped. At the first touch, the process got a new data library data segment. This saved space while preventing processes that didn’t call the library from having the extra overhead of data it wouldn’t use.
I finished this, packed up the moving van, and headed across the country just as the Los Angeles riots began. Betsy and Charlie would soon follow by plane, and I looked forward to picking them up at the Atlanta airport.