EOS – so what is it?

The EOS Architecture is unique in the world of networking, although should be fairly easily understandable by those with a background in IT systems architectures and the UNIX operating system. Many parts of it will be extremely comfortable to the Linux community and the users of Cisco’s classic IOS. What makes EOS unique though is how these parts come together – akin to how the Reese’s Peanut Butter Cup brings chocolate and peanut butter together to create something unique and wonderful.

EOS has several core parts-

Lessons learned from Unix
Lessons learned from IT
Lessons learned from Networking
Lessons learned as a Historian

The Lesson of Unix

The Unix lesson we learned is a very simple one – if you want something to be stable, keep it simple. The Unix kernel adopts this philosophy, moving as much as possible to the user space so that the kernel can stay simple, and ensure stability of the processes running on it. EOS uses an unmodified Linux kernel, currently on Fedora Core 12, although we are moving to FC14 shortly.

By not modifying the kernel we gain several key advantages over companies who have modified, and in some cases extensively modified, the kernel of either BSD or Linux.

We can stay current with the latest in kernel capabilities around memory management, schedulers, and stacks
If there is a security challenge related to TCP or other kernel resident component the open source community fixes it within hours
Our kernel is extremely familiar to all developers
Helps enable Arista to provide a single image across all products by not including device specific drivers in the kernel

The Lesson of IT

The REST-ful web architectures that developed in the late 1990′s and have been carried through today understood and embraced the value of a multi-tier architecture where end-user presentation layer (Web Servers), application logic (App Servers), and stateful databases (DB Servers) comprised the application architecture. This architecture scaled very well because no application had to directly integrate with another, they communicated in proscribed manners through the database tier through a publish and subscribe model.

This architecture has been implemented within EOS itself where we have broken down the applications within the networking stack so that each can operate independently and all communication goes through the centralized database. The database in question is a No-SQL, in-memory, machine generated database that auto-generates optimized C++ and client code based on the requirements of every application to publish and subscribe to data objects.

The CLI and the APIs into EOS are very equivalent to the web tier
The processes such as BGP, PIM, STP, and SNMP are equivalent to the application tier
They all communicate through SysDB – the database tier

Lessons from Networking

We all learned a lot working at and with other networking companies in the past. We learned that using a consistent CLI as the primary user-to-machine method of programming a networking device was the accepted norm, even if the CLI itself was a throwback to 1960′s terminal models. We also learned that keeping the CLI as consistent as possible with pre-existing user expectations would sharply lower a learning curve. So the first order was to stick with a familiar industry-standard CLI that most networking administrators could pick up and be useful on in seconds. Commands such as-

Write Memory
Show config
write t
sho ip acc
are all accepted.

The other lesson learned from networking is that stability is paramount, but historically feature velocity suffered as code bases got larger and larger and the monolithic code bases caused fault traces to take weeks and months. To address this we did several things differently:

Created an auto-build system that ensures that no developer can ‘break the build’ this makes our engineers extremely productive
SysDB is auto-generated code based on the requirements of all the processes to communicate. This improves performance, and stability
We built an auto-test system, we have run over 1 Million instances of our test plan to date, and it runs 24/7/365 and fully regression tests the OS against all platforms
We learned how hard it was to find the right code for a given platform. So we moved all device drivers to user space, and we only load the device drivers needed to support a given platform at boot time. Kind of like a real OS should!

Historical Lessons

The last lesson learned was that of a technologist, or a historian. Open operating environments have never lost. Even people who decry the proprietary nature of some of the current mobile handset smart-phones will admit they are far more open development platforms enabling 10s of thousands of third party applications than their predecessors were. The devices that created a community won. The devices that embraced open software and architectures won. To that end we are building this community site. To show how to develop into the network, how to run your code on a switch, how to solve your operational problems with human-to-machine interfaces and machine-to-machine interfaces.

The network has been this ‘voodoo magic’ part of IT where no one wanted to do anything because the systems have been historically so brittle. We had to fix all of that baggage first, now that we are well down that path, without the brittle foundation it is time to open up the OS to a new wave of innovation and investment- from individuals, from the Linux community, from management systems, from academia, and from corporations.

When we use open source, we contribute back – you will see Arista forking and then contributing on Github and other sites, contributing to Linux kernel development, and participating in the communities where the real work gets done. We leverage open source, but we also contribute back to the community.

So what benefits does all of this bring? With EOS each process is separate, and all process communication is only between that process and the SysDB. This means that a single process can fail, or even be restarted, and in most cases the topology will never know. The process will be restarted and brought back, it will read its last state in from SysDB and will restore itself to production.

One of our engineers likes to present at conferences and on our production development network drop to the bash shell and kill -9 on the spanning tree PID. Without missing a beat by the time he can show the process listing again the process has restarted and continued. No flapping, no listening/learning/blocking/forwarding, no TCNs – just continued service. Since we can put all the processes in user space, and they are all separate we can individually patch them if needed. We actively ship patches to customers when we identify critical bugs that need to be fixed. And these patches can be implemented without restarting the switch in almost every case. Since device drivers are in user space these can also be patched and upgraded. While it doesn’t stop the switch and cause a reload, things like this do cause a few seconds of no forwarding while the updated firmware gets pushed and ASICs restart and such.

The biggest area though is that of openness and extensibility. We know that for the past ten years there has not been a lot of operational innovations in networking – yet the Linux community flourished in this timeframe with many capabilities coming to Linux that would make sense as part of the network: tcpdump, wireshark, mrtg, puppet, chef, nagios, cobbler, etc… With the opening up of EOS, the forthcoming EOS API, and our innovations in CloudVision for a scalable XMPP framework for managing hundreds and thousands of switches from a single CLI we are creating the strong foundation for the integration of open, automated, machine readable, programmatic control over the network via human and machine readable simple and scalable interfaces.

We look forward to working with you….

 

This entry was posted in Dev-Blog. Bookmark the permalink.