Wednesday, March 21, 2018

Source of Truth

"Imagine walking down the park with your wife, and suddenly seeing your ex. Wife talks automation, she agrees. Wife says intent, she does the same. Wife talks container... and now they are best friends forever."

Since Cisco and Google announced a partnership to deliver a hybrid cloud solution last year, I started following back to see what my ex is doing in software space. During my time in Cisco it used to be a hardware-first company, or a "software solution that must run in own hardware"-first company, so it is interesting to hear about the announcement of Kubernetes-based Cisco Container Platform recently. It is great to see new materials from Cisco DevNet to transform the skills for Network Engineer towards software-based and automation, just like this awesome Network Programmability Basics video course.

One blog post by Hank Preston about "Network as Code" caught my attention. He laid the three principles of Network as Code: 
  • Store Network Configurations in Source Control
  • Source Control is the Single Source of Truth
  • Deploy Configurations with Programmatic APIs
and now I would like to expand more about this Source of Truth, in the context of network device config generation.

Source of Truth is the authoritative data source for a piece of information (it is usually compared with Source of Record, but let's not go into that discussion). In a network config generation pipeline, Source of Truth is the place we look for information needed to generate the config. And I agree with Hank, even though many organizations today use the current running device configuration that is active in the production network as the Source of Truth for network configuration, this is NOT the way to go to have a reliable system.

One important idea in Site Reliability Engineering is that in order to have a reliable system, you need to make it out of interchangeable and replaceable parts that can fail at any time. We need to treat network device as cattle, not pet, where we look at network infrastructure as fleet and any network device can fail and re-spawn automatically to return to previous state before the failure occurs. If current running device configuration in production network is the source of truth, and a device fails, we cannot use it as the source of information to generate the configuration for the replacement device. You surely can take the backup of the configuration and keep it offline somewhere, but if the active network device fails before the configuration can be backed up, will you use the previous backup as Source of Truth?

Now, we can use the configuration that we capture from current running production network as Source of Truth IF, and only if, the next changes to network device are done first in that offline configuration. So let's say you have a production network, and you capture all the config from active devices to start creating Source of Truth. You keep those device configurations in repository where you can enable version control (example below is taken from this blog post):

If you want to change the configuration in the network, you have to follow the change process (if you have any) for the configuration you put in the repository such as branching, do the change, and ask for peer review before your branch is merged back into the master.

But it is not always practical to use device configuration that is vendor-specific, sometimes it is even platform-specific, as Source of Truth. Let's say your current production network is running using one device model from certain vendor. For some reason, either during failure or not, you want to auto generate the same config for different device model or even for new device from different vendor. Or perhaps you run virtualized environment and you want to do horizontal scaling to your network device, for example to spin up new virtual router to handle more load, and the new virtual router contains mostly the same configuration like the current virtual router except some unique configuration such as hostname, IP address and so on.

Network device configuration has two components: configuration syntax, that is specific to vendor or platform, and data variables, that are consistent regardless of the syntax. And data variables can be the same for all devices (e.g. SNMP configuration, NTP server etc.) or unique for every device e.g. hostname, IP address etc. If we use Ansible as the automation platform as example, we need different information as data source to generate configuration: nodes, data variables and jinja templates.

The inventory file (INI file) contains the information of nodes where we want to perform the change. It can be as simple as a list of IP address or hostname of network devices. Data variables of the configuration can be assigned to group of devices if they are generic, like NTP server configuration, or assigned to specific node if the configuration is unique such as loopback IP address. And those variables can be stored in the same INI file, or within a set of group variable files. Jinja2 templating is used to provide the configuration syntax per device vendor, that is stored in different file for each vendor.

hostname {{ system.hostname }}
interface loopback 0
 description Management Interface
 ip address {{ system.ipaddr }} {{ system.netmask }}

Ansible playbook then uses template module with those Jinja template files as the source to render the template to generate the device configuration in selected destination folder. The configuration files in destination folder are automatically created by inserting the proper data variables into the respective Jinja templates.

As you can see, all configuration artifacts in Ansible such as inventory file, group variable files, and even the Jinja template files can be kept in repository with version control system. If you want to modify the configuration of the device in production, you have to update those files (and follow the change process), generate the new config, then the new config can be pushed into production device (you may have to push the new config to staging device first, depending on your release process). Hence, those files are the Source of Truth in this example.

But what if you want to grow bigger than that example? What if you have more data that is needed to generate the network configuration? And what if you want to store the data in different locations beyond some simple files?

Below is my attempt to draw the system for network config generation pipeline to answer those requirements:

I put human icon in the most left of the drawing to put the argument: we, human, are actually still the ultimate Source of Truth. When a network architect or engineer designs a network, he or she has already an "intent" of how the final design will look like. Designer has already thought about the intended state of the network when it runs. However, we need the designer to describe the network to be built in a data format and structure that computer can understand. This means even a detailed document such as Low Level Design document is no longer sufficient.

The data required to generate network config are distributed in different location or software system, for example:

1. Inventory Database
It has the list of all hardware (and software) in the organization, whether they are operational or not. The inventory could be maintained by operation engineer or even procurement team who put focus on ensuring the hardware/software has still valid support contract from the vendor, for example

2. Design Rules
This is usually the main content of Low Level Design document: from physical design (how port is allocated e.g. first port of router 1 is always connected to router 2 in one pair) to logical (e.g. how VLAN is assigned) and traffic policy (e.g. BGP peers and any traffic manipulation for each peer) and so on

3. IP Database
It is common for large organization to use dedicated IP address management tool. The tool can make it easier to do IP allocation planning and auditing to ensure there is no mistake such as duplication. The same tool may be used to manage VLAN assignment, VRFs, or tracking DHCP pool allocation

4. Site Information
Information about physical location, site naming, cabling layout, MDF and IDF locations, rack configuration and so on are stored in drawing format, or other format that can be understood by those who need to work or maintain the physical facilities. It may even contain the information about the environment such as power and cooling

5. Capacity Planning
Any design has a scaling factor (e.g. a pair of aggregation switches can handle up to 20 access switches, more than that means new pair of aggregation switches is required). Capacity planning is also required to forecast future demand based on organic growth, for example a calculation based on the pattern of traffic utilization growth over time

Again, all the data above can be kept in repository that has version control. So they are the Source of Truth (or System of Record for some people). And our automation tool can access them through API to get the data needed to generate network configuration.

But what if configuration generation tool is not the only tool that requires the information? What if we have another tools, such as a Build Planning or Network Analytic tools that are needed for successful config change to production network, and they need to get information from any data source listed above? Surely such tool can consume the information from the data source directly, however when we have more data sources and more consumers we introduce many-to-many relationship, and any small change in any component may impact many relationship. We need a single Source of Truth that gives the complete view of the network information, as the only authoritative data source for all consumers. And that single Source of Truth is a model.

A model is a representation of the actual thing. The picture above shows the model of the Internet. For network automation system, we need several models:

1. Topology Model 
describes the structure and represent Layer 1 to Layer 3 of the real network, using a graph with edges representing abstract links to connect between different nodes on which packets flow. The model can describe low level information for individual node composition such as multi-linecard switch, and to even higher-level abstractions such as tunnels and BGP sessions

2. Configuration Model
describes configuration data structure and content, to represent configuration intent and generated configuration. The model should be generic, i.e. vendor-neutral data conforming to OpenConfig YANG data models where possible. OpenConfig is a collection of industry standard YANG models for configuration and management that will be supported natively on networking hardware and software platforms

3. Operational Model
represents the state of the network, and uses to describe monitoring data structure and attributes. Model-Driven Telemetry is a new approach for network monitoring in which data is streamed from network devices continuously using a push model and provides near real-time access to operational statistics

Some may argue that we can have a single model for all the above (and to truly have a single Source of Truth). The decision is really up to the designer of the model, for example combining configuration information to Topology model may run the risk of adding bloat to the model, and consequently making it curation and change control even harder. And even Operational Model seems to serve specific purpose, but all three may be inter-related for example the operational state of the network may become the input to update the Topology and Configuration model.

If we go back to network config generation pipeline, configuration tool should derive information from the model (and from additional policy and template representations) to auto generate the configuration to be pushed to production network. The config generation tool should have both unit test and integration test to ensure the new configuration can be integrated successfully. There should be a close loop mechanism to provide feedback if the new configuration pushed to production does not make the network achieve the intended state. But let's keep more detailed discussion about how the generated config get pushed to the device, and how the close loop system or feedback mechanism works, for some other time.

Sounds too good to be true? The system is too hard to develop? It seems to be just another smoke and mirror? Well, some large organizations in the world have built it and they operate such system everyday due to the scale of networking they have to deal with (and I'm just discussing from a very high level here). Yours may not have similar requirement to build the automation platform for that scale, but at minimum any organization should try to reach Level 2 as described in my Autonomous Network, by using available tool like Ansible.

If you have read this far and face some difficulty to understand this post, or may feel there are some gaps and would like to see more practical example, I highly recommend you to read this new Network Programmability and Automation book. In fact, I highly recommend any Network Engineer to read this book to learn the skills required to become the next-generation Network Engineer.

And if you are someone who wakes up every morning and keep thinking about all the details required to build a real vendor-agnostic model-driven network automation platform, with close loop from streaming telemetry, with ability to rollback or to improve automatically based on the feedback, and make it run in the Cloud, please let me know.

It looks like we share the same Source of Truth.

Friday, February 16, 2018

Network Engineer Certification in 2018

Last week I was in Mountain View, in a room full of senior Network Engineers, and we were talking about the skills that need to be developed by more junior Network Engineers. Suddenly someone shouted from the back "CCIE!" and the whole room started laughing.

So CCIE is a laughing stock now?

No need to get offended. You have to understand the context here:
These group of people have been working for the best company in the world. They have been working on the most advanced network infrastructure. The company's undersea cables connect all contingents, to delivers 25% of worldwide Internet traffic.

These people didn't develop their skill through certification. They developed their skills by building the real stuff. When these group of Network Engineers realized the network capacity in the company's data centers has grown so fast that conventional routers and switches can't keep up to meet the requirements of its distributed systems, they decided to build its own instead. These Network Engineers build and operate software-defined networking, before the world invented that terminology. They've been automating network operation in Data Center, WAN, Internet Peering, all the way to Wifi and Enterprise networking, to support 7 company's products with more than billion users.

But think about my situation 18 years ago when I started. I was jobless. I graduated not from Computer Science. There was not any clear guideline available on how to become a Network Engineer. There was not any opportunity to develop my skills. Pursuing certification, from CCNA to CCIE, was the most logical and the best choice at that time.

Having said that, it's 2018. And if any of you think your current situation is similar with me 18 years ago, and that makes you try to repeat my experience with certification today, you should think again.

Remember the most important principle here: use certification as a mean to learn the knowledge. Certification program is good since it puts structure to your learning path. And certification exam, is usually a good way to measure your progress. So if you believe your certificate will get you a job, it's up to you. If you still like to read "top paying" or "hottest IT certification" article, be my guest. I can tell you straight away no certification will be able to put you in that room in Mountain View.

However, if you agree with my point to use certification as guideline to study, here are the Top 10 that I think every Network Engineer should pursue in 2018:

(Please note I'm putting only the certifications that I have personally taken and possessed, to walk the talk)

1. Treat Network as Cattle, not Pet

This comes from one important idea in Google Site Reliability Engineering: that in order to have a reliable system, you need to make it out of interchangeable and replaceable parts that can fail at any time. Bikash Koley, CTO at Juniper Networks, reviews the challenges of networking within large scale infrastructure, reviewing the change needed from treating networking less like pets, and more with fleet management in mind.

This first point is not about certification. It's about mindset.

2. Vendor-Agnostic Networking Skills

Just like shown in one example of Google Network Engineer job ads that I posted several months ago, network engineering is here to stay. We still need someone with in-depth networking knowledge. You still need to know IGP and BGP and traffic engineering in details. Those knowledge are owned by Network Engineer (NE), not Software Engineer (SWE), Site Reliability Engineer (SRE) nor Security Engineer.

And you may use certification to build networking expertise. My advise is to reach at minimum CCNP/JNCIP level. You are welcome to continue to Expert level, but there is a risk for your knowledge to become too vendor-dependent for the implementation of the concept. And this also means take only one: either CCNP or JNCIP (or any equivalent from another vendor). They all teach the same concept, the only different is in the way to implement it. And you can go to multiple tracks to learn Routing & Switching, Data Center, Service Provider, Security and so on depending on how much you want to cover from an end-to-end network.

3. Linux is the New English 

Many tools for network engineer run on Linux, so it makes sense for any Network Engineer to know how to use it. I believe at minimum you should have a System Admin level knowledge. If you can go deeper and learn about hypervisor, kubernetes pods and Linux virtual networking, it is even better. Application workloads run on Virtual Machine or Containers are running on top of this OS as underlay. Today's network engineer must know how to connect them through virtual switch and virtual network, using several options of overlay protocols.

To develop Linux skill you can use something like RHCSA or equivalent.
(Note: I don't want to get into the debate of Linux vs. BSD here. Just look at the tools that you are using as Network Engineer, and check which OS they run and study it)

4. Speak API not CLI

Arista Networks CEO Jayshree Ullal once said “CLI is the way real men build real networks today.” In large-scale network this is definitely not the way to go. Instead of connecting to network device manually using CLI, our management tool or software must connect to the device using API. Understanding what is supported by the API can help to develop or even troubleshoot any issue between our software and the device.

I don't think there is any certification specifically covering API (and I haven't taken any that covers this). But I found this Network Programmability Basic learning program from Cisco DevNet is really good in explaining APIs.

5. Controller and Orchestrator 

Network used to be treated as group of devices running autonomously, with distributed intelligent, and each device is making the decision where to forward the packet. If we treat network as one fleet, the decision should be done from central location. This central Controller or Orchestrator must know how the network looks like, the current state of the network, and in Intent Based Networking System it can even translate business intent into specific instruction to be sent to network device.

In an end-to-end environment, all physical and virtual resources are managed by Controller and Orchestrator, that consists of network control, compute and storage control and service control, with cross-domain orchestration to manage all of them. This Controller and Orchestrator provide northbound API for the application, and use various southbound API from all the control layer to the resources. Southbound protocol from controller to the device does not have to be OpenFlow. However, if somehow you want to learn this protocol in more detail you can use the certification from ONS.

6. Automate or Die

Running network infrastructure as code is not something cliché anymore; it's real and necessary. When you have more devices in the network, automation is the only way to avoid human error. However, automation can bring complexity. And one mistake in CLI may bring down only one device, while one mistake in automation platform can be propagated quickly to the entire network.

My advice is to build your automation skills slowly: starting with Level 1, task specific automation, where you can write simple code to communicate to network devices using various APIs to execute certain task. Then move up to Level 2 by using platform like Ansible and its playbook to execute series of task to complete one workflow. Continue doing this until you reach Level 5 automation when you just need to define the policy between users or components in the network, by providing declarative requirements, and the system will execute without any human interaction. Zero human touch networking. This is the level for Intent-Based Networking System.

7. Cloud, more Cloud, and Multi-Cloud

According to PwC research, virtually all mid-and large-sized enterprises expect to move some workloads to the cloud in the next 1-3 years. Google spent over $30 billion in an effort to significantly improve its Cloud infrastructure. Alibaba now offers even more features than before in an attempt to take on the might of Amazon. Oracle is making massive investments in its cloud infrastructure with the addition of 12 new data center locations around the world, to join the cloud wars against IBM and Microsoft.

If the paragraph above does not encourage you to learn about Cloud, then you should! Enterprise IT in the future will have to connect their premise to Cloud, to multiple Cloud providers in fact, and as Network Engineer you must design the interconnection. At minimum you need to learn at least one Cloud provider, and you can use certification like Google Cloud Architect or equivalent for AWS.

8. Model Driven and Data Structure

A model is a simplified representation of a system. When we send the command using certain protocol to the device over API directly, this is called Stove Pipe approach. We need an abstraction layer, or a model, in the middle of the communication between all those protocols with the network devices. Think its function as mechanism to “normalize” devices configuration into one standard data model then push that configuration into devices using one standard protocol.

Company like Google has been using abstraction with model-driven approach to provide network topology view, configuration data structure and content, and telemetry data structure and attributes. A data structure is a particular way of organizing and storing data in a computer so that it can be accessed and modified efficiently. It is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

Again, I believe the videos from Devnet's Hank Preston is the best place to start learning about this.

9. Analyze Users' Behaviors

Many Network Engineers are busy everyday firefighting the problems in the network. They are the King of troubleshooting. Sometimes they troubleshoot problems that happen due to manual deployment and provisioning in the network. When we start using automation and controller to do deployment and operation of the network, Network Engineers are not going away. Now they need to do work that is closer to the users. They need to understand more who the users are, what they do in the network, what application they are accessing, how the users behave, and so on. In such a away Network Engineer needs to move to become network analyst, to collect those information and perform the analysis in order to predict any problem in the future and prevent it before it happens. Network Engineer will then provide better user experience to the users.

I don't know if there is any certification to teach you to do this, but recently I took Coursera's From Data to Insights with GCP (even the analysis is not related to networking) and I found it very interesting.

10. Software Engineering Principles

Remember, Network Engineer is not a Software Engineer. However, in order to treat a network as a fleet, using controller and workflow automation, that connects to network device using APIs, it will be really helpful if any Network Engineer understands Software Engineering principles.

Network Engineers produce architectures and designs. Those architecture and designs should incorporate software thinking. How can software implement the architecture at hand? Which primitives do we need, and in which order, to implement and operate the design? You don't need to write code it all yourself; But it helps if you can specify it as a set of requirements to a Software Engineer.

In my opinion, any Network Engineer should at least take CS50 class: Introduction to Computer Science from Harvard. And you should know at least Agile software development framework such as Scrum. You can take this certification if you want.

The top 10 above should prepare you to become the Network Engineer of the Future. Or, as I mentioned it before, you also have a choice to spend more time closer to the business and start becoming Solutions or Enterprise Architect. Architect must translate business requirements into technical specifications, and provide integrated solutions to answer the requirements. You may want to pursue business-related certification (Togaf?) or even an MBA.

And if somehow you have a better chance to develop your skill by building something real, just like those Network Engineers in Mountain View, forget the certifications all together.
Just start building.