Wednesday, March 21, 2018

Source of Truth

"Imagine walking down the park with your wife, and suddenly seeing your ex. Wife talks automation, she agrees. Wife says intent, she does the same. Wife talks container... and now they are best friends forever."

Since Cisco and Google announced a partnership to deliver a hybrid cloud solution last year, I started following back to see what my ex is doing in software space. During my time in Cisco it used to be a hardware-first company, or a "software solution that must run in own hardware"-first company, so it is interesting to hear about the announcement of Kubernetes-based Cisco Container Platform recently. It is great to see new materials from Cisco DevNet to transform the skills for Network Engineer towards software-based and automation, just like this awesome Network Programmability Basics video course.

One blog post by Hank Preston about "Network as Code" caught my attention. He laid the three principles of Network as Code: 
  • Store Network Configurations in Source Control
  • Source Control is the Single Source of Truth
  • Deploy Configurations with Programmatic APIs
and now I would like to expand more about this Source of Truth, in the context of network device config generation.

Source of Truth is the authoritative data source for a piece of information (it is usually compared with Source of Record, but let's not go into that discussion). In a network config generation pipeline, Source of Truth is the place we look for information needed to generate the config. And I agree with Hank, even though many organizations today use the current running device configuration that is active in the production network as the Source of Truth for network configuration, this is NOT the way to go to have a reliable system.

One important idea in Site Reliability Engineering is that in order to have a reliable system, you need to make it out of interchangeable and replaceable parts that can fail at any time. We need to treat network device as cattle, not pet, where we look at network infrastructure as fleet and any network device can fail and re-spawn automatically to return to previous state before the failure occurs. If current running device configuration in production network is the source of truth, and a device fails, we cannot use it as the source of information to generate the configuration for the replacement device. You surely can take the backup of the configuration and keep it offline somewhere, but if the active network device fails before the configuration can be backed up, will you use the previous backup as Source of Truth?

Now, we can use the configuration that we capture from current running production network as Source of Truth IF, and only if, the next changes to network device are done first in that offline configuration. So let's say you have a production network, and you capture all the config from active devices to start creating Source of Truth. You keep those device configurations in repository where you can enable version control (example below is taken from this blog post):

If you want to change the configuration in the network, you have to follow the change process (if you have any) for the configuration you put in the repository such as branching, do the change, and ask for peer review before your branch is merged back into the master.

But it is not always practical to use device configuration that is vendor-specific, sometimes it is even platform-specific, as Source of Truth. Let's say your current production network is running using one device model from certain vendor. For some reason, either during failure or not, you want to auto generate the same config for different device model or even for new device from different vendor. Or perhaps you run virtualized environment and you want to do horizontal scaling to your network device, for example to spin up new virtual router to handle more load, and the new virtual router contains mostly the same configuration like the current virtual router except some unique configuration such as hostname, IP address and so on.

Network device configuration has two components: configuration syntax, that is specific to vendor or platform, and data variables, that are consistent regardless of the syntax. And data variables can be the same for all devices (e.g. SNMP configuration, NTP server etc.) or unique for every device e.g. hostname, IP address etc. If we use Ansible as the automation platform as example, we need different information as data source to generate configuration: nodes, data variables and jinja templates.

The inventory file (INI file) contains the information of nodes where we want to perform the change. It can be as simple as a list of IP address or hostname of network devices. Data variables of the configuration can be assigned to group of devices if they are generic, like NTP server configuration, or assigned to specific node if the configuration is unique such as loopback IP address. And those variables can be stored in the same INI file, or within a set of group variable files. Jinja2 templating is used to provide the configuration syntax per device vendor, that is stored in different file for each vendor.

hostname {{ system.hostname }}
interface loopback 0
 description Management Interface
 ip address {{ system.ipaddr }} {{ system.netmask }}

Ansible playbook then uses template module with those Jinja template files as the source to render the template to generate the device configuration in selected destination folder. The configuration files in destination folder are automatically created by inserting the proper data variables into the respective Jinja templates.

As you can see, all configuration artifacts in Ansible such as inventory file, group variable files, and even the Jinja template files can be kept in repository with version control system. If you want to modify the configuration of the device in production, you have to update those files (and follow the change process), generate the new config, then the new config can be pushed into production device (you may have to push the new config to staging device first, depending on your release process). Hence, those files are the Source of Truth in this example.

But what if you want to grow bigger than that example? What if you have more data that is needed to generate the network configuration? And what if you want to store the data in different locations beyond some simple files?

Below is my attempt to draw the system for network config generation pipeline to answer those requirements:

I put human icon in the most left of the drawing to put the argument: we, human, are actually still the ultimate Source of Truth. When a network architect or engineer designs a network, he or she has already an "intent" of how the final design will look like. Designer has already thought about the intended state of the network when it runs. However, we need the designer to describe the network to be built in a data format and structure that computer can understand. This means even a detailed document such as Low Level Design document is no longer sufficient.

The data required to generate network config are distributed in different location or software system, for example:

1. Inventory Database
It has the list of all hardware (and software) in the organization, whether they are operational or not. The inventory could be maintained by operation engineer or even procurement team who put focus on ensuring the hardware/software has still valid support contract from the vendor, for example

2. Design Rules
This is usually the main content of Low Level Design document: from physical design (how port is allocated e.g. first port of router 1 is always connected to router 2 in one pair) to logical (e.g. how VLAN is assigned) and traffic policy (e.g. BGP peers and any traffic manipulation for each peer) and so on

3. IP Database
It is common for large organization to use dedicated IP address management tool. The tool can make it easier to do IP allocation planning and auditing to ensure there is no mistake such as duplication. The same tool may be used to manage VLAN assignment, VRFs, or tracking DHCP pool allocation

4. Site Information
Information about physical location, site naming, cabling layout, MDF and IDF locations, rack configuration and so on are stored in drawing format, or other format that can be understood by those who need to work or maintain the physical facilities. It may even contain the information about the environment such as power and cooling

5. Capacity Planning
Any design has a scaling factor (e.g. a pair of aggregation switches can handle up to 20 access switches, more than that means new pair of aggregation switches is required). Capacity planning is also required to forecast future demand based on organic growth, for example a calculation based on the pattern of traffic utilization growth over time

Again, all the data above can be kept in repository that has version control. So they are the Source of Truth (or System of Record for some people). And our automation tool can access them through API to get the data needed to generate network configuration.

But what if configuration generation tool is not the only tool that requires the information? What if we have another tools, such as a Build Planning or Network Analytic tools that are needed for successful config change to production network, and they need to get information from any data source listed above? Surely such tool can consume the information from the data source directly, however when we have more data sources and more consumers we introduce many-to-many relationship, and any small change in any component may impact many relationship. We need a single Source of Truth that gives the complete view of the network information, as the only authoritative data source for all consumers. And that single Source of Truth is a model.

A model is a representation of the actual thing. The picture above shows the model of the Internet. For network automation system, we need several models:

1. Topology Model 
describes the structure and represent Layer 1 to Layer 3 of the real network, using a graph with edges representing abstract links to connect between different nodes on which packets flow. The model can describe low level information for individual node composition such as multi-linecard switch, and to even higher-level abstractions such as tunnels and BGP sessions

2. Configuration Model
describes configuration data structure and content, to represent configuration intent and generated configuration. The model should be generic, i.e. vendor-neutral data conforming to OpenConfig YANG data models where possible. OpenConfig is a collection of industry standard YANG models for configuration and management that will be supported natively on networking hardware and software platforms

3. Operational Model
represents the state of the network, and uses to describe monitoring data structure and attributes. Model-Driven Telemetry is a new approach for network monitoring in which data is streamed from network devices continuously using a push model and provides near real-time access to operational statistics

Some may argue that we can have a single model for all the above (and to truly have a single Source of Truth). The decision is really up to the designer of the model, for example combining configuration information to Topology model may run the risk of adding bloat to the model, and consequently making it curation and change control even harder. And even Operational Model seems to serve specific purpose, but all three may be inter-related for example the operational state of the network may become the input to update the Topology and Configuration model.

If we go back to network config generation pipeline, configuration tool should derive information from the model (and from additional policy and template representations) to auto generate the configuration to be pushed to production network. The config generation tool should have both unit test and integration test to ensure the new configuration can be integrated successfully. There should be a close loop mechanism to provide feedback if the new configuration pushed to production does not make the network achieve the intended state. But let's keep more detailed discussion about how the generated config get pushed to the device, and how the close loop system or feedback mechanism works, for some other time.

Sounds too good to be true? The system is too hard to develop? It seems to be just another smoke and mirror? Well, some large organizations in the world have built it and they operate such system everyday due to the scale of networking they have to deal with (and I'm just discussing from a very high level here). Yours may not have similar requirement to build the automation platform for that scale, but at minimum any organization should try to reach Level 2 as described in my Autonomous Network, by using available tool like Ansible.

If you have read this far and face some difficulty to understand this post, or may feel there are some gaps and would like to see more practical example, I highly recommend you to read this new Network Programmability and Automation book. In fact, I highly recommend any Network Engineer to read this book to learn the skills required to become the next-generation Network Engineer.

And if you are someone who wakes up every morning and keep thinking about all the details required to build a real vendor-agnostic model-driven network automation platform, with close loop from streaming telemetry, with ability to rollback or to improve automatically based on the feedback, and make it run in the Cloud, please let me know.

It looks like we share the same Source of Truth.

Friday, February 16, 2018

Network Engineer Certification in 2018

Last week I was in Mountain View, in a room full of senior Network Engineers, and we were talking about the skills that need to be developed by more junior Network Engineers. Suddenly someone shouted from the back "CCIE!" and the whole room started laughing.

So CCIE is a laughing stock now?

No need to get offended. You have to understand the context here:
These group of people have been working for the best company in the world. They have been working on the most advanced network infrastructure. The company's undersea cables connect all contingents, to delivers 25% of worldwide Internet traffic.

These people didn't develop their skill through certification. They developed their skills by building the real stuff. When these group of Network Engineers realized the network capacity in the company's data centers has grown so fast that conventional routers and switches can't keep up to meet the requirements of its distributed systems, they decided to build its own instead. These Network Engineers build and operate software-defined networking, before the world invented that terminology. They've been automating network operation in Data Center, WAN, Internet Peering, all the way to Wifi and Enterprise networking, to support 7 company's products with more than billion users.

But think about my situation 18 years ago when I started. I was jobless. I graduated not from Computer Science. There was not any clear guideline available on how to become a Network Engineer. There was not any opportunity to develop my skills. Pursuing certification, from CCNA to CCIE, was the most logical and the best choice at that time.

Having said that, it's 2018. And if any of you think your current situation is similar with me 18 years ago, and that makes you try to repeat my experience with certification today, you should think again.

Remember the most important principle here: use certification as a mean to learn the knowledge. Certification program is good since it puts structure to your learning path. And certification exam, is usually a good way to measure your progress. So if you believe your certificate will get you a job, it's up to you. If you still like to read "top paying" or "hottest IT certification" article, be my guest. I can tell you straight away no certification will be able to put you in that room in Mountain View.

However, if you agree with my point to use certification as guideline to study, here are the Top 10 that I think every Network Engineer should pursue in 2018:

(Please note I'm putting only the certifications that I have personally taken and possessed, to walk the talk)

1. Treat Network as Cattle, not Pet

This comes from one important idea in Google Site Reliability Engineering: that in order to have a reliable system, you need to make it out of interchangeable and replaceable parts that can fail at any time. Bikash Koley, CTO at Juniper Networks, reviews the challenges of networking within large scale infrastructure, reviewing the change needed from treating networking less like pets, and more with fleet management in mind.

This first point is not about certification. It's about mindset.

2. Vendor-Agnostic Networking Skills

Just like shown in one example of Google Network Engineer job ads that I posted several months ago, network engineering is here to stay. We still need someone with in-depth networking knowledge. You still need to know IGP and BGP and traffic engineering in details. Those knowledge are owned by Network Engineer (NE), not Software Engineer (SWE), Site Reliability Engineer (SRE) nor Security Engineer.

And you may use certification to build networking expertise. My advise is to reach at minimum CCNP/JNCIP level. You are welcome to continue to Expert level, but there is a risk for your knowledge to become too vendor-dependent for the implementation of the concept. And this also means take only one: either CCNP or JNCIP (or any equivalent from another vendor). They all teach the same concept, the only different is in the way to implement it. And you can go to multiple tracks to learn Routing & Switching, Data Center, Service Provider, Security and so on depending on how much you want to cover from an end-to-end network.

3. Linux is the New English 

Many tools for network engineer run on Linux, so it makes sense for any Network Engineer to know how to use it. I believe at minimum you should have a System Admin level knowledge. If you can go deeper and learn about hypervisor, kubernetes pods and Linux virtual networking, it is even better. Application workloads run on Virtual Machine or Containers are running on top of this OS as underlay. Today's network engineer must know how to connect them through virtual switch and virtual network, using several options of overlay protocols.

To develop Linux skill you can use something like RHCSA or equivalent.
(Note: I don't want to get into the debate of Linux vs. BSD here. Just look at the tools that you are using as Network Engineer, and check which OS they run and study it)

4. Speak API not CLI

Arista Networks CEO Jayshree Ullal once said “CLI is the way real men build real networks today.” In large-scale network this is definitely not the way to go. Instead of connecting to network device manually using CLI, our management tool or software must connect to the device using API. Understanding what is supported by the API can help to develop or even troubleshoot any issue between our software and the device.

I don't think there is any certification specifically covering API (and I haven't taken any that covers this). But I found this Network Programmability Basic learning program from Cisco DevNet is really good in explaining APIs.

5. Controller and Orchestrator 

Network used to be treated as group of devices running autonomously, with distributed intelligent, and each device is making the decision where to forward the packet. If we treat network as one fleet, the decision should be done from central location. This central Controller or Orchestrator must know how the network looks like, the current state of the network, and in Intent Based Networking System it can even translate business intent into specific instruction to be sent to network device.

In an end-to-end environment, all physical and virtual resources are managed by Controller and Orchestrator, that consists of network control, compute and storage control and service control, with cross-domain orchestration to manage all of them. This Controller and Orchestrator provide northbound API for the application, and use various southbound API from all the control layer to the resources. Southbound protocol from controller to the device does not have to be OpenFlow. However, if somehow you want to learn this protocol in more detail you can use the certification from ONS.

6. Automate or Die

Running network infrastructure as code is not something cliché anymore; it's real and necessary. When you have more devices in the network, automation is the only way to avoid human error. However, automation can bring complexity. And one mistake in CLI may bring down only one device, while one mistake in automation platform can be propagated quickly to the entire network.

My advice is to build your automation skills slowly: starting with Level 1, task specific automation, where you can write simple code to communicate to network devices using various APIs to execute certain task. Then move up to Level 2 by using platform like Ansible and its playbook to execute series of task to complete one workflow. Continue doing this until you reach Level 5 automation when you just need to define the policy between users or components in the network, by providing declarative requirements, and the system will execute without any human interaction. Zero human touch networking. This is the level for Intent-Based Networking System.

7. Cloud, more Cloud, and Multi-Cloud

According to PwC research, virtually all mid-and large-sized enterprises expect to move some workloads to the cloud in the next 1-3 years. Google spent over $30 billion in an effort to significantly improve its Cloud infrastructure. Alibaba now offers even more features than before in an attempt to take on the might of Amazon. Oracle is making massive investments in its cloud infrastructure with the addition of 12 new data center locations around the world, to join the cloud wars against IBM and Microsoft.

If the paragraph above does not encourage you to learn about Cloud, then you should! Enterprise IT in the future will have to connect their premise to Cloud, to multiple Cloud providers in fact, and as Network Engineer you must design the interconnection. At minimum you need to learn at least one Cloud provider, and you can use certification like Google Cloud Architect or equivalent for AWS.

8. Model Driven and Data Structure

A model is a simplified representation of a system. When we send the command using certain protocol to the device over API directly, this is called Stove Pipe approach. We need an abstraction layer, or a model, in the middle of the communication between all those protocols with the network devices. Think its function as mechanism to “normalize” devices configuration into one standard data model then push that configuration into devices using one standard protocol.

Company like Google has been using abstraction with model-driven approach to provide network topology view, configuration data structure and content, and telemetry data structure and attributes. A data structure is a particular way of organizing and storing data in a computer so that it can be accessed and modified efficiently. It is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

Again, I believe the videos from Devnet's Hank Preston is the best place to start learning about this.

9. Analyze Users' Behaviors

Many Network Engineers are busy everyday firefighting the problems in the network. They are the King of troubleshooting. Sometimes they troubleshoot problems that happen due to manual deployment and provisioning in the network. When we start using automation and controller to do deployment and operation of the network, Network Engineers are not going away. Now they need to do work that is closer to the users. They need to understand more who the users are, what they do in the network, what application they are accessing, how the users behave, and so on. In such a away Network Engineer needs to move to become network analyst, to collect those information and perform the analysis in order to predict any problem in the future and prevent it before it happens. Network Engineer will then provide better user experience to the users.

I don't know if there is any certification to teach you to do this, but recently I took Coursera's From Data to Insights with GCP (even the analysis is not related to networking) and I found it very interesting.

10. Software Engineering Principles

Remember, Network Engineer is not a Software Engineer. However, in order to treat a network as a fleet, using controller and workflow automation, that connects to network device using APIs, it will be really helpful if any Network Engineer understands Software Engineering principles.

Network Engineers produce architectures and designs. Those architecture and designs should incorporate software thinking. How can software implement the architecture at hand? Which primitives do we need, and in which order, to implement and operate the design? You don't need to write code it all yourself; But it helps if you can specify it as a set of requirements to a Software Engineer.

In my opinion, any Network Engineer should at least take CS50 class: Introduction to Computer Science from Harvard. And you should know at least Agile software development framework such as Scrum. You can take this certification if you want.

The top 10 above should prepare you to become the Network Engineer of the Future. Or, as I mentioned it before, you also have a choice to spend more time closer to the business and start becoming Solutions or Enterprise Architect. Architect must translate business requirements into technical specifications, and provide integrated solutions to answer the requirements. You may want to pursue business-related certification (Togaf?) or even an MBA.

And if somehow you have a better chance to develop your skill by building something real, just like those Network Engineers in Mountain View, forget the certifications all together.
Just start building.

Saturday, September 16, 2017

Network Engineer Jobs

So you want to work for Google as Network Engineer? Check out one of the job ads here. I pasted the screenshot below just in case the ad is removed once the position is no longer available.

"You'll build software for distributed services, abstractions and the components of the system that operates and powers Google." OK, even this is not common in Network Engineer job description, it makes sense since Google is running one of the world's largest networks to connect its data centers that are scattered all around world. As minimum requirement, you must have experience in software development in one or more modern programming languages e.g. C++, Java, Python, Go, etc. And learn how to code using "Teach yourself Python in 24 hours" won't be enough since it is expected for you to have experience in data structures, complexity analysis and software design.

Is Google really looking for Network Engineer (NE), and not Software Engineer (SWE)? Yup, you still need to have expertise in networking protocols and technologies, including end-to-end packet flow, forwarding and routing. Google knows that a world class distributed computing infrastructure must run on world class networking infrastructure that is operated reliably and at scale. When network capacity in the company's data centers has grown so fast that conventional routers and switches can't keep up, Google could not buy, for any price, a data-center network that would meet the requirements of its distributed systems. So the engineers decided to build its own instead.

And someone like me who relies on 3 CCIEs and CCDE only won't be qualified to apply. Before you ask if it is still needed to pursue certification or not, let me say it again: you still need in-depth networking knowledge. You still need to know OSPF, ISIS, and BGP in details. And you may use that kind of certification to build the expertise. But don't turn into certification junkies just like a younger me once! Especially if your target is to only pass the exam, it won't get you to Google for sure. Once you understand network engineering, learn software engineering and how to design, analyze and troubleshoot large-scale distributed systems. In this company, and similar companies that build and maintain large-scale networking like Facebook, Amazon and others, Network Engineer is expected to write software and tools that interact with networking systems, to support Software Defined Networking, zero touch networking, to automate network operations, and to develop advanced monitoring systems.
Google is definitely looking for someone who is at minimum already in Phase 5 as per my Network Engineer evolution, that will progress to Phase 6 someday. And during my past 440 days in Google, I'm so lucky to be surrounded by these guys.

Wednesday, August 09, 2017

Building Intent Based Networking System

I've been unhappy with my creation-to-consumption ratio lately, which is the amount of time spent creating compared to amount of time spent consuming. Yes I spend time creating design documents, business proposals, system architecture, slides for both technical and non-technical content, product requirement documents, blog posts, and occasionally write simple codes, but much of my free time is spent consuming for Netflix, newspapers, Twitter, televised sports, Facebook, blogs, Medium, TV series, online courses and others.

You may say we need consumption as an input prior to creating. And I agree, consuming is fine if it is part of learning or research in order to create something. But creation must come first. So if I commit to create something, let's say a system design or even this blog post, I must start by starting the work first and whenever I feel some information is needed to add or validate the work only then I will consume new inputs to mix with the old ones and fuel creativity.

Tonight I'm sitting in front of my macbook, in an attempt to increase my creation-to-consumption ratio, by writing about building Intent-Based Networking System (IBNS). Let's start with problem definition.

The end customer is a Small-to-Medium Business (SMB) owner who wishes to expand his business to multiple offices. Some of the owner's requirements are below:

"I will have three different size of offices: small for max 20 employees, medium for max 50 employees, and large or main branch with max 200 employees."
"In every office all employees who work in back office must use company-provided wired PC, while employees who work in sales may use company-provided wireless laptop or their personal computing device."
"Anyone can use company's internal collaboration application to chat anytime, however the use of video conference application must be using scheduling system."
"Those who work in sales can access our customer data using web portal, however only those who work in back office can access the database to update the entry."

As you can see, all the above are described using high-level human language, driven by business requirements, policy based, and focus to the applications. These are business intents, and usually in normal network operation we will need an architect or engineer to translate such requirements into technical specification all the way until the method to do the implementation. In the near future, this problem will be solved with no human interaction using Intent-Based Networking.

According to Andrew Lerner from Gartner, Intent-Based Networking is a piece of networking software that helps to plan, design and implement/operate networks that can improve network availability and agility. IBNS incorporates four key things:
  • Translation and Validation– Converts higher-level business policy (what) as input from end users and converts it to the necessary network configuration (how) 
  • Automated Implementation – Uses network automation and/or network orchestration to configure the appropriate network changes (how) across existing network infrastructure 
  • Awareness of Network State – Ingests real-time network status for systems under its administrative control, and is protocol- and transport-agnostic 
  • Assurance and Dynamic Optimization/Remediation– Continuously validates in real time that the original business intent is being met, and can take corrective actions when it is not met 
So company executives or managers define a high-level business policy they want to be enforced in the network. The IBNS verifies that the policy can be executed, then manipulates network resources to create the desired state and enforce policies using fully automated operations. IBNS is gathering data to constantly monitor the state of the network, to ensure the desired state of the network is maintained, and can take automated corrective action to maintain state.

Wait. Is this Intent-Based Networking just another name for network automation and orchestration?

Based on comparison chart created by Big Switch above, Intent defines what is the goal (declarative) while automation or orchestration provides explicit method of how to do the implementation (imperative). This creates layer of abstraction since the input now is more high level to describe business needs, and not implementation specific. And since the Intent is declarative, it requires validation of the Intent to ensure it can be translated by the system into series of tasks that need to be done. Automation and orchestration do not deal with telemetry or monitoring of network state. Real-time network state monitoring is one key element since the system needs to validate in real time that the original business intent is being met. When it is not met, the system can take dynamic actions to correct it.

Where is SDN here? SDN is an architecture for networks with original idea to separate the control plane in SDN controller, from data plane in networking device. This separation makes abstraction between application developer and network device: anyone who wants to create new application related to networking now does not have to understand how specific network device works, instead she can make her application to communicate to SDN controller using northbound API. And if there is any changes required to be done on the device, the controller will do so using southbound API. IBNS can work in both SDN-based and non-SDN based network infrastructure.

A while ago I built the five levels of Autonomous Network, mimicking the levels in Autonomous Vehicle:

Level 0 means no automation at all, and engineer configures network device manually using CLI. As pointed out by Gartner, even if in 2016 there are still 85% operations teams using CLI as their primary interface, the number will go down to 30% in 2020. So some form of automation will surely happen. In Level 1, which is the task specific automation, engineer can write code to communicate to network devices using various APIs to execute certain task e.g. reconfigure BGP configuration, get network state information from the device, redirect the traffic by shutting down certain interfaces or doing routing protocol manipulation, and so on.

Level 2 is when we want to execute series of task to complete one workflow, let's say to deploy the configuration to all network devices for one new medium-size office based on the requirements above. If we assume all required hardware are already in place, the system will start by pulling out the definition of medium-size office: how many devices, what are the types, what kind of configuration needs to be done, and perhaps the current device configuration so the system can make decision if the new config will only be appended to the current one, or will add/replace the full config completely. Then a sequence of tasks following a playbook or recipe, using either an open-source or purposely build software, will be executed to automate the end-to-end workflow to complete the deployment.

Orchestration is needed in Level 3 when there are more components involved in the process. To address some of the business requirements in this exercise will need to orchestrate the work between different controllers or managers for networks (both physical and virtual), servers, storage, security policy as well as management system. For example, we need to make sure the new office will be added into the inventory database. The device configuration and security policy like user segmentation will be enforced by network controller. Then perhaps new virtual machines need to be started on physical servers to host the applications or virtualized network functions. All information, device config, network state until application data must be stored in the storage somewhere, that needs to be orchestrated too. 

Thing gets more interesting from Level 4. If we have already used SDN-based network, we can just use the northbound API to inform the controller what we want it to change from the network. But with non SDN-based infrastructure, it means we still can connect directly to each network device to push the configuration changes. Even if we use automation platform or orchestrator to connects to the network device using API like Netconf and REST, and no longer through manual CLI or SNMP, this approach is called stove pipe as explained in diagram from Tail-F above. The main disadvantage of this approach other than scalability issue, is the communication between the platform to each device (or Managed Object as per Tail-F) is implementation specific depending on the device vendor. This means if we change the device vendor we may have to change the method of implementation from our platform to the new vendor's devices.

Introducing Model Driven networking. Instead of the platform to connect directly to each network device, we can build a model to represent the device and the device config, so now any automation and orchestration happens on the model first. Model based networking provides abstraction since even the network devices are changed to different vendor, the model will remain the same. Another benefit is: any planned changes to the network can be simulated and validated in the model first, and only gets implemented in the real device when the changes are considered safe.

And finally, Level 5 is the target for network intelligence for any infrastructure. We specify business intent, we define the policy between users or components in the network, we provide declarative requirements, and the system will execute without any human interaction. Zero human touch networking. This is the level for Intent-Based Networking System.

Now we know the definition and characteristic of IBNS, how to build one?

I'm using bottom up approach here even though top down will work too and I would argue it is a better approach:

First, start by building the infrastructure. No kidding, we still need the network. Some vendors may call it Network Fabric, and it may consist both physical and virtual networks. We still need physical cables to connect between physical network devices or at least the servers running network functions. In later case we can use overlay protocols to connect between different network functions.

Second, automate and orchestrate the infrastructure. As mentioned earlier, if it's SDN-based infrastructure we will have a controller to deal with network control plane then push the desired state to the device. And in non-SDN infrastructure we still can use a controller to automate any configuration changes to network device (or the model of network device). We need to use an orchestrator to combine this controller with another physical or virtual infrastructure managers that manages the servers, storages and virtual machines.

Third, build the telemetry and monitoring system. At minimum we need a mechanism to measure the state of the network from the status of the service, topology view, both previous and latest configurations, logs for any state change, and error checking mechanism for any failed changes.

Forth, create a mean to translate business intent. This could be in the simple web portal or mobile app that provides service catalogue offering packages for user to select from, with some degree of possible customization. Some day this may turn into a form of virtual assistant that will listen to our voice command, and will translate the captured information into a series of workflows need to be executed by the system.

Obviously the building exercise I write here is a very simple one that should work in principle, event though the devil is always in the details.

Let's see how Google does it, as taken from their public presentation.

Google has been using abstraction with model-driven approach to provide network topology view, configuration data structure and content, and telemetry data structure and attributes. Imagine a vendor-agnostic network topology. The information we need from such topology is a representation of all network devices as nodes, and link to connect the nodes to each other. It can have both node and link attributes such as node identification, port information (e.g. Node A's first port is connected to Node B, second port to Node C and so on) and link speed. We can also have the information to map the node to the current actual network device, for example Node A is currently representing Cisco hardware type X with specific hardware and port configuration, that obviously can be changed when needed. Such information is required for the system to know how to map the model to the actual network device e.g. Node A's first port means interface Gi0/1 on Cisco type X.

Configuration and monitoring information must be described in vendor-agnostic way so it is not bound to specific configuration line or monitoring attribute from the vendor. Any network device configuration will be described as models of interfaces, routing protocols, routing table, routing policy, ACL etc. Each configuration model such as BGP protocol later can be mapped into specific implementation for different vendor's device configuration. And the state of the network like routing table information can be retrieved from the device and populated into the model as well.

Google implements telemetry system using publish–subscribe messaging pattern where senders of messages (publishers) categorize published messages into classes without knowledge of specific receivers (subscribers), and subscribers express interest in one or more classes and only receive messages that are of interest. Using gRPC protocol, it is possible to have a continuous time-series data stream from the device with incremental updates. And the device can provide asynchronous event-driven reporting that does not require to get any response from the servers/collectors (think about device logging or SNMP trap). Obviously it is possible for the collectors to run ad-hoc request to collect data from the devices, that could be a synchronous RPC call.

Once we have all the components in place, what we need now is to connect all pieces together to get the system up and running. The users, or operators, of the system use application to describe configuration intent. An example of this is a web portal to provide the operators to select the option of one use case: "drain the traffic from link X" let's say because we want to do maintenance of the link or migrate it to another link. The instruction is sent using declarative API to both configuration and topology model. Once the requirement is translated into changes to the model, the system will analyze the current configuration to understand what changes are needed to generate required configuration instance. This configuration will be mapped into specific vendor configuration line that will be pushed to the device using different option of southbound protocols depending on which device. Telemetry data is used to monitor configuration changes and the system will provide feedback to the operators when the intent has been implemented successfully.

As closing remark, Intent-Based Networking is considered as "one of the most significant breakthroughs in enterprise networking". Cisco CEO claims Intent Based Networking will redefine the network for the next 30 years. Gartner made prediction that 10% of enterprises will use intent-driven network design and operation tools, reducing their network outages by 65%, by 2020.

And after reading this post, I believe you will agree with such statements because it makes sense. It makes sense not because of the amount of technologies involved in the system, but because the system can provide the answers to the requirements from the business. And that is just what needed from any innovation in this space: to solve real business problem.

Disclaimer: This post represents my own personal view. All the sources of information are available online and accessible to public. No confidential information belong to my current employer is disclosed in this post

Saturday, July 15, 2017

Network Engineer Evolution

About two years ago I made a learning roadmap for network engineers who want to transform their skills towards Software Defined Networking. I presented it at various events including Cisco Live. It was good, but it looks like I didn't provide the full story. So let's discuss it again, and we will start from the very beginning.

Any network engineer who just starts his or her career today will begin in Phase 1: as the User of networking products where the engineer only knows how to configure the product, hopefully by reading the documentation from the vendor's website first. This type of engineer is what I call "Config Monkey" (sorry, monkey!). If you think you are still in this phase, please don't get offended: I started my career here too. There is no innovation at all, only follow the manual to make the products run.

Then we will move to Phase 2: as Advanced User of networking products. This is the phase where the engineer understands how networking protocols work in detail. He is a domain expert now and can start fine tuning the protocols to optimize the infrastructure. IGP timers, fast re-route, BGP attributes etc. and the engineer should go back and forth between the protocols standard and how they get implemented in vendor's products. So all the fine tunes are based on the 'knobs' provided by the vendor. And by nature, phase 2 network engineer possesses the skill to do troubleshooting as well.

Phase 3 is when the engineer starting to become System Integrator. Even it's similar like Advanced User, but now the engineer must deal with different network functions from wireless access and top of rack switches to security devices, firewalls, domain names, caching, network-based storage, content delivery, application load balancer and so on to provide end-to-end services to end users. He is aware about design trade-offs of various choices. There is still no innovation yet, however by now the engineer has possessed the skill to design, integrate and fine tune complex system all the way to application layer.

SDN and network virtualization comes to the picture in Phase 4: Advanced System Integrator. The system now consists both physical and virtual components. Overlay network runs above the underlying physical infrastructure. Virtual infrastructure has multiple controllers and managers that need to be integrated between each other. Network services have life cycle from initiation until depletion so it must be monitored. Phase 4 engineers talk about APIs when integrating different components. Both physical and virtual networks must run in harmony to provide end-to-end connectivity for the users to access the applications and services.

Once the engineer passes Phase 4, this is the point where he can decide to take either one of different technical paths: first, is to move towards the business and start becoming Solutions Architect. Architect must translate business requirements into technical specifications, and provide integrated solutions to answer the requirements. We can live happily ever after here. I know it because I used to be in this phase for many years when I work for Cisco.

The second path is like what Morpheus described as the red pill: stay in Wonderland and learn how deep the rabbit hole goes. We can choose to stay as engineer to go even deeper in the next two phases.

Phase 5: Contributor. Phase 4 engineer assumes all components will just work when they are integrated, just like playing Lego. Yes, she still needs to understand how one component consumes the API of the other component. But in reality the integration is not, or never, that straight forward. Engineer moves to phase 5 once she starts developing few components to make the system works smoothly. It could be as simple as making automation script using SDK provided by the product's vendor. Or create new driver for an open source platform to connect to specific network device. Or customize the current module of one software to make the system runs. Engineer writes code, understands software development workflow, and fills up the missing ingredient to build one solid system.

Phase 6 is the phase served for Creator. This is the God-mode in Network Engineering. Engineer can look at the current network protocol and decide to invent the new and (hopefully) the better one. When building a complex system with multiple products from different vendors, engineer can assess if it is required to build new software component to have a successful integration. Phase 6 engineer thinks about scalability all the time, centralized vs. distributed model, and about the workflow from beginning of user's request until the service is provided. She generates ideas require to solve complex and open-ended problems. She thinks agile and runs iteration to optimize the system. Engineer in this phase is the one who translates business intent into automated workflow execution to deliver the service.

So let's look at my T-shape SDN Skill Transformation path and try to relate it to the 6-phase of Network Engineer Evolution above.

Obviously you need to be at least a Phase 3 engineer before looking at this path. To start the journey in Phase 4, you need to learn virtualized infrastructure for network, compute and storage as well as the managers and controllers to manage them. Learn abstraction and modeling. Then learn software architecture and engineering. Get involved in software development. Start writing code or optimize existing code to move to become Contributor in Phase 5 and beyond. Or if you decide to take Solution Architect path instead, switch the mindset and learn business skills.

I won't mention any vendor's product, or any vendor's certification, anymore in the evolution. Understand the expectation of what engineer in each phase has to deliver, skills that must be possessed, then make your own judgement to decide which vendor or which certification program (if needed) you want to use in your learning process.

And please don't be mistaken to think I self-proclaimed myself as a Phase 6 engineer. I made decision a while ago to become Product Manager instead, that provides me opportunities to work with many creators to build the next cool things in networking.

Final thought: it's okay to start as monkey once. But knowing where we are, and where we want to be, can surely help to plan how to get evolved.

Tuesday, July 04, 2017

One Year Ago Today

One year ago today, fourth of July, was my first day at Google Zürich. It’s been a very interesting journey so far, and from the beginning I spent most of my time to focus on three things: switch to Product Management to learn how to build great product, work on scalable Enterprise networking solution from cloud-based SDN to intent-driven automation, and learn data analysis in-depth from data visualization all the way to Machine Learning, to be used in product development.

As you notice, I rarely post new blog since I joined the company last year. And I find it quite difficult to find any active blog from other Googlers too. Just like any tech company, when we joined all of us signed an agreement containing various obligations including the requirement to hold proprietary information and trade secrets in strictest confidence. But I believe there should be some non-confidential things that we can share in our personal blog.

So why can’t we blog?

First, we are very busy here. And not because we have to, but we choose to.

I mean, there are just too many interesting things to do and to learn at Google. If you work for the best company in the world that empower every employee to innovate, in everything we do, you surely want to spend time the best you can. We write a lot, like writing product requirement document, design specification, or execution plan, but then we will be busy building the product and getting things done.

Second, most of us feel that what we do is not new.

There are so many talented people in Google with great ideas and executing them everyday. So unless we innovate something completely new, or improve something to make it 10x better, most Googlers think what we do is not new, it’s common, and we assume everyone must has known this already so it’s not worth sharing. That could be true for within Google but some of the ways we do things here (again, the non-confidential things) could be very useful for people outside.

Third, we are trusted with so many confidential information, we don’t want to unintentionally share them.

Google culture is very open. Every Noogler, new hire, usually get access to Google codebase within the first week in the company. Employees share their salary and bonus in Google sheet. There is a weekly company-wide all-hands meeting called TGIF where top management to various teams present about a product Google has been working on, and then take any questions from the audience. Any questions from old timer to new hire and even intern. And we are all trusted not to leak the information to outside the company.

This has created the culture of trust, that make us believe we are truly part of the family. And as family member, you don’t want to break the trust by sharing confidential information even unintentionally to outside the family.

(Read here about the impression of company culture from an intern)

Having said those, I will still try to continue blogging here.
Watch this space.

Thursday, January 19, 2017

2016 Year in Review

Every beginning of the year I usually review what I have done the past one year, make notes, and build the plan for the upcoming year. I made many mistakes in the past, did things I’m not proud of, however I use them as opportunity to learn and try to be better next time.

Early 2016 I found that my startup company was competing directly against Cisco (that was still my employer at that time). That was quite surprising. I founded that company in 2012 initially as my pet project, the lab for my MBA, where I can practice whatever I learned from the business school. My pitch for the startup was simple: we do what Cisco (or Cisco Services) will not do. We built online learning platform to learn Cisco certification using group mentoring system. We run physical network audit. We did system integration projects to interoperate Cisco products with any other vendors.

However, since late 2014 the engineering team in my company have evolved. They grew skills in network programming. The team put more focus on Software Defined Networking (SDN). They built lab to validate Network Function Virtualization (NFV). And then the team started to develop our own SDN Controller and Network Automation platform.

Then customers started to come. Customers wanted SDN solution, NFV infrastructure and network automation, but the ones that are vendor-agnostic. They came to my company. They asked the team to bid in the project. That’s when finally Cisco started to notice because they were bidding too.

Early April I decided to resign from Cisco to run my own company as full time CEO.

Mid 2016 I received an offer from Google to join them in Zürich, Switzerland. From April I have built company vision for my startup and laid multi-year strategy, and I knew they can be executed under the current leadership team even without me. I also have personal reason to move my family to Europe. So I agreed to leave Dubai and started working at Google from July.

Even before I joined Google, I already made a plan of what I will learn in the company. Google is the right place to learn so many interesting things, but for 2016 I just wanted to focus on three things:

1. Learn how to build great product

“Behind every great product, there is a great product manager” - Marty Cagan

Google has created 7 great products with more than a billion users using each. And as Ben Horowitz wrote: a good Product Manager is the CEO of the product. A Product Manager combines business, technology, and design in order to discover a product that is valuable, feasible, and usable.

Product Management is above all else a business function, focused on maximising business value from a product. A Product Manager understands the technology stack from the product, and most importantly understanding the level of effort involved is crucial to making the right decisions. And Product Manager is the voice of the user inside the business and must be passionate about the user experience.

2. Continue to learn about SDN, but the scalable ones

Deep down inside I’m still a network engineer. I’ve been focusing on SDN & NFV since 2014 when I was in Cisco. Google has been using software-based solution in its network infrastructure even before the world called it SDN. However, I’m currently interested with highly scaled SDN solution using cloud based platform.

And I’m very interested with transformation path for any Enterprise company to evolve towards a fully automated network operation. I even built the five levels of Autonomous Network, mimicking the levels in Autonomous Vehicle, and currently working on the fifth level: intent-based, policy-driven, zero touch networking.

3. Learn Data Analysis to Machine Learning

Google is the best place to learn Data Science. Period. With Google Brain and DeepMind as part of the Alphabet group, this is the only company I know that puts Machine Learning first in every aspect of its products. Currently I'm focusing to learn about data analysis, data vizualisation and predictive analysis using machine learning.

The three things above are still my valid learning plan for 2017.
How about you? What is your learning plan this year?

Build great product.
Cloud based SDN solution.
With data analytics and machine learning.
“Building the network of the future”. Got it?