Friday, March 06, 2009

Deep Diving Router Architecture, Part II

So in the first part I have explained the basic of internal packet switching process inside a router. Normally we look at a router just as a node with multiple interfaces, and our focus is on how the router communicate to the others to build the routing table. Once the table has been built we can assume the packet is going in from one interface and going out from another interface depending on the destination. In case of multicast packet, one packet is going in from one interface and going out from several other interfaces depending on the join request to the multicast group. And even there are features such as filter and Quality of Services, if we see a router as a simple node with ingress interface and egress interface, we normally think that the features are applied either in ingress or egress direction respective to the router, and they just work like magic.

Now you can see that there are several other tasks that as important as building the routing table. The first is to build forwarding table based on the routing table. The forwarding table contains the next hop information and next hop interface for each destination, just like the routing table, with addition of layer 2 information of the next hop. The packet must be sent out with the new layer 2 header so it is important to re-write this information to the packet. The next task is the lookup process to match the destination with the entry in forwarding table. Packet must be stored somewhere while waiting for the lookup process to be completed. Then the packet must be moved to different location (or in old router the actual packet can be still in the same physical memory location, but it has different pointers to distinguish the state before and after the lookup). The last but not least is to apply the features or policy to the packet inside the router. It’s really crucial to understand what the above tasks are, as well as where exactly they are done.

First, let’s all understand a concept to separate the router into two planes, control and forwarding or data. Actually there is the third one called management plane which is used to connect, to interact and to manage the router itself, but let’s just focus on the first two. Control plane is where all communication between routers using routing protocol happen, in order to build the routing table and forwarding table to be used to switch the packet from ingress interface to egress interface. The process to switch the packet between interfaces in the same router is part of data or forwarding plane.

Let’s see a brief architecture of one next generation and carrier-class router in below picture.

The architecture uses modular concept where most of important tasks are performed in different locations by different components. This is very contrast with the simple architecture in Part I where there is only a single main board, a central route processor and memory, and PCI bus communication to move the packet from the network card to the processor and back to the network card. Route processor is still the main brain of the system. But the function of switching packet including the lookup can be done by different hardware altogether. The network card or line card may have its own processor to do the lookup and dedicated hardware to do the actual packet switching. And to connect different line cards we use a module called switch fabric, known as the backplane of a router. Modular approach is chosen to address the challenges of scalability and to avoid all-in-one approach where a module can become a single point of failure of the whole system.

So the central route processor can be considered as one line card now and it is still required to do the function of control plane, which is running the routing protocol with another routers to build the routing and forwarding table that can be pushed to the network processor in the line card. Once the line card has this information, it will be able to do the lookup and layer 2 information re-write to the packet. To increase the performance during the switching, or applying some features such as packet filter, we can have a dedicated hardware that is programmed to do specific instruction only, called Application Specific Integrated Circuit (ASIC).

The picture below can describe how the forwarding information is built by the central route processor and it can then be pushed to the network processor in the line card.

The route processor uses routing protocol such as ISIS, OSPF and BGP to build Routing Information Base (RIB) database known as routing table. In next generation networks, it is common to use not the IP protocol as the information to switch the packet, but instead by using the MPLS label information. So the MPLS label for specific route or destination IP prefix is communicated and agreed among the routers using different label distribution protocols: LDP, RSVP or even with BGP. Obviously the label distribution protocols depend on the underlying routing protocol for the routers to communicate to each other. And the routing table is used along with the label database to build Label Forwarding Information Base (LFIB). If forwarding information base is derived from the routing table and it contains the next hop IP destination information with the next hop interface and layer 2 information to be re-written in the packet, the LFIB contains the next hop IP destination information with the MPLS label need to be popped or pushed to the packet before it can be sent out the egress interface.

Both forwarding table and label forwarding table can be pushed to the network processor in the line card using the Inter Process Communication (IPC) interface. If all incoming packets must be processed by the network processor, then we just distribute the processing challenge from a central processor to distributed model. Moving a bit far, the network processor can build specific instruction to define what action need to be done to the packet that comes to the line card, and push this information to the hardware that is built to run specific instruction such as ASIC. And ASIC nowadays can process packet not only in layer 2 but in layer 3 and layer 4 as well to deploy feature such as packet filter and so on. And the function to process packet in layer 2 and layer 3 and 4 can be split in two different ASICs for performance purpose.

Up to the point where the forwarding information is pushed to the line card and to the specific hardware, is part of control plane. The actual switching packet by the hardware or ASIC from one line card to the others, is part of the forwarding plane.

The carrier class router from Cisco extends the modular concept to even further more by introducing the concept of Modular Services Card (MSC). So line card is separated into two components: physical part and the intelligent part. The physical (known as PLIM – Physical Layer Integrated Module) is dealing with all layer 1 in TCP/IP stack, including to provide the physical port where we can plug the cables. And the MSC is the one that does the upper layer processing once the PLIM has constructed the bits or digital signal from the network media into a single TCP/IP packet. The purpose is obviously to address the scalability issue. I mean, the physical part can be replaced or upgraded but the MSC can remain the same. Or if one day we want to upgrade the MSC capacity we can do so without removing the physical cabling on the ports.

Let see a bit closer of how the packet gets processed inside the line card by using the MSC architecture as above. This is a very famous discussion known as Life of a Packet.

From PLIM the packet is sent to the MSC through the midplane (you can also consider this process happens in a single line card without separation of PIM, midplane and MSC). Then the packet is processed by the Ingress Packet Engine, that has all the information and instruction received from the line card processor to decide what to do to all incoming packets. Once it has been decided to send the packet to other line card or to the route processor (for some cases where the packet is destined to the router IP address itself or for control packet to manage the router) then the packet need to be sent to the backplane or switch fabric, with additional internal header to ensure only the destination line card will receive it. In some architecture the packet travels in the fabric must be standardized or normalized to use a fixed size or length. The reason is because it will be easier and faster for the hardware to process the packets with the same size. In some architecture the packet is converted to different format (such as fixed-size cells with new header) when it travel across the backplane. So there should be a buffer or place to put the packets into the queue before it can be transmitted into the backplane. The backplane itself is another module or line cards designed specifically to connect all other line cards. We will discuss the backplane or switch fabric later.

From the backplane the packet is transmitted back to the destination line card and obviously it requires another buffer or queues to convert the packet back to its original format or length. Then there is another process in egress packet engine in case there are some features need to be applied. For specific cases, the MPLS label imposition to push the label can happen here. In most cases, the layer 2 re-write or MPLS label imposition can be done in ingress engine, so the egress doesn’t need to do any lookup or further processing other than applying additional features in egress direction. And for a carrier-class router the egress engine can be the same packet engine as the ingress or it can be different hardware to ensure the performance. Before the packet is sent outside through the physical interface, there should be another queue to place the packets to wait for its turn before it can be processed and moved to the network media.

When you look at the physical layout of the card as in the picture below, it is really easy to understand each component. And you can see there are several different chips to do different tasks. The forwarding path from ingress physical port to the switch fabric and go back to the egress physical port can be seen clearly.

In some other architecture, the hardware that processes the incoming packet cannot do lookup so it has to consult the route processor. But it can send the packet to the destination line card directly over the backplane. So when the ingress line card receive the packet, it may put it in buffer or queue while waiting for the route processor to do the lookup and give instruction where to send. This mean the ingress line card doesn’t have to send the whole packet to the route processor, instead it can make a copy of only the layer 3 header and send this information to the route processor. Once the ingress line card knows the destination line card, it can send the packet out to egress line card directly.

When we discuss the switch fabric or backplane, the very basic and mid-range router may still use bus architecture just as shown in the picture below. Even if the linecard has its own processor to do the lookup, but with bus backplane the packet sent from ingress line card will be received by all the other line cards.

Bus uses similar concept as Ethernet, when the ingress line card puts the packet into bus and all other line cards can receive it, and only the switching engine or destination egress line card will take the packet to process it further. You can see directly that the bottleneck of this system is the capacity of the backplane.

The better backplane architecture is using the crossbar below.

With crossbar, each ingress line card can send the packet to any other line card at any given time. But since the egress line card can only receive from one ingress line card at any point in time, there should be a controller or scheduler to ensure there is only one ingress line card to connect to the egress line card. The controller can be integrated as part of the switch fabric or it can be separated as another external module to offer scalability and redundancy.

There is a router architecture that still has both crossbar fabric and bus. Bus is still required perhaps for backward compatibility. For instance the old line cards may not have the new fabric connection to the backplane so it has to use bus. The newer line card that has already had fabric connection still need to have connection to the bus if it needs to send packet to the bus-only line card. And in some cases, bus is still used by line card to send the packet to the central route processor.

The latest switch fabric technology is very intelligent as it can do lookup and packet replication within the fabric, and provide full line rate connectivity to the egress. For example if each line card is connected to fabric with X Gigabit per second connection, then at any given point in time as long as the number of packets send to egress line card is still less or equal to X Gigabits per second the traffic would be able to flow without any congestion even if the packets come from multiple ingress line cards. And in carrier-class router normally the capacity to receive packets from the fabric is double or even more than the capacity to send to the fabric. It means, if each line card can send X Gbps to the backplane, so each line card can receive 2 – 2.5X Gbps from the backplane to accommodate multiple ingress line cards sending the packets to the same egress line cards at the same time.

In this type of fabric, there can be a bypass link between ingress line card and egress line card. But this bypass link should not be used to forward the actual packet. Usually the link is used by the egress line cards to inform the ingress line card if there is congestion so the ingress line card can slowing down the rate of the packets sent to the switch fabric.

There are other things to discuss when the packet is in the fabric. As I mentioned before the packet itself can be standardized into a fixed-size packet (by fragmenting the packet if it’s larger than the threshold and adding pad if the packet is smaller than the threshold). By converting the packet into an internal format such as fixed-size cells with internal header the processing inside the switch fabric can be faster. In carrier-class router there are different stages of the fabric so even inside the fabric lookup process need to be done to ensure the packet is sent to the right egress line card only. Obviously this is why there is new internal header since the lookup process in the fabric may not be the same with the lookup in ingress line card processor which is based on IP or MPLS label forwarding table. If the fabric doesn’t do any lookup, so it is up to the ingress line card to put the internal header which identify the destination egress line card. By adding the internal header to the packet, in case of crossbar fabric the controller can determine which egress line card this ingress line card should be connected, to ensure the packet can reach the destination line card. And this internal header can be considered as additional overhead to the packet inside the fabric.

Up to this point, do you still think the knowledge of internal packet switching is not important? Well, my friend, it seems like you really want to push your luck. So please continue reading to the next part where I will try to explain the implication of hardware architecture to the features and applications running on top of it.

End of part two.


Anonymous said...

Really good post that explains alot of things about router functionality in the nitty gritty details!

Thanks alot for the info!

Unknown said...

Great articles, Himawan!

Stephen Hsiao said...

It's Good explains !!

niroshan said...

Excellent article, Really helped me out in these aspects, Thank you again