Saturday, March 07, 2009

Deep Diving Router Architecture, Part III

In the previous two parts we have discussed a lot about the hardware architecture. So where do we go from here? Let’s now discuss the features and the applications running on top of the hardware architecture that we have been discussing so far. I’m running out of the pictures that are available and can be found in google to explain this topic. And obviously I can’t use the picture from my company’s internal document. So let this part be the picture-less discussion.

The following are few sample features and applications that are required from a modern and next generation router:

High Availability (HA) and Fast Convergence
Router fails eventually. The failure may happen on the route processor module, the power supply, the switch fabric, the line card, or the whole chassis somehow. The key point here is not on how to avoid the failure, but how to manage during the failure to minimize time required to switch the traffic to redundant path or module.

For most of us who like to see a network as a collection of nodes connected to each other, the failure might be only in either link or node failure. For these two cases, router vendors have been introducing Fast Convergence (FC) features in the product such as IGP FC and MPLS TE Fast Re-Route (FRR) to reduce the network convergence time to minimal. And the key point for this type of failure is to detect the failure as soon as possible. If the nodes are connected with direct link, the Loss of Signal (LoS) may be used to inform the failure to the upper layer protocol such as IGP. If it is not direct link, we may use a feature called Bidirectional Forwarding Detection (BFD) which basically sends hello packet from one end to the other.

When the hardware fails, we expect to see packet loss for fragment of time. In most cases this is inevitable and the only thing we can do is to minimize the packet loss or to reduce the convergence time. For a router with redundant route processor, let’s say the primary route processor fails and it has to switch over to the secondary route processor, it can use a feature called Non-Stop Forwarding (NSF) during the switch over time until the secondary route processor is ready to completely take over, to avoid any packet loss. NSF offers some degree of transparency, since the failure node can inform its neighbors that it’s going to down :) but make promises it will go back online again so please all neighbors don’t flush the routes from the routing table for certain period of time, and please keep forwarding the traffic to the failure node.

The failure node itself must use modular concept as explained in previous discussion. So the forwarding plane should be done in other location but the route processor, for example in the line cards. Before the failure, the router must run the Stateful Switchover (SSO) feature to ensure the redundant route processor is synchronized with the primary route processor, fabric and the line card. During the switch over, while waiting for initialization process of the secondary route processor to take over completely, forwarding packet is still done in the line card by using the last state of local forwarding table before the failure. So if the failure node can still forwarding the packet to the neighbors, even it uses the last forwarding table state before failure, and the neighbors are willing to continue forwarding the packet to the failure node because they have been informed it will go back online again soon, then we should not have any packet loss at all. Later the SSO/NSF feature should be able to return the forwarding table to the recent state once the secondary route processor has taken over completely.

The new HA feature has been pushed recently is the Non Stop Router (NSR). NSR is expected to offer full transparency to the neighbors. For NSF during the failure the IGP relationship is tear down, even the neighbors will continue using the routes from the failure node during the agreed period of time. With NSR, the IGP relationship should remain up during the switch over.

If we go back to the hardware design and architecture, we can see now the first requirement is to have the secondary route processor to be synchronized always with the other route processor, fabric and the line card. If this is not possible to achieve then we should see packet loss during the switchover. Obviously we all understand that if the failure is in the line card or fabric, while there is traffic passing through it, we should expect to see packet loss regardless of any HA features we enabled. And for modular switch fabric architecture, we should have several different modules for fabric and the failure of one module should not affect the total capacity of forwarding packets in the whole switch fabric.

Quality of Services
Quality of Services (QoS) feature in order to differentiated treatment to the packet is a must have requirement especially during network congestion. Where exactly the congestion may occur?

If we use the carrier class router architecture in Part II, we can see that the congestion may happen on the following:
- Egress queue, a queue in egress line card before physical interface: while waiting for the packet to be transmitted to the physical media
- Fabric queue, a queue to receive packet from switch fabric in egress line card: since it has to normalized the packet received from fabric if the packet must be converted to fixed-size cell, for example. Or because the egress queue is congested so this queue is becoming congested too
- Ingress queue, a queue before sending packet to switch fabric in ingress line card: as consequences of the congestion in fabric queue or in the fabric, this queue can be congested as well

Congestion may happen in the switch fabric itself. But normally carrier-class router has a huge capacity in forwarding inside the switch fabric to accommodate fully loaded chassis with all line cards. Unless if the switch fabric is modular and there is failure in some of the fabric modules that will reduce the capacity.

So the key here is we should be able to differentiate services in many points inside the router. For example, if the egress physical ports are congested, we should be able to ensure the high priority packet in egress queue will be transmitted first. Same case with the fabric queue. And even inside the fabric we should be able to prioritize some packet in case the fabric queue or the fabric itself is congested. And when there is congestion in egress queue, it should inform the fabric queue, that will inform the ingress queue to slow down sending the packet to the fabric. This mechanism is known as back pressure, and the communication from fabric queue to ingress queue normally is through the bypass link, and not through the fabric since for this intelligent fabric described in Part II it has only one way direction from ingress to egress, not the other way around. And slowing down the packet sent to the fabric actually means the ingress packet engine should start dropping low priority packets, so it can send lower rate of traffic to the ingress queue.

It is clear now where we can deploy QoS tools in different points inside the router. Policing, for example, should be done in ingress packet engine. Egress queue can use shaping or queuing mechanism and congestion avoidance tools. Fabric queue may need only to be able to inform the ingress queue in case there is congestion.

Btw, the QoS marking that is used inside the router is normally derived from the marking set to the packet such as CoS, DSCP or EXP. When the packet travels within the router, the external marking is used to create internal marking that will be used in forwarding path until the packet goes out from the router. It should be the task of ingress packet engine to do the conversion.

One other important point from QoS feature is the support of the recent hierarchical QoS model. In normal network, packet that comes to the router has only one tag or identification to distinguish the priority of the packet of one given source or flow. In MPLS network, the tag will be EXP bit. In normal IP network, the identification can be CoS or DSCP. And they are all associated to only one type of source or flow so there is only one QoS action need to be done to it. But how if there are multiple tags, and it is required to provide different QoS tools to different tag? Let’s say in Carrier Ethernet environment the packet that reaches the router comes with two 802.1q tags, the S-tag to identify the provider’s aggregation point for example, and the C-tag to identify different customer VLANs (this is known as Q-in-Q). We may want to do QoS action to the packet as a unit, it means we just need to apply the QoS to the S-tag, but we also want to apply QoS based on different C-tag. This means the router must support hierarchical QoS model where the main QoS class will impact the whole packet, while the child classes can be specific based on customer tag.

In a network of multiple nodes, multicast traffic means a single packet coming from one source get replicated to multiple nodes depending on the request to join the multicast group. Now it’s our time to look in more detail and ask question: who is doing the replication inside the router?

Multicast packet can be distinguished easily from the destination multicast group address. Inside the router the replication can be in ingress line card, called ingress replication, or in egress line card, called egress replication. Using multicast control protocol such as PIM, the ingress line card should be able to know the destination line cards for any multicast group address. Let’s say we have two ports in the ingress line card, and multicast packet (S,G) is received in one port. From the lookup the ingress packet engine or network processor find out that the other port in the same line card is interested to the multicast group as well as some other line cards. Ingress line card may do ingress replication, to replicate the packet into multiple and send it to the other port in the same line card as well as to the other line cards.

Now, if we always do ingress replication there is a huge drawback in term of performance. Let say the rate of multicast packet received by ingress line card is X Gbps. And there are 10 egress ports, in different line card, that are interested to the multicast group. If ingress replication is being done, then the ingress card must multiply the packet into 10, meaning the total number of rate is 10X Gbps now, and this is the rate that is sent from the ingress line card to the switch fabric. In this scenario it’s better to use egress replication since the ingress line card just needs to send a single packet to each egress line card that is interested. And if there are multiple ports on the egress card that are interested to the same multicast group, the replication of the packets can be done by the egress line card in order to send the same packet to all those ports. This egress replication can avoid unnecessary huge number of traffic inside the ingress queue and the fabric in case of the ingress replication had been used.

In carrier-class router, the switch fabric is more intelligent it can do replication of multicast packet inside the fabric. So again, the ingress line card just need to send a single packet to the fabric, then based on interested egress line cards the fabric will replicate this packet and send it to those egress cards, then the egress line card can do another replication in case there is more than one port that is interested with the multicast group.

Performance and Scalability
Once you have reached this point, I guess now you have started asking questions in your head for any features or protocols: is it done in hardware or software? Is it done by central CPU or distributed in the line card? Is it done in ingress line card or egress? If yes, then good, finally we are making progress here.

Before I continue I would like to mention one critical component in the hardware for forwarding plane which is Ternary Content Addressable Memory (TCAM). In simple words, TCAM is a high speed memory that is used to store the entry of forwarding table or other feature such as access control list, in order to do high performance hardware switching. Remember the concept of pushing the forwarding table to the line card processor, then from the line card processor to the hardware? TCAM is used to stored the information. So now you know, we should ensure there is enough space there to keep the information, or in other words the TCAM is one limit point in forwarding path. If the route processor push more forwarding entries that the TCAM can handle, we may end up with inconsistent forwarding table between route processor and line card. This means, even the route processor knows what to do with the packet, but the hardware may not have the entry and will just drop it.

Looking at the modular architecture of next generation router, it is clear for us that in order to achieve non-blocking or line rate packet switching performance we should ensure that every components in the forwarding path should support the line rate performance. It means if we want to forward X Gbps traffic without any congestion, then the components from ingress processor and queue in ingress line card, the capacity of the fabric, the fabric queue, egress processor and egress queue in egress line card should be able to process X Gbps or even more. So if you want to know where the bottleneck inside the router, check the processing capacity of each component. If you know the capacity from the ingress line card to the fabric is only X Gbps, but you put more ports in ingress line card with total capacity more than X, it means you are doing over subscription. And by knowing the congested point you can figure out which QoS tools to be applied and where exactly you need to apply it. In this sample, using egress QoS won’t help as it is not the congestion point, since the congestion is in the queue to the fabric.

Now, why bother to keep increasing the route processor performance then, if we know the actual performance is in the forwarding plane that is done in the line cards? Well, because we still need the route processor to do the control plane function. You need a good CPU in order to process big number of IGP or BGP control packets. You still need a big memory to store the routes received from the neighbor before it can be pushed down to the hardware. You also need a good capacity for storage to keep the router software image as well as any system logging and crash dump information.

NGN Multi-Service Features and Application

It is common for an next generation network to carry multiple different services. The common applications other than multicast for IPTV, are MPLS L3VPN for business customer, Internet, L2VPN point to point and multipoint with VPLS and so on. The complexity comes when we have to combined and run the features at the same time.

For example, when we have MPLS-based network, the label imposition for the next hop is done in ingress line card. But how if we run another features such as one type of L2VPN that can be software based or performed in route processor? We may need to do the label imposition in egress line card because of this reason.

And how about if we have to do multiple lookup? For example, if we have to remove two MPLS tags on the last label switch router in case of Penultimate Hop Popping (PHP) is not being used in MPLS L3VPN network. First of all we need to do lookup to know what we need to do with the first or the topmost MPLS tag. Most probably we want to keep the top most to get the EXP bit for QoS. Then we have to do another lookup to see the VPN label on the second tag to associate it with the VRF. Last, after all the MPLS labels have been stripped off, we still need to do another lookup in IP forwarding table to know to which egress interface we should send the packet. Doing several lookups in the same location such as ingress may introduce us with the concept or recirculation, where the packet is looped inside the ingress line card. So after the first lookup the packet is not sent to the fabric but it will get the layer 2 information re-written with the destination of ingress line card itself, and the packet will be sent to the first hardware that processes incoming packet. So it looks like it’s just the next packet need to be processed by the line card.

Multicast VPN can give us a different challenge. But just to summarize, by knowing how the protocol and feature works, and the component inside the router that does specific task related to the feature, we can foreseen if any issues may occur during the implementation of the design. And we may be able to find the work around to overcome the issues.

Frankly speaking, I really can’t go to more detail discussion, for various reasons. First, it’s already 4 am in the morning now. I have been awake for almost 48 hours to write this Deep Diving trilogy and do some other things at the same time, so I’ve got to sleep. Have I mentioned how grateful I am for them who invented Red Bull? But for now, even the strongest energy drink won’t make me last forever.

Second, although I want to write more in this subject but I may not be able to do so. It’s really difficult to discuss in more detail but still able to avoid using or discussing some confidential information from my company. O well, let’s see how it goes. I may have a fresh idea after getting a proper sleep.

Good night.
End of the trilogy.


Anonymous said...

For Fast Convergence, you should discuss the hardest part of it, the BGP PIC features. It is the longest one compared to IGP and not many vendor can do that. I mean, there is case when FRR can't help. The more interesting case is also how the NE connected to the PE react to that kind of changes, inbound and outbound. I use this a lot in designing system with 3GPP standards in mind.

Cheers, #8930

Himawan Nugroho said...

Hi #8930,
there are two reasons why I didn't put BGP - Prefix Independent Convergence even this is a cool FC feature. First, because I want to focus on platform or hardware architecture and PIC is the function of control plane which is related to FIB data structure. Second, frankly I haven't tested this feature neither deployed it by myself. I would like to try it first before write it down. But indeed your comment gives me idea to write a specific chapter to discuss only about Fast Convergence. Maybe later. Thanks

Gustavo Rodrigues Ramos said...

Excellent trilogy! Thank you.


houston Kenna said...

Your summaries are always top-notch. Thanks for keeping us apprised. I’m reading every word here. las vegas skydiveshi

stev2711 said...

Very nice...this information is hard to get...good and structured explanation too...great job !