Friday, February 26, 2010

The Evolution of Communication Processors

In response to the previous posting, I got couple of requests asking me to explain the details of what this new ACP processor means and how it compares to older devices. Evolution of communication processors and router devices will be a good topic to give a lecture or do a white board discussion on. In lieu of that here is a longer description to anyone interested. Apologies to my friends in the IT area for the first couple of paragraphs that may sound rudimentary. :-)

In the 1960s and 70s when microprocessors were developed, initial goal was to get them to do general math operations like addition, subtraction, multiplication, division combined with memory access operations like load/store a value in a memory location, fetch the next instruction, etc. By combining these basic operations, one could do more complicated things like matrix manipulation, industrial instrumentation tasks, and so forth. As the microprocessor development evolved, it led to GP (general purpose) CPUs (Central Processing Unit) that were more powerful and did lot more basic operations allowing the development of various types of computers with these GP-CPUs serving as their brain. As the software evolved to abstract out underlying hardware architecture letting software engineers focus on the applications they wanted to develop, lot of sophisticated applications started showing up from personal productivity applications like word processors/spreadsheets to bigger server applications like weather prediction, image processing, missile guidance and so on. In the 1980's when Internet routing requirements initially came up, it was addressed purely via software running on standard general purpose CPUs.

The Intel Pentium processors (as well as its older predecessors x286, x386, x486) are well known example of a general purpose CPUs these days. There are several other GP-CPUs like ARM, PowerPC and other such family of processors. Since Intel and AMD processors are used on most of the PCs sold all over the world, these are the most famous ones sold in millions of units every year. For a while Macintosh machines used PowerPC processors as their CPU but about 5 years back they also switched over to Intel processors. The general software development flow for these GP-CPUs is to write software programs in languages such as C, C++ and Java that allow the focus to be on what we want to achieve from the end usage point of view, and then compile the written code to develop a version of binary code only a specific CPU can understand and run very fast. Various operating systems we hear about such as Windows, Linux and MacOS provide a platform to write such programs, do the compilation and run to make the application development process easier. When we install a program from a CD or from the internet, it is usually this compiled binary code that is meant to run on Intel Pentium CPU architecture running windows operating system. The world of telecom mostly uses Linux or several other more specialized operating systems that are called RTOS (Real Time Operating Systems).

These GP-CPUs are meant to run any kind of application reasonably well. Now, if we want to route traffic on the internet, we can write programs that will run on a router PC that has one of these GP-CPUs. As you know, information exchange on the internet takes place using small packets. They are similar to paper mail we send via ordinary postal service in the sense that each packet contains an address (in a part called header) indicating the destination it has to reach and a body (in a part called payload) with little bit of information. When you access a website from your home computer, you may receive 10,000 such packets that are opened up and put together by your PC to show you one web page. A router PC sitting on the internet will receive such individual packets from the web server, look up the address where it needs to go (say to your home PC) and forward it to the correct cable among multiple cables it may be attached to so that the packet eventually reaches your home computer. Now, if we want the router to not only route the traffic but also inspect the traffic for any virus and block any infected traffic, we can enhance the software code running on the router PC to do this, recompile the code and run. It will start doing virus filtering. Thus, using the general purpose CPU provides enormous flexibility in what we want to do by developing appropriate applications. This is great but for one serious caveat. If the CPU is simply looking up packet headers and routing packets and does nothing else, it can handle enormous volume of traffic. Let us say one CPU can handle the traffic for 10,000 users simultaneously. If we then ask it to scan for virus as well, it will take up so much of CPU capacity that it can serve only 100 users now..! If you force it to do one more task (for example encrypt all the traffic with a secret password so that no one other than the destination computer it is intended for can understand the information being transferred), it can slow down to such an extent that it can serve only 10 users. This 10,000 to 10 is not an exaggeration. So, you can imagine depending upon the workload how many general purpose CPUs you may need to handle a given number of users. Frustrating part is, in real life the load on the processor can arbitrarily vary. To give a simple example, due to snow storm if a lot of people work from home connecting to their company computers via encrypted connection, encryption/decryption load may increase which may not be the case on other days..!

In the nineties as the volume of internet traffic kept raising, these GP-CPU based routers routing packets using software proved inadequate and so companies like ours developed special processors that are meant to do just one thing very well (rather than all the things reasonably well like GP-CPUs). That one thing is simply looking up the address on the header of individual packets and sending them out on the right cable purely in hardware without much dependence on the software. This is very similar to address sorting machines US postal service uses to sort mail. If you use a human being to sort the mail, he/she will be capable of doing a lot of other things and can be taught to do new things easily like GP-CPUs. But they can never reach the speed of the sorting machine that can consistently sort 100s of mail pieces each minute though the sorting machine cannot be taught to do other things easily. Internet routers in the 90's started using such network processors and managed to speed up the routing process enormously. This is often referred to as 'fast-path' approach while traffic going through 'GP-CPU' is referred to as 'slow-path' or 'control plane' approach. Our company sold a lot of these fast-path approach based processors. To understand how these network processors are used, think of Dell or HP that sells PCs containing an Intel Pentium processor inside. An exact mapping will be the routers (equivalent to PCs) our customers (equivalent to Dell and HP) sell that contains our network processor (equivalent to the Pentium processor in PC) inside.

By late 90's there were additional requirements such as Quality of Service (QoS) that had to be managed by the routers. A good example of QoS management could be on a router close to home/customer premises. Nowadays difference between telephone and cable TV companies are starting to disappear since both companies are able to provide TV, internet access and telephone service via one cable brought into your home. This is usually referred to as "triple play" of bundled services. Using such a service you can be talking to someone using your home phone, while someone is watching TV in your home while a third family member is surfing the web using your home computer. The voice, video and data traffic is all converted into packets that go out/come in via the router. Among these three services, the telephone traffic takes up only very little bandwidth but is extremely sensitive to delay since even a fraction of a second delay on the phone line can be pretty annoying while you are talking to someone. The video traffic takes up the most bandwidth (really high if it is HD TV) but since it is unidirectional, you can buffer up the traffic a little on the router or TV so that occasional half a second traffic stoppage can be completely hidden from the viewer. The data traffic meant for the internet user is the least affected by delays (it is ok to take few more seconds to open an email) but it takes more bandwidth than the phone but less than the video. Routers handling all these three types of traffic all through the internet need to have the intelligence to identify voice traffic and give it VIP treatment routing it without any delay anywhere while providing second and third level priority to video and data respectively. We can throw in as many wrinkles in this scenario as you like. For example, if the internet user in your home is playing a live action game online or using Skype to do a video conference call instead of simply reading email, then that traffic priority need to be handled equivalent to telephone traffic. But the idea should be clear. Our network processors can identify all these traffic variations dynamically and accord appropriate priority using hardware alone at enormous speeds consistently. This was great.

These routers that used hardware based network processors as the main brain used to have a small GP-CPU also inside the box that is used to initially bring up the network processor and download the addresses onto a table so that it can do the sorting. GP-CPU is like a human being needed to setup and turn on the sorting machine in the post office so that the automated sorting machine can sort things at an enormous speed. In the 90's router boxes could demand a lot of money since the technology was new. So, it was ok to have an additional small GP-CPU inside each router despite its additional cost. As the price pressure increased in the early part of last decade, we added a small GP-CPU inside our network processor ASIC itself. So, when our customers designed their router boxes, they could save some money by eliminating the GP-CPU from their design and use the one inside our processor itself to setup the processor. This worked very well when our processors were used to design DSLAM (Digital Subscriber Line Access Multiplexer) boxes that sit in the telephone company's rack providing service to your home DSL connection since those boxes only needed minimum help from the GP-CPU to turn the network processor on.

You are still reading this..? Great..! :-)

As it always happens, marketing teams started promoting these network processors with a tiny built in GP-CPUs to other application domains. A business office gateway is one such application. These gateways sit at the entrance to a small branch office serving about 50 employees working in that branch. The employees connect to a LAN (Local Area Network) and use their local printers, email servers, etc. Whenever they access the internet or the main office server or make a phone call outside the office, the traffic gets routed via the gateway. Similarly any traffic coming into the branch office comes in via this gateway. These gateways are required to do lot more than simply sort and send out the traffic. Let us look at four examples.

- They hide the addresses of the individual employee computers and send the traffic from all the 50 computers as if it originated from one computer which is the router itself. When replies are received back, they sort it out first and direct it to the appropriate employee computer. You can think of all the employee computers as individual occupants of hotel rooms in a hotel with 50 rooms (or 50 post box numbers in a post office). Incoming mail could be addressed just to the hotel (or post office) but still reaches individual users. This is easy to do in the network processor itself in fast-path but requires initial setup by the GP-CPU each time a computer connected to the network is turned on or off (i.e. some initial packets need to go through 'control path'), thus adding a little bit of load to the GP-CPU.

- They typically encrypt the traffic that goes out from the branch office to the main office and decrypt the traffic received from the main office using secret codes so that anyone else tapping the traffic between main and branch office can not steal any information. This cannot be easily done in the sorting hardware engine (i.e. fast-path) itself. It could be done in the GP-CPU. But then it will mean sending all the main/branch office traffic to the tiny GP-CPU (i.e. control path) adding enormous load to it negating the consistent performance advantage we were receiving while using only the fast-path based network processor. One solution we have implemented for this problem is to have another special fast-path chip that does only encryption/decryption using hardware alone. So, all the traffic requiring encryption/decryption will first be sent to that additional chip on the same router board from the network processor and then received back before it is sent out to main office/LAN respectively. But again this adds additional cost to the gateway box due to the addition of another chip and complicates the design.

- They setup and tear down telephone connections as and when phone calls are made to outside world. This task cannot be easily done on the network processor fast-path alone and so adds additional load to the GP-CPU.

- They often perform content inspection which means inspecting the entire packet including the payload to make sure the contents are clean. You can think of this as the mail sorting office in the post office reading the entire body of the letter or checking the photos, material that are enclosed inside the envelope being delivered to make sure there is nothing objectionable. While in the post office this may not be acceptable due to privacy issues, in the business office this is perfectly acceptable since the gateway needs to filter out traffic with virus, spam email, etc. In addition it may also have to block YouTube video and such other contents that the management may not want employees accessing during office hours. You can imagine how difficult it will be to ask the sorting machine in the post office to inspect contents of each mail and assess whether it is good or bad to be delivered to the customer..! But if there is a human being there, he/she could do it well and can even learn quickly of ever changing rules on what is acceptable/objectionable. Along similar lines, the GP-CPU can be very flexible to handle this task anyway we want (block YouTube from 9am to 5pm but allow it after business hours, etc.). However, just like the human being in the post office, GP-CPU will slowdown the processing speed considerably since this will take a lot of time and effort. The tiny GP-CPU processor originally included in the network processor to just boot it up certainly will not have the horse power to do this kind of detailed inspection of all the packets.

There are additional cases where flexibility is needed in the gateway or routers where GP-CPU architecture will provide all the flexibility if only the performance can somehow be made consistent (i.e. the throughput should not fall down dramatically when we turn on all these services). Another big flexibility with GP-CPU is the ease with which application developed for one GP-CPU can be ported to another GP-CPU. If it is a very simple application (say a simple calculator program) designed to work on Pentium processor that need to be ported to AMD or PowerPC processor, it could be as simple as recompiling the software for the other processor and then running it. Much more complicated applications can be ported from one GP-CPU to another GP-CPU usually within days at least to show that it works, while optimizing it to make it run faster can take some more time. But moving a GP-CPU application to a fast-path based network processor architecture may take considerably more effort. Going back to our post office example, you can think of porting one GP-CPU application to another GP-CPU as replacing one human being with another human being in the sorting section of the post office and teaching him/her to sort the incoming mail. Fairly easy compared to replacing a human being with a sorting machine for the first time which may take a lot of time to set it up making sure it works properly. I have been in discussions for two or three days with a potential customer who is quite excited about the sorting speed of the hardware based processor and wants to adopt the technology only to see them reluctantly walk away simply because they are afraid of the migratory effort required. We have addressed this issue by providing software packages that make this transition easier. Some other company network processors are so hard to program unlike ours that they have their own cottage industry..! Customers using such processors simply hire these contract companies to do the porting work since doing it themselves will be impossible in any reasonable timeframe..! So the Holy Grail is to bring in the flexibility of GP-CPU while ensuring the throughput processing consistency of the hardware based communication processors. This is what ACP does..! :-)

We start out with not one GP-CPU but a 4 core PowerPC processor running at a maximum speed of 1.8GHz sitting inside ACP..! This means you can take any standard GP-CPU application or publicly available software or protocol stack and compile it for the PowerPC GP-CPU and run it on ACP quickly. This will prove that any existing customer application can be easily migrated to this device. Of course it brings in the issue of adding more services will slow down the throughput issue back into the picture. To address that we have added 12 different hardware acceleration engines that perform just one thing very well very fast inside ACP. Thus, for example there is one engine that can perform all the encryption/decryption you need purely using hardware. There is another engine which does just hardware based content inspection to weed out malicious virus, spam laden traffic. Additional engines perform other such tasks without software slowdown.

Once you compile and run any code you want on the PowerPC processor to convince yourself that the solution works, you can then identify the part of the traffic that will require encryption/decryption and route it to the encryption/decryption hardware engine. Now you can remove that part of the code from the GP-CPU completely since this task is outsourced to the hardware block which is inside the same ACP device..! No more traffic slowdown due to encryption/decryption hogging GP-CPU capacity and no big additional cost due to addition of a separate chip to do this on the router board.

Similarly, next we can identify all the WAN (Wide Area Network) traffic that comes from outside network that need to be inspected for virus, spam, etc. and route it to the content inspection hardware engine which is inside the same ACP device. Cleaned up traffic can come back to the GP-CPU for further processing or transfer to the LAN or WAN as needed thus releasing GP-CPU from this onerous task. There are other engines to do other tasks such as checking the integrity of the packets to make sure they are not corrupted in transmission (you can think of identifying mail in the post office that is damaged), modifying the packets (think of post office forwarding a mail received to a new customer address) and so on. By taking advantage of every engine possible by diverting traffic, the 4 core PowerPC GP-CPU complex can reduce its load overcoming the throughput issue. Software packages are being put together to ease the effort required to program the hardware engines. The extent to which these engines are used can be adjusted to retain the required flexibility since using the engines is not mandatory for the router box to function properly. Another neat thing is the software simulator we have developed that runs on PC or Unix machines mimicking the entire ACP. So, even before customer have the real device, they can fire up the simulator on any PC, simulate traffic going in, routed to different blocks to see whether the application works as per design intent and how well the device is expected to perform. After sorting out all the issues, the simulator can generate one configuration file that can then be downloaded into the actual chip to get it configured completely as it was setup in the simulator.

After implementing this hybrid architecture, we went one step further with a technology we named Virtual Pipeline (which is a trademark now) using which you can route traffic from one hardware acceleration engine to the next with or without going back to the GP-CPU. Thus for example, incoming traffic first sent to the decryption engine to get the encrypted traffic decrypted first, can then be routed to the content inspection engine directly so that it can be checked for virus/spam, etc. After that if it is safe, it can be sent back to the GP-CPU or to the LAN side ports directly without having to go to GP-CPU at all. Similarly, incoming traffic need not go to the PowerPC GP-CPU at all. From the input port it can directly go to decryption engine and then to the content inspection engine and out to the LAN port. Traffic originating from the LAN side meant to go out to the main office, can enter the device, skip GP-CPU or content inspection (since the content is expected to be safe) and head directly to the encryption engine and leave for the WAN side. Depending upon various classification decision made on each packet, every packet or traffic stream can travel through any combination of engines and head out of ACP. The term virtual pipeline may make sense now. This flexibility should be quite powerful and handy in designing the traffic flow through the device.

As the standards these days require, it has support for USB, flash and various other memory device access, several input/output port setup possibilities, etc. Compared to 555 ICs I have used 25 years back that had a total of 8 pins, ACP comes with 1295 pins that get soldered onto the circuit board. :-) Though this pin count may sound high, for this class of devices called SoC (System on Chip), it is becoming typical. This is because one SoC includes multiple hardware modules inside one chip so that entire system (like a router box) can be built using one chip and very minimal additional components instead of multiple devices with smaller pin counts.

Depending upon what the customer needs are, there are variations of the device within the ACP family that can have just two or all four PowerPC cores activated, support different system clock speed (200MHz to 400MHz), provide different throughput capacity (5 Gigabits per second to 20 Gbps), etc. Just to give you an idea, 1 Gbps speed is about 500 to 1000 times faster than the DSL connection speed we get at home. So, in the near future this communication processor is meant for handling huge volumes of mobile phone traffic delivering broadband (3G & 4G) connectivity to thousands of smartphones. Who knows in how many different ways it could be used in future..? :-)

Interestingly in an article I published in 2005 posted at http://www.eetimes.com/showArticle.jhtml?articleID=173600898, I wrote that the security related features should be integrated inside the communication processor. Though it was not an epiphany but only a logical thought, I am glad to see that idea being realized via this processor now. :-) I am just a cog in one wheel of a huge machine that developed this device. Still it is good to see it being announced and covered in the technical press. I know our competitors are not sleeping at the wheel and will be bringing in similar or different devices that will provide stiff competition. Nevertheless I sure hope we sell a ton of these devices all over the world.

Did you really read this far..? Thanks, please email me back so that I can send you the quiz questions next. :-) Just kidding. Do drop me a line so that I know at least someone read through the whole write-up. :-)

-sundar.

1 comment:

  1. Nice explanation! As a matter of fact, your writing is excellent. It's quite interesting

    ReplyDelete