Netwatcher

May 1998 Volume 16.5


Netwatcher (ISSN 0890-5800) is a monthly publication of CIMI Corporation. Subscription information is available here . Copyright © 1998, CIMI Corporation. All rights reserved. No publication or reproduction of this document is permitted without the express written consent of CIMI Corporation.


Management Briefing

Management Briefing

The failure of AT&T's frame relay network in April created a lot of news activity, and also a lot of soul-searching on the part of network operations managers and planners. Big data network failures, at least in the commercial area, have been rare occurrences – up to now.

Voice networks are rated by our survey base as being reliable in excess of 99%, meaning that the conditions under which they do not meet their mission occur less than 1% of the time. Data network reliability, again as reported by users, ranges from a low of below 80% (for the Internet) to a high of 98%, a lofty plateau reserved for SNA networks and public frame relay. One would therefore think that the AT&T failure was a one-in-a-million shot. Statistically, that's true. The question is whether there's a fundamental change talking place, and if so, what can be done about it.

Data networks have a problem in a "reliability" dual with voice. Most users consider a voice network to have failed when calls cannot be completed. A lost call that is reconnected, or a busy signal that persists for a minute or so, is often not remembered as a failure at all. The nature of our use of voice networks limits our expectations; short, random, calls are the rule. In data networks, where application relationships over the network often last hours and sometimes are continuous, anything that causes the application to burp is called a "failure".

One logical way to create a failure-free network is to build one out of components that don't break. In other words, hardware reliability is an obvious differentiator for network equipment vendors. That, of course, is something we probably have known all along. In particular, we must ask whether there is any benefit to connectionless networks over connection-oriented frame relay, because many will say the AT&T outage proves there is.

A good place to start is the definition of "failure" in a data network as any condition that creates an application-level-visible loss of communications.

But is a network that is "up" but not serving the application really "reliable"? Not according to most users. For them, network reliability, then, is a measure of the percentage of time that the network fills its mission.

What definition to use? Let's start with the concept that reliability is a measure of the up-time of the network, the percentage of time when it offers service. OK, then what conditions cause one, and how can they be minimized?

Application-level relationships are usually called "sessions", and they exist at OSI Level 4 or higher. These relationships are mapped onto the network (at Level 3) in a way that depends on the type of network we're talking about.

Connectionless networks have an unreliable Level 3 service set, one that doesn't provide delivery assurance or specific indication of information loss. IP is connectionless, and so the TCP Level 4 function set includes the capability of detecting lost data and discovering the loss of a path between partners.

Connection-oriented networks have some ability to discover the loss of connections or data. Some, like frame relay, provide no specific error recovery but signal a connection loss. Some, like X.25, provide error detection and correction in addition to virtual circuit status monitoring. We note here that the common denominator is the ability to sense connection status.

Connection status is important in considering data network reliability, because a report of a connection failure is likely to be interpreted by the higher protocol layers as a "my partner is not reachable" condition. That, of course, is one of those burps we talked about.

So where are we? In any network, a failure becomes significant if it's reported to the user. To avoid reporting a failure to the user, we must either not fail, or recover from a failure in the network without creating a condition that the user would interpret as a loss of reachability.

Let's first dispose of the hardware reliability issues. It's true that individual vendors may build gear that is more or less reliable than the average, but it is not true that the equipment used for one protocol is necessarily any more or less reliable than that used for another. Thus, in the first sense of failure avoidance – the ability of the network to maintain failure-free operation throughout – all networks are potentially equal.

It's the second sense of failure avoidance – recovery – that separates the network sheep from the network goats, so to speak. Thus, it's that area that we must consider in more detail.

The ability of a network to recover from a failure is dependent on two basic factors:

The protocols and algorithms used for topology management and routing are pretty much standard in both the connectionless and connection-oriented world. While routing takes place on a per-packet basis in the former case, and on a per-session basis in the latter, there is no reason to believe that the sophistication of path recovery would be intrinsically better in either case. In other words, if we can draw on essentially the same tools to discover routes through a network regardless of protocol or architecture, we'll find alternate paths about as easily in one network as in another, providing that both networks have sufficient route diversity to create such paths in the first place.

It is true that the standards for frame/cell networks don't require alternate routing at all, which may lead some to believe that it isn't provided in such networks. In fact, most vendors provide a means of alternate routing in virtual circuit networks, so the real world conforms reasonably well to our idealistic model.

The differences in data network reliability, then, must come either from individual differences in hardware reliability, or from differences in the way a particular architecture or protocol can handle the process of insulating a service user from the fact that a network session has been reconnected.

TCP/IP, the standard-bearer for connectionlessness in today's world, was designed as a reliable service over an unreliable subnetwork. This basic assumption meant that TCP had to expect to routinely lose packets, face delay variations, and suffer rerouting of traffic around failures. Since such conditions were considered the norm, the TCP protocol can tolerate network failures that result in an "unreachable partner" for seconds at a time, and some TCP applications will tolerate even longer failures.

Frame relay, the most pervasive of the connection-oriented public data network protocols, has a tougher time. The strict interpretation of the frame relay standards would call for the generation of a "PVC-down" local management message if any node or trunk that carried the PVC sustained any hard error, or created any condition that might alter the sequence of the data packets. This is because frame relay is assumed to be "unreliable" in that it doesn't provide error detection and correction, but "reliable" in the sense that it provides path status notification and does not expose the application to a risk of re-sequencing data.

To illustrate the difference, assume a four-node network, fully interconnected, with the nodes arranged as the corners of a square. The sides and diagonals of the square represent the trunks in the network. Now assume that there are a pair of users, one connected to the node in the upper left and another connected to the node in the lower right, communicating through this network. The direct path is the diagonal from upper left to lower right, so let's assume that's where their session data is going. Let's further assume that during their relationship that path breaks.

No matter what the architecture, we can assume that the network can figure out the fact that there are two equally valid and optimal alternate routes that will still connect the partners, each following a path through one of the other corners of our square (upper-left-to-upper-right-to-lower-right and upper-left-to-lower-left-to-lower-right). Thus, we have an available reconnect path. What happens?

In an IP network, the network re-converges on a new topology that leaves out the failed diagonal route. Packets start to flow along the new path. If a few get to the destination ahead of the last packets to successfully transit the diagonal route that failed, it doesn't matter because TCP re-sequences them. It's therefore likely that no failure will be noticed by the application.

In a frame relay network, the problem is that we can't be sure that we can reconnect the path because we can't be sure that information won't be provided to the partners out of sequence. Frame relay, remember, can drop stuff but can't change their order of arrival if they do arrive. Thus, our frame relay network would normally report a PVC-down condition for at least a transient interval, to signal the applications that something may have gone awry, throw away the data in flight, and start rerouting the PVC. When that's done, a PVC-up message would be generated.

Is this a hard failure? It depends on what the applications will do when the PVC status change messages are generated. If the session-level protocol reports an error to the application, then we've failed. If not, the application won't be impacted here either.

Another way that the application won't be impacted is if the frame relay network doesn't report a failure automatically, but instead reconnects and resynchronizes the data flow – makes sure that all the in-flight data is forwarded to the destination in order. If that's done, no application impact is visible.

To recapitulate, TCP probably won't create a visible failure in this situation. Frame relay might, depending on whether the end-systems' software reports a failure to the application when it gets a PVC-down message, and depending on whether the equipment used in the network gives such a message in the first place when a re-route occurs.

What this means is that TCP/IP networks are not intrinsically more reliable than frame relay or other virtual circuit networks, but the practices of nodal equipment vendors and end-system software vendors may make connection-oriented networks report failures in conditions that a TCP/IP network would not.

Now, what about the other kind of "reliability" we talked about earlier. The difference between "up" and "useful" sums up the issue here. A frame relay network that reports so much congestion it throttles users back to a tenth of their CIR, or that discards a third of all the data presented, isn't useful. Likewise an IP intranet that introduces major delays in application dialogs, impacting worker performance, isn't useful. The problem is that there are few objective measurements of how often these problems occur or what could be done about them.

Surveys are a good way to establish the rules in situations like this, and so we'll cite our own research of last year on the matter. Remember that the term "reliable" here was taken to mean "useful in the application mission":

Buyers indicate that IP or connectionless networks are more able to perform alternate routing, but also indicate that cost constraints limit their ability to exploit this capability. Hence, the higher reliability of private IP isn't realized.

In the public network sense, buyers report that the public services fail hard in a small number of cases, but fail "soft" by degrading seriously in a much larger number of cases. It is this soft failure that creates the largest number of application problems, and thus the major reason for perceiving such services as "unreliable."

There are a number of lessons to be learned here, folks.

First, don't expect connectionless networks to be a panacea for data network failures . The vulnerability to hard failure is really the same for connectionless networks, but the way the failure is handled by the software inside and outside the network may be different. You may or may not see an objective difference at the application level. In addition, connectionless networks are more prone to performance-degradation outages.

Second, know how your software reacts to a virtual circuit down message if you are using a connection-oriented protocol like frame relay. It is difficult to get good data here, but it appears as though about half of all frame relay failures could be accommodated by simply waiting three to five seconds before declaring a failure; the PVC would come back up.

Third, know how your connection-oriented vendor provides for rerouting of traffic within the network and how that strategy will impact your own application error recovery. The speed of rerouting compared to the timeout delays before a connection is declared to have failed is the critical issue; be sure the software gives the network time to respond.

Finally, remember that "failure" is in the eye of the beholder. We've seen network gurus and end user managers arguing whether a network was "up" or "down", with each having his own definition of the state. Networks are user service tools, and the only objective measure of their performance is their support of their application mission. This definition of reliability is broader than one based on simple hard failure, but applying it makes sense out of the fact that SNA networks (always the networks with the highest user availability expectations) paid more heed to performance stability than to alternate routing in most configurations. Why? Because unstable performance was the biggest problem in the eyes of the users.

Network reliability isn't automatic in any protocol or service architecture. It's up to network managers to plan to make reliability as high as possible. The alternative is to accept whatever happens, however bad it may be.


In the Know

In the Know

We continue our MPLS tutorial here with a discussion of the role of MPLS in creating VPNs. This is critical, because VPNs represent the largest revenue opportunity in the public IP space. If MPLS can be made to provide very effective VPN support, then it can draw on this large revenue stream to fund development and deployment of MPLS gear, and it will be an enormous market success. If not, it will not present any special capabilities in a financial sense, and its success will depend on service provider network traffic management policies. Sorry, but this feature is for subscribers only!


Strategies

Strategies

It's probably no surprise to most of our readers that there are rumbles of discontent in the financial community regarding the performance of many networking vendors. The technology sector has been the traditional high flyer of the stock market, with enormous price-earnings multiples to fuel big capital gains. When earnings growth doesn't seem to justify those multiples, the Street gets antsy.

We don't do stock analysis in Netwatcher, except peripherally in our Annual Technology Forecast issue in December, but we do analysis of the vendors and their performance. This month, we want to take a look at the megatrend of the market, and the vendor strategies for addressing it.

What's Happening?

In a word, commoditization . The computing industry went through it in the 1980s, as PCs made processing power available directly to line organizations. The industry changed forever as a result. We're now commoditizing networking, for both deliberate and reactive reasons, and that will change our industry as well.

The deliberate reason is that the growth of networking depends on its simplification in a technical sense, and its more direct application to business needs. PCs were successful because they let line organizations apply computing directly, and in doing so opened a new market. So it would be with networking.

The reactive reason is pricing. Vendors, striving to compete in a market that is increasingly unable to come to terms with feature-based differentiation, have cut the cost of switched LAN products sharply. This encourages dispersal of purchasing authority by bringing the product pricing within the budget authority of line department buyers.

Commodity networking does a couple of things to vendors:

In short, the heady days of networking based on a few large sales of expensive, high-margin, products are over. Now the question is what vendors will do about it.

The impact of this will be to shift the LAN market sharply toward low-cost, feature-minimalist, switching products. The WAN market will shift over time to architected intelligent carrier services to displace low-level services that require customer integration of facilities and CPE.

For Every Action – Reaction

The logical thing to do in a changing market is to quickly move to address the market's ultimate direction, while bolstering profits in the areas where high margins can still be sustained. In terms of our current marketplace, that advice translates to taking a position in the low-end workgroup switch and WAN access space with products that can be made and sold cheaply, while focusing in the near term on moving larger WAN products to the service providers, whose needs are more specialized and will sustain higher margins.

That having been said, we can review the vendors in the marketplace and rate their responses to the market change.

Cisco obviously has the cat-bird seat in the current market. Routers are WAN products, and Cisco has them. The company's profits are still largely dependent on these higher-end boxes, but Cisco has also used the last couple of years to promote its LAN products. So far, it hasn't let its own switch port pricing dive to the levels of competitors, but it's clear that it will do so when competitive pressure dictates. In the meantime, it's getting a per-LAN-switch margin about a fifth better than most competitors.

Cisco has also done a good job in the WAN access space, linking the sale of its access products to the carrier channel to insure that the margins can be sustained. It is also working to improve distribution in the LAN product area, and if it succeeds it would have indirect sales relationships to cover both LAN and WAN in the next decade.

Where Cisco is vulnerable is in the WAN infrastructure sense. Because its ability to sustain high margins is dependent on account control, it is reluctant to let a paradigm shift from routing to switching take place; such a shift would devalue the thing Cisco is most known for. Carriers buy about 40% of Cisco routers, but most of them are resold and not installed in infrastructure. Only AT&T and US West, in the domestic account space, have committed to Cisco switches for WAN services.

Lucent is obviously the vendor squaring off against Cisco, and the company could prove formidable. Lucent has, via acquisition, broadened its product line to incorporate high-end LAN products. It has its own ATM switches, and has recently acquired Yuri to buttress the low end and access side of the ATM space. It also resells Bay products.

The strength of Lucent lies in its relationship with the RBOCs, most of whom are relatively un-invested in data services and even more so in IP-based services. Since it is impossible to imagine a massive business commitment to tactical data services that wouldn't mirror, in traffic distribution terms, voice service distribution, we can assume that much of the tactical data opportunity will be intra-region. RBOCs should be the big players, and Lucent could be a big winner if that's true.

Lucent's vulnerability is in the marketing angles. The company is still focused on selling to techno-nerds rather than marketing to the masses. That would put it at a major disadvantage when it comes to selling LAN products, even though Lucent does store-front resale of telephony products and could presumably tap similar channels for LAN gear. LANs, for the near future, won't sell themselves the way phones do.

The new Lucent IP product, the PacketStar IP switch that Lucent previewed in a recent customer event as the "RS4000", is the weathervane of Lucent's fortune, in our view. If this product is effectively merchandized, it could launch Lucent into the IP services space and exploit its position with the RBOCs. If it isn't, it will create a very public failure in a very key market area that can't help but discredit Lucent at least for a time.

Nortel has perhaps the strongest private ATM position in the marketplace, and an enviable position with the carrier community as well. The company has taken an early and effective position in the "data-over-ATM" space, providing access devices to link Nortel ATM products to virtually every major data architecture, including IP.

Nortel's acquisition of Micom provided it with a fairly good distribution channel and a potentially strong access product set, though the company has not leveraged this opportunity as well as it might have hoped. Despite acquisitions, Nortel is lacking in the LAN space in terms of having a recognized brand and a broad product set.

If ATM drives carrier provisioning activities, Nortel stands to gain substantial market share. In particular, it would be credible to those carriers who want to deploy business-quality services based on a common architecture. If IP drives the market, however, Nortel has no clear position. It seems to have a variety of IP relationships with other vendors, but no really effective IP strategic position.

The opposite position is held by Ascend, whose ATM switches have the best IP-integration story in the marketplace. Ascend has a good position with RBOC data networks, though its control of these accounts has slipped since the acquisition of Cascade (who had them to start off with, and had the switches as well) by Ascend. Recent wins at Williams and AT&T make it clear that Ascend is now trying to get its strategic act together in the ATM space.

The recent Ascend announcement of integrated ATM and optical networking, based in part on a Ciena relationship, may be seen by some as a reaction to Cisco's optical networking announcement, but it's really a statement by Ascend that they want to be a player in the major infrastructure league – along with Cisco, Lucent, and Nortel.

Ascend, however, is totally lacking in the premises/LAN space, and also lacks distribution for low-end products. The question is whether this will cripple its end-user position, and whether that will contaminate its credibility with carriers – who want vendors who are recognized by users.

That same problem may plague Newbridge, but the firm has other problems as well. Until late 1997, Newbridge seemed to be in a fight with Ascend and even Cisco as the ATM and frame relay infrastructure player of choice for the RBOCs. In the fall, they inexplicably fell into some kind of marketing funk, and they have lost major strategic ground ever since.

Newbridge almost owned the TDM/T1 market, but the performance of that market has fallen off faster than other technologies could make up. The company's attempts to enter the premises LAN market (first with VIVID, an ATM LAN, and then with the UB Networks acquisition) have failed. Newbridge still has distributors and still has a name in some user sectors, but the firm is in strategic difficulties and has perhaps a year to get straight.

Another old-time firm, Bay Networks, may be looking up. Bay has a broad product line in both the LAN and WAN space – broad enough to contend with Cisco on a box-for-box basis except in the ATM switch space. The new management team has straightened out its distribution problems, improved profit margins, and developed a pretty good overall strategy for both carrier and enterprise networking. There is still a lack of marketing savvy, not so much in PR or marketing communications as in higher-level strategic marketing.

Bay's biggest problem may be the impression that somebody will be buying it within six months. Nortel and Lucent have both been mentioned as suitors, and it's very possible that both are in fact interested. Recent security analysis reports have suggested that if the stock market were to discount the issue of amortizing off goodwill asset value, the acquisition might be non-diluting for either Nortel or Lucent.

Bay's technical challenge lies in its carrier position. While Bay's high-end router has some following, it doesn't have the big ATM switch products of the other major WAN competitors, and thus may not play in an integrated service provisioning market – the kind facility-based carriers seeking IP service revenues would likely create.

Cabletron is another enterprise stalwart, and another one with little carrier market position. The company has depended on account control to sustain high margins, and is thus doubly hit by the commoditization trend. Cabletron acquired Yago, a new-age router vendor, but this hasn't made it especially effective in the WAN space, and price pressure is eroding its margins in the LAN space.

In an age where marketing is king, Cabletron has been somewhat inept. Its recent CEO change was clearly a PR disaster, and nothing strategically significant has come out of the new regime. This is another company that has to get itself straight in about a year, or face the consequences.

A LAN vendor with a bit brighter future is 3Com, whose acquisition of US Robotics seemed to position it to challenge Ascend and Cisco. It now seems that competitive pressure at the low end of the LAN line is going to make things very tough for 3Com for at least three more quarters. The company has no high-margin WAN play to exploit, so despite the fact that it has perhaps the preeminent position in the distribution side of networking, it may suffer near-term financial disadvantages.

3Com may have hoped its alliance with Newbridge would save its WAN position, but the latter company's fall from PR grace has discredited that option.

Another vendor worthy of consideration is Fore Systems, whose new "networks of steel" advertising campaign and new CEO would raise at least a hope the company is getting a bit more market-realistic. But if Fore realizes that ATM isn't the be-all, end-all of networking, they must also realize that there are only two options left: IP-and-ATM marriage at the service layer, or ATM as a multiplexing option in the access layer and core. The latter mission is clearly an infrastructure play, one that a Lucent or Nortel could be expected to win. That leaves Fore the mission of service ATM.

Whether Fore sees it this way or not, it seems to be doing little or nothing to make an IP/ATM marriage more realistic. The company has promoted no new ideas here, which puts them in the MPOA or RFC 1577 camp by default. To quote an old adage, "those dogs won't hunt" and the market has already decided that. Fore needs to change dogs, so to speak.

For the rest of the market, the course of action is clear. In a commoditizing market, it's hard to survive unless you are either a top-tier player (and probably nobody not named here already can hope to be one) or have a niche of compelling value. Even that kind of niche probably cannot be protected for more than a couple of years, so an up-and-coming firm would need to quickly leverage any success to move into another niche, and "hop from stone to stone" until total company size and sales let it go after the major players. Or … be acquired.

We're entering a new age of networking – vendors and users alike. The top computing companies of today still include some of the top players of the 1980s, but there are a host of companies who didn't even exist then who are big guys today. Who would have predicted Compaq's acquisitions, for example?

Strange things will happen in networking, too. Hang on to your hats.


Down the Line

Down the Line

In the next issue, we'll complete our MPLS tutorial with an examination of MPLS and ATM. We're also going to take a look at service provisioning, DWDM, and carrier ATM switches over the rest of this year.


- NETWATCHER Index Page

Access the index of CIMI Corporation's recent newsletters.