Netwatcher

November 1998 Volume 16.11


Netwatcher (ISSN 0890-5800) is a monthly publication of CIMI Corporation. Subscription information is available here . Copyright © 1998, CIMI Corporation. All rights reserved. No publication or reproduction of this document is permitted without the express written consent of CIMI Corporation.


Management Briefing

Management Briefing

In this issue, In the Know will be dealing with the potential impact of QoS features in WANs on the way we do premises networking. That’s an important concern for most buyers, or at least it should be.

Another issue that is related to the issue of QoS is the "free bandwidth" question. It’s popular in the media and trade show circuit to say that "bandwidth is free", or at least becoming so. Some events in the present market, such as Sprint’s unlimited fixed-price weekend calling, are cited as proof of the free bandwidth assertion.

Managers need to understand the reason behind this view, the chances of its being true, and the consequences of that truth on overall networking. Providing some insight here is our topic for this month.

The foundation of the free bandwidth philosophy, if we can call it that, rests with the radical improvements in fiber-based transport. A glass fiber used to be limited to T3 rates (it was so as recently as the early 1990s, at least in practical deployment), but now OC-192 and DWDM have radically increased the capacity of a fiber strand. While DWDM and high-rate SONET gear is more expensive than lower-rate optical transport equipment, the ratio of capacity improvement is much better than the ratio of cost increase. The result is a plummeting unit cost of transport.

That unit transport costs are falling would seem to validate the whole "free bandwidth" paradigm, but that’s the deadly trap that many fall into. Reduction in transport bandwidth costs only impacts those applications where transport bandwidth cost is a large part of the total service cost. Those applications, it turns out, are less numerous than we think.

Network costs can be broken down into four main categories:

    1. The cost of the capital equipment used to build the network. This capital cost is depreciated over time, and the current year charge is the "cost" applied. This would include the costs of installing a transport network—equipment, real estate or right-of-way, fiber installation, etc.
    2. The cost of support and management for the technical infrastructure. This is the cost of the "craft personnel" who service the network, install things, fix things, etc. It is also the cost of the management centers that monitor network conditions.
    3. The administrative cost of the business, which includes billing, order processing, advertising, facilities, management, marketing, and all the rest of the usual business overhead items.
    4. The profit the business expects to earn, and return to its shareholders.

The only one of these four areas that would clearly be impacted by improvements in fiber transport technology is the first. Even there, it’s clear that there is a lot more involved than simply lighting glass more efficiently. A telephone system, for example, involves a complex series of switches linked by transport connections. The cost of the switches isn’t changed by the improved SONET or DWDM technology.

Even on the transport connection, the improvements in overall cost of bandwidth aren’t a sure thing. Suppose we can do OC-3 over a given fiber for a cost of 10 cents per megabit per second, and OC-192 at a cost of only a tenth of a megabit per second. We’ve improved our cost structure by a hundred times, right? Not unless we can actually use the OC-192. All of the improvements in transport cost rely primarily on economy of scale. That’s valuable only when the larger-scale bandwidth is consumed, which tends to mean only within the dense part of the network core. The average edge office switch today doesn’t even justify an OC-3.

What this means is that the long-distance carriers are the ones who are most subject to changes in unit cost of transport bandwidth. They have a larger portion of their total cost represented by transport, and their networks are dense enough to justify higher-bandwidth technology that offers lower unit pricing. But even for them, the truth is that total cost is probably not impacted by more than 10% to 20%.

Why, then, is Sprint seeming to be saying that bandwidth is really free? The answer is that they aren’t. What they’re saying is that weekend calling levels are far lower because businesses are usually closed. Their networks, which must be sized to support calling during the busy weekday period, are largely idle on weekends. There isn’t any reduction in any of our four cost factors during that period, just a reduction in usage. If Sprint, or anyone else, can create a pricing plan that earns more revenue per user by selling weekend time a different way, the "more revenue" earned isn’t accompanied by any increase in cost because the network’s there already. More revenue is more profit, therefore.

OK, we’ve (hopefully) proved something here. Bandwidth may be free, but services aren’t.

We’ve made the point in the past that service providers must transition to a new model of revenue generation, one based on sale of service and not sale of bandwidth. The reason is that, in attempting to sell more bandwidth, providers have passed along economies of scale to the buyer. T3 costs a lot less than 28 T1s, and a fraction of what 672 DS0 voice-grade lines would cost. That was supposed to make it more attractive to business buyers, but in fact all that’s happened is that other service providers (like ISPs) have bought the bandwidth and turned it into revenue-generating services.

This is why it’s "hard to get T3s", as is often reported. It’s not that we lack capacity to provide them (after all, it’s inconsistent to say we’re exploding with capacity, but can’t generate 45 Mbps of it to sell), it’s that the facility-based carriers don’t see any benefit in subsidizing competitors. It’s also why cost of high-bandwidth trunks aren’t falling any more; they’re rising, in many cases.

This is also why it’s not possible to translate the concept of free bandwidth into what seems the logical next step, which is superfast Internet or some other service. Providers can’t protect their high-capacity-low-cost transport from arbitrage by other service providers in such a market. The average user can’t consume raw bandwidth—they need a service envelope to surround it—so user sales are not optimum. Creating that envelope is really the most costly thing about networking, because it’s the most human-intensive task.

The immediate consequence of the trends we’re seeing here is the gradual shift toward "architected" services rather than sale of bandwidth. You can turn a T3 line into anything, but a VPN with specific QoS parameters can’t be arbitraged into something other than the parameters that define it. A service provider can sell a buyer a piece of low-cost bandwidth wrapped in a service envelope like "VPN" and pass along the lower unit cost of capacity, because the service features added make the bandwidth useful only in the narrow application context for which it is intended. This lets the service provider sell the service based on the lower bandwidth cost.

Pricing on "bandwidth" won’t improve over time, but special services (QoS options on frame relay, ATM, and IP services) will provide special deals whose pricing won’t depend strictly on the peak or average bandwidth. The price advantages these services will offer network users will be the drivers in the VPN revolution, and will also be the thing that enables new applications like collaborative computing.

The changes to the way services are sold, and the reduction in unit cost of long-haul bandwidth, could have an impact on service provider strategies for equipment features that could roll over into the user space. The latter will be covered in the In the Know feature this month.

Service provider equipment designed for packet transport over low-cost bandwidth might elect to employ forward error correction facilities to recover from bit errors, because the cost of the extra bits would be small compared to the QoS impact of a dropped packet. This is particularly true if SLAs guarantee loss rate, a practice that isn’t prevalent today but which will become so in the future.

Service providers might even consider the more extreme concept of "flooding" or duplicating packets on a limited number of paths to insure delivery. Today, with bandwidth-based pricing, this would not be feasible. In the future, it probably will be. Again, the motivation would be to insure QoS.

Insuring QoS, of course, can also be done through simple oversupply. As service-mode pricing and special service interfaces replace raw bandwidth sales, service providers are more free to rely on oversupply. That is probably the intention of carriers like Quest; no special provisions for QoS management are required if the network can be equipped with essentially unlimited bandwidth.

Note that this unlimited bandwidth doesn’t percolate through to provide you the consumer with low-cost services. This is another place where classical wisdom has it wrong. The purpose of business is to make money, so the deployment of unlimited capacity makes sense only if capacity isn’t what you’re selling. Otherwise, the supply so far outstrips the demand that prices and profits fall.

The special service relationships created by the buyer, which will surely be priced on a QoS basis, will impose the pricing rules of limited capacity on us all, whether we have limited capacity or not. There is simply no business alternative, because too much of the cost of being a service provider has nothing to do with the cost of long-haul bandwidth.

Well, we have about covered the promised issue set. We can summarize our points as follows:

    1. Fiber technology improvements will result in a sharp reduction in unit bandwidth costs, but the impact will be felt only where sufficient traffic can be aggregated to use the higher-scale facilities efficiently. For the near term, at least, this means the deep core of long-haul networks.
    2. The pressure on bandwidth cost, and other collateral business changes of the decade, are forcing carriers to adopt a service and pricing model that doesn’t expose raw capacity to the buyer, for reason of adverse arbitrage by competitors.
    3. The new service paradigm will let sellers architect services that consume more bandwidth and rely on low unit bandwidth costs, by protecting the bandwidth from that arbitrage. These new services (like VPNs) will be much cheaper as a result, and will displace leased-line networking as a result.
    4. Equipment and network planning changes will accompany the explosion in bandwidth availability, but the impact of these on the buyer will be limited by the services layer that is overlaid to permit consumption by applications.

It sounds like we’re saying that nothing much will change, but that’s not really true. A shift to VPN buying would be a major change for everyone. What probably will not change is the basic nature of the service provider relationship with the buyer of services. That’s something we all must learn, but that Internet aficionados must take particular care to understand.

Networking is a business, and that’s the most important conclusion we can draw from any discussion.


In the Know

In the Know

All the discussion of QoS in the WAN begs the question of whether the same issues, or related issues will develop for the LAN or campus network. If so, what might the effects be on equipment, software and applications?

We’ve said in the past that the general rule for the future is to take advantage of the low capital cost of LAN switching to oversupply networks with bandwidth at the local level. By this means, we reduce or eliminate the congestion that creates the QoS problems we’re trying to solve.

But there are some limits to this strategy, as you might expect. One obvious one is the possibility that WAN QoS objectives would be met more easily with some cooperation from the premises devices. Certainly we’d expect the gadget at the DMARK to participate in any WAN QoS strategy, but it might be desirable to make QoS policy awareness extend further inward.

Another issue is whether there are any assumptions that have to be enforced to insure that "infinite capacity" on the premises really stays infinite, meaning that high-bandwidth LANs don’t somehow become congested anyway. If the LAN can’t be made congestion-less, then we can’t assume that it won’t have to be equipped with its own QoS management system.

The WAN QoS Linkage

There are two reasons why QoS management has to be applied at the WAN level; congestion and service policy.

Congestion is the easiest to understand. The point where resources become scarce is the point that gets congested in any network. Usually, the WAN has lower-capacity connections than the LAN and so it’s the WAN that’s likely to become congested. When it does, applications compete for resources in a way that might not optimize the performance of those that are most QoS-dependent, or most business-critical.

Service policy QoS is more complex. Here, the issue is not whether the network is congested per se, but whether the service parameters that the buyer has purchased (usually expressed as the boundaries of the traffic offered load curve, meaning peak and average throughput and a means of deciding how the averaging is done) are met. "You bought this service, mister buyer, and now you must police your traffic to its limits or I the provider am not bound by my SLA."

Since any explicit SLA will have to be issued under specific flow assumptions (at least for now; we’ll come back to that), the CPE at the DMARK will have to shape traffic to meet the buyer’s commitment in that area. The shaping algorithm must match the policing done by the service provider, or the network will be driven out of the service contract range. If that happens, the service provider can discard traffic, break the SLA, or whatever the agreement provides.

Shaping traffic in a simple sense is a buffering function. The traffic goes into a buffer at whatever rate it’s arriving at the service DMARK. It is metered out of the buffer at the rate the flow envelope permits, which in turn is the rate specified in the service agreement. The buffer level rises and falls according to the relative rate of these two activities.

Here’s where premises involvement comes in. If the shaping buffer overfills, it will overflow and packets will be lost. Not only is this a bad thing from a performance perspective, it can cause problems between the service provider and buyer, because the loss might be mistaken for a dropping of packets in the network—something the seller is responsible for under the SLA. To prevent overflow of the shaping buffer, we need to signal earlier devices along the path to buffer, and/or flow-control the application itself.

Distributing the buffering responsibility seems a logical thing, but there are barriers to the concept. One is the OSI model, and the other the multi-vendor problem.

Under the OSI model, flow control is a Level 4 function. Level 4, in turn, is a layer that resides only in end-systems, not in network nodes. Since asking an upstream device to buffer is invoking flow control on that device, the concept of distributing the buffering violates the OSI precepts.

The multi-vendor problem is more complex. Few users have single-vendor premises networks. Few LAN switches are designed to buffer traffic. Signaling to an upstream partner to do something it’s not required to do by OSI and not equipped to do in hardware terms is really appealing if the partner isn’t made by the same company as the DMARK device. In fact, why not signal that guy to do all the shaping, raising his cost and cutting yours? This is finger-point heaven, in other words.

Unfortunately, flow control to the application isn’t much more appealing, and for the same reasons. While the application might have a Level 4 process running (TCP is one), there is no provision for signaling it from a network node. TCP is a user-to-user procedure. In addition, there’s no guarantee the application would be prepared to exercise flow control.

Some of the OSI-level problems would be solved by lower-level flow control standards like those provided for ATM in the ABR (Available Bit Rate) class of service. Feedback from downstream toward the destination is used by ABR to control ingress of traffic. In theory, any credit- or rate-based system of flow management could be applied at OSI Level 2 below the IP and TCP packet level to regulate the higher-level flow.

In theory. In practice, there’s a problem because many of the DMARKs will be multi-service in nature. A single DMARK might support "best-efforts" IP service and a half-dozen VPNs with their own independent QoS levels. Presumably each of these would have a specific policy governing flow, so each would have to be shaped independently. Furthermore, the relative priority of one versus the other or the absolute QoS each represented would have to be reflected in the way the shaping buffer was emptied.

The sophistication of the shaping buffer management system depends on the sophistication of the QoS guarantee at the DMARK. If there is no explicit QoS (meaning a DiffServ-like prioritization of traffic), the only requirement is that the relative priority given to traffic in the network matches the relative priority given to each traffic type as the shaping queue is serviced. Here, it is unlikely that real policing is done by the service provider, so the "shaping" queue is only a buffer controlling access to the WAN.

Where explicit, absolute, QoS is being promised in the WAN, the division of resources at the access point must reflect that guarantee. Each VPN division with absolute QoS must be guaranteed a share of the access line sufficient to match the QoS promised by the network. To do more is to oversupply the network with traffic, a violation of the flow agreement that is likely to accompany the explicit SLA. To do less is to insure that the service provider never really has to carry the traffic you’ve paid for.

The explicit system, at least, almost demands that if the shaping buffer can overflow, there be some means of backpressure on the application to limit the flow. Two approaches are possible; implicit and explicit.

Implicit backpressure takes advantage of the fact that most applications use a window-based flow control system. In such a system, the sender has a specific number of packets or bytes that can be sent at any time (the "window"). The receiver can return acknowledgements to increase this value, and any sending decreases it. When the value goes to zero, the sender must stop.

Holding data in a shaping buffer prevents its delivery, and thus prevents its acknowledgement by the receiver. As data accumulates without acknowledgement, the sender’s window is reduced, eventually to its zero or "closed" state. The sender stops, and buffer overflow is prevented.

This scheme, while lacking in elegance, is really very effective if there aren’t too many senders with independent windows ganging up on a single DMARK, and if the windows aren’t too big. A thousand users with a 65 kilobyte TCP window size would require 65 megabytes of buffer space to prevent overflow, assuming that full window sizes were negotiated.

A variation on this scheme is to selectively drop packets to force a window adjustment and reduce the sender’s average flow rate. This process is employed by some of the packet-flow software designed for use on the Internet. The idea is to regulate the sender so that TCP isn’t forced into a discard-induced window size reduction too often.

The ultimate solution is to involve the application in the process by coupling the buffer/shaping process to the application’s rate of data generation. This is beyond the state of current APIs and client/server software for TCP/IP, but it does work with applications based on SNA because SNA has per-node flow control ("path pacing").

Adding explicit flow control either at the application level or the node level would involve one of three approaches:

    1. A general extension to standards at Level 4 to permit intermediate nodes to introduce feedback on buffer states. Since this wouldn’t be compatible with existing implementations, it’s doubtful it could be done.
    2. Generation of buffer-level feedback at a lower OSI level, or on a parallel control channel. This would then have to be "coupled up" to the application level by client/server software elements. Again, there could be a compatibility problem.
    3. Proxying the Level 4 processes at the DMARK so that the "receiver" the LAN user talks with is not the end system partner but the proxy process running as a part of the DMARK shaping algorithm. Since this proxy would control the rate that acknowledgements were generated, it would control the flow of the application directly. A similar proxy, talking across the network to the partner’s side, would manage the flow over the WAN.

The only disadvantage of the last approach is cost; performance to provide this level of active intervention at the DMARK wouldn’t come cheap. However, today’s trend to combine more and more functions at the top of the switching hierarchy (firewalls, etc.) would permit adding the proxy capability fairly easily. Proxy firewalls, in fact, perform many of the needed functions already. This is probably the strategy that will prevail over time; for now, implicit control will be the rule.

Assuring the Infinite Premises

Some of the rules for building high-capacity premises networks have already been noted, but we’ll repeat them here in summary:

    1. Constrain the client and server NIC speeds to a level based on the real application requirements. Too high a data rate on client or server systems lets too much data into the network, and can cause congestion at the DMARK.
    2. Be sure that inter-switch trunks are ten times the speed of the feeder trunks into the switch, to avoid exit port collision and discarding.
    3. Be wary of extending Level 2 domains across switches (VLANs that span a switch). It’s tempting to do this, but the multicast traffic can cause congestion at the connections between switches.
    4. Watch file transfers, disk copies, or other activities that don’t have a natural limit on information rates. Most transactional applications won’t congest a network because the human users can’t operate fast enough to do so. Where data is moving in bulk, there are no human limits.

Probably the two things on this list that get most users in trouble are the first and last items, so we’ll spend a bit more time on them.

There’s a huge temptation to move everybody up to faster LAN adapters, and an even greater temptation to do so at the server level. The use of fast adapters, however, will increase the rate at which a system can inject data into the network. It may also cause speed-matching problems in connections involving slower partners. Either can cause switch or local router buffers to overfill, creating delays and eventually data loss. This contaminates the QoS on the LAN even though the theoretical LAN capacity may far exceed demand.

Most client systems don’t really need even 100 Mbps connections, so hand out the faster connections only where there’s clearly a value. If you do so, keep an eye on the interactions between these people and their slower brethren; it may be necessary to create such speed-matched linkages via a device capable of a lot of buffering, and also to try to steer enterprise WAN traffic around places where these exchanges could congest.

The bulk transfer applications are usually what exacerbates the speed-match problem. If a fast user starts to copy a file to a slow one, the data will avalanche into the network faster than it can be delivered, and everything will back up. It doesn’t take much to overfill switching buffers. One way to prevent this is to forbid file transfers via disk copy redirection except to on-switch partners. The rest of the world should e-mail a copy of the file, which delivers it via a protocol with flow control limits. Some organizations make client-to-client networking for file or print sharing verboten under all conditions.

Conclusion

For now, a combination approach to the problem of coupling QoS policies is in order. Users should take steps to build LANs whose congestion potential is minimal, and take further steps as outlined above to make sure they stay that way. That will control the extent to which LAN delivery interferes with WAN QoS.

It probably won’t solve all of the problems, though. If you buy a service with an explicit SLA, you will need a special device at the DMARK to manage the connection—that should be the baseline assumption. Standard routers or FRADs aren’t the answer; look for shaping capability that matches the type of QoS guarantee you’ve received.

Finally, you’ll need a management system that can differentiate between packets lost on the WAN and those lost somewhere end-to-end. The latter may not be the service provider’s responsibility, and if they aren’t, you’ll need to do something about them yourself.


Strategies

Strategies

Everyone wants to talk about SS7 and its use in newly evolving IP-oriented voice products. While all this discussion is going on, there is an effort underway to define a new architecture for voice services. This architecture wouldn’t necessarily displace SS7, but it would tend to relegate it to the protocol used to interface with legacy PSTN equipment.

The new architecture is a mixture of international and Internet standards, part of which is under the auspices of the MMUSIC (Multiparty Multimedia Session Control) group within the IETF. The aspect of the architecture that will be the focus of this section is the Media Gateway Control Protocol, or MGCP.

Sorry, Internet users. This section is for subscribers only!


Down the Line

Down the Line

 

Our next issue is the Annual Technology Forecast.  This will not be posted to the Internet, and no back issues or special subscription starts will be accepted relating to this issue.


- NETWATCHER Index Page

Access the index of CIMI Corporation's recent newsletters.