Netwatcher

October 1997 Volume 15.10


Netwatcher (ISSN 0890-5800) is a monthly publication of CIMI Corporation. Subscription information is available here. Copyright © 1997, CIMI Corporation. All rights reserved. No publication or reproduction of this document is permitted without the express written consent of CIMI Corporation.


Management Briefing

Management Briefing

In a recent seminar, we commented on future trends and issues, and used as an example the concept of "lambda switching". Some attendees wondered if the reference was a joking comment on the hype-ridden nature of our industry – if photons aren't fast enough, how about neutrinos?

Lambda switching isn't a joke, though some of the forecasts made about it certainly are. It's a term associated with an up-and-coming issue in digital transport, wavelength division multiplexing or WDM.

To understand what WDM and lambdas have to do with your networks, you've got to start with the way that optical information is usually formatted. Since the early 1990s, the rule in digital fiber-optic transport has been SONET, the Synchronous Optical Network standard first promulgated by Bellcore and later adopted (with some changes, as usual) as an international standard called the Synchronous Digital Hierarchy (SDH).

SONET is a form of time-division multiplexing, the same stuff that makes up T1 and T3. As in each of these standards, the basic unit of transmission is a frame of information, a structured set of bits that include both header information to help steer and decode the user data, and the user information itself.

SONET frames are usually described as a kind of matrix of rows and columns, like a spreadsheet, but the details of the structure aren't important for this little briefing. What is important is that the SONET frames are repeated every 125 microseconds, and that each frame's payload is divided among information sources.

When you want to add something to SONET or remove something from it, you have to assemble a frame and synchronize its timing with the other frames, then convert it to optical (via a laser or laser diode) for fiber transport. When you want to switch a particular information flow from one trunk to another, at a carrier node for example, you have to assemble the incoming frame, find the particular SONET payload component ("virtual tributary") associated with the flow you want, remove it, and then assemble it into the output frame on another trunk.

This process of SONET handling, called "add-drop multiplexing", is certainly more efficient than the dissection of other forms of high-level digital traffic (T3 comes to mind), but it's still handling. A similar process occurs when two SONET trunks are to be combined to form a higher-speed trunk. The two inputs are given space in the output payload, which has to be large enough to hold both. This means a larger frame, and a higher bit rate because the frame timing has to be 125 microseconds throughout.

Time division multiplexing isn't the only kind of multiplexing. In the 1970s and 1980s, there were a few devices that performed frequency-division multiplexing. In this kind of multiplexing, information streams weren't separated from one another in a time-synchronized frame structure, but by frequency – the way radio stations are separated. A carrier wave is created for each flow, and modulated with the digital bits of the flow. At the other end, the carrier is demodulated. This process of modulation takes place, using a single "carrier wave" in SONET – it's how bits get converted to photons in the laser diode.

FDM fell out of favor in non-optical times, because it wasn't efficient in its use of bandwidth, but it survived into LAN days in CATV-based LAN products of the late 1980s. What resurrected it was fiber optics.

In fiber systems, we have bandwidth to burn. If we took two SONET streams and converted them from frames to photons of a particular and different frequency, they'd be independent streams and could pass over the same fiber (the two frequencies of light have to be close together or the fiber isn't optimally transparent to both).

One question you've probably already had is what this has to do with wavelengths – so far, we're talking only about FDM. The answer is that there are two ways to describe an electromagnetic or optical signal: by the frequency at which it repeats (measured in cycles per second, or Hertz); and by the distance between consecutive peaks (measured in meters or millimeters). If you divide the speed of wave travel (the speed of light) by the frequency, you get the wavelength. The Greek symbol for wavelength is l , called "lambda". Hence, "lambda multiplexing" is WDM.

Well, that was interesting. What difference does it make? OK, here's one. If we have a given fiber trunk, our capacity is limited by the SONET frame size we can create. Everybody has OC-3 (155 Mbps) and OC-12 (622 Mbps). Many now have OC-48 (2.4 Gbps) or higher (to get the gigabits per second of anything OC-24 or higher, divide the number – 24 in this case – by 20). Who has OC-384? With WDM, we can take a bunch (from three to 10 or so, at present) SONET streams and combine them onto fiber. Each is independent, and if we can go as high as OC-192 and have 10 WDM streams, we have OC-1920, in effect.

Great! But most of us aren't really taxing OC-192s at present. Capacity improvement on fiber trunks might eventually be a justification for WDM on a large scale, but it probably wouldn't make much difference in the near term. What makes WDM interesting today is lambda switching.

Remember those add-drop multiplexers? They represented electro-optical conversions that were expensive in both cost and handling. With WDM, streams of independent frames enter a node. Those that really have terminating conversations there are passed through the standard SONET process. Those that are simply heading elsewhere via this node are switched through at the optical level. No handling, delay, etc.

The goal of lambda switching is to make transit connections at the optical layer, before any electro-optical conversion, cost, and delay have been incurred. On multi-hop networks, it would potentially reduce the end-to-end delay of the path, since each time we frame out a SONET payload for cross-switching, we generate a minimum of a 125 microsecond delay for serialization (collection of the data from the input trunk).

The whole WDM/lambda switching concept is interesting to carriers and vendors for a number of reasons:

  1. It would allow the integration of TDM voice networks and IP or ATM networks onto a common optical infrastructure, without mingling the traffic and without performing electro-optical conversions every time traffic transited a node site, even when that site wasn't involved with any of the conversations the flow contained. This is important because ATM and IP nodes would probably be scarce relative to TDM node sites, so their traffic would be subject to a lot of unnecessary handling with SONET processing alone, and this would raise network costs for all traffic types.
  2. Two different carrier jurisdictions could share a fiber pipe as "ships passing in the night", having no ability to influence each other or see each other's information. This is useful for dark fiber providers who want to leverage strands between key locations, like LA and San Francisco.
  3. "Wavelength virtual networks" could be developed, a kind of tributary system of SONET pipelines separated by WDM, collecting traffic in regional centers for hand-off to an inter-region trunk. But these virtual networks could share the same fiber as the feeders that assembled that regional center's traffic, just flowing over a different wavelength.

Some of these benefits are available from WDM alone. What lambda switching would do is secure their economic viability by reducing the cost of creating a whole new virtual network infrastructure above physical facilities but below the frame/cell layer where virtual networks are traditionally created.

This idea of another virtual layer has many people interested, as our third point above suggests. Some believe that as bandwidth cost falls through lambda switching and WDM, conservation of trunk bandwidth would become less a priority, and conservation through multiplexing – the core of the frame/ATM value proposition – would no longer be required.

Maybe so, in the far future. In the present, the idea that lambda switching and WDM would somehow cause bandwidth cost to plummet is highly destructive to the planning process. We're all adults here, folks. Nobody is going to invest in capital technology to lose money, which is what happens to carriers if bandwidth is free. In any case, the cost of bandwidth is increasingly the cost of access and switching, so making transport bandwidth cheaper probably wouldn't impact overall costs significantly.

What WDM and its companion concepts will do is help contain the cost of bandwidth as our demands for it increase in a quantitative sense, and diversify in a burstiness sense. That's a worthy benefit, but one that we probably won't be sensitive to until early in the next decade.

If you own glass, and if you plan to exploit it commercially, you may want to consider WDM technology as a means of allowing multiple customers to share a fiber span without casting you into the role of carrier and integrator at the SONET level. You can assign a wavelength (lambda) to each customer, and as long as their gear is compatible with the WDM equipment you select, each can provide SONET equipment to suit their needs.

The "as long as their gear is compatible" theme is important here. Where optical devices emit their own photons, the wavelength is fixed by the device. That may limit your ability to WDM, because you can't control all the lambda values used.

If you are being promoted for WDM in any way, you'll need to be cautious with the benefit case!


In the Know

In the Know

The developing interest in video and voice over the Internet has brought new focus to a set of standards that will probably govern both areas: H.323. Even as the definitive specification for packet video, H.323 is not well known. The fact that it is also a key ingredient in the packet voice area was discussed in an earlier Netwatcher issue on voice compression, and that truth is even less widely known.

Chances are, whatever your company will do with voice over IP or frame, and whatever it will do with IP-based collaborative or video applications, will be influenced by H.323. This being true, it might be a good idea to understand a little about what it is, and how it works.

The Structure of H.323

Many of today's standards are largely collections of other standards (the standards bodies believing, apparently, that this will help them flood us with more stuff to read). H.323 is such a standard, calling out lower-level standards in a structure designed to support multimedia packet communications.

At the highest level, H.323 defines a series of four components, representing elements of a multimedia system, with a fifth component (the network) implied. The four components are terminals, multipoint control units, gateways, and gatekeepers. We'll talk about each of these later.

The elements above communicate over our implied fifth component, the network. H.323 mandates a series of communications functions and protocols that can be generally divided into "control plane" and "information plane" areas.

The control plane functions of H.323 include three signaling procedures: H.245 facilities control, Q.931 call signaling and control, and Registration/Admission/Status control (RAS).

Data plane functions in H.323 can also be divided into two segments: coding and transport. The coding elements cover how various information forms are mapped to digital packet streams – voice coding, for example. The transport segment covers how information is mapped to the network protocol being used to support information exchange.

The H.323 architecture supports "multimedia conferences". These conferences can be any of the following:

  1. Point-to-point relationships between pairs of users, similar to phone calls.
  2. Point-to-multipoint relationships based on a central conference management point, a multipoint control unit or MCU. This creates a "star" conference topology.
  3. Point-to-multipoint relationships based on network multicast services. This emulates an MBONE-like conference. In this configuration, an MCU is still required for conference management.

These architectures can be combined, in theory, to form hybrid forms of conferencing. This is likely to occur where conference participants are not all supported on a single transport network, and where optimum cost management or convenience dictates a different structure within each network type.

In theory, the "multimedia" part of H.323 is also negotiable. While the standard supports voice/data/video integration as a collaborative framework, it also supports applications where one or all of the stations have limited multimedia capability. Thus, a "conference" could be data-only, voice-data, etc.

H.323 Components

As we noted above, there are four explicit components of an H.323 system. Some of these are optional, and virtually all have mandatory and optional elements.

Component number one, which is pretty obviously mandatory, is the terminal component, which is the definition given to the H.323 client. Terminals contain both control and information plane functions, with the latter being defined optionally for each media type the terminal supports.

Another option of H.323 terminals is T.120 support. T.120 is the collaborative data application standard, one we've covered in some detail in the past. The use of T.120 (and thus of data conferencing) is not mandatory.

The second component of H.323 is the MCU. MCUs contain Multipoint Controller and Multipoint Processor functions, with the former being the H.245 negotiation master point and the latter being the specialized bridging and stream processing functions associated with each of the "medias" in "multimedia".

In many implementations of H.323, MCU functions are likely to be incorporated in other components as well. This makes the MCU function suitable for arbitrating distributed conferences where no mandated central point exists. In this case, some party (usually the one who starts the process) inherits MCU responsibility. This type of conference, by the way, is most likely to be based on multicast services, since the MCU Multipoint Processor function would probably load a client system significantly.

Component number three for H.323 is the gateway, which in H.323 provides translation between H.323 conferencing systems and conferencing systems based on another standard, like the isochronous H.320. In effect, a gateway is a proxy terminal in both networks, responsible for signaling plane and information plane translation as required to support cross-network communications. The function is optional for the obvious reason that not every H.323 system has a need to communicate outside the packet realm.

In public services based on H.323, gateways are probably a key component, because they would enable packet video and voice stations to interwork with the circuit world. Gateways would also play a role (in theory) in packet voice applications that were linked to the public switched telephone world.

The final component is the gatekeeper, this confusing name covers an essential function set—address management and conference policy management. The gatekeeper coordinates activity via the RAS protocol mentioned earlier; in networks without gatekeepers, there is no address translation, admission control, or resource management (and, thus, no RAS is used).

Address management is needed to translate from the logical or user names of stations to their address in the H.323 network domain. This function is optional in that if terminals know the addresses of their partners, the translation function is not required.

Gatekeepers also perform translation between out-of-network addresses and H.323 addresses. This is needed if there is a gateway on the network supporting a community of off-H.323 users. Voice over IP applications that provide for dial-in or dial-out via PSTN must provide a gatekeeper function (though if the applications are not H.323-compliant, the function may not be called "gatekeeper" and may not function as we've described here).

Conference policy management is required if the resources used by H.323 conferences could load the transport network that supports the users. The central gatekeeper function can be used to limit the number of conferences, thus controlling the bandwidth that conferencing can consume in an indirect sort of way.

Explicit bandwidth management is also supported by the gatekeeper as an optional capability, for networks that can provide allocated bandwidth to specific applications. Likewise, the policy management component can enforce connection permission policies that establish who is allowed to talk to whom.

The policy management and "centricity" role of the gatekeeper makes it the point of coordination for conferences that change their nature (from point-to-point to multipoint, for example) and thus must consume a new resource (an MCU, in our example).

H.323 and the Network

The first issue in H.323 networking is the form of information that is going to be exchanged—the coding convention for the media types in use.

Audio information is compressed in a variety of ways (hopefully compatible among the terminals in use). The baseline standard is the standard PCM coding (G.711) used by voice networks worldwide. A more desirable variant, G.723, is probably the de facto standard for compression today. Even more effective compression strategies like G.729 are likely to be added in the future, and their combination of bit efficiency and audio quality will probably make them odds-on favorites by the next decade.

Video compression is based on an augmented form of H.261 coding found in H.320. The augmentation is the introduction of a requirement for support of motion-coded video, which improves video quality, and better compression algorithms to support the lower bit rate desirable in packet networks.

A variety of standard image sizes and resolutions are provided with H.323, from a simple 128x96 to a sexy 1408x1152. Only one of the formats (176x44) is assured to be compatible with H.261.

The data standard, as we've noted, is T.120. This standard provides for file transfer and whiteboard application support. The assumption with T.120 support is that both the H.323 terminals and MCUs would be T.120-enabled as well, since both elements also play compatible roles in the latter standard.

The use of control plane connection facilities in H.323 is not exactly intuitive. The H.245 functions for the highest level of resource management (resource coordination among conference participants and with other components) take place on LANs over TCP/IP sessions. Sessions also carry T.120 stuff (if any) and call signaling between elements, arising out of their having used Q.931 call procedures to request a conference connection.

The mapping of what are basically connection-oriented H.323 procedures to TCP is extended with the mapping of the transport of information flows to the RTP extension of UDP. We've described RTP before as well, so we won't get into it here.

Where We're Going With H.323

It's clear that packet video is an H.323 function, even in today's implementations. It's also clear that multimedia collaboration is going to be an H.323 domain, to the extent that the application develops at all.

What is also true, but not necessarily clear, is that H.323 will probably set the standard for packet voice eventually, at least to a degree. Where there is a community of "in-network" users with voice capability and a need or desire to interwork the network with the PSTN, it's only logical that the strategy for interworking would support the facilities of the in-net users. In other words, if I'm an H.323 user on an IP network supporting voice over IP, why wouldn't somebody calling in to the IP network eventually want to call me?

Another "truth-not-so-clear" is that H.323 is only a standard, not a need. Standards facilitate, they don't justify. The business value of H.323 depends on the business value of packet voice and video relationships, and that value is only now being tested in the marketplace.

IP voice today is largely a tariff issue, as we've said many times. It's doubtful that such a transient incentive would command a large quantity of users to invest in VoIP technology. In fact, it may well be that the eventual value of voice over IP will depend on the extent to which H.323 network components deploy, and the extent to which VoIP conforms to H.323 standards.

H.323 and its related issues are an example of a couple of trends that exert a negative influence on our industry, and on the acceptance of network technology for business purposes. We should therefore close our section on that note.

First, H.323 is driving multiple interdependent market activities, and yet its role in this process is not recognized or explained. We could well develop a consumeristic affinity for an IP voice architecture that wasn't particularly compatible with H.323, which would force us to either abandon H.323 as a collaborative standard or reconcile ourselves to incompatible transport of the same media on the same network by two different application classes.

Second, the standard itself is being developed with less than perfect attention to market priorities. Sure, it's desirable to have packet video collaboration gateway to circuit video collaboration networks to extend the latter and assure interworking. But if our goal is to extend the business value of collaboration, we have to put features and issues that support that goal on the top of our priority list. In some key areas, like the requirements for voice coding in H.323, it's clear we've put compatibility above utility, and that's dumb in the long term.

There are a lot of good things happening in the packet video space, but some dumb things as well. Everybody who is interested in this critical area should hammer providers to be sure that the real issues are addressed, and quickly.


Strategies

Strategies

In every project that advances our business goals through the use of telecommunications, we spawn a demon with every blessing – the demon of dependence .

What is peripheral to a process can usually be removed without creating a major impact on that process. What is central to a process cannot. As network technology is incorporated into business activities in mission-critical ways, the assurance that the activity can be executed can be given only if accompanied by the assurance that the network will perform.

How do we do that? Network devices are multiplying, and so are network trunks. Speeds are increasing, and the labor pool of qualified personnel is being emptied by startup technology firms who lure all network gurus to Silicon Valley.

The problem is exacerbated by the fact that there are really two networks being built today – LANs and WANs – and that technology choices in both areas are broadening. Given all of this, it's not surprising that buyers often simply set this issue aside and hope for the best.

Well, we really can't do that, and so we will undertake the mission to at least outline the strategies for network availability enhancement in this segment.

To Fix, or Not to Break?

The first issue in network availability management is the most fundamental (and, unfortunately perhaps the most difficult). If we can't survive a network problem, do we work to reduce the rate of problems, or work to fix the ones that happen? Putting this in semi-technical terms, do we increase the Mean Time Between Failures (MTBF) or reduce the Mean Time To Repair (MTTR)?

MTBF is enhanced in two very high-level ways: by contract and by design practices. Both are good approaches, in their place.

Contractual enhancement of MTBF is essentially a legal form of extortion, where the network buyer asserts a reliability requirement on the provider as a part of the sales agreement, as a condition of sale. The provider must then make good on this figure or face some form of redress in law.

Design-based enhancement of MTBF is a technology function. Each network component is dissected into its parts, and each part is optimized to present the lowest risk of failure. Where risk levels are still deemed too high after such a process, internal redundancy and "hot switchover" to standby components may be used to insure that a low-level element failure does not result in a high-level network failure.

It may seem illogical to separate this kind of internal MTBF management from MTTR response. After all, aren't we considering a faster response to a failure by having (for example) a hot standby power supply? But remember, failure is in the eye of the beholder. Our goal is to control network failures, and device or even circuit reactions that do not actually cause a failure are MTBF enhancements. The key is to say that network MTBF is enhanced by any measure that reacts to a problem in a way that is transparent to the network's applications. MTTR is enhanced by restoring normal interaction after an impact has been created.

The question of whether to enhance MTTR or MTBF is a knotty one. The answer has to come from a buyer's examination of the consequences of failure. If simply breaking is going to create all kinds of problems, then MTBF has to be as high as possible. The total elimination of visible network problems is expensive, though, so most users will accept short-duration, visible, failures if they can be corrected effectively. Hence, MTTR reduction is usually the objective of availability planning.

This doesn't mean that you can't impact MTBF effectively. For parts of your networks in the critical path, it's a good idea to look for the following:

  1. Devices that have a high MTBF rating. Don't be afraid to ask your vendors for their MTBF levels, as certified in their government contracts.
  2. Redundant features in key areas of hardware design. Usually power supplies and magnetic media devices can be provided in a hot standby mode that will insure that a failure doesn't impact the device functionality. Load-sharing power supplies should be considered a must for high-availability applications, and disk mirroring is important where there are disk drives used for parameter or software storage.
  3. Control module redundancy, also sometimes called "redundant common control" may be a good idea for devices that have a lot of ports and trunks, because a single failure there will take down the whole box. Beware, though. Some control redundancy strategies will not switch over transparently, so they don't really augment MTBF.
  4. I/O module redundancy. Most high-capacity network products have redundant I/O as an option, but most will not switch "hot" from one to another. If you are really looking to reduce MTBF, you must have hot switchover.

OK, let's suppose you've considered the whole MTBF augmenting approach and you're now ready to work on MTTR. The issues and options you'll have will depend on whether you are in the LAN or WAN, but there is a general set of rules for MTTR augmentation.

First, you can offload the network MTTR problem on a third party through an outsourcing contract. This doesn't mean that technical measures to insure MTTR levels are low won't be required, just that you won't be required to take them.

Second, you can focus on measures to restore service by fixing the thing that broke. This may mean actually repairing something, or it may mean swapping something out for a like component so that the original and broken gadget can be repaired.

Third, you can bypass the failed process completely, routing traffic around the failed network. This is what dial backup does in most cases.

Most organizations should consider all three of these approaches in optimizing their response to network problems. The level of outsourcing you do is dependent on how well, in both economic and technical/labor/skill terms, you think you can do on your own with the other two areas of choice.

Outsourcing: Consider It Done!

Most network users today should absolutely be using some level of outsourcing in their problem management strategy. In the WAN, it's very likely that you can alleviate most or all of your problems with an outsourcing agreement.

Service Level Agreements (SLAs) are the best form of network insurance you can buy. Any time you get a service from a carrier, you should insist on a service level agreement that outlines the objectives of the service in performance and availability, and specifies responses and penalties should a problem occur.

There are several kinds of SLAs. "Baseline" SLAs are service objectives set by the carrier for a given type of service as a standard part of the service offering. With a baseline SLA, you usually don't have to negotiate anything about the technical or quantitative issues, but you may have to inspect just what happens if the SLA is not met to insure the response is adequate. A number of carriers provide baseline frame relay SLAs, for example, but don't back the objectives with any specific response. That's not an SLA at all, folks, it's a wish list! The other type of SLA, a negotiated SLA, has both specific criteria and specific contractual guarantees, including some form of rebates or liquidated damages. That's a real SLA – one with teeth.

Beyond SLAs, you get in to the issue of outsourcing equipment support and managed services. A managed service is an extension to a carrier service where the carrier extends influence inward toward the desktop, accepting responsibility for diagnosing and correcting problems on the premises.

Premises management outsource agreements will usually have to confine themselves to providing guarantees of "fix it or replace it" within a given period. This is because there are simply too many variables to permit an outsourcer to guarantee availability or performance levels. Thus, for premises network elements, it may be necessary to provide some form of alternative network path should a really critical application set be cut off.

In the WAN: TDM and The Rest

If you have a WAN network to assure, and you have an SLA in place, the next step is to consider what to do if the carrier doesn't respond in time and you aren't happy with just collecting the penalty.

To restore WAN services, you have to consider what might have broken. Two problems could occur: an access failure in your connection to the carrier network, and a transport network failure. You should try to deal with transport failures through SLAs and carrier selection; use networks that have enough internal redundancy that the operator is willing to guarantee availability. That leaves access failures to handle.

One way of handling access failures is through redundant access. You can get two links to a carrier POP, preferably diversely routed to two POPs so no one failure takes both out. The redundancy will probably cost you something, though, so you may want to consider that cost before you commit.

A second strategy is to create a whole new network path from end to end, bypassing not only the access connections on both sides but the network connection as well. This is a good approach where the bandwidth involved is within the range that dial-up service can cover, but as the capacity of the primary link increases, the likelihood that dial backup is practical reduces. Remember, if you have anything more than 56/64 kbps bandwidth, you'll need to use inverse multiplexing to combine DS0s into a fatter pipe.

A third approach is to use some form of private transport like microwave or dark fiber to link each site with a neighbor site whose network connection is diverse from that of the primary. This private shunt will then provide an alternate path to the network. For this to work, the hardware in each shunt site must be capable of transit routing from the shunt into the network, so there are device limits to consider.

For all these approaches, the buyer has to consider the network technology implications. TDM network backup is pretty well understood, but frame/cell networks introduce new issues.

In traditional TDM networking, network services are based on connections, which are reservations of fixed bandwidth resources to support user activity. A connection can be permanent (a leased line) or transient (a dial-up call), but while the connection exists, there is a fixed path for it through the network and a fixed level of resources dedicated to its traffic.

The access portion of traditional TDM networks mirror this structure. Many voice and data connections are accessed via a dedicated loop – analog or digital. This dedicated access is not optimum where many connections are made to a single site, and buyers have from the early 1980s deployed wideband trunks (T1 and fractional T1) to access a set of connections. This trend of TDM integrated access continues today, with users employing T1, T3, and SONET technology to make a connection to their carriers' points of presence.

In cell relay networks, the same access/network model applies, but the resource relationships are different. With cell relay networks, the network services are based on virtual connections, which represent fixed paths through the network. Unlike TDM network paths, these virtual connections represent not specific resource allocations, but rather the right of the connection to draw a specified resource level, on demand, from a resource pool. If no resource is required, none is drawn, and the pool is larger for other applications to consume.

A similar concept exists with cell relay access. The "resource" is the maximum access bandwidth, and virtual connections draw on this resource as they would on any other network resource. This allows for statistical multiplexing of traffic: peak traffic values from one connection can slip into idle intervals on another. In effect, if we consider the peak traffic level as the "bandwidth" allocated to each connection in both the cell relay and TDM networks, the cell relay network will permit oversubscription. This means that the sum of the resource commitments made by the network can exceed the capacity of the network at a given point.

If we assume that a cell relay network and a TDM network have the same nodal points, the same number and routing of digital trunks, the same access bandwidth, and the same MTBF on physical components, the two networks would present an identical risk of failure in both the access and network components. If we assume that both networks supported the same mix of switched and permanent connections, the cell relay network would enjoy a reliability advantage over the TDM network, because the bandwidth elasticity of the cell relay network would permit some number of connections to be rerouted onto trunks and through nodes where the TDM network could not, owing to the latter's inability to oversubscribe.

This exercise demonstrates that cell relay networks are not automatically less reliable than their TDM counterparts, and in fact may be more reliable. But any network's reliability must be measured not against its competitive network concepts' reliability, but against application requirements. If a given level of TDM reliability is unacceptable to users, that same reliability will be unacceptable for cell switched networks.

In fact, the network service buyer problem with cell relay is not the reliability of the service per se, but the fact that addressing reliability concerns in cell relay networks is beyond the experience of most buyers. However, the basic approaches to disaster recovery for cell relay are the same as for TDM.

These approaches start by dividing the risk of failure according to the simple service model of "access" and "network" described above. A failure may occur in either component, and a disaster recovery plan may back up one or both components using a similar service or an alternative service.

This seems a trivial conclusion, but it is important considering the fact that many cell relay "services" are actually only cell relay integrated access connections to legacy network resources. A buyer may elect to purchase an ATM access line to a carrier service point and deliver TDM-formatted voice and data over that connection. At the carrier's point of presence, the ATM access line is demultiplexed and the original TDM-based service components are recovered for delivery to the traditional TDM switching and cross-connect facilities. In this model, the only cell relay component is the access component, and no special attention need be paid to the reliability issues of the network portion of the service.

Where cell relay network services are consumed, the access connection may be a leased TDM connection. This is a common approach for interexchange service node access; the local exchange carrier provides a simple digital trunk over which the cell relay traffic flows transparently. In this configuration, there is no additional risk in the access space versus TDM, other than any differences in equipment MTBF for the equipment terminating the connection at the subscriber and carrier locations.

In short, to address disaster planning, the service consumer must consider the topology of the cell relay service and the specific incremental risk that topology generates versus a TDM service. The consumer must also consider which special risks must actually be managed in cell relay terms, and which can be addressed using the same type of disaster planning as would be applied in traditional leased TDM networks.

LAN Availability Assurance

On the LAN, as we've noted, the outsource option provides a more cloudy benefit set, largely because there are just too many variables. That means that you, the buyer, will have to do something creative in designing the LAN to insure that failures are rare and quickly addressed.

The primary vehicle for availability control on the premises is buying a lot of capacity. At each level of the network hierarchy, the trunk speeds out of a device should be six to ten times the port speed in (10 Mbps ports equal 100 Mbps trunks, and so forth). Constrained resources will not only create problems, they'll reduce your options in dealing with problems created by other conditions.

The second thing to apply is multi-homing of devices. It's a good idea to link each device at a given network level (workgroup, department, core) with two devices at the next level. When you do this, you'll have to consider the way that such a dual pipe will be used by the protocols involved.

Routers handle alternate routes with a fair degree of verve and flair, though the loss of a connection will result in a period where topology updates are exchanged so the routers can "converge" on a new view of the network's structure. Bridges will respond to failures in a variety of ways, depending on the bridge type (source routing or spanning tree) and the specific details of the implementation. Some management systems will also allow you to commission a trunk through commands when an error is reported, and these commands can be incorporated into a script run when the fault event is recognized.

A third thing to look for is load sharing on the internal trunk connections. Some switches and routers will support parallel links and load sharing, so that if one is lost the other will simply take up the load. This is a good way to recover from failures of physical media, where that is a likely cause of problems. Load sharing also helps your network add capacity where congestion is occurring without forcing you to move to the next higher level of trunk (to gigabit Ethernet, for example).

If your network can't absorb all of the traffic in the various "after-the-failure" configurations your backup options will create, you may have to add policy management for resource allocation to the network to insure that the limited resources are parceled out in conformance to your overall priorities.

Coordinated Responses

Yes, there's a LAN. Yes, there's a WAN. But to most of the users of your network, there's only one network. That means that you'll have to take care to insure that the recovery strategies associated with LAN and WAN don't collide with one another.

One key step in doing that is to establish clear lines of demarcation for service agreements, both from each other and from your own responsibility space. A "clear line" doesn't just mean an agreed dividing point, either. It means a fence and a way to be sure just what's happening on both sides. At each responsibility demarcation, you need to have a set of rules that govern the interaction and a set of procedures to monitor those rules. Trust, but verify.

A second thing to consider in the coordination area is the protocol visibility of the remedial measures to be taken. If something fails in a router network, topology updates are generated. Those updates propagate to other routers. If both LAN and WAN are based on routers, and if both LAN and WAN have different service agreement responsibility, there is a chance that problems and recovery measures could migrate from one to the other, which won't be nice.

A final word: review the maintenance reactions and administrative response of all your outsourced and internal procedures, to insure that nothing collides there. One user had an internal policy of running a backup dial connection, and an outsource contract that required that the original connection be maintained in service for analysis!

Failures are a bad thing, and keeping them from becoming a disaster is a tough job. With proper attention to detail, almost any kind of network can be maintained at acceptable levels of availability, however. Planning and execution in the early phases can save you serious problems later.


Down the Line

Down the Line

We have closed our nominations for infrastructure switch vendor review. Candidates are Ascend, Cisco, Fore, Lucent, Newbridge, and Nortel. To be fair to all, we'll start the reviews in our January issue, so we can run them without long gaps. December is the Annual Technology Forecast Issue. Remember, that issue is for subscribers only and will not be posted to the Internet.


- NETWATCHER Index Page

Access the index of CIMI Corporation's recent newsletters.