Can Second-Tier Vendors Win in a DCI-Centric Model of Infrastructure Evolution?

Juniper had a Wall Street event earlier this week and analysts used terms like “constructive” and “realistic” to describe what the company said. The central focus in a technical sense was SDN and the cloud, not separately as much as in combination. Juniper’s estimates for growth through 2019 were slightly ahead of Street consensus, so the question is whether the characterizations of Street analysts are justified, not only for Juniper but for other wannabe network vendors still chasing the market leaders.

Juniper isn’t alone in saying that the cloud, and the SDN changes likely to accompany it, are going to drive more revenue. The view of most network vendors is that sure, “old” model spending is under pressure, but “new” model spending will make up for it. There are of course variations in what constitutes the old and new, but I think it’s fair to say that most vendors think the cloud and SDN will be a growth engine. How much of an engine it will be, IMHO, depends on how effectively vendors address the drivers that would have to underpin the change.

Let’s start with the optimistic view. If my model is correct, carrier cloud (driven by SDN and NFV) would add over 100,000 new data centers globally. All of these data centers would be highly connected via fiber, and obviously they’d be distributed in areas of heavy population and traffic. If we were to see these data centers as deployed purely for cloud computing, that in itself would generate a decent opportunity for data center interconnect (DCI).

If we presumed these data centers were driven more by SDN and NFV it could get even better. For example, if all wireline broadband were based on cloud-deployed vCPE, then all wireline access traffic would be aggregated into a data center, which means that it would be logical to assume that almost everything in wireline metro aggregation would become a DCI application. And given that mobile infrastructure, meaning IMS and EPC, would also be SDN/NFV-based, the same would be true on the mobile side. All of that would combine to create the granddaddy of all DCI opportunities, to the point where most other transport fiber missions except access would be unimportant.

If I were a vendor like Juniper with a commitment to SDN and the cloud, this is the opportunity I’d salivate over, and shortly after having cleaned myself up, I’d be looking at ways to promote this outcome in infrastructure/market evolution terms. It’s here that problems arise, for Juniper any anyone else who wants to see a DCI-centric future.

The media loves the cloud, but the fact is that even Amazon’s cloud growth hasn’t been able to get cloud computing much above the level of statistical significance in terms of total global IT spending. We still have a lot of headroom for growth, but if we assume that enterprises’ own estimates of cloud penetration are accurate, we would probably not see cloud computing generating even a fifth of that hundred thousand data centers. Most significantly, cloud computing doesn’t drive the edge-focused deployment of data centers that SDN and NFV do, and thus doesn’t compel the same level of interconnection. You have fewer bigger data centers instead.

There is nothing a network equipment vendor can do to promote “traditional” enterprise cloud computing either. This arises from the transfer of applications that fit the cloud profile out from the data center, and how a network vendor could influence that is unclear to say the least. For network vendors, in fact, the best way to promote cloud computing growth would be to get behind a cloud-centric (versus mobile-connection-centric) vision of IoT. Network vendors don’t seem psychologically capable of doing that, so I think we’d have to put encouraging cloud computing as the driver for our DCI explosion off the table.

SDN as a driver, then? Surely SDN and the cloud seem to go together, but the connection isn’t causal, as they say, in both directions. If you have cloud you will absolutely have SDN, but you can’t promote cloud computing just by deploying SDN unless you use SDN to frame a more agile virtual-network model.

This is a place where Juniper could do something. Nokia’s Nuage SDN architecture is in my view the best in the industry as a unified SDN-connection-layer model but Juniper’s Contrail could be the second-best. Juniper even has controller federation capability to allow for domain interconnection. The problem for both vendors seems to be that SDN used this way would transform networks away from traditional switching/routing, and so it could hasten the demise of legacy network revenues. Would SDN make it up? Perhaps it would change market leaders, but it’s hard to say why operators would adopt SDN on a large scale as a replacement for traditional L2/L2 if it were more expensive.

Which gets us to NFV. NFV as a means of creating agile mobile infrastructure is the most credible of the evolutionary-NFV applications. The challenge is whether a vendor who isn’t a mobile-infrastructure player can drive the deployment, especially given that Ericsson, Huawei, and Nokia all have NFV stories to tell. Obviously, any major mobile-infrastructure NFV could create an explosion in the number of cloud data centers and drive DCI, but fortune would likely favor vendors who were actually driving the deployment.

The big thing about NFV data centers is the potential they’d be widely distributed and that they’d be natural focus points for service traffic. That, as I said up front, is what would make them revolutionary in DCI terms. The obvious question is whether the mobile-infrastructure players who could drive the change would benefit enough from it—data centers would house servers after all, and DCI replacement of traditional metro infrastructure would impact most of the big vendors by cutting switching/routing spending even faster (and further).

Ericsson and Cisco would seem to have an edge here because they have a server and data center strategy that would give them an opportunity to gain revenue from a shift to hosted, DCI-centric, metro infrastructure. Ericsson has also been a strong player in professional services, and Cisco’s quarterly call this week showed they had a significant gain in professional services and that they are stressing data center (UCS and the cloud) infrastructure in their profit planning. In fact, Cisco is making a point of saying they are shifting to a software and subscription revenue model even for security.

Conceptually, smaller players in an industry should have first-mover advantages, but in networking in general (and with Juniper in particular) the smaller players have been at least as resistant to change as the giants. Juniper actually launched a software-centric strategy at a time when Cisco was in denial with respect to just about every network change—the recognized transformation and the cloud at least two years earlier than the industry at large and they had some product features (like separation of control and data plane) that could have given them an edge. They just didn’t have the market mass or insight to make good on their own thought leadership.

That’s what will make the DCI opportunity difficult for any second-tier vendor. The drivers of the opportunity are massive market shifts, shifts that will take positioning skill, product planning, and just plain guts to address. Especially now, because the giants in the space have awoken to the same opportunity.

Netcracker’s AVP: Is This the Right Approach to SDN and NFV?

I had an opportunity this week to look over some material from Netcracker on their notion of a “digital service provider”, part of the documentation that relates to their Agile Virtualization Platform concept. I also reviewed what was available on the technology and architecture of AVP. I find the technology fascinating and the research and even terminology a little confusing.

Netcracker is an OSS/BSS player, of course, and as such they represent an interesting perspective on the transformation game. My surveys say that the OSS/BSS vendors are obviously more engaged with the CIO, but they are also better-connected with the CFO and CEO and less connected with the COO and CTO. That makes them almost the opposite of the network equipment vendors, and that alone means that their vision could be helpful in understanding what’s going on. It’s also a chance to compare their views with what I’ve learned in my own contacts with operators, so let’s start at the top.

What exactly is a “Digital Service Provider” or “digital transformation” to use the TMF term? Obviously not just a provider of digital services because “digital” in a strict sense is all there is these days. I think what both concepts are getting at is that operators need to be able to create a wider variety of services more efficiently and quicker, which means that software and IT have to play a larger role—perhaps even the dominant role. So the notion of AVP is to facilitate that.

What drives operators to want a digital transformation, says the material, is almost completely reactive. Customers demand it, revenue gains depend on it, competition is doing it…these are indications of a market driven by outside forces rather than one trying to get ahead of the curve. It’s not that operators are being dragged kicking and screaming into the transformation, perhaps, but they are surely not romping off on their own accord.

The barriers to achieving the transformation are equally interesting, or at least one point is. Operators told Netcracker that technical factors like operations and integration was the most important inhibitor only about a quarter of the time. Factors like staffing and skills and culture were far more important in the survey, and perhaps most interesting of all was the fact that only about 15% of operators seemed to be groping for solutions—the rest said they either had transformed or were well on their way.

I have to confess I have a bit of a problem with these points, for two reasons. First, it would seem the survey shows that AVP is too late and doesn’t address the main issue set, which is skills and culture and not technology. Second, it’s hard to see how Netcracker or anyone else would have much of a shot at solving market problems if 85% of the buyers don’t need a new approach.

My own surveys have yielded different responses. The overwhelming majority of operators tell me that their driver for change is profit compression for connection-oriented services. Only a small percentage (and almost all of them Tier Ones or MSPs) have an approach lined up, and an even smaller percentage says they’ve made substantial progress implementing one. Thus, my own data seems to make Netcracker’s case for opportunity more strongly.

Interestingly, a different Netcracker document, the AVP brochure, frames it differently. There the big problem is network resource and configuration, staff and culture second and third, with cost and operations processes and systems trailing. This brochure also lays out three reasons for the “slow process” (recall that the other one says only 15% are lagging). These are commercialization uncertainty, operational complexity, and organizational misalignment. The last of these corresponds to the staff/culture point and I’d say that the other two are different perspectives on the “resources and configuration”, “cost”, and “operations processes and systems”. I don’t think the inconsistencies here are fatal issues, but they do create a bit of confusion.

My surveys say that operators are generally committed to a two-prong approach. In the near term, they believe that they have to make operations processes more efficient and agile, and they believe this has to be done by introducing a lot of software-driven automation. In the longer term, they believe that they need to find revenue beyond connection-based services.

AVP is interesting from a technology perspective, perhaps even compelling. Netcracker says it’s made up of composable microservices, and that sounds like the approach that I think is essential to making OSS/BSS “event-driven”. Unfortunately, there aren’t enough details provided in any of the material for me to assess the approach or speculate on how complete it might be. For the record, I requested a complete slide deck from them and I’ve not received one.

AVP is a Netcracker roadmap that has many of the characteristics (and probably all of the goals) that operators’ own architectures (AT&T and Verizon’s recent announcements for example) embody. Their chart seems to show four primary elements—a professional services and knowledge-exchange practice, enhanced event-driven operations processes, a cloud deployment framework that would host both operations/management software and service elements, and the more-or-less expected SDN/NFV operations processes. Netcracker does have its own E2E orchestration product, but the details on the modeling it uses and how it links to the rest of the operations/management framework aren’t in the online material.

If operators’ visions of a next-gen architecture are valid (and after all the operators should be the benchmark for validity) then the Netcracker model is likewise, but it does have some challenges when it’s presented by a vendor and without specific reference to support for operator models. My surveys say that the big problems are the state of SDN/NFV and the political gap that’s inherent in the model itself.

Remember who OSS/BSS vendors call on? The CIO is surely a critical player in the network of the future, and might even be the critical player in both the operations-efficiency and revenue-agility goals. However, they aren’t the ones that have been pushing SDN and NFV—that’s been primarily the CTO gang. Operators are generally of the view that if there is any such thing as a “digital transformation” of infrastructure, SDN and NFV are the roots of it. Interestingly they are also of the view that the standards for SDN and NFV don’t cover the space needed to make a business case—meaning fulfill either the cost-control or revenue goal I’ve already cited. So we have CIOs who have the potential to be the play-makers in both benefits, the OSS/BSS vendors (including Netcracker) who could engage them…and then across a gap the CTOs who are driving infrastructure change.

Properly framed, the Netcracker model could not only link the layers of humans and technology that have to be linked to produce a unified vision of the network of the future. Properly framed, it could even harmonize SDN and NFV management from within, and then with operations management. It’s easier for me to see this being done from the top, from the OSS/BSS side, than from the bottom. But it’s not going to happen by itself. Vendors, operators, and even bodies like the TMF at whose event Netcracker made its presentation, need to take the process a little more seriously. Absent a unified, credible, approach from benefits to networks, operator budgets will just continue their slow decline under profit pressure.

I think OSS/BSS vendors have a great opportunity. My research and modeling shows that an operations-centric evolution of network services could produce most of the gains in efficiency and agility that have been claimed for SDN and NFV. Without, of course, the fork-lift change in infrastructure. That story should be very appealing to operators and of course to the OSS/BSS types, but what seems to be happening is a kind of stratification in messaging and in project management. Operations vendors sing to CIOs, network equipment vendors to CTOs, and nobody coordinates. Maybe they all need to take an orchestration lesson themselves.

Service Assurance in the Network of the Future

One of the persistent questions with both SDN and NFV is how the service management or lifecycle management processes would work. Any time that a network service requires cooperative behavior among functional elements, the presumption is that all the elements have to be functioning. Even with standard services, meaning services over legacy networks, that can be a challenge. It’s even more complicated with SDN and NFV.

Today’s networks are multi-tenant in nature, meaning that they share transmission/connection facilities to at least some degree. Further, today’s networks are based on protocols that discover state and topology through adaptive exchanges, so routing is dynamic and it’s often not possible to know just where a particular user’s flows are going. In most cases these days, the functional state of the network is determined by the adaptive processes—users “see” in some way the results of the status/topology exchanges and can determine if a connection has been lost. Or they simply don’t see connectivity.

QoS is particularly fuzzy. Unless you have a mechanism for measuring it end-to-end, there’s little chance that you can determine exactly what’s happening with respect to delay or packet loss. Most operator guarantees of QoS are based on performance management through traffic engineering, and on capacity planning. You design a network to offer users a given QoS, and you assume that if nothing is reporting a fault the users are getting it.

It’s tempting to look at this process as being incredibly disorderly, particularly when you contrast it with TDM services that because they were dedicating resources to the user could define the state and QoS with great precision at any point. However, it’s not fair to SDN or NFV to expect that they will do better than the current state of management, particularly if users expect lower prices down the line, and operators lower opex.

The basic challenge posed by SDN in at least replicating current management knowledge is the fact that by design you’re saying that adaptive exchanges don’t determine routes, and in fact don’t happen. If that’s the case, then there is no way of knowing what the state of the devices is unless the central controller or some other central management element knows the state. Which, of course, means that the devices have to provide that state. An SDN controller has to know network topology and has to know the state of the nodes and trunks under its control. If this is true, then the controller can construct the same knowledge of overall network conditions that the network acquired through adaptive exchanges, and you could replicate management data and SLAs.

NFV creates a different set of problems. With NFV the service depends in part on functions hosted on resource pools, and these are expected to offer at least some level of “automatic recovery” from faults, whether that happens by instantiating a replacement copy, moving something, reconnecting something, or scaling something under load. This self-repair means that a fault might exist at the atomic function level but you don’t want to recover from it at the service level till whatever’s happening internally has been completed.

The self-remediation model of NFV has, in the NFV ISG and IMHO, led to a presumption that lifecycle management is the responsibility of the individual virtual network functions. The functions contain a local instance of a VNF management process and this would presumably act as a bridge between the state of resources and their management and the state of the VNFs. The problem of course is that the service consists of stuff other than that single VNF, and the state of the service still has to be composited.

The operators’ architectures for NFV and SDN deployment, now emerging in some detail, illustrate that operators are presuming that there is in the network (or at least in every domain) a centralized service assurance function. This function collects management information from the real stuff, and also provides a means of correlating the data with service state and generating (in some way) the notifications of faults to the service processes. It seems that this approach is going to dominate real SDN and NFV deployment, but the exact structure and features of service assurance aren’t fully described yet.

What seems to have emerged is that service assurance is a combination of three functional elements, aggregation of resource status, service correlation, and event generation. In the first of these, management data is collected from the things that directly generate it, and in some cases at least the data is stored/cached. An analytics process operates on this data to drive what are essentially two parallel processes—resource management and service management. The resource management process is aimed at remedying the problems with physical elements like devices, servers, and trunks. The service management process is designed to address SLA faults, and so it could just as easily replace a resource in a service as require it be fixed—in fact, that would be the normal course.

Service management in both SDN and NFV is analogous to end-to-end adaptive recovery as found in legacy networks. You are going to “fix” a problem by reconfiguration of the service topology and not by actually repairing something. If something is broken, that becomes a job for the resource management processes.

Resource management doesn’t appear to present any unusual challenges. You have device state for everything, and so if something breaks you can fix it. It’s service management that poses a problem because you have to know what to reconfigure and how to reconfigure it.

The easiest way to determine whether a service has faulted is to presume that something internal to the service is doing it, or that the service users are reporting it. Again this may seem primitive but it’s not really a major shift from what happens now. If this approach is taken, then the only requirement is that there be a problem analysis process to establish not what specifically has happened but what can be done to remedy the fault by reconfiguration. The alternative is to assume that the service assurance function can identify the thing that’s broken and the services that are impacted.

Both these options seem to end up in the same place. We need to have some way of knowing when a virtual function or SDN route has failed. We need to have a recovery process that’s aimed at the replacement of that which has broken (and perhaps a dispatch task to send a tech to fix a real problem). We need a notification process that gives the user a signal of conditions comparable to what they’d get in a legacy network service. That frames the reality of service assurance.

I think that the failing of both SDN and NFV management to date lies in this requirement set. How, if internal network behavior is not determined by adaptive exchange, does the service user find out about reachability and state? If SDN replaces a switch/router network, who generates the management data that each device would normally exchange? In NFV how do we reflect a virtual function failure when the user may not be directly connected to the function, but somewhere earlier/later in the service chain?

The big question, though, is one of service configuration and reconfiguration. We cannot assume that every failed white box or server hosting a VNF can be recovered locally. What happens when we have to change the configuration of the service enough that the elements outside the failing domain have to be changed to reflect the reconfiguration? If we move a VNF to another data center, don’t we have to reconnect the WAN paths? Same with SDN domains. This is why the issue of recovery is more than one of event generation or standardization. You have to be able to interpret faults, yes, but you also have to be able to direct the event to a point where knowledge of the service topology exists, so that automated processes can reconnect everything. Where is that point?

In the service model, or it’s not anywhere. Lifecycle management is really a form of DevOps, and in particular of the declarative model where the end-state of a service is maintained and compared with the current state. This is why we need to focus quickly on how a service is modeled end-to-end and integrate that model with service assurance, for both legacy and “new” technologies.

Overlay/Underlay Networking and the Future of Services

Overlay networks have been a topic for this blog fairly often recently, but given that more operators (including, recently, Comcast) have come out in favor of them, I think it’s time to look at how overlay technology might impact network investment overall. After all, if overlay networking becomes mainstream, something of that magnitude would have to impact what these networks get overlaid onto.

Overlay networks are virtual networks built by adding what’s essentially another connection layer on top of prevailing L2/L3 technology. Unlike traditional “virtual networks” the overlay networks are invisible to the lower layers; devices down there treat them as traffic. That could radically simplify the creation of virtual networks by eliminating the need to manage the connectivity in a “real” network device, but there are other impacts that could be even more important. To understand them we should start at the top.

There are two basic models of overlay network—the nodal model and the mesh model. In the nodal model, the overlay includes interior elements that perform the same functions that network nodes normally perform in real networks—switching/routing. In the mesh model, there are no interior nodes to act as concentrators/distributors of traffic. Instead each edge element is connected to all the others via some sort of tunnel or lower-level service.

The determinant in the “best” model will in most cases be simply the number of endpoints. Both endpoints and nodes have “routing tables”, and as is the case with traditional routing, the tables don’t have to include every distinct endpoint address, but rather only the portion of an address needed to make a forwarding decision. However, if the endpoints are meshed then the forwarding decision has to be made for each, which means the endpoint routing tables get large and expensive to process.

Interior node points can simplify the routing tables, particularly since the address space used in an overlay network need not in any way relate to the underlying network address space. A geographic/hierarchical addressing scheme could be used to divide a network into areas, each of which might have a collecting/distributing node. Node points can also be used to force traffic along certain paths by putting a node there, and that would be helpful for traffic management.

The notion of an overlay-based virtual network service clearly empowers endpoints, and if the optimization of nodal locations is based on sophisticated traffic and geography factors, it would also favor virtual-node deployments in the network interior. Thus, overlay networks could directly promote (or be promoted by) NFV. One of the two “revolutionary elements” of future networking is this a player here.

So is the other. If tunnels are the goal, then SDN is a logical way to fulfill that goal. The advantage SDN offers is that the forwarding chain created through OpenFlow by central command could pass wherever it’s best assigned, and each flow supported by such a chain is truly a ship in the night relative to others in terms of addressability. If central management can provide proper traffic planning and thus QoS, then all the SDN flows are pretty darn independent.

The big question for SDN has always been domain federation. We know that SDN controllers work, but we can be pretty sure that a single enormous controller could never hope to control a global network. Instead we have to be able to meld SDN domains, to provide a means for those forwarded flows to cross a domain boundary without being elevated and reconstituted. If that capability existed, it would make SDN a better platform for overlay networks than even Ethernet with all its enhancements.

The nature of the overlay process and the nature of the underlayment combine to create a whole series of potential service models. SD-WAN, for example, is an edge-steered tunnel process that often provides multiple parallel connection options for some or even all of the service points. Virtual switching (vSwitch) provides what’s normally an Ethernet-like overlay on top of an Ethernet underlayment, but still separates the connection plane from the transport process, which is why it’s a good multi-tenant approach for the cloud. It’s fair to say that there is neither a need to standardize on a single overlay protocol or architecture, nor even a value to doing so. If service-specific overlay competition arises and enriches the market, so much the better.

Where there obviously is a need for some logic and order is in the underlayment. Here, we can define some basic truths that would have a major impact on the efficiency of traffic management and operations.

The first point is that the more overlays you have the more important it is to control traffic and availability below the overlay. You don’t want to recover from a million service faults when one common trunk/tunnel has failed. This is why the notion of virtual wires is so important, though I want to stress that any of the three major connection models (LINE, LAN, TREE) would be fine as a tunnel model. The point is that you want all possible management directed here. This is where agile optics, SDN pipes, and so forth, would live, and where augmentation of current network infrastructure to be more overlay-efficient could be very helpful.

The second point, which I hinted at above, is that you need to define domain gateways that can carry the overlays among domains without forcing you to terminate and reestablish the overlays, meaning host a bunch of nodes at the boundary. Ideally, the same overlay connection models should be valid for all the interconnected domains so a single process could define all the underlayment pathways. As I noted earlier, this means domain federation has to be provided no matter what technology you use for the underlayment.

The third point is that the underlay network has to expose QoS or class of service capabilities as options to the overlay. You can’t create QoS or manage traffic in an overlay, so you have to be able to communicate between the overlay and underlay with respect to the SLA you need, and you have to then enforce it below.

The final point is universality and evolution. The overlay/underlay relationship should never depend on technology/implementation of either layer. The old OSI model was right; the layers have to see each other only as a set of exposed services. In modern terms, that means that both layers are intent models with regard to the other, and the overlay is an intent model to its user. The evolution point means that it’s important to map network capabilities in the overlay to legacy underlayment implementations, because otherwise you probably won’t get the scope of implementation you need.

You might wonder at this point why, if overlay networking is so powerful a concept, operators haven’t fallen over themselves to implement it. One reason, I think, is that the concept of overlay networks is explicitly an OTT concept. It establishes the notion of a network service in a different way, a way that could admit new competitors. If this is the primary reason, though, it may be losing steam because SD-WAN technology is already creating OTT competition without any formal overlay/underlay structures. The fact that anyone can do an overlay means nobody can really suppress the concept. If it’s good, powerful, then it will catch on.

Can We Apply the Lessons of NFV to the Emerging IoT Opportunity?

I blogged yesterday about the OPNFV project for Event Streams and the need to take a broad view of event-driven software as a precursor to exploring the best way to standardize event coding and exchange. It occurred to me that we’re facing the same sort of problem with IoT, focusing on things that would matter more if we had a broader conception of the top-down requirements of the space. Let me use the same method to examine IoT as I used for the Event Streams announcement—examples.

Let’s suppose that I have a city that’s been equipped with those nice new IoT sensors, directly on the Internet using some sort of cellular or microcellular technology. It’s 4:40 PM and I left work early to get a jump on traffic. So did a half-million others. I decide that I’m going to use my nice IoT app to find me a path to home that’s off the beaten path, so to speak. I activate my app, and what happens?

What I’m hoping for, remember, is a route to my destination that’s not already crowded with others, or will shortly become crowded. That information, the IoT advocates would say, is exactly what IoT can provide me. But how, exactly? If the sensors count cars, I could assume that car counts would be a measure of traffic, but a car counter would count not cars but cars passing it. If the traffic is at a standstill, how many cars are passing? Zero, so I have a bad route choice.

However, it may not be that bad because I may never see the data in the first place. Remember, I have a half-million sharing the road with me, and most of them probably want to get home early too. So what are they doing? Answer; hitting their IoT app to find a route. If that app is querying those sensors, then I’ve got a half-million apps vying for access to a device that might be the size of a paperback book. We have websites taken down by DDoS attacks of that size or smaller, and those sites are supported by big pipes and powerful servers. My little sensor is going to weather the storm? Not likely.

But even if I got through, would I understand the data? I could presume that the sensors would be based on a basic HTTP exchange like the one that would fetch a web page. Certainly I could get something like an XML or JSON payload delivered that way, but what’s the format? Does the car sensor give me the number of cars in the last second, or minute, or hour, or what? Interpreting the data starts with understanding what data is actually being presented after all.

But suppose somehow all this worked out. I’ve made the first turn on my route, and so have my half-million road-sharing companions. Every second, conditions change. How do I know about the changes? Does my app have to keep running the same decision process over and over, or does the sensor somehow signal me? If the latter is the case, how would a specific sensor know 1) who I was, 2) what I wanted and 3) what I had to know about to get what I wanted?

OK, you say, this is stupid. What I’d really do is go to my “iIoT service” on my iPhone, where Apple would cheerfully absorb all this sensor data and give me answers without all these risks. Well, OK, but that raises the question of why a city-full of those IoT sensors got deployed when they’re nothing but a resource for Apple to exploit. Did Apple pay for them? Ask that to Tim Cook next shareholder call. If Apple is just accessing them on “the Internet” because after all this is IoT, then is Apple and others expecting to pay for the access? If not, why did they ever get deployed. If so, how does that cheap little sensor know who Apple is versus some shameless exploiter of their data?

Maybe, you speculate, we solve some of our problems with the device that started them, our phone. Instead of counting cars, we sense the phones that are nearby. Now we know the difference between an empty street and gridlock. Great. But now we have thousands of low-lifes tracking women and children. Prevent that with access controls and policies, you say? OK, but remember this is a cheap little sensor that you’ve already had to give the horsepower of a superserver to. Now we have to analyze policies and detect bad intent?

Or how about this. A bunch of Black Hats says, gee we could have fun by deploying a couple hundred “sensors” of our own, giving false data, and getting a bad traffic situation to become gridlocked enough that even emergency fire and rescue can’t get through. Or we’re a gang of jewel thieves with an IoT getaway strategy. How do these false-flag sensors get detected?

Sometimes insight comes in small steps. For example, the Event Stream project talks about Agents that get events, Collectors that store them in a database. This kind of structure is logical to keep primary event generators from being swamped by all the processes that need to know the state of resources. Isn’t it logical to assume that this same sort of decoupling would be done in IoT? The project seeks to harmonize the structure of event records; isn’t it logical to assume that sensor outputs would similarly have to be harmonized? Resource information in NFV and sensor data in IoT both require what are essentially highly variable and disorderly sources to be loosely coupled with highly variable and disorderly process sets that interpret the stuff. The issues raised by each would then be comparable.

Once we presume that we need to have common coding for event analysis and some sort of database buffering to decouple the sensors in IoT from the processes, we can resolve most of these other questions because we don’t have a sensor network problem anymore, we have a database problem, and we know how to address all the concerns raised above if we presume that context. But just as Event Streams have to trigger an awareness of the need for contextual event processing, so the existence of a database where sensor data is collected and from which it’s distributed begs the question of what apps do and how public policy goals are maintained.

We’re not there yet with IoT. Even IT vendors who make IoT announcements are still kissing low-power protocols and transmitters and not worrying about any of the real issues. And these are the vendors who already sell the databases and analytics products and clouds of servers, who have the technology to present a realistic model.

Way back in 2013, in CloudNFV, I outlined the set of issues that NFV would have to address, and everyone who was involved in that process knows how hard I tried to convince both vendors and operators that key issues were being ignored. It’s now 2016, and we’re now just starting to address them. Could we have today a complete NFV implementation to deploy if we’d accepted those issues in 2013 when they were first raised? My point isn’t to aggrandize my own judgment; plenty of others said much the same thing. Many are saying it now about IoT. Will we insist on following the same myopic path, overlooking the same kind of issues, for that technology? If so, we’re throwing out a lot of opportunity.

Is the New OPNFV Event Streams Project the Start of the Right Management Model?

One of those who comment regularly on my blog brought a news item to my attention. The OPNFV project has a new activity, introduced by AT&T, called “Event Streams” and defined HERE. The purpose of the project is to create a standard format for sending event data from the Service Assurance component of NFV to the management process for lifecycle management. I’ve been very critical of NFV management, so the question now is whether Event Streams will address my concerns. The short answer is “possibly, partly.”

The notion of events and event processing goes way back. All protocol handlers treat messages as events, for example, and you can argue that even transaction processing is about “events” that represent things like bank deposits or inventory changes. At the software level, the notion of an “event” is the basis for one form of exchanging information between processes, something sometimes called a “trigger” process. The other popular form is called a “polled” process because in that form a software element isn’t signaled something is happening, it checks to see if it is.

Many of the traditional management and operations activities of networks have been more polled than triggered because provisioning was considered to be a linear process. As networks got more complicated, more and more experts started talking about “event-driven” operations, meaning something that was triggered by conditions rather than written as a flow that checked on stuff. So Event Streams could be a step in that direction.

A step far enough? There are actually three things you need to make event-driven management work. One, obviously, is the events. The second is the concept of state and the third is a way to address the natural hierarchy of the service itself. If we can find all those things in NFV, we can be event-driven. Events we now have, but what about the rest?

Let’s start with “state”. State is an indication of context. Suppose you and I are conversing, and I’m asking you questions that you answer. If there’s a delay or if you don’t hear me, you might miss a question and I might ask the next one. Your answer, correct in the context you had, is now incorrect. But if you and I each have a recognized “state” like “Asking”, “ConfirmHearing”, and “Answering” then we can synchronize through difficulties.

In network operations and management, state defines where we are in a lifecycle. We might be “Ordered”, or “Activating” or “Operating”, and events mean different things in each state. If I get an “Activate” in the “Ordered” state, it’s the trigger for the normal next step of deployment. If I get one in the “Operating” state, it’s an indication of a lack of synchronicity between the OSS/BSS and the NFV processes. It is, that is, if I have a state defined.

Let’s look now at a simple “service” consisting of a “VPN” component and a series of “Access” components. The service won’t work if all the components aren’t working, so we could say that the service is in the “Operating” state when all the components are. Logically, what should happen then is that when all the components are in the “Ordered” state, we’d send an “Activate” to the top-level “Service object”, and it would in turn generate an event to the subordinates to “Activate”. When each had reported it was “Operating”, the service would enter the “Operating” state and generate an event to the OSS/BSS.

So what we have here is a whole series of event-driven elements, contextualized (state and relationship) by some sort of object model that defines how stuff is related. It’s not just one state/event process (what software nerds call “finite-state machines”) but a whole collection of such processes, event-coupled so that the behaviors are synchronized.

This concept is incredibly important, but it’s not always obvious that’s the case. But here’s an example. Suppose that a single VNF inside an Access element fails and is going to re-deploy. That access element would have to enter a new state, let’s call it “Recovering” and so the VNF that failed would have to signal with an event. Does that access element go non-operational immediately or does it give the VNF some time? Does it report even the recovery attempt to the service level via an event, or does it wait till it determines that the failure can’t be remedied? All of this stuff would normally be defined in state/event tables for each service element. In the real world of SDN and NFV, every VNF deployed and every set of connections could be an element, so the model we’re talking about could be multiple layers deep.

This has implications for building services. If you have a three- or four-layer service model you’re building, every element in the model has to be able to communicate with the stuff above and below it through events, which means that they have to understand the same events and have to be able to respond as expected. So what we really have to know about service elements in SDN or NFV is how their state/event processing works.

Obviously we don’t know that today, because we didn’t have even a consistent model of event exchange, which Event Streams would define. But the project doesn’t define states, nor does it define state/event tables or standardized responses. Without those definitions an architect couldn’t assemble a service from pieces because they couldn’t be sure that all the pieces would talk the same event language or interpret the context of lifecycles the same way.

The net of this is that Event Streams are enormously important to NFV, but they’re a necessary condition and not a sufficient condition. We still don’t have the right framework for service modeling, a framework in which every functional component of a service is represented by a model “object” that stores its state and the table that relates to event-handling in every possible state.

The question is whether we need that, or whether we could make VNF Managers perform the function. Could we send them events? There’s no current mandate that a VNFM process events at all, much less process some standard set of events. If a VNFM contains state/event knowledge, then the “place” of the associated VNF in a service would have to be consistent or the state/event interpretation wouldn’t be right. That means that our VNF inside an access element might not be portable to another access element because that element wanted to report “Recovering” or “Faulting” under different conditions. IMHO, this stuff has to be in the model, not in the software, or the software won’t be truly composable.

I’m not trying to minimize the value of Event Streams here. It’s very important, providing that it provokes a complete discussion of state/event handling in network operations. If it doesn’t, then it’s going to lead to a dead end.

Will Operators Avoid the Same Mistakes they Say Vendors Make in Transformation?

Operators want open source software and they want OCP hardware, or so they say. It would seem that the trend overall is to stamp out vendors, but of course neither of these things really stamp out vendor relationships. They might have an impact on the buyer/seller relationship, though, and on the way that operators buy and build networks. If the model crosses over into the enterprise space, which is very likely, then it could have a profound impact on the market overall.

If there are multiple stages of grief, operators have experienced their own flavor in their relationship with their vendors. Twenty years ago, operators tended to favor big vendors who could provide complete solutions because it eliminated integration issues and finger-pointing. Ten years ago, operators were starting to feel they were being taken advantage of, and many started creating “procurement zones” to compartmentalize vendors and prevent one from owning the whole network. Now, as I said, they’re trying to do what’s really commodity or vendor-independent procurement. That’s a pretty dramatic evolution, and one driven (say almost all operators) by growing concern that vendors aren’t supporting operator goals, but their own.

What created the divergence of goals can be charted. Up until 2001, technology spending by both network operators and enterprises was driven by a cyclical transformation of opportunity that, roughly every fifteen years, introduced a new set of benefits that drove spending growth. New opportunities will normally demand new features from technology products, new differentiators appear, and operators have new revenue to offset costs. Thus, they tend to be looking for ways to realize that new opportunity quickly, and vendors are their friends.

In 2001, the cycle stalled and has not restarted. If you go back to both surveys and articles you can see that this inaugurated a new kind of tech market, where the only thing that mattered was reducing costs. That’s a logical response to a static benefit environment—you can improve financials only by realizing those benefits more cheaply. But cost management usually starts with price pressure on purchases, and that’s what launches an “us-versus-them” mindset among operators and vendors.

The old saying “we have met the enemy and they are us” might then apply now. Both open source software and “COTS” hardware are very different because they are commodities. They present operators with a new problem, which is that vendor support for innovation depends on profit from that support. Absent a differentiable opportunity, nobody will support transformation unless operators pay for professional services. Ericsson probably foresaw this shift, and as a result focused more on professional services in its own business model. While realizing the benefits of a shift to commodity network elements has been slower to develop than Ericsson may have hoped, it’s still clearly underway.

But OK, operators want open-source and COTS. Will they be able to get it, and if so who will win out?

If you look at transformation as operators do, you see that their primary goal is what could be called “top-end engagement”. They have benefits—opex and capex reduction and revenue augmentation—and they need to engage these things. Traditional technology specifications and standards, starting as they normally do at the bottom, don’t even get close to the business layer where all these benefits have to be realized. That’s why the operator approaches seem to focus “above” the standards we know.

So the most important point here is to somehow get architectures to tie back to the benefits that, while they were expanding, drove industry innovation. A friend of mine, John Reilly, did nice book (available from Amazon in hard copy or Kindle form) called “The Value Fabric: A Guide to Doing Business in the Digital World” that might be helpful in framing business benefits for multiple stakeholders. It’s a way of describing how a series of “digital bridges” can establish the relationship among stakeholders in a complex market. It’s based on a TMF notion, which means it would be applicable to operator ecosystems, and vendors could explore the notion of providing digital bridges to link the stakeholders in an operator ecosystem together. Advertising, content providers, operator partners, hosting providers, and even software providers who license their stuff, are examples of stakeholders that could be linked into a value fabric.

But all of this good stuff is likely to be software, and however valuable software is in the end, it’s not going to cost as much as a couple hundred thousand routers. In fact, it’s reducing that hardware cost that’s the goal for operators. Network vendors are not going to embrace being cost-reduced till they vanish to a point. And if COTS servers and open-source software are the vehicle for diminishing network vendor influence, who’s incented to take their place in the innovation game? In the architectures that operators are promoting, no major vendor is a long-term winner.

I’m not saying this is bad, or even that vendors don’t deserve some angst. I’ve certainly told enough of them where things are heading, and tried to get them to address the issues while there was still time, to have lost sympathy. What I am concerned about is how the industry progresses. Is networking to become purely a game of advertising/marketing at the operator level, and of cheap fabrication of white boxes at the hardware level? If so, are the operators now prepared to drive the bus by investing some of their savings in technologists who can do the deep thinking that will be needed even more in the future?

The network of the future is going somewhere other than the simple substitution game that operators envision. You can see, for example, that personal agents are going to transform networks. So will IoT, when we finally accept it’s not about making every sensor into a fifty-buck-a-month LTE subscriber. The limitation in vendor SDN/NFV architectures is that they try to conserve the legacy structure that progress demands we sacrifice. The limitation in operator architectures is that they constrain their vision of services to the current network, and so forego the longer-term revenue-driven benefits that have funded all our innovations so far.

What’s above commoditizing services? Revolutionary services, we would hope. So let’s see some operators step up and show more determination to innovate than the vendors they’re spurning. Let’s find values to build a fabric around.

Network Feature Composition, Decomposition, and Microservices

At the TMF event in Nice Verizon opened yet another discussion, or perhaps I should say “reopened” because the topic came up way back in April 2013 and it was just as divisive then. It’s the topic of “microservices” or breaking down virtual functions into very small components. NetCracker also had some things to say about microservices, and so it’s a good thing to be talking about.

If we harken back to April of 2013, we’re at a point where the NFV ISG had just opened its activity. There was still plenty of room to discuss scope and architecture, and there was plenty of discussion on both. This was the meeting where I launched the CloudNFV project, and it was also the meeting where a very specific discussion on “decomposition” came up.

Everyone knows that the purpose of NFV was to compose services from virtual functions. Anything that composes a whole from some parts will be sensitive to just how granular the parts are. We know, for example, that if you compose virtual CPE from four or five functional elements (firewall, NAT, etc.) you get some benefits. If you had a virtual function that consisted of all of these things rolled into one and that was as granular as you got, it’s hard to see how a physical appliance wouldn’t serve better. Granularity equals agility.

The “decomposition” theme relates to this granularity. Here, the suggestion was that operators require that virtual functions be decomposed not only into little feature granules, but even further into what today we’d call “microservices”. There are a lot of common elements in things like firewall, VPN, NAT, and so forth, so the decomposition camp says. Why not break things down into smaller elements to allow even totally new stuff to be built from the building blocks of the old. It carries service composition downward to function composition.

The operators really liked this, and so did some vendors (Connectem introduced it in a preso I heard), but the major vendors really hated it. They still do, because this sort of decomposition not of services but of functions threatens their ability to promote their own VNFs. But the fact that buyers and sellers are in conflict here is no surprise. The question is whether decomposition is practical, and if it is whether microservices are a viable approach.

Virtually all software that’s written today is already decomposed, in that it’s made up of classes or modules or functions or some other internal component set. My memory of programming techniques goes back to the ‘60s, and I can honestly say that even then there was tremendous pressure from development management to employ modular structures. Even in programming languages like assembler, or machine language, there were features to support “subroutines” or modular elements that called directly on the computer’s instruction set (for those interested, look up “Branch and Link”).

One might think that this long history of support for modularity would mean that it would be no big thing to decompose functions. Not necessarily. Then, as today, the big problem is less dividing software into modules than it is in assembling those modules in any way other than the original way.

Most software that’s composable is really designed to be composed at development time. There are frequently no convenient means provided to determine what data elements are needed and what format they’re expected to be in. Worse yet, the flow of control among the components may implicitly depend on efficient coupling—local passing of parameters and execution. For something to be a “service” or “microservice” in today’s terms, it would have to accept loose coupling through a network connection. That’s something that adds complexity to the software (how do you know where the component is and whether it’s available?) and also can create enormous performance issues through introduction of network delays into frequently used execution paths.

The point is that it’s an oversimplification to say that everything has to be decomposed and recomposed. There are plenty of examples of things that should or could not be. However, there are also examples of vendor intransigence and a desire to lock in customers, and quite a few of the functions that could be deployed for NFV could be decomposed further. Even more could be designed to be far more modular than they are. We have to strike a balance somehow.

NetCracker’s concept of making more of NFV and operations modernization about microservices is an example of how that could be done. If there’s a service whose lifecycle events are so frequent that they are almost data-plane functions, that service has a serious problem no matter how you deploy it. Generally, management and operations processes have relatively few “events” to handle. State/event tables are the most common way to represent lifecycle process phases and their response to events, and the intersection of the states and events defines a component, a “microservice” if you like, and one that’s probably not activated often enough that it couldn’t be network-coupled. I’ve advocated this approach from the first, back to that 2013 meeting of the ISG.

Event-driven OSS/BSS is one way of stating a goal for operations evolution—another is “agile”. Whatever the name, the goal is to make operations systems respond directly to events rather than imposing a flow as many systems do. This goal was accepted by the TMF almost a decade ago, but most operations systems don’t achieve it. A microservice-based process set inside a state/event lifecycle structure would be exactly what the doctor (well, the operator) ordered.

If we want to go further than this, into something composable even when the components have to stay local to each other, then we need to define the composition/execution platform much more rigorously. An example, for those who want more detail, is the Java Open Service Gateway Initiative (OSGi), which has both a local and remote service capability. Relatively few network functions now residing in physical network devices conform to this kind of architecture, which means you’d have to rewrite stuff or apply the microservices-and-decomposition model to new functions only.

It’s hard for me to see this stuff and not think of something like CHILL or Erlang or Scala—all of these are specialized languages that could be applied to aspects of virtual-function development. If you’re going to develop for a compositional deployment that ranges from local to network-coupled, you might want to make the location and binding of components more abstract. If you want to be able to do this in any old language you may need to define a PaaS in which stuff runs and make binding of components an element of that, so you can adapt to the demands of the application or to how its owners want to deploy it.

Microservices, composable operations, and “decomposition” of network functions are all good things, but there’s a lot more to this topic than meets the eye. Software agility at the level that operators like Verizon or vendors like NetCracker want demands different middleware, different programming practices. The big challenge isn’t going to be accepting the value of this stuff, or even getting “vendor support” of the concept. It’s going to be finding a way to advance something this broad and complex in as a complete architecture and business case. We’ve not figured that out for something relatively simple, like SDN or NFV.

Vendors Aren’t Driving SDN/NFV Anymore, so What Now?

There is an inescapable conclusion to be drawn from recent industry announcements: Vendors have lost control of SDN and NFV, which means they’ve lost control of the evolution of networking. Operators, in a state of self-described frustration with their vendors’ support for transformation goals, have taken matters into their own hands. I’ve gotten emails over the last ten days from strategists and sales types in the vendor community, and they’re all asking the same question, which is “What now?” It’s a good question in one sense, and it’s too late to ask it in another—or at least too late to have the full set of choices on the table. But there are always paths forward, some better than others, so we need to look at them.

In a prior blog I made the point that commoditization of connection services was inevitable, and that it was also inevitable that operators will spend less on capital equipment at L2/L3 than they have in the past. Accepting this truth, I’ve said, is critical to vendors who have historically depended on these layers for their revenue and profits.

The up-front truth for this blog is that it is no longer possible for vendors to control the SDN and NFV revolution even if they were to step up now and do what should have been done all along. I’ve noted what should have been done many times and in any case, it’s too late to do it. Buyers have taken their own path now, and vendors need to fit into the operators’ programs and not try to define their own. I’m not saying they don’t need to pay attention to the focus on opex, to the need to develop a holistic SDN/NFV business case, only that doing that won’t give them control of the game anymore.

The key to accommodating operator initiatives seems to start with sophisticated service modeling. All SDN and NFV modeling and the associated APIs and orchestration is derived from the software concept of “DevOps” that defined a way of describing deployment of software elements and their connection into systems we’d call applications. There have always been two models of DevOps, one that describes the steps to take (called the “prescriptive”) and the other that describes the end-state desired (initially called the “declarative” but increasingly called the “intent model”). The critical first step vendors need to take in modeling is to adopt declarative/intent modeling.

“How-to” modeling cannot be general—it has to be a process description that naturally depends on what you’re doing and where you’re doing it. If you describe a system of VNFs in terms of its intent, you can deploy it on any convenient platform. If you say how to deploy it, you can deploy only on the target upon which your instructions were based. All the emerging operator architectures make it clear that a wide variety of platforms, including legacy “physical network functions” or PNFs, have to be supported for any feature. The thing, as they say, speaks for itself (for Latin/legal fans, “res ipsa loquitur”).

I personally think an intent model approach would be ideal across the board, meaning everywhere from top to bottom in an implementation. However, it is essential only at certain key points in the structure of an SDN/NFV package:

  • At the top, where SDN/NFV software interfaces with current OSS/BSS systems.
  • Underneath “End-to-End Orchestration” or EEO, to define the way that infrastructure-based behaviors are collected into functional units.
  • At the “Infrastructure Manager” boundary, to describe how a given behavior is actually deployed and managed for one or more of its hosting options.

Each of these points represent a hand-off that operators are insisting be open, which means that the implementation below has to be represented to the implementation above. Intent modeling makes that mutual representation practical.

The second point that vendors have to enforce in their implementation is the notion of a VNF PaaS implementation. All of the APIs that a VNF presents as an interface have to be connected with a logical paired function, and all of the SDN/NFV and management APIs that a VNF would be expected to use have to be offered in a uniform way to the “virtual space” in which VNFs run. This same requirement exists in a slightly different form for SDN, but in my view it would be met by the support of an intent-model “above” the SDN controller.

This is going to be the most important issue for NFV, I think. Absent a PaaS-like framework, there is no meaningful portability/onboarding, and no way to contain integration cost and risk. Commercial VNF vendors are likely to tie up with NFV partners (as they have already) and integrate only with these partners, which opens a risk of each setting licensing terms that operators will find offensive because there’s little or no competition. Open-source could be totally excluded from the picture.

A “VNF” is a system, a black box that provides a feature or features, asserts an explicit SLA, and contains a range of deployment options that could adapt to conditions by scaling or replacement. All of this good stuff should happen inside the box, with specific contained APIs to link the functionality with the rest of the service ecosystem. Absent that, we have no reliable integration, and we are absent that now.

The next point is perhaps the largest problem, and a problem that would have to be solved in order to solve the VNFPaaS challenge. Management, meaning lifecycle management at all levels, has to be defined explicitly or nothing can be integrated at all—no VNFs, no NFVI, nothing. The current model is kind of like the software equivalent of the universal constant (“That number which, when multiplied by my answer, yields the correct answer.”) We have the VNF Manager that might be integrated with each VNF, or it might be centralized, or a combination of both. What is integrated with a VNF is part of a tenant-service, and what is centralized is part of the management system. You can’t float between these two environments because it’s not secure or reliable to do so, any more than you can let applications change the operating system.

The really big problem here is that the industry approached all this from the bottom, and you can’t really do management right except from the top. You manage services against the SLA. You manage service components against the behavior that you set for them to secure the SLA, and you manage resources to the standards required to make those component-level behaviors work. Management should be linked to modeling, so that every model layer has appropriate SLAs and management definitions. That way you have management of the system of functions that make up a service, down to the system of resources that support the functions.

The final point for SDN/NFV vendors is to focus strongly on federation, not only across operator boundaries but across implementations of SDN and NFV at the lower level. “Federation” in my context means supporting an autonomous implementation at some level by representing it as an opaque model to the level above.

A good modeling approach will take you a long way toward federation support of this sort because an intent model makes the “who” and “how” opaque to the higher-level orchestration process. However, there are a number of commercial relationships possible among operators, and there’s always going to be a number of different approaches to sharing management data.

Accommodating the commercial relationship is an implementation issue with intent modeling. The decomposition of a model representing a federated lower-level (or partner) element just means activating whatever that lower-level process might be, at any appropriate level. So you could have a “treaty federation” where billing data didn’t have to be exchanged, or one where the order process in one domain was activated by the orchestration in another.

The management stuff could be more complicated, depending on how good the management model is to start with. If you presume my preferred approach, which is a repository in which all “raw” management data is collected and from which management APIs present query interfaces, then there’s no real issue in controlling what a partner sees or how it should be interpreted.

In some respects, operator architectures could make it easier on vendors. If they fit in the architecture they don’t have to offer a complete solution. If they fit in the architecture they don’t have to sell the entire SDN/NFV ecosystem. It could create focused procurements and shorter sales cycles. It certainly will facilitate more limited service-specific applications of SDN and NFV, as long as they can be fit into the operator’s holistic model. It’s also surely an indication that the SDN/NFV space is maturing, moving from media hype to the real world. It’s just important to remember that doesn’t mean media hype becomes the real world. Operator architectures are the proof of that.

The Critical Open-Source VNF: How We Could Still Get There

One of the most logical places for operator interest in open-source software to focus is in the area of virtual network functions (VNFs). Most of the popular functions are available in at least one open-source implementation, and operators have been grousing over the license terms for commercial VNFs. It would seem that an open-source model for VNFs would be perfect, but we seem to have barriers to address in making the approach work.

VNFs are the functional key to NFV because they’re the stuff that all the rest of the NFV specifications are aimed at deploying and sustaining. Despite this, VNFs have in some sense been the poor stepchild of the process. From the first, everyone has ignored the fundamental truth that defines VNFs—they’re programs.

Virtually all software today is written to run on a specific platform, with hardware and network services provided through application program interfaces (APIs) presented either by an operating system or by what’s called “middleware”, system software that performs a special set of useful functions to simplify development. In some cases, the platform (and in particular the middleware) is independent of the programming language, and in others it’s tightly integrated. Open-source software is no exception.

A convenient way to visualize this is to draw a box representing the program/component, and then show a bunch of “plugs” coming out of the box. These plugs represent the APIs the program uses, APIs that have to be somehow connected to services when it’s run. Let’s presume these plugs are blue.

When something like NFV comes along, it introduces an implicit need for “new” middleware because it introduces at least a few interfaces that aren’t present in “normal” applications. If you look at the ETSI diagrams you see some of these reference interfaces. These new APIs add new plugs to the diagram, and if you envision them in a different color like red, you can see the challenge that NFV poses. You have to satisfy both the red and blue APIs or the software doesn’t run.

A piece of network software of the sort that could be turned into a virtual function also has implicit external network connections to satisfy. A typical software component might have several network ports—one for management access, one as an input port and another as an output port. Each of these ports has an associated protocol—for example, a management port might support either IP SNMP or a web API (Port 80). Data ports might have IP, Ethernet, or some other network interface (to connect to a tunnel, for example).

Then there’s what we might call “implicit” plugs and sockets. Virtual functions have a lifecycle process set, meaning that they have to be parameterized, activated, sustained in operation, perhaps scaled in or out—you get the picture. This lifecycle process set may or may not be recognized by the software. Scaling, for example, could be done using load balancing and control of software instances even if the software doesn’t know about it. But something has to know, because the framework has to connect all the elements and work, even when there are many components with many plugs and sockets to deal with.

What this means is that when a piece of open-source software is viewed as a virtual function, it will have to be deployed in such a way that all the plugs from the software align with sockets in the platform it runs on, and all the sockets presented by NFV interfaces line up with some appropriate plug. How that might happen depends on how the software was developed.

If we presume that somebody built an open-source component specifically for NFV, we could presume that the software itself would harmonize all the plugs and sockets for all the features. The same thing could be true if the software was transplanted from a physical appliance and altered to work as a VNF. Operators tell me that there is very little truly customized VNF software out there in any form, much less open-source.

The second possibility is to adopt what might be considered a variation on the “VNF-specific VNF Manager (VNFM).” You start with a virtual function component that provides the feature logic, and you combine it with custom stuff that harmonizes the natural plugs and sockets and connectivity expected by the function with the stuff needed by NFV. This combination of functional component and management stub then forms the “VNF” that gets deployed. Operators tell me that most of the VNFs they are offered use this approach, but also that only a very few open-source functions have been so modified.

The final possibility is that you define a generic lifecycle management service that talks to whatever plugs are available from the function component, and makes the necessary connections inside NFV to do deployment and lifecycle management. I’ve proposed this approach for both the original CloudNFV project and my ExperiaSphere model, but operators tell me that they don’t see any signs of adoption by vendors so far.

All of these options for open-source virtual functions expose two very specific issue sets—deployment (the NFV Orchestrator function) and lifecycle management (VNFM). For each issue set, current trials and tests have exposed a “most-significant-issue” challenge.

In deployment, the problem is that open-source software’s network connection expectations are quite diverse. In some cases, the software uses one or more Ethernet ports and in others it expects to run on an IP subnet, sometimes with other components, and nearly always with the aid of things like DNS and DHCP services. One challenge this presents is that “forwarding graphs” that show the logical flow relationship of a set of VNFs may do little or nothing in describing how the actual network connectivity would have to be set up.

In the lifecycle management case, there are two challenges. One is to present some coherent management view of the VNF status. In the ETSI model this is the responsibility of the VNFM, which is often integrated with the VNF, but I don’t think this is workable because the VNF may be instantiated in multiple places because of horizontal scaling. The other challenge is getting the VNF information on its own resources. You can’t have a tenant service element accessing real resource management data, particularly if it plans to then change variables to control behavior.

I’ve said in prior blogs that VNF deployment should be viewed as platform-as-a-service (PaaS) cloud deployment, where the platform APIs come from a combination of operating system and middleware tools deployed underneath the VNFs, and connectivity and control management tools deployed alongside. We have never defined this space properly, which means that there is no consistent way of porting software to become a VNF and no consistent way to onboard it for use.

What’s needed here is a simple plug-and-socket diagram that defines the specific way that VNFs talk to NFV elements, underlying resources, and management systems. The diagram has to show all of the plugs and sockets, for not only the base configuration of the VNF but also for any horizontally scaled versions, including load-balancers needed.

Open source is not the answer to this problem; like any other software it has to run inside some platform. In fact, the lack of a platform puts the application of open-source software to VNFs at risk because it poses a significant risk in terms of resources needed to adapt the software, and in the open-source world the commercial interest in covering that risk is diminished.

Operator initiatives like the recent architecture announcements from AT&T and Verizon take a step in the right direction, but they’re not there yet. I’d love to see these operators step up and define that VNFPaaS framework now, so we can start to think about the enormous opportunity that open-source VNFs could open for them all.