The increasing use of computer systems in the military and aerospace world and the escalating complexity of applications are placing new demands for continuous operation on next-generation network infrastructures. These go beyond traditional requirements, such as tolerance of shock, vibration, extreme temperatures and environmental hazards, to include the network-centric concept of service availability. Service availability implies a service is always available, regardless of hardware, software or user fault, and is often taken for granted until downtime occurs. It can easily be overlooked or discounted, even when it is crucial to the successful deployment of mil/aero applications in the field.
There are both real-world and theoretical examples of what can happen when service availability is overlooked, or not fully specified. Over the last three years, the FAA’s aging National Aerospace Data Interchange Network (NADIN) system, which tracks more than 1.5 million flight schedules daily, has experienced a number of major service outages. These have resulted in delays of up to 6 hours at as many as 100 airports at a time on a nationwide basis. Perhaps most troubling of all is that in many of these incidents, the FAA was unable to trace the cause of the problem in the 24-year-old system.
Need for Service Availability
If one looks at modern guidance systems, the lack of service availability can have deadly consequences. For example, in ship-to-air systems, missiles are guided by shipborne radar until they come within self-guidance range of their target (Figure 1). Until this time, the shipborne control systems must continuously take radar measurements of the target, compute the missile guidance commands and send them to the missile. Concurrently, the command and control systems must collect the missile’s health status and determine whether it is within permissible parameters. If not, then the missile must be destroyed. The failure of any aspect of this process, which includes control loop closure times not being met, computer hardware failures, software failures or network failures, can have disastrous consequences, including missing the target, or losing the missile, which could lead to the potential loss of the ship.
Figure 1
Service availability technology is particularly suited for complex ship-borne radar systems like those aboard Aegis guided-missile destroyers (USS Roosevelt DDG 80 shown). There, computer hardware failures, software failures or network failures can have disastrous consequences—including missing the target or losing the missile.
These examples highlight the importance of service availability in the daily operations for military and aerospace systems. Further, the unpredictable and often unclear causes of problems reveal the need for a transparent and reliable approach to service availability that prevents outages.
Proprietary Problems
Service availability concepts are nothing new in the military and aerospace world, but they have generally been addressed on a system by system, application by application basis. While this enables operation to be tailored to specific requirements, it results in highly proprietary systems, which become less flexible over time and, as a result, much more costly to maintain. The challenges faced by upgrading the FAA’s aging NADIN system are a prime illustration of this.
An alternative approach is to leverage commercial systems and service availability (SA) middleware to provide a common infrastructure basis for service availability. The benefits of infrastructure options based on open specifications can be substantial.
Since the early 2000s, two complementary organizations have developed open specifications to help create commercial ecosystems around five-nines (99.999% uptime, or five minutes and 15 seconds of downtime per year) SA-based systems for markets including mil/aero. These specifications include the AdvancedTCA specifications developed by the PCI Industrial Computer Manufacturers Group (PICMG), and the Hardware Platform Interface (HPI) and Application Interface Specification (AIS) defined by the Service Availability Forum.
No Single Point of Failure
AdvancedTCA specifications are based on a no single point of failure concept, and they provide considerable flexibility to the mil/aero equipment provider in terms of systems that can be built. These systems are defined around a chassis with multiple switch fabric options, with large systems generally accommodating 14 card slots. ATCA guidelines about the use of power and cooling include a 200W per slot maximum and a 48V DC power requirement, and while not required, systems generally have multiple units, allowing for field replacement without taking a system out of service during deployment.
A fundamental concept is the base fabric, which provides a dual star Gigabit Ethernet interconnection mechanism between all blades in a system (Figure 2). This enables two switch modules to be placed in a system, with up to 12 payload blades and with every blade interconnected via each of the switch blades, providing an inherently redundant system. The base fabric is typically used for control traffic within the system itself.
Figure 2
A typical ATCA system will separate fabrics into a base fabric and a data fabric. This enables two switch modules to be placed in a system, with up to 12 payload blades and with every blade interconnected via each of the switch blades, providing an inherently redundant system.
Payload traffic is generally carried on the data fabric. The PICMG specifications allow for a range of options in this area, including a dual star approach and full mesh interconnectivity, with a broad range of transport technology options. The industry has consolidated around a dual star topology with Ethernet as the preferred interconnection mechanism, although some special applications use alternatives, such as Serial Rapid IO. The commercial industry is now in its third and fourth generations of systems; thus, a robust ecosystem now exists where many of the interworking issues among multiple manufacturers have been resolved. The current data fabric standard is 10 Gbit/ss with 40 Gbit/s data fabric systems starting to appear.
Separate Fabrics for Control and Payload
The use of separate fabrics for internal control and payload applications enables system designers to provide increased predictability of performance. Configuration, management and SA monitoring can be carried over the base fabric and there is no fear that sudden high-payload activity on the data fabric will inadvertently disrupt internal operation or accidentally trigger a service availability failover event. Similarly, the data fabric traffic is not subject to sudden spikes in system control traffic, which may affect application performance. In the case of the NADIN system and the missile control example, such factors could be critical.
However, a robust open hardware system only addresses part of the service availability issue. With the capability to install multiple applications with a wide variety of requirements in a single ATCA system, it is really the service availability middleware that ties the system together, enabling systems designers to take full advantage of the underlying hardware flexibility.
The Service Availability Forum (SA Forum) was founded in 2001 to address the issues of defining services and application programming interfaces for an open specification approach to continuous hardware and software operation. The concept of the SA Forum specifications is to create abstractions between the underlying hardware and the application environment to support standards-based SA middleware implementations. Since inception, the SA Forum has created a critical mass of specifications for service availability. The robust offerings include the Hardware Platform Interface (HPI), which abstracts the hardware from management middleware and makes each independent of the other, as well as the Application Interface Specification (AIS), which standardizes the interface between SA Forum-compliant, high-availability (HA) middleware and service applications (Figure 3).
Figure 3
The SA Forum’s Hardware Platform Interface (HPI) is used to abstract low-level information from hardware so that it can be accessed and programmed through common interfaces.
Abstracting Low-Level Hardware Info
HPI is used to abstract low-level information from hardware so that it can be accessed and programmed through common interfaces. This enables applications directly accessing hardware functions and receiving hardware events to run on multiple platforms with minimal modification. Indeed, HPI is now implemented in many commercial and proprietary platforms and is viewed as a market success. HPI exposes a set of platform-defined management instruments and through the HPI interface, the various instruments can be read and configured. Common application triggers, such as voltage drops or watchdog timer expirations, constitute failure “events,” which serve as inputs to AIS high-availability middleware. The specifications also allow for instrument grouping to create resource records that can then be further subdivided into domains with a common set of capabilities.
AIS is significantly more sophisticated as it provides the set of services necessary to support highly available software applications. All HA middleware implements most or all of these services, as they are fundamentally necessary for an “always on” system. What is different is the layered approach and open forum collaboration to create application- and platform-agnostic architectural models with a rich set of APIs. The AIS specification includes a set of core services, such as checkpointing, cluster management, event handling, etc., which are the necessary underpinnings for any system. Additionally, and perhaps where the SA Forum has brought the most value, services and frameworks provide standardized mechanisms for managing an integrated SA environment through an availability management framework, along with a standard mechanism to manage an overall environment, including both hardware, through HPI, and software. The most recent additions complete the core services with a software management framework, enabling seamless upgrade and downgrade campaigns to be implemented.
Portability across Multiple Systems
It is important to remember that AIS does not dictate how an application should be written; rather, it provides a set of interfaces and capabilities to create applications in a service available system. AIS is driven and configured by its application environment, and it is the common approach to the middleware that enables rapid portability across multiple systems and between multiple applications.
With the increasing use of AdvancedTCA and the rise of SA Forum middleware implementations in mil/aero applications, the HPI and AIS specifications are in turn increasingly being adopted by the mil/aero industry. A number of defense implementations using AIS have appeared in the last 2-3 years. The U.S. Navy has adopted the SA Forum AIS specification as the core of its high-availability requirements for its objective architecture efforts to create commonality and reuse across combat systems.
In early 2010, SA Forum member company GoAhead Software announced partnering with Global Technical Systems (GTS) and Northrop Grumman (NGC) to support the Navy’s Common Processing System program. The SA aspects of the system are based on SA Forum specifications and integrate GoAhead’s SAFfire solution, ensuring continuous service of warfighter systems without loss of service or data. Pre-SA Forum versions of GoAhead’s software, Self-Reliant, have already been deployed through Lockheed Martin’s work on the LCS COMBATSS-21 combat system, and with the Royal Australian Navy for installation of the Aeigis system aboard the Australian Hobart-class air warfare destroyers.
The next generation of SA middleware is taking shape through the OpenSAF open source project. This project was formally launched in early 2008 and has since gathered significant industry backing. GoAhead Software has embraced this industry move and announced a commercial distribution of OpenSAF: OpenSAFfire. This includes contribution of many of the key concepts of its SAFfire and Self-Reliant products to the project to enhance the code base. The upcoming release 4.0 of OpenSAF incorporates the latest specifications from the SA Forum in a highly modular architecture, which is expected to form the basis for next-generation implementations, and the migration of existing applications into an open SA environment.
Building Block Ecosystem Approach
It is important to understand that open specifications provide key benefits in building and creating service available systems in the diverse application areas of military and aerospace. The building block ecosystem approach of AdvancedTCA and the SA Forum specifications enables system designers to leverage the commercial ecosystem, yet provide their own value-add and expertise in specific areas without having to design the whole system. This not only leads to cost-effective and more rapid development timeframes, but enables system upgrades as new generations of blades appear based on the latest technology. With a common SA infrastructure based on SA Forum specifications, applications can be ported, combined and updated in a much more seamless manner. With a common system infrastructure, the management of many aspects of these systems can be aligned, saving time, money and reducing the need for specialized training on all aspects of each individual system.
As network-centric defense operations continue to evolve and increase in importance, the complexity of next-generation systems could produce a myriad of unforeseen outages across diverse applications. It is expected that the requirements for five-nines service availability and higher, along with increased usage of commercial systems and open specifications building blocks, will continue to grow in the military and aerospace segments. AdvancedTCA and the SA Forum specifications are key elements in the effective use of commercial ecosystems for service availability. The resulting enhancement of system transparency and reliability will eliminate the dramatic outages and confusion about their causes and offer cost and time saving benefits.
SA Forum
Beaverton, OR.
(503) 619-0576.
[www.saforum.org].


