RE-DISTRIBUTABLE COMPUTER ARCHITECTURE

by

William H. Bytheway

June 22, 1997


FORWARD

For years the space launch and satellite ground support systems have been evolving from the basic punched card batch processing computer systems into distributed multi-processor systems.  Many systems during their life cycle outgrow the computer processing capabilities requiring a re-design of the hardware and software architecture.  This paper will define a concept for an architecture that easily adapts to a changing physical and functional environment at a minimal cost.

 

INTRODUCTION

A typical system architecture for aerospace ground equipment (AGE) in today’s environments must be totally flexible in order to grow and meet new growth requirements.  Early requirements definitions usually are enhancements of a prototype system not adequate to accommodate growth changes and new customer demands on the hardware and software.  A typical design usually starts with a combination of physical and functional definitions for the system and evolves from there. 

 

The intent of this paper is not to define a specific AGE architecture but to identify a re-distributable processing architecture.  The benefit is greatly realized on a life cycle that spans 10 or more years of operation. 

 

In addition, the benefits of maintaining an open systems architecture will be emphasized since during the life cycle of the space program, hardware and software vendors will continue to develop new products and discontinue support on existing products.

 

TYPICAL PROBLEMS

Following are a couple of examples to help you to better understand the kinds of problems experienced in some current systems. 

 

The Air Force Satellite Control Facility (AFSCF) Data Systems Modernization System (DSM) was implemented in the early 1980s to replace the older CDC-3800 batch computers and sneaker network for management of a series of Department of Defense (DOD) satellite programs.  The generic DSM system was capable of basic command and control of a spacecraft, ephemeris management, telemetry display and analysis.  The system was designed for this support as well as a Mission Unique Software (MUS) capability to be supplied by the respective satellite program for support related to payload command and control, housekeeping and trending.  Early requirements for the MUS system were not well defined, causing many problems for implementation for some programs.  This caused a functional re-partitioning of the software onto two separate processors (On-line command and control, Off-line planning and MUS support) utilizing shared disk storage.  Problems were discovered in the Off-line software performance during several hours of the day where CPU utilization peaked and the system performance went to zero.  This caused even more functional partitioning and rescheduling to other physical processors.  Hence the concept of distributed processing architecture was formed in the space and defense community. 

 

The DSM solution is a prehistoric example of how we evolved into today’s network architecture using UNIX, VMS and communication protocols like TCP/IP and Decnet.  Most of the DSM processes were file I/O event driven, i.e. one application produced files that were used by the next, and so on.  With linked disk drives, sharing of disks and data between computers allowed these functions to easily be partitioned to additional computer processors.  This works well for stand-alone applications as in an Off-line system, but what about real time applications that require constant communications between software applications and modules? 

 

The Harris CORE Electronic System is being developed for NASA to be the next generation Shuttle Launch AGE system.  Harris handle this problem by creating an open system architecture that physically partitioned the processing stages into pre-processing, processing, command & control and display.  Separate computers were utilized for each process and connected via Ethernet.  Growth can easily be handled by adding computers to the system, upgrading Ethernet to FDDI, upgrading to faster computers, etc.  But the system still functionally partitioned software to specific computers.  Figure 1 shows a typical example of the typical ground station architecture.


Figure 1:  Ground Station Architecture


The Enhanced Tactical Radar Correlator (ETRAC) is being developed for the joint military services as a mobile system capable of being deployed in the battle field to support the processing of radar imagery data.  This system is capable of collecting, processing and distributing the finished product to the customer.  The ETRAC system uses a combination of centrally located processors and software feeding a distributed set of GUI workstations.  This system supports a distributed processing architecture but has gone beyond the concept of physical partitioning early on in the development of the program.  Each functional process of the system communicates with each other using a mail concept.  The overall software system consists of many functional software modules each communicating with each other using a standard communications protocol.  This allows complete freedom for physical partitioning of a functional software module to meet growing changes in the overall system.  As requirements change on any one given piece of computer hardware, additional hardware can be added, requiring only the software module to be relocated with a new mail address.  As more processor power is added to the next generation of computer hardware, processes can be re-located. 

 

The problem definition is now clear, a program provides a hardware system with the capability to support preliminary software functional requirements with sufficient margin to spare.  The software architecture continues growing to meet new requirements until the system can no longer meet any requirements.  Typical solutions are to either add more processor power or re-distribute the software function to additional computers.  To add to the problem, new operating systems and run-time environments cause new problems to appear, old software becomes obsolete and must either be replaced by new commercial off the shelf (COS) products, be modified, upgraded or completely re-designed. 

 

METHODOLOGY

Early requirements definitions for a new program should be careful not to partition specific functionality to specific hardware.  A DEC Alpha computer may be capable of handling all processing for a single spacecraft in the early requirements definition phase of a space program, but new and additional requirement changes will impact system performance later.  Systems engineers and customers should resist the temptation of performing a physical allocation before the functional design is complete in a distributed processing architecture.  The concept of an open system architecture must be kept as the highest priority.  Very quickly, the cost of software development and maintenance will out-strip the cost of the hardware.  The software design must very carefully choose software development standards. 

 

1.        A good choice for several programs was to develop all software to ANSI C++ standards, use a POSIX compliance UNIX operating system, and develop all Graphical User Interfaces (GUI) to meet X11R5 capability.  This opens the doors to using many Xview, Motif an OpenWindows tools.

2.        On the other hand, another program chose to develop software using SunView tools, a non-ANSI C compiler and a closed architecture.  Over time, the hardware vendors stopped supporting these ancient products and the SunView graphical user interface lost support.  Next the customer directed a change to a new open systems architecture compliant operating system and run-time environment.

 

Using the above examples, a newly designed system would not limit itself to only Sun, Silicon Graphics or DEC OSF workstations, but attempt to develop software that can be adapted to any hardware display platform.  The choice of software standards (Motif, XVIEW, SunView, SunOS, Solaris, OSF, UNIX, SOS, System-7, VMS, FORTRAN, K&R C, ANSI C/C++, Intel 80xxx, Motorola 68xxx, Pentium...) should be done with the requirement that these products adhere to an open systems architecture that will have continued support from multiple vendors for the entire duration of the program’s life cycle. 

 

Systems mentioned above have many advantages that almost carry religious followings, but once a space program commits to a particular architecture, the cost of changing to another system can be very costly.

 

Note:  Those programs that choose NOT to develop their software and hardware using Open System Architecture concepts should not plan on upgrading their computers or operating systems within the life cycle of the program. 

 

 

THE RE-DISTRIBUTABLE PROCESSOR ARCHITECTURE

To develop a truly distributed processing architecture allows physical re-allocation of software functional modules within the same computer or on different computers on the network.  The lowest level of a software program usually consists of a software module, these modules are grouped into CPC’s which make up a CPCI.  The concept of a re-distributable processor architecture can be applied at any of these levels, making for a very interesting architecture.  To further explore this concept, a few definitions are needed:

 

1.        Distributed processor Architecture - The concept of a distributed processor architecture requires multiple processors and/workstations to be connected via a network and requires all of the features of Internet (TCP/IP) communications and Network File Sharing (NFS).  The concept of this architecture allows for functional processes (modules) to be partitioned between various processors.  Applications can be controlled manually or automatically using various methods of dedicated socket-to-socket TCP/IP data connections between modules, timed events or events that are keyed from some trigger (file existence, signal, etc.).

2.        Re-Distributable Processor Architecture - This concept is the same as above except that the concept of a dedicated sock-to-socket connection is replace with a mail concept.  One design would have a software module register its address with a Master Control Mail Router (MCMR) that establishes a socket-to-socket connection with each software module. 

 

A communications protocol would define an “envelope” that included such information as (1) from:  processor/module, (2) to:  processor/module, (3) message ID, (4) sizeof message.  The Master Control Mail Router (MCMR) would be responsible for routing this “envelope” to the destination software module.  Wild cards would be defined for broadcast modes to selective groups of modules or processors. 

Figure 2:  Example of Master Control Mail Router

3.        Software Architecture Management - A standard library object module could be used on the front end of all code requiring mail communications, thus making it easy for changes in the communications to be updated for all functional modules.  The purpose of this front end would be to initialize communication with the MCMR and provide functions for sending and receiving message “envelopes”.  Figure 2 shows a typical example of an MCMR function.

.           Remote host (XHOST) Applications - While the capability to log onto a remote machine, launch and control software on the remote machine is a distributed processor feature, it is not a part of the Re-Distributable Processor Architecture.  This architecture only deals with using the MCMR to tie distributed software modules together into a single processing system.

Figure 3 - File sharing

4.        Network File Sharing (NFS) - This feature can be used as a method for sharing large data messages between software modules.  It is recognized that mailing a large one gigabyte mail message between modules could include a performance hit with the MCMR.  One design would be to create a data file in a shared directory as shown in Figure 3 and send a mail message to the receiving application notifying it of the new data availability.

5.        Shared Memory / Ethernet - The implementation of the MCMR could use a combination of shared memory and Ethernet to transfer data between different processors.

7.        Pros and Cons

a)       The main advantage is that an end user could communicate with many software modules located on any processor without the overhead of establishing a separate socket-to-socket TCP/IP connection.  Communications with any process is simple to perform.

b)       The disadvantage is that communications requires a MCMR which slows thru-put, a socket-to-socket connection consists of the illusion of a continuous data stream, whereas “mail” must have a finite size.

8.        Re-Distributing - Should there be a requirement to re-distribute software functions to a different processor, there needs to be a method for updating all software modules that communicate with the moved module.  This can easily be managed using either a configuration control file read during initialization or by modifying header (.h) files and rebuilding all modules. 

9.        Inter-Module Communication Protocol -  A message identification (ID) parameter needs to be created for each software module.  This ID should not be used for defining any other software module and is used for determining the passed data structure.  This makes monitoring and debugging of the process much easier and allows for global broadcasts of messages to all modules and processors (like terminate/shutdown).  The “sizeof” parameter allows for the variable argument lists to be passed (assuming the sender knows the size of the message).  The requirement for message acknowledgment (ACKs) is up to the design requirements.

10.     Open system Architecture - The open system architecture is required for this concept to work because of the requirement of reallocating software modules to different hardware vendor products.  Software must meet strict standards in order to make it easy for software modules to communicate. 

11.     Data Conversion between Differing Systems - It is possible for a method to be developed to pass mail between a DEC VMS and UNIX operating system using the UCX TCP/IP transport layers available with the DEC products.  Issues concerning data format conversions would have to be addressed. 

 

CONCEPT OF SERVER AND CLIENT

The MCMR is responsible for routing function calls between applications.  Normally the MCMR (acting as the client) will establish a connection to the applications (acting as both servers and clients) making them ready at any time for function call processing 

 

Server:  The server typically listens and waits to be called, accepts the call and reacts to the data.  An example would be a history module responding to a request for data between a start and end date.  It would perform error checking on the input, process the request and return the response.  Lastly it would wait and listen for the next request.

Client:   The client is normally initiates the request for data, establishes the connection, makes a request and waits for the returned data.  If the data does not come, it may look elsewhere.

 

Referring Figure 4, if the trending module needs to retrieve data from the history module, it will make a function call request to the history module via the MCMR.  Because the connection is already established, the data is passed almost immediately from one application to another.  This could be thought of a programming function call where you are passing values and pointers to data items or structures, the only difference is the time required for the network calls.  But the trending module

 

FAULT TOLERANCE

Fault tolerance can easily be built into re-distributable system quite easily.  The TCP/IP domain server typically will return multiple linked IP addresses for any one given domain name submitted to it using the UDP protocol.  Typically what happens is the client application attempting to make a connect will try the first IP address and if a successful connection isn’t made, the next one in the list is tried.  There isn’t any limit to the number of entries in the list.  This would allow the the distribution of the same software functionality on different computers at the same time.  Load sharing of the use of these functions could be scoped for the nominal case, and in the event of a failure the backup systems would fill in.

 

The MCMR and the server applications typically establish a connection early on in the session and health management handshaking takes place periodically to ensure this connection is still valid.  In the event that the connection is lost, the MCMR will attempt to re-establish communications with the primary application in the event of a network hiccup, then process the list of IP addresses returned by the domain linked list until a new server application can be found.

 

So what if the MCMR crashes or the computer fails?  This is the purpose of the System Health Management Module discussed next.

 

 

EXPANSION INTO LARGER SYSTEMS

To expand this concept into a larger system is an easy task to perform as shown in Figure 4.  The connection between MCMR routers allows for the functional separation between responsibilities between groups.  In this example the Online system is primarly responsible for the command and control of a satellite system,  The Offline system is responsible for planning and scheduling tasks, and because ephemeris generation is a processor CPU hog it can be separated into it’s  own cluster of computers.  As the system and requirements grow, so can the architecture.

 

 

SYSTEM HEALTH MANAGEMENT

The complexity of large systems in support of many spacecraft programs makes it extremely difficult to perform fault prediction, fault detection and fault isolation.  Some current systems have no way to indicate if a software process or module has failed, or has limited performance.  The MCMR could be easily used by a System Health Monitor Module and, by some clever Human Machine Interface (HMI), notify a computer operator for corrective action.  The Health Monitor Module could periodically poll each process within the system and request a health status response message from that module.

 

Where the “Vehicle Health Management Module” normally act in a server role, but it would also have a client role where it was responsible for notifying the operater through the HMI and offering the option of starting up a new MCMR.  Depending on the robustness of your health monitor processes, it could even autonomously start up a new MCMR process.


Figure 4:  Example of Distributed MCMR System


COMPUTER SECURITY AND PERFORMANCE ISSUES

A single MCMR could easily be overloaded from too many large routing requests from a number of user workstations.  An example of this would be requests for real time telemetry by vehicle analysts desiring to observe a spacecraft event or monitor a booster anomaly.  Current distributed processor architectures limit the number of socket-to-socket connects to the real time telemetry server or respond only to selected IP addresses.  Similar limits on the number of users could also be used in the MCMR in addition to a user priority scheme. 

 

This concept may not be the best solution for a continuous stream of real time telemetry data at a high data rate.  While TCP/IP will still chop up the continuous data stream into 1 Kbyte (or so) chunks for transmission, the route is fairly direct between software processes/modules.  The overhead of a “mail” system may be too great for adequate performance in the telemetry high data rate application. 

 

The transfer of small packets, structures or data items less than 64 Kbytes (wag) between modules should prove no problem for a MCMR mail system. 

 

SUMMARY

I have introduced you to the concept of a Re-Distributed Processor Architecture.  This concept targets the new long-term space or launch booster program that desires the maximum flexibility to adapt to new software and architecture systems.  No metrics have been given as to what the ideal selection of COS products since we live in a rapidly evolving computer world.  A good rule of thumb is that if 95% of the industry is using it, then there is a good chance that it will be around for a while (which is why I favored the UNIX/XWindows/ANSI C/C++ choices).