By John O. Schoenbeck
When NASA's Jet Propulsion Laboratory began recently to build its third-generation hypercube parallel supercomputer, the facility continued its commitment to a new standard of computer technology. The Mark-3 is fully operational now, but earlier versions have been in use for several years.
Tasks assigned to computers have become extremely complex, and they now demand greater processing power than ever before. Research on the fluid flow of air around the wing section of an aircraft, on weather patterns, or on the behavior of subatomic particles can take days to run on conventional computers.
Larger computers offer a partial solution, but the real answer is to build radically different designs that connect multiple processors working together in parallel. Such computers are still experimental but they are already available from companies like Bolt Beranek and Newman, Cray Research, Dec, IBM and others.
“The Mark-3 will be used for much more than just satellite tracking,” says JPL's Moshe Pniel, a team leader on the project. “it will be used for research in chemistry, biology, high-energy physics, and even for image processing.”
Conventional computers are limited by their sequential operation. The central processing unit, or CPU, is instructed to carry out a program one step at a time and at each step it gets information from memory, performs the required operations, and returns the new information, to memory. This design works well with relatively simple programs and small amounts of data but is inefficient at higher levels of complexity.
In contrast, the principle of parallel processing is to divide a complex job among a number of processors in order to do many operations at the same time. The CPUs may be relatively few in number or in a massively parallel design, there may be thousands of them. Each CPU may be slow, but taken collectively they are fast and flexible, just as many people working together can build a house faster than one person alone.
Designing a parallel system raises many problems. For example, the system needs to break the task into separate operations and then put it back together, and it must provide ways for the processors to work together smoothly. The designer must decide how many processors will be used, how they will be connected together, and whether they should be specialized for different tasks or be all alike.
Next, the CPUs must be connected to each other and to memory. Each processor and the memory assigned to it form a “node.” The two basic designs are shared memory and local memory.
Shared memory systems connect all processors to a common memory. These systems must be designed so that no processor reads a memory location before the appropriate value has been stored there. Usually, an area of memory is reserved to keep track of what has been written where.
The disadvantage of the shared memory system is, that making memory serve many processors simultaneously is difficult and expensive. Also, a bottleneck results when several processors try to use the memory bus or access the same memory location. One solution is a switching system, like a telephone exchange, that will connect any two processors upon request.
The second way to connect the processors with memory is with local or distributed memory. In such a “multicomputer,” each processor has its own memory attached to it. These systems need fewer support chips and less control logic, so they are cheaper than share memory, data must be distributed to the appropriate nodes at the start of computation. Also, processors must get data from other nodes. Depending on how the nodes are connected, the request can be sent either directly or through intervening nodes.
Thus, the need to share data creates another design problem: how should the nodes be connected? Communications must be fast, so no node can have too many connections. On the other hand, if data must be relayed by many nodes there will be delays and processors will be idle.
The most promising answer is to connect the nodes as if they were at the corners of a multi-dimensional cube, called an n-cube or hypercube. For example, a 3-cube has eight nodes, each connected to three others, while a 4-cube has 16 nodes, each connected to four others. JPL's Mark-3 is a 7-Cube, integrating 128 nodes. The Connection Machine, a parallel computer from Thinking Machines Corp., is an even larger 12-cube, consisting of 4096 “corners,” one for each chip.
Once these design decisions have been made, the next step is to determine the way for the program to control the computer. With such new and radical hardware, radical approaches are necessary to make it run most efficiently. There are two basic ways for the computer to get its instructions. The more complex way is to provide each processor with its own unique. instructions. In this case, the task is divided into operations different from one another but executed at the same time. This is like having a carpenter, a plumber and a painter all working on your house at the same time.
The second way, which is much simpler, is to provide all processors with the same set of instructions. Each processor obeys the instruction if it applies to the processor's job, or else it waits for another instruction that does apply. This is similar to sequential computing, except that many operations are performed simultaneously. It can be compared to having several carpenters working on your house – getting things done faster than one would by oneself – then having several painters come in when the carpenters are done.
The work flow can also be controlled in several ways. One way, called “dataflow,” does not specify a sequence of operation. Instead, each node performs its operation as soon as it has all the data it needs and immediately sends the result to the next node. Dataflow helps to prevent the bottleneck from many processors using the communication channels at the same time or trying to read the same memory location.
A second way to control the workflow, called “demand-driven,” is the opposite of dataflow because each operation takes place only when it is needed and requested, rather than as soon as it becomes possible.
Desktops and supercomputers
Parallel processing is also finding its way to desktop computing. Even now, it is available for PCs in the form of an expansion board or software simulation. For example, the INMOS T414 CPU can pass information onto multiple processors while working on a problem, and Orchid Technology's OC Turbo 286e can run simultaneous applications in the computer’s memory and in the 286e’s own memory.
The development of parallel hardware is ahead of the software to use it but the language that can accommodate the capabilities of parallelism are being written. These include new forms of C and FORTRAN as well as languages designed specifically for parallel processing like Linda (from Yale University) and Occam (for the INMOS T414).
The future of parallel processing, whether on desktops or in supercomputers, will depend on the further development of hardware and software. It is still too early to know which of the various architectures will predominate. But eventually the standardization of design and programming languages will ensure that -- like sequential computers – parallels computers can work the same way for all programmers and for a wide range of applications. When this happens, the new generation of computers will be able to accomplish tasks never before thought possible.
Cleveland Institute of Electronics
CIE is a distance learning school that has been training technicians
with our patented lessons and labs for eighty years. CIE has programs that
cover electronics troubleshooting, computer technology, wireless communications
CIE's programs include both theory and hands-on training that will give you the
skills and confidence necessary to start a new career or advance in the one you
Click here for a
complete list of CIE's courses.
Request a FREE CIE Course