#### A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE A. J. Kerr and T. E. Curtis Admiralty Research Establishment, Portland. #### 1. INTRODUCTION. Current generation sonar signal processors need to operate with throughputs of several hundred million arithmetic operations per second to achieve acceptable operational performance. The digital processing load imposed by this processing requirement is beyond the capability of currently available individual "off-the-shelf" DSP components. Distributed networks of processing engines, based on a signal flow architecture [1], have been employed in some applications and are capable of supporting throughputs in excess of 500 million arithmetic operations per second. Future generation systems require a major increase in throughput [2], three orders of magnitude, if they are to keep pace with the perceived improvements in threat performance. This processing power is required to handle the data from larger aperture arrays, to perform more complex adaptive processing at the front end of the systems and to reduce operator/analyst load at the back end of the system and, although the throughput of standard DSP parts has increased significantly over the past few years, the improvements have nowhere near matched those needed to maintain a leading edge in system design. System throughputs can obviously be increased by using extended multi-processor distributed networks, but not without some problems in the areas of inter-processor communication, algorithm decomposition and scheduling, system software, etc. These areas are being addressed in a number of technology programmes [3,4] world-wide that are developing the tools and techniques for massively parallel systems but, from a practical systems viewpoint, these are still some way from useful results. Practical experience with current generation systems (backed up by detailed computer simulation of future systems [5]) leads to the obvious conclusion that the fewer the number of processor nodes required for a particular application, the more tractable the system engineering and software interactions, provided, of course, adequate inter-processor communication bandwidths can be maintained. Hence there remains a requirement to develop individual processing engines with very high throughput, even with distributed architectures. Any major increase in processing power requires parallel advances in a number of related fields, viz in the areas of VLSI semiconductor technology, system architecture and processing algorithms. The technology choices available to achieve this ## A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE speed are limited: high performance processors require higher levels of integration (typically in excess of 40,000 gates) than can currently be provided on the ultra high speed DSP technologies, i.e. UHS bipolar or GaAs (currently less than 10,000 gates). Small geometry CMOS processes on the other hand support complexities in excess of several hundred thousand gates now, with the prospect of giga-scale integration within the next ten years but with limited clock speeds. Consequently for programmable, high performance DSP processors, small geometry CMOS highly integrated processors are the main contender, with device complexity traded for clock speed to maximise throughput. To achieve useful gains in throughput, advances often need to be exploited in an interactive manner, that is improvements are obtained by combining modest advances in a number of areas, rather than a major advance in just one: such a synergism has become possible over the past few years in the development of application specific integrated circuit (ASIC) in digital processing systems. The combined advances in the areas of semiconductor device technology, CAD support software, and in device packaging and testing methods now allows ASICs with complexities in the range 10 to 200 thousand gates [6] to be developed quickly and at moderate cost. Whilst this complexity can be bettered using full custom designs, the integrated tools available for ASIC design [7], verified by a large user base and available on the system engineer's workbench rather than in the semi-conductor specialist's clean-room, make ASICs an attractive system component and allow system designers to produce competitive performance trade-offs in critical DSP applications. Most available "off-the-shelf" DSP parts are appear to have been designed for scalar calculations. Signal processing algorithms are, however, characterised by the need to perform mathematically intensive vector processing operations on large complex data blocks. Data block sizes are typically in the range 1K to 16K points complex and the number of mathematical operations per data point is high, i.e. there is a high data to procedure ratio. Consequently, one useful way to improve node processor throughput is to utilise ASIC advances to produce a fully-ported complex vector processing engine for such operations. Fully-porting requires individual parallel input, coefficient, and output data busses (all complex) and a separate control buss: for a 32-bit system, this requires packages with in excess of 200 pins, but has the advantage that very high data I/O bandwidths can be supported using wide, but relatively slow, memory architectures. The following sections outline the application of such a device, the Complex Vector Processor (CVP), designed at the Advanced Technology Laboratory of Marconi Defence Systems at Borehamwood using LSI Logic's proprietary LDS software tools [8], in a jointly funded MOD/GEC programme. The CVP was fabricated in the UK (at LSI Logic Ltd., Footscray, Kent) on the LCA10000 (129,000 gate), 1.2 micron DLM CMOS process. # A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE # 2. CVP DEVICE ARCHITECTURE [9]. The CVP implements full 32-bit complex multiplication on chip in a single clock cycle (25 MHz over commercial voltage/temperature ranges, 20 MHz over military) and provides in addition four 40-bit programmable complex accumulators to facilitate operations such as radix-2 and radix-4 FFT butterflies, complex FIRs and complex linear vector operations: the key features of the chip are listed in Table 1. 32 x 32 bit Complex Multiply Four 40 bit Complex Accumulators Fractional or Integer Two's Complement Arithmetic Dual channel Real Operation Versatile Pipelined Accumulator Operations Modulus Extraction and Weighting Radix 2 and Radix 4 FFT Butterflies Modulus Extraction and Weighting High Speed: Commercial Military 25 MHz 20 MHz 1024 point Complex FFT, Window and Modulus: Commercial Military 205 $\mu$ Sec 256 $\mu$ Sec Designed and Fabricated in the UK TABLE 1 - Key Features of Complex Vector Processor. The processor architecture is shown schematically in Figure 1. It is both pipelined and fully ported to maximise data throughput with four 32-bit input and two 32-bit output busses. The complex multiplier can perform one complex or two real products every clock cycle using two's complement arithmetic. The multiplier output feeds a four stage pipeline which in turn feeds to the programmable accumulator block. This contains eight separate 40-bit accumulators, usually configured as four complex units. In a conventional multiplier/accumulator design only a limited set of micro-code commands (for example, load, add and subtract) are supported: in the CVP a comprehensive range of data modification functions are supported for data rotation and conjugation are implemented as well. A full list of accumulator operations is included in Table 2: These operations can be #### A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE programmed independently on the four complex accumulator blocks and provide considerable flexibility in algorithm design. | Complex Accumulator Operation | | | | | | Mnemonic | | |-------------------------------|---|-----------------|-------------------|-----------------|-----------------------|----------|-----------------| | s | = | ( | 0), | ( | 0) | | clear | | s | = | (s <sub>r</sub> | ), | (8 <sub>i</sub> | ) | | hold | | S | = | ( - | tr), | ( _ | t <sub>i</sub> ) | | load | | S | = | | | | t <sub>r</sub> ) | | loadj | | S | = | | | | $-t_i$ | | conj | | S | = | ( | t <sub>i</sub> ), | ( | tr) | | jconj | | S | = | ( - | tr), | ( | - t <sub>i</sub> ) | | neg | | S | = | | | | - t <sub>r</sub> } | | negj | | S | = | | | | $t_{\underline{i}}$ ) | | negconj | | S | = | ( - | t <sub>i</sub> ), | ( | - t <sub>r</sub> ) | | negjconj | | s | = | | | | $+ t_i$ | | add | | S | = | | | | $+ t_r^-)$ | | addj | | S | = | | | | $-\mathbf{t_i}$ | | addconj | | S | = | _ | _ | _ | + t_) | | addjconj | | S | = | _ | _ | _ | $-t_i$ | | Bub | | S | = | _ | _ | _ | - t <sub>r</sub> } | | виbj | | S | = | _ | _ | _ | $+ t_i$ | | subconj | | s | = | (sr - | t <sub>i</sub> ), | (s <sub>i</sub> | - t <sub>r</sub> ) | | <b>subjconj</b> | TABLE 2 - CVP Complex Accumulator Operations and Mnemonics. ``` S = complex value currently being accumulated ``` The device also includes circuit blocks to select and scale accumulator data for output, for modulus extraction and gain monitor blocks for scaling in block floating systems. The CVP is packaged in a 299-pin PGA, Figure 2: it is marketed by LSI Logic Ltd., Brackhell, as device number L5A0873. ## 3. SINGLE CVP PROCESSOR BOARD. The CVP device has been used in a number of board level architectures: a block schematic of the lowest performance system is shown in Figure 3; the I/O queue control and process scheduling logic is shown in Figure 4. $s_r$ = previous real accumulator value $s_i$ = previous imaginary accumulator value $t_r^- = current real data from pipeline$ t; = current imaginary data from pipeline # A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE The writeable control store (WCS) is down-loaded with code from a remote 80286-based micro-processor as part of the system power up sequence: it can be partitioned to handle chained vector operations. Process order is selected via a hash table containing interrupt addresses for the code procedures in the WCS: the process scheduler monitors the I/O queue stores and when sufficient data is available at the input and sufficient space for results at the output, processing commences. The first operation is to read the data header word that defines the vector processes required on that data block. This block header is a channel indentification tag but it also contains other system information, such as the data source. The tag is used to address the process order hash table which in turn generates the interrupt address to the WCS to start vector processing. The hash table and the micro-code stores are multi-ported and can be modified during normal processor node operation, allowing the code primitives and process order to be dynamically re-configured. The CVP, together with local dual port scratch memory and input/output queue stores, 64K\*144 bit writeable control store and a process scheduler are contained on the printed circuit card in Figure 5. This operates at a clock rate of 25 MHz and supports a self scheduling signal flow architecture, by means of I/O data queue stores and controls outlined above: input and output data transfers at rates up to 200 Mbytes/second can be performed simultaneously and asynchronously. The complete board supports processing rates in excess of 25 million, 32-bit complex, multiply/accumulate operations per second, sufficient for 1024-point complex correlation to be calculated in 41 $\mu$ seconds and 1024-point complex FFTs (with windowing and modulus extraction) in 205 $\mu$ seconds. At this speed, the board power requirements are around 3 amps at 5 volts, of which around 1.2 watts is dissipated by the CVP itself. CVP programmes are generated using a translation programme, written in Turbo Pascal: this assembles the required processing primitives into in-line micro-code and verifies that the code operation is correct using a Pascal model of the CVP before it is down-loaded to a target board. A library of primitives is currently being assembled. The writeable control store (implemented in 64Kx4 video DRAM) limits the code sequence length on the current board to less that 64K complex vector operations, but since code primitives for the CVP are compact (the 1K-point complex FFT is only 5K CVP operations for example) this has not been found to be a limit in practice. If necessary, the code store length could be extended to 256K or 1M by upgrading to current DRAM parts. # 4. THE SYSTOLIC NODE PROCESSOR BOARD. A more powerful processor, using a systolic architecture, is also being #### A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE developed: this is shown schematically in Figure 6. The processor contains four CVP nodes in a pipeline architecture, communicating through separate I/O double-banked memory: the systolic pipeline length can be extended by cascading processors if required. A photograph of the system is shown if Figure 7. It has been designed to support sustained data rates up to 32 million complex samples per second: typical applications include for example radix 4 pipelined FFT algorithms. A 256-point FFT can be realised on a single board in around 8 $\mu$ seconds whilst a 4K-point FFT uses two boards to provide an eight stage pipeline (an input node, six radix 4 nodes and an output node) in 128 $\mu$ seconds. This equates to in excess of 1,000 million 32-bit multiply/accumulate operations per second. The high precision and compact nature of this design, compared with other implementations [10] makes such a module attractive in many real time radar and communications spectrum analysis applications. #### 5. DISCUSSION. The above sections outline the development and application of a commercially available ASIC technology in high throughput processor architectures. The device design and production were carried out in the UK and illustrate the level of performance that can be achieved on-shore with readily accessible design tools and technology. The device design, down to verified netlist level, was carried out by signal processing engineers were using LDS design tools to develop the architecture from a "middle-out" systems viewpoint. The design cycle for the CVP was short, less than 18 months, and the ASIC design methodology produced a complex design that was "right first time" and operated to specification without the functional hacking that is inherent in full custom designs. The board level designs utilising the CVP provide performance levels that are currently "state-of-the-art" and indicate that compact, distributed signal flow processor networks, with node throughputs in excess of 1,000 million operations per second, are possible in current UK-based technology. ## 6.ACKNOWLEDGEMENTS. The authors gratefully acknowledge the considerable input to this paper, and the work outlined in it, from the Advanced Technology Laboratory of Marconi Defence Systems at Borehamwood, from GEC(Avionics) and from LSI Logic Ltd. #### 7. REFERENCES. [1] T E CURTIS, A G CONSTANTINIDES and J T WICKENDEN, 'Control Ordered Sonar Hardware - COSH: A Distributed Processor Network for Acoustic Signal #### A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE - Processing', IEE Proc, Part F, 131, 1984. - [2] T E CURTIS, 'High Throughput Sonar Processors', in ADVANCED SIGNAL PROCESSING, Edited by D J CREASEY, IEE Telecommunications Series 13, Pub. Peter Peregrinus Ltd, 1985. - [3] D J MARSHALL, 'ASICs for Military Applications', ERA Seminar on Digital Signal Processing, London, 1988 ERA Report 88-0386. - [4] see for example: VHSIC ANNUAL REPORT FOR 1988, VHSIC Program Office, 1988. - [5] T A LANFEAR et al, 'Simulation of Signal Flow Graphs for Signal Processing Systems', Proc IEEE ICASSP 85, Tampa, Florida, 1985. - [6] see for example: B G COLE, 'Getting to the Market On Time', Electronics, April, 1988. - [7] M J CASEY, 'A DSP Designer's Toolkit', ERA Seminar on Digital Signal Processing, London, 1988 ERA Report 88-0386. - [8] see for example: W CARNEY AND I CURTIN, '50k Gate Array Meets Tomorrow's Design Challenges', Digital Design, 1986. LSI LOGIC LCA10000, LCA100000 Compacted Array Series Data Sheets. - [9] M PRICE, 'CVP A 32-bit Complex Processor Chip', ERA Seminar on Digital Signal Processing, London, 1988 - ERA Report 88-0386. - [10] see for example: E J SWARTZLANDER, 'Systolic FFT Processors', Proc International Worksgop on Systolic Arrays, Oxford, 1986. M P QUIRK et al, 'A Wide-band High-resolution Spectrum Analyzer, IEEE Trans ASSP, 36, 1988. Honeywell HDSAP66110/66210 Digital Signal Processing Chip Set, Honeywell Inc., 1988. Copyright HMSO, 1989. Figure 1 - Overall Schematic of CVP Architecture Figure 2 - CVP Device Photograph # A FAST 32-BIT COMPLEX VECTOR PROCESSING ENGINE SINGLE CVP FFT ENGINE - 1024 PTS, 205 uSEC Figure 3 - Single CVP Node Schematic Figure 4 - CVP Node I/O and Control Logic Schematic Figure 5 - Single CVP Node Card Transform speed - 1024 pts complex, 41 microsecs. Figure 6 - Four Node Systolic CVP Schematic Figure 7 - Four Node Board Photograph