# A HIGH SPEED PIPELINED FFT CROSS CORRELATOR FOR SONAR APPLICATIONS

L Heystek

UEC Projects (Pty) Ltd Durban South Africa

#### 1.INTRODUCTION

This paper describes the architecture and performance of a cross correlator capable of digitally processing three channels, each with bandwidths up to 20 kHz, in applications where source localization is obtained by means of time—delay (TDOA) techniques. The distribution of signal processing functions over a pipeline of modules is described and some results of real—time hardware simulations of source localization is presented.

Section 2 of this paper presents an overview of passive source localization by a cross correlation method. Section 3 describes the hardware architecture of the FFT cross correlator. Sections 4, 5, and 6 describe the processing algorithms for each DSP module in the correlator. Section 7 describes the derivation of range estimates from the information obtained from the correlator output and range estimates obtained from real—time simulations.

### 2.SOURCE LOCALIZATION

Fig.1 depicts the geometry for a line array with three sensors or sub-arrays.  $T_{12}$  is the time delay for wavefront propagation between  $S_1$  and  $S_2$ , and  $T_{23}$  is the delay between  $S_2$  and  $S_3$ . D is the inter-sensor distance.

For far-field ranging, estimates for bearing B, and range R are:

$$cos(B) = \frac{C(T_{12}+T_{23})}{2D}$$
 (1)

$$R = \frac{D\sin^{2}(B)}{C(T_{23}-T_{12})}$$
 (2)

C is wavefront propagation speed. Variance of the time delay estimates for the generalised cross correlator has been derived by Knapp and Carter [1], and Quazi [2] has derived simpler expressions for certain conditions. Carter has also derived statistics for range and bearing estimates [3].

TRANCE

B

TO SEE

TO

Figure 1

For passive Sonar systems to obtain time delay estimates with minimum variance, signal processing systems must be developed to minimise the influence of parameters that increase

### FFT CROSS CORRELATOR

these variances. Paramaters that may be practically controlled in real—time hardware systems are: (1) Bandwidth— since delay variance is inversely proportional to bandwidth, the system must accomodate wide bandwidths— this implies high sampling rates in digital systems; (2) Integration time—the variance is in fact (non—linearly) inversely proportional to the time—bandwidth product, BT— this implies long integration times or processing of long data sequences. A trade—off exists between doing coherent processing on long sequences or sub—dividing the sequence and performing incoherent processing on these shorter sequences [4]. The selection depends on whether the signals may be considered stationary or non—stationary over the integration period; (3) Signal to noise ratio—this parameter may be controlled by employing low noise analog circuits in the signal conditioning and sampling circuits, and by introducing as little quantization noise as possible in the signal processing elements; (4) Spectral shaping—many filters have been suggested for the generalised cross correlator [1]. These filters generally modify the cross—spectral densities obtained from two correlated signals.

Delay resolution directly affects range and bearing accuracies. If the correlator resolves time delays into large intervals, range and bearing estimates exhibit large 'dead bands' and consequently large errors. Delay resolution is therefore a parameter that can and should be controlled in the hardware design.

# 3.HARDWARE ARCHITECTURE

The cross correlation of two finite length sequences  $x_1(n)$  and  $x_2(n)$  is defined as

$$Rx_1x_2(m) = \sum_{n=0}^{k} x_1(n)x_2(m+n), \quad k=N-1$$
 (3)

For m=0...N-1, the cross correlation  $Rx_1x_2(m)$  is computed more effeciently with FFT algorithms:

$$Rx_1x_2(m) = F^{-1}\{S_1(m)^*S_2(m)\},$$
 (4)

where  $S_1$  is the spectral density of sequence  $x_1$  and  $S_2$  that of  $x_2$ . \* denotes complex conjugate, and  $F^{-1}$  denotes inverse FFT. The latter method has been chosen for the hardware implementation of the cross correlator described here. Fig. 2 is a block diagram of the hardware architecture. Generic hardware modules have been used for the DSP processors, and generic FFT modules have been used for the DFT blocks. The DSP processors are TMS320C25 based, clock speeds are 35 MHz. Memory is organised as 48 kwords of no wait state RAM and 16 kwords of no wait state ROM/EPROM.

The FFT modules are bit—slice designs with a basic performance of 900 nanoseconds per complex butterfly. Transform times vary from 900 microseconds for a 256 point complex FFT to 20 milliseconds for a 4096 point complex FFT. Data is 16 bits, scaled down after each butterfly operation to prevent overflow conditions within the radix—2 decimation — in — time algorithm. The FFT modules are structured for pipelined systems, having seperate input and output memories as well as an internal 'working' memory. Input data may be loaded as soon as the first pass of the previous FFT is complete. Similarly, output data

### FFT CROSS CORRELATOR

may be operated upon until the last pass of the next result is available. Data flow between modules is via a 16 bit, 10 MHz asynchronous bus and all modules comply with DIN Euro standard.

One DSP processor is designated the master and the remaining two are designated slaves. The master processor communicates with the slave processors via a 5 MHz common serial bus. Commands are encoded in the 16 bit serial data, with two of the bits representing slave address. Thus each slave responds only to commands intended for itself. Six bits represent encoded commands and the remaining eight bits are for data values required with some commands. Some commands require responses from the slave, and in this case slaves are allowed a latency time to respond else the master assumes the slave is 'down'. In this way many diagnostics are incorporated in the system. In fact 70% of all commands are diagnostic test commands, thus greatly increasing maintainability of the system. Firmware on the DSP processors is also generic in the sense that a 'superset' of each processor's firmware resides on each board. Each board identifies it's function on reset by reading it's card slot address. Inputs and outputs to and from each module are initiated with interrupt structures but may also be polled.



Figure 2

The flow structure and supervisory elements of the DSP software were coded in Pascal, while the signal processing elements were coded in assembler. A useful feature of this mixing of language levels is that most code can be contained in Pascal CASE blocks. Each CASE block then corresponds to a command requested by the master processor.

### 4.INPUT PROCESSOR

The input processor is interrupted whenever input data  $x_1(n)$ ,  $x_2(n)$ ,  $x_3(n)$  is available. The three data samples are read from three input port addresses. Data sampling up to 40 kHz is possible, leaving 25 microseconds for processing data between samples. Data processing requires:

### FFT CROSS CORRELATOR

- 1) Unless input data is 16 bits, it is generally required to sign extend such data to represent 16 bit two's complement data.
- 2)  $x_1(n)$ ,  $x_2(n)$ ,  $x_3(n)$ , n=0...N/2 -1 is stored in memory buffers  $M_1(n)$ ,  $M_2(n)$ ,  $M_3(n)$ , n=0...N/2 -1.
- 3) When three N/2 sample sequences are available it is necessary to zero pad each sequence with N/2 zeroes to prevent circular correlation [5].

$$M_1(n) = x_1(n)$$
  
 $M_2(n) = x_2(n)$   
 $M_3(n) = x_3(n)$ ,  $n = 0...N/2-1$  (5a)

$$M_1(n) = 0$$
  
 $M_2(n) = 0$   
 $M_3(n) = 0$ ,  $n = N/2...N-1$  (5b)

These three sequences are then written into the FFT real memory, FFTR(n), n = 0...N-1. The FFT imaginary input is initialised once per transform size to all zeroes. The FFT module has status outputs indicating completion of the first pass. The input processor monitors this status (via interrupt), and writes the data, after which it may continue sampling input data. Thus the first stage of concurrent activity in the pipeline is achieved.

In application tests, BT products of over 2000 were realised. The three signal bandwidths were limited to 8kHz. The sampling period was 25 microseconds, and 128 sequences each of length N=128 were processed, obtaining total sequence lengths of 16384, and BT=3276. Coherent processing of the input sequences can be obtained by providing two input buffers. One buffer is filled on input interrupt while the other is delivered to the FFT input.

# **5.CROSS SPECTRUM PROCESSOR**

This DSP processor is located between the two FFT modules and has the largest suite of signal processing firmware. It performs the following signal processing:

1) For complex FFT output data  $X_1(n)$ , and  $X_2(n)$ , n=0...N-1 the complex cross spectrum is obtained from the data read in bit-reversed form

$$G_{12}(n) = X_1(n)^* X_2(n), n = 0...N-1$$
 (6a)

and for FFT outputs  $X_2(n)$  and  $X_3(n)$ , n = 0...N-1

$$G_{23}(n) = X_2(n)^* X_3(n), n = 0...N-1$$
 (6b)

2) Form complex spectral density estimates

$$S_{12}(n) = \sum_{k=0}^{K-1} G_{12}(n), n = 0...N-1$$
 (7a)

#### FFT CROSS CORRELATOR

$$S_{23}(n) = \sum_{k=0}^{K-1} G_{23}(n), n = 0...N-1$$
 (7b)

are formed.

3) To reduce the effect of interfering tonal signals, a weighting function [1]

$$W^{p}(f) = |S_{12}(f)| / |S_{12}(f)|$$
(8)

or PHAT (phase transform) is applied to the cross spectral densities so that the prefiltered cross correlations become

$$R_{12}(m) = F^{-1}\{W^{p}(f)e^{j\theta_{12}(f)}\}$$

$$R_{23}(M) = F^{-1}\{W^{p}(f)e^{j\theta_{23}(f)}\}$$
(9a)
(9b)

$$R_{23}(M) = F^{-1}\{W^{p}(f)e^{i\theta_{23}(f)}\}$$
(9b)

where  $\theta_{12}(f)$  is the phase function of the cross-PDS.

- 4) The cross spectral densities are computed with 32 bit accuracies, and must finally be normalised to 16 bits so that the real and imaginary components may be written to the inverse FFT module. The result available status from the forward FFT controls when this processor may unload results. As soon as results are read, the forward FFT may continue with it's last pass of the next FFT in the pipeline. Similar to the input processor, Cross spectrum data may be written to the inverse FFT as soon as the first pass of the preceeding inverse FFT is complete. In this way triple concurrency occurs in the pipeline: all passes except the first and last of both the forward FFT and the inverse FFT, as well as cross spectrum computations proceed together.
- 5) To obtain an increase in delay resolution, the cross spectral densities are upsampled by a factor U prior to inverse transforming. This is achieved by zero padding N points of the spectral densities with NU-N zeroes. This process may be viewed as ideal lowpass filtering in the frequency domain, by multiplying with a rectangular window or as convolution in the time domain by convolving the cross correlation function with an equivalent sinc function Thus an NU-point transform is performed after the N-point transform.

Fig. 3 shows the operations described and depicts the overlap and timing of them. Here N is 256 and NU is 4096, thereby interpolating by a factor 16.

# 6.PEAK DETECT PROCESSOR

The final DSP processor in the pipeline performs the following signal processing:

1) Results available from the inverse FFT are the complex cross correlations

$$R_{12}(t) = F^{-1}\{S_{12}(f)\}$$
 (10a)

#### FFT CROSS CORRELATOR

$$R_{23}(t) = F^{-1}\{S_{23}(f)\}$$
 (10b)

These values are read from the FFT in bit-reversed order. The real and imaginary components are converted to the real magnitude squared cross correlations

$$|R_{12}(t)|^2 = (Real R_{12}(t))^2 + (Imag R_{12}(t))^2$$
 (11a)

$$|R_{23}(t)|^2 = (Real R_{23}(t))^2 + (Imag R_{23}(t))^2$$
 (11b)

2) The peak value is searched for and the index recorded for each of the magnitude squared correlograms

$$|\hat{\mathbf{R}}_{12}(t)|^2 = \max(|\mathbf{R}_{12}(t)|^2), \ \mathbf{w}^1 \leq t \leq \mathbf{w}^{\mathbf{u}}_1$$
 (12a)

$$|\hat{R}_{23}(t)|^2 = \max(|R_{23}(t)|^2), w_2 \le t \le w_2$$
 (12b)

where  $w^1$  and  $w^u$  are index limits (windows) supplied to the peak search algorithm by a prediction algorithm.



Figure 3

3) An N-length sequence is read from the NU length inverse FFT results. The sequence is selected in the region  $\{\hat{R}-N/2...\hat{R}...\hat{R}+N/2\}$ . Then steps 1) and 2) above are performed for the interpolated region to find the interpolated delay peaks' indeces,  $\hat{I}_{12}(t)$  and  $\hat{I}_{23}(t)$ , within this region. The final index values are:

$$\hat{\mathbf{I}}_{12} = \mathbf{U}(\hat{\mathbf{R}}_{12} - \mathbf{N}/2) + \hat{\mathbf{I}}_{12} \tag{13a}$$

$$\hat{\mathbf{I}}_{23} = \mathbf{U}(\hat{\mathbf{R}}_{23} - \mathbf{N}/2) + \hat{\mathbf{I}}_{23} \tag{13b}$$

#### FFT CROSS CORRELATOR

The delay values  $\hat{\mathbf{I}}_{12}$  and  $\hat{\mathbf{I}}_{23}$  are related to the sampling period  $\mathbf{T}_s$  as

$$\hat{\mathbf{I}} = \mathbf{T_s}/\mathbf{U} \tag{14}$$

The total correlation lag times directly available are

$$T_{total} = -(N/2)T_s...+(N/2)T_s$$
 (15)

### 7. RANGE ESTIMATION

Equations (1) and (2) are formulae for estimation of range and bearing obtained directly from time delay estimates provided by the cross correlation processes described in sections 4 to 6. Delay measurements are noisy, and in small S/N ratio cases, range estimates become almost valueless unless some from of range filtering is performed. These predictive filters generally require accurate and large dynamic range number representation. For this reason a microprocessor with floating point coprocessor is included at the end of the pipeline. At this point data rates have been reduced sufficiently for this type of processor to compute filter iterations and perform high level command, control and communications functions.

A non-biasing 'Kalman' filter has been derived [6] and implemented on this processor module in order to run real-time simulations of a range-bearing estimator which obtains the delay estimates from the cross correlator. The filter is modelled such that the source position is described by the second order equations

$$dz_1/dt = z_2$$
 (16a)  
 $dz_2/dt = a_2$  (16b)  
 $z = z_1$  (16c)

The acceleration  $a_z$  in the z-direction is considered random with zero mean and standard deviation  $\sigma a_z$ . Also, since the measurement equation

$$y = [T_{23} - T_{12}] = (D^2/C)(\sin^2 B)/R$$
 (17)

is non-linear, the 'inverse range'  $x_1 = 1/R$  is used as the unknown filter state. Similarly to equation (16) the 'inverse range' dynamics is modelled by

$$\begin{array}{l} dx_1/dt = x_2 \\ dx_2/dt = w \\ y = [(D^2/C)(\sin^2B)]x_1 + v \end{array} \tag{18a}$$

A simulation scenario was prepared where a source moves South at 20 knots and the 'correlator' moves North at 10 knots; initial range is 10 Km, initial bearing is 80 deg. 200 iterations of range estimation were performed. The simulator hardware produced three signals with delay resolution of .5 microseconds, and in this case 0dB S/N ratio. The results are shown in the graphs below. Both graphs have time in seconds as ordinate and range

### FFT CROSS CORRELATOR

error in meters as abscissa. The lower trace is the unfiltered range error, while the top trace is the filtered range error. Ts was 25 usecs., N=256 and U=16, resulting in correlator delay resolutions of 1.56 usecs.

### 8.REFERENCES

[1] C.Y. Knapp and G.C. Carter, "The generalized correlation method for estimation of time delay," IEEE Trans. Acoust., Speech, Signal Proc., vol ASSP-24, pp. 320-327, Aug.

[2] A.H. Quazi, "An overview on the time delay estimate in active and passive systems for target localization," IEEE Trans. Acoust., Speech, Signal Proc., vol ASSP-29, pp. 527-533, June 1981.

[3] G.C. Carter, "Passive sonar signal processing," L. Bjorno (ed.), Underwater Acoustics

and Signal Processing, pp. 499-508.

[4] K. Scarbrough, R.J. Tremblay, G.C. Carter, "Performance predictions for coherent and incoherent processing techniques of time delay estimation," IEEE Trans. Acoust., Speech, Sig. Proc., vol ASSP-31, pp 1191-1196, Oct. 1983.

[5] A.V Oppenheim, R.W. Schafer, "Digital Signal Processing," Englewood Cliffs, New

Jersey: Prentice Hall Inc.

[6] E. Eitelberg, L. Heystek, "On bias in non linear Kalman filtering," IEEE COMSIG

conference, Somerset West S. Africa, 1989.

