Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA

Field Programmable Gate Array (FPGA) boast abundant resources with which to realise high-performance accelerators for computationally demanding operations. Highly efficient accelerators may be automatically derived from Signal Flow Graph (SFG) models by using architectural synthesis techniques, but in practical design scenarios, these currently operate under two important limitations - they cannot efficiently harness the programmable datapath components which make up an increasing proportion of the computational capacity of modern FPGA and they are unable to automatically derive accelerators to meet a prescribed throughput or latency requirement. This paper addresses these limitations. SFG synthesis is enabled which derives software-programmable multicore single-instruction, multiple-data (SIMD) accelerators which, via combined offline characterisation of multicore performance and compile-time program analysis, meet prescribed throughput requirements. The effectiveness of these techniques is demonstrated on tree-search and linear algebraic accelerators for 802.11n WiFi transceivers, an application for which satisfying real-time performance requirements has, to this point, proven challenging for even manually-derived architectures.


INTRODUCTION
F IELD Programmable Gate Array (FPGA) offer enormous computational capacity and distributed memory resources for high-performance, low-cost realisation of signal, image and data processing [1], high performance computing and big data analytics [2] and industrial control [3] operations. As custom computing devices, FPGA typically host accelerators-components whose circuit architecture is tuned to realise a specific function with performance and cost well beyond that available via software-programmable devices, such as multicore processors or graphics processing units.
To achieve these benefits, accelerators have traditionally been developed manually at Register Transfer Level (RTL) [4]. This low level of design abstraction enables highly efficient results, but imposes a heavy development load made increasingly unproductive as the scale of modern FPGA devices increase. Architectural Synthesis (AS) eases this burden by automating the derivation of pipelined accelerators from Signal Flow Graph (SFG) models and has proven highly successful in producing high-performance, efficient results [1], [5].
In the context of modern FPGA, current SFG-AS approaches have two shortcomings. First, the accelerators they produce are networks of fixed-function components, such as adders, multipliers or dividers. However, modern FPGA increasingly rely on multi-functional or programmable components, such as the DSP48E1 slice in Xilinx FPGA [6], to provide computational capacity. When used to realise fixed-function components, multiple of these are required to affect different operations where otherwise one would suffice, leading potentially to increased resource cost. No current SFG-AS approach can harness these components' programmability. Additionally, when designing for an industrial operating context or for standards-based systems, accelerators need to meet a prescribed throughput or latency. No current SFG-AS approach can automatically derive accelerators to meet such requirements and as a result, iterative cycles of time-consuming FPGA place-androute to refine results to meet a required performance. This is a highly time-consuming process and a major barrier to high-productivity accelerator design. This paper addresses these shortcomings. Specifically, three principal contributions are made: 1) A novel SFG-AS approach is presented which derives accelerators composed of custom multicore SIMD processor architectures utilising the programmable datapaths on modern FPGA. 2) It is shown how, via off-line estimation of multicore performance and compile-time application analysis, accelerators may be automatically produced which meet a pre-defined throughput requirement. 3) Automatic AS of accelerators with demanding realtime requirements is demonstrated by application to the design of transceivers for 802.11n WiFi. The remainder of this paper is structured as follows. Sections 2 and 3 outline the multi-SIMD accelerator design problem, before Sections 4 and 5 describe the synthesis approach and Section 6 applies this to the design of 802.11n transceiver accelerators.
Modern FPGA boast enormous on-chip computation, distributed memory and communications resources. For example, Xilinx's Virtex-7 FPGA family offer per-second access to up to 7 Â 10 12 multiply-accumulate (MAC) operations and 40 Â 10 12 bits/s of memory via programmable DSP48E1 [6], Look-Up Table (LUT) and Block RAM (BRAM) [7] resources. Along with the abundance of on-chip registers on modern FPGA, these resources are ideal for creating high-throughput, deeply-pipelined accelerators [1], [2], [8]. However, to date accelerators have been designed at RTL, a low-level, manual process made increasingly unproductive as the scale of modern FPGA increases.
Currently, two popular classes of approach address this productivity problem by adopting more abstract design entry points. High-Level Synthesis (HLS) translates programs written in popular software languages, such as C, C++ or OpenCL, to accelerators [9], [10], [11]. These tools derive RTL circuit architectures from the input and allow a designer to manipulate performance and cost by transforming the C source via, for example, loop unrolling, or by issuing synthesis directives. They support bit-true hardware data-types for arithmetic and automatically generate RTL code, frequently via the use of advanced scheduling and resource sharing in order to affect performance and cost.
An alternative approach uses AS techniques to derive accelerators from SFG models described in tools such as MATLAB Simulink. Exemplified by tools such as Xilinx's System Generator, Altera's DSP Builder and Synopsys' Synplify DSP, these empower a designer to specify the behaviour of an RTL component before generating code in the form of VHDL or Verilog. They support bit-true hardware datatypes, automatic RTL code generation, close integration with vendor synthesis and place-and-route tools and hardware-in-theloop emulation. The SFG models created are also ideal for application of AS transformations such as automatic pipelining/retiming and graph folding or unfolding to trade the performance and cost of the accelerator [1], [5].
Regardless of the approach chosen, however, some restrictions are apparent. Consider the designer's concern: to realise a given function, on a given FPGA device, with a throughput or latency which is prescribed either by an industrial operating context, or by standards to which the equipment of which it is a part must comply. In this scenario, SFG-AS and HLS tools' capabilities are currently lacking in two important ways. Whilst transformation techniques such as retiming [12] and folding or unfolding [5], [13] allow accelerator performance to be traded with resource or energy cost [1], [13], [14], current approaches provide only partial support for deriving accelerators with a given performance. This is because they measure performance and cost in terms of abstract units such as clock cycles (latency), samples/cycle (throughput) or number of arithmetic components [13]. This puts the effect on actual performance and cost in doubt, since facets such as number of LUTs, or the length of each clock period cannot be accurately estimated until the entire, highly time-consuming FPGA synthesis toolchain, including place-and-route, has been traversed. Hence creating an accelerator of a given performance is a very unproductive, manual process.
Furthermore, all of these techniques derive RTL circuits composed of fixed-function components, such as arithmetic components, buffers or switches. However a substantial proportion of the computational resource on modern FPGA is increasingly made up of components which are multifunctional or even programmable, such as the DSP48E1 on Xilinx Virtex-7 FPGA. If these are restricted to performing only one operation each, accelerators of greater cost may result than otherwise necessary. To the best of the authors' knowledge, no current AS approach addresses either of these shortcomings.
The programmable components which these processes should target require both control logic and memory resource to store and manage delivery of instructions and operands. These structures evoke the notion of softwareprogrammable processors, the use of which on FPGA has been growing in recent times and has evolved into intermediate fabrics or overlays [15], [16], [17]. These take a wide variety of forms, including vector processors [18], [19], GPU-like structures [20], [21], [22] or domain-specific processors [16], [23]. These are all founded on components such as the DSP48E1 but impose large resource and performance overheads to enable program and data control. To enable efficient accelerators, these overheads must be minimised via a soft design approach which customises their structure to the workload at hand. An approach which adheres to this philosophy is described in [24], [25]. Here, very fine-grained processors are used as building blocks for large-scale multi-SIMD structures whose architectures are tuned to the workload. This approach has been shown to support accelerators with performance and cost which are highly competitive with those from libraries such as Xilinx's Core Generator or Spiral [26]. However, there is no technology to automate their generation.
This paper devises an SFG-AS approach which derives custom multi-SIMD processors built around programmable on-chip computation units. Section 3 introduces the target architectures in more detail.

FPGA Processing Elements
In [24], [25] is described an approach to the realisation of FPGA accelerators for signal, image and data processing using a template architecture shown in Fig. 1. Accelerators are realised using networks of Processing Elements (PEs)software-programmable SIMD soft processors. The execution of each SIMD is decoupled from all others and communication is via point-to-point links. The structure of the network, the communications links and the widths of each PE are customisable at design time to maximise performance and minimise cost for the workload at hand. To enable highly efficient accelerators, two key features are demanded of the PEs. They must be lean, incurring very low resource cost, to enable scalability to many hundreds of units for complex accelerators; associated with this requirement is the need for standalone operation-the ability to process data, access and manage memory and communicate externally without the need for a host processor. The FPGA PE (FPE) [24] is a RISC loadstore PE which fulfils these requirements; SIMD and SISD (i.e., single-lane SIMD) variants of the FPE are shown in Fig. 2. The FPE includes only vital components-a program counter, program memory, instruction decoder, register file, branch detection, data memory, immediate memory and an arithmetic logic unit based on the DSP48E1 in Xilinx FPGA [6]. A COMM module allows direct insertion/extraction of data into and out of the FPE pipeline. In addition, the FPE's architecture is highly configurable for tuning to a specific workload [24].
By ensuring absolute lowest cost, economies of scale enable significant multicore resource cost savings. The price for this efficiency, however, is flexibility-the FPE is not a run-time general-purpose component because its architecture is highly tuned to the application at hand. In addition, it is domain-specific, enabling very high performance for certain types of operations, with performance degradation for others [25]. The benefit, however, is very high performance; a 16-bit SISD FPE on Xilinx Virtex 5 VLX110T supports 480 MMACs/s requiring 90 LUTs; this is just 14 percent of the cost of a general-purpose Xilinx Microblaze processor and 35 percent of that of the iDEA processor [23] on the same device. This efficiency enables processor-based accelerators for a range of applications whose performance and cost is highly competitive with hand-crafted accelerators. The structures which results are heterogeneous, as illustrated for an example symbol detector for 4 Â 4 16-QAM Multiple-Input, Multiple-Output (MIMO) 802.11n transceivers in Fig. 3. This architectures includes clusters of MIMD structures (4-FPE 1 ) and nine 16-lane SIMD structures (FPE 16 ). This accounts for a total of 288 processing lanes, communicating via point-to-point data queues to realise the functionality required. The performance and cost of this architecture has been shown to be highly competitive with hand-crafted realisations of the same behaviour for this and a range of other functions [24], [25].

Synthesis of FPE-Based Accelerators
The goal, in this paper, is to generate a multi-FPE accelerator architecture from an application such that a prescribed throughput, expressed as a number of iterations n of the application per second, is satisfied. Accelerators based on the FPE promise high performance and efficiency, but a series of substantial design challenges must be overcome to automate their derivation to meet a given real-time performance. These are summarised in Fig. 4. A series of key subtasks are involved [4]: Allocation of a set of SIMDs to realise the application, Partitioning & Binding of application tasks to SIMDs and insertion of point-to-point communication links, Scheduling of the operations on each SIMD, Estimation of the performance of the result in order to ensure requirements are satisfied, Code Generation of source for each PE.  Partitioning, scheduling and estimating the performance of application workloads for programmable multicore and GPU devices is an active topic of research [27], [28]. However, these works differ from that described here. For instance, in many cases they can reduce the scheduling load on the compiler by employing hardware scheduling circuitry [27], [28], [29]; this is not available in the FPE. In all cases, they do not have to allocate a processing resource, as is required for the FPE and their performance estimation, e.g., [30], does not extend to estimating the physical length of a clock cycle, as is required for FPGA.
The SFG modelling entry point is well-suited to synthesis problems such as this and has already been adopted in numerous FPGA AS tools as detailed in Section 2. It is a highly restricted form of dataflow, a domain of modelling languages which have been shown well-suited to rapid synthesis of digital signal processing operations for both multicore and FPGA [5], [31], [32], [33]. Specifically, SFG is a sub-class of synchronous dataflow [34], a decidable dataflow dialect [35] which has consistently demonstrated outstanding support for compile-time analysis and generation of efficient code. Its popularity has led to a considerable body of work in the areas of partitioning and binding, scheduling and code generation of dataflow applications for multiprocessors [32], [36], [37]. Hence, this paper focuses on the key novel aspects of this work: allocation of multi-SIMD PE architectures and mapping and scheduling of SFG tasks across PEs to meet a given real-time performance. ð Þdescribes a set of nodes or actors N and a set of edges E ¼ N Â N -directed First-In, First-Out (FIFO) queues of data tokens. A node is said to fire, consuming/producing a pre-specified number of tokens, known as the rate, from each incoming/outgoing edge. In an SFG, all rates are 1 and are not quoted.

ARCHITECTURAL SYNTHESIS
In Fig. 5, the SFG is composed of 108 subactors, each of which processes a single Orthogonal Frequency Division Multiplexing (OFDM) subdivision of the allocated frequency band. For each sub-band i, two pieces of data are input: a channel matrix H i and a received symbol vector y i . In the cases considered in this paper, H 2 C 4Â4 and y 2 C 4Â1 . Preprocessing is applied by pp, ordering the entries of both y and H according to the distortion on each path through the wireless channel, before an equalised version of y (y eq ) is produced. The resulting data are then refined to an estimateŝ of the transmitted symbol vector s 2 C 4Â1 by a sequence of Euclidean distance cost functions a1 i À a4 i in Fig. 5a (e1 i À e4 in Fig. 5b) Further details of these algorithms are available in [38], [39].
Each of the SFGs in Fig. 5 contain many instances of actors of a restricted range of classes, replicated in a very regular data/task parallel fashion. For instance, in FSD the sequence of actors a4; a3; a2; a1 f gforms all 16 branches of the SD tree, replicated 108 times to constitute 1,728 instances of this same sequence. Similarly, there are 16 data parallel instances of min, one per tree. These repeated parallel sequences are well suited to SIMD realisation, and we propose to exploit this feature to derive multi-SIMD realisations via a two-step process illustrated in Fig. 6.
As shown on the left of Fig. 6, the process commences with the SFG model and the definition of a set K of kernels. Kernels are considered the fundamental units of 'work', with the accelerator created to realise these kernels. This approach is in keeping with that of common heterogeneous computing languages such as OpenCL [40], where they are known as work-items, and CUDA. A kernel k 2 K is a subgraph of G such that G may be subdivided into a set of par- , every actor in G is a member of precisely one  kernel instance. Kernels may take any form and are defined by the designer. However, in order to realise the most effective multi-SIMD realisations, these should be chosen to expose large numbers of similar, data parallel operations. In the FSD application, for example, both the FSD tree branches and the min actors, highlighted in Fig. 6 as k 1 and k 2 respectively, are ideal kernels.
From the SFG and kernel definitions are derived a workload. A workload is a sequence of kernel batches, with a multi-SIMD workgroup synthesised to process a prescribed number, n, of batches per second. Since all SIMDs in each workgroup execute the same kernel a distinct workgroup is required per kernel class; this paper illustrates the synthesis process for k 1 in Fig. 6.

Workload Synthesis
Consider the FSD model in Fig. 5a. Each OFDM sub-band contains multiple instances of the kernel k 1 , all of which depend on data emanating from a pp node -y eq and Hwhich are local to that sub-band. Hence, when realising the set of k 1 kernels, if two instances from the same sub-band are realised on different SIMDs or different lanes of the same SIMD, this local data will have to be stored in multiple different memories, increasing the total memory capacity required and the total FPGA resource cost. Conversely, if both are realised on the same SIMD lane, a substantial resource saving may result. This data locality is ensured by the SFG model's hierarchy and, as a result, it is important that this is maintained and exploited to guide the workgroup synthesis process to realise these kernels on the same SIMD resource. Accordingly, the workload is described as a batch of kernel clusters, formed to emphasise local communication and memory storage.
Realising this feature requires two capabilities. The graph G must be reformulated to express its behaviour in terms of the kernel set K, with any instance of a kernel k 2 K in G replaced by a single actor k, whilst kernels of the same class within the same composite node need to be clustered for batch formation. The effect of this clustering on the FSD and SSFE-1; 1; 2; 4 ½ SFGs are shown in Fig. 7. Note that there are 108 disjoint FSD subgraphs, Q 1 À Q 108 , each representing an OFDM subcarrier. Two kernels are identified: k 2 designates the min actor as a kernel, whilst k 1 identifies a branch of the FSD tree as a kernel. The SFG is factored to replace the subgraphs represented by each of these kernels with a single 'kernel' actor. In addition, the similar kernels in each disparate subgraph Q i are composed into clusters and hence two clusters arise -C 1 and C 2 , composed respectively of all instances of k 1 and k 2 . Similarly, the three kernels identified in Fig. 7b result in three clusters for each Q i . The SFG model reformulation process is performed as described in Algorithm 1.
while j jKj do 6: Input to this process are the set of kernels K and the SFG G. The goal is to derive G 0 , a SFG of equivalent behaviour to G whose child actors are all kernels and members of K. In the process the set of kernel clusters C is also derived. The reformulation finds every instance of every kernel in G (line 6) by isolating its disjoint subgraphs Q G (line 3). All instances of each kernel k in Q (Q k ) are replaced with a single actor representing the kernel (line 7), with the set of kernels for each subgraph appended to the cluster definition (line 8). This process is repeated for every disjoint subgraph and every kernel type, with G 0 and C returned.

Design Space Scaffolding
The workgroup synthesis strategy adopted is illustrated in Fig. 8a. A workgroup is synthesised for each class of kernel. It executes all instances of the kernel batches, where each batch is a set of clusters (as determined in Section 4.2) of sufficient size to meet the system throughput requirement. In deriving the workgroup, there are three key challenges: determining the batch size, the number of SIMDs and the width of each. To aid this process, a template workgroup structure is assumed, illustrated in Fig. 8b.
A workgroup is a two-dimensional SIMD structure, the rows of which are composed of SIMD units with column i (i ¼ 1; . . . ; l) formed by the composite of lanes j; i ð Þ; j ¼ 1; . . . ; d. To derive such a structure, d and l must be determined to execute a batch with a given throughput. The dimensions d; l ð Þ can vary between: 1; 1 ð Þ: one single-lane SIMD is employed to process w i kernels sequentially,  To guide the selection of the appropriate combination of rows/columns, two key observations may be made: To achieve highest efficiency, the kernel load of each column should be balanced so that no lane is idle awaiting others to finish. This implies that the number of kernels executed per column is an integer factor of the number contained in the batch. FPE performance scales linearly with number of lanes up to a width of 16, after which clock period constraints as a result of wide instruction broadcast imposes increasingly sublinear scaling [41]. A batch describes a subset of the clusters associated with each class of kernel; the set of viable batch sizes S ¼ s 2 Z : 1 s jCj f g . For each viable batch size s i 2 S, a multi-phase workload may be defined as a sequence W , with each w i 2 W determining the number of kernels executed during that phase of the sequence. Specifically, for Given these observations, a set L of candidate workgroup widths (i.e., number of columns) can be enumerated as the integer factors of the workload size, up to a limit of 16 Given this set of widths, the viable depths d can be determined by subdividing the batch across the workgroup columns and determining the number of SIMDs required by comparing estimates of the iteration rate to the requirement. This is achieved via Algorithm 2.  (1) and (2) (line 6) respectively. The workload is executed in jW j phases and hence the iteration rate n is scaled accordingly (line 7). Then, for each candidate workgroup width, viable depths and workload mappings are derived (line 10, as described in Section 4.4) and appended to the set of candidate solutions M c (line 11). The final result M is selected from this set (line 12).

Workgroup Derivation
The Deploy process maps a batch w of kernels onto a workgroup of a given width l such that a number of iterations per second of the batch n is achieved. The behaviour of this process is described in Algorithm 3. There are two key steps: the batch is subdivided across the columns which make up the width of the workload (ShapeWorkload in line 3, described in Section 4.5), before the resulting arrangement, denoted by the set A is used to determine the number of SIMDs required in order to execute the kernels assigned to each column in satisfaction of n (line 4, described in Section 4.6). The cardinality of the resulting two-dimensional set M 0 defines the dimensions of the SIMD array and whose entries define the sequence of kernels executed on each lane of each SIMD.

Workload Shaping
Workload shaping assigns kernels for execution on the columns of the two-dimensional workgroup; at the point of entry only the width of the workgroup (i.e., the number of lanes in each SIMD) is defined, with the number of SIMDs initially assumed to be one. The goal of this process is to subdivide the batch across a given number of columns (i.e., workgroup lanes), with the number of rows (i.e., SIMDs) to be later derived. In order to ensure that kernels sharing local data variables are assigned to the same SIMD lane, they must be assigned to the same workgroup column and hence the mapping of kernels to columns is guided by C according to Algorithm 4. R q s j : jqj ¼ minðjs j j; c l À js j jÞ È É 8:

16: return A
The ultimate aim is to derive a mapping of kernels from the batch w to workgroup lanes, deriving a set A each element a i 2 A of which defines the kernels assigned to that lane. To derive this subdivision from w, the number of kernels per lane is calculated (line 2) and the kernels for each lane isolated (line 6). The assignment maintains columnlocal communication, i.e., kernels from the same cluster are assigned to the same column, as far as is possible. From each cluster are extracted kernels which number the lower of either the number of unmapped kernels in the cluster or the number required in order to fully load the current lane (line 7). These kernels are assigned to the current lane (line 8) and removed from the cluster (line 9). When a lane is fully loaded the next is considered (lines 10, 11); otherwise if all kernels in the cluster have been mapped the process repeats for the next cluster (line 12, 13).

Allocation, Mapping and Scheduling
Given the mapping of kernels to workgroup lanes, it remains to determine the number of SIMDs required in order to execute each lane's load to meet the system's throughput requirements. This is performed by a joint allocation/mapping/scheduling process which has three main objectives: Determine the number of SIMDs. Assign each kernel to a specific lane of a specific SIMD.
Order the execution of kernels on each SIMD. This requires a procedure with two inputs, a definition of the kernels assigned to each workgroup lane A, and a definition of the required number of iterations per second n times per second. The resulting two-dimensional set M 0 describes the sequence of kernels executed on each lane of each SIMD, derived via Algorithm 5. if n e > n then 9: M 0 Map(A, d) 10: There are, potentially, a significant number of options for the number of SIMDs-any integer number up to a maximum of ja i j -and a greedy design space pruning process is used to determine the appropriate value. Upper and lower bounds on the number of SIMDs, d max and d min are defined (line 2), with the range between these limits successively halved over multiple iterations. In each iteration, the performance of the mid-point of the range is estimated. In the case where it is too low, the upper half is chosen on the next iteration or, in case it exceeds the requirement, the lower half is chosen. In each iteration the number of kernels assigned to each workgroup row is determined (line 6) and its throughput estimated (line 7 -see Section 5). If the estimated throughput exceeds the requirement (line 8), the allocation is valid and the kernel load is mapped across the d SIMDs (line 9) before the upper bound on the search space is lowered to d À 1 (line 10) and the process repeated to determine potentially lower-cost solutions. When performance is not sufficient, the lower bound d min is increased to d þ 1 (line 11) and the process repeats until either the lower or upper bounds exceed their viable ranges. The result is the final workgroup derived which exceeds the performance requirement.
The process of mapping kernels to SIMD rows is a trivial subdivision of each a 2 A into d subsequences each of length d jaj d e, where each subsequence describes the kernels to be executed on each row of the workgroup. This process is not described further here. Fig. 9 illustrates the process of synthesising a workgroup to realise k 1 for FSD. As shown, there are 108 clusters, each containing 16 instances of k 1 . Accordingly S ¼ 1; 2; . . . ; 108 f g . For each s i 2 S, a workload can be derived. In the case where s i ¼ 54, a two-phase workload results, each phase of which executes 864 kernels. Given the processing of 54 clusters per batch, the viable widths of workgroup, i.e., the integer factors of the number of clusters, are given by L ¼ 1; 2; 3; 6; 9 f g . For each l i 2 L the number of workgroup rows may then be Fig. 9 illustrates the final arrangement when d ¼ 2 and the kernel load for each workgroup column subdivided thereon.

FSD Example
Key to this process is its ability to estimate the throughput of a given workload, on a given workgroup and to account for potential estimation inaccuracies. Techniques to facilitate both these objectives are described in Section 5.

Throughput Estimation
To estimate throughput, two key metrics are required: the number of cycles required to execute the workload and length of each cycle, i.e., the clock period of the architecture. Each SIMD executes a sequence of kernels and hence one prominent component of the throughput estimation problem is determining the number of instructions and cycles required to execute a given number of identical kernels. Given this information, the estimation process can employ any scheduling approach desired. Since the accelerator architecture exploits numerous copies of a single component (the FPE) in various SIMD configurations, the instruction stream for a kernel will be identical, regardless of on which component of the final architecture it is deployed. This allows pre-synthesis characterisation of the performance and cost of a kernel, a characterisation which may be used to enable the allocation process.
In order to reduce resource cost, forwarding hardware has been omitted from the FPE. In order, then, to avoid data hazards, NOPs must be inserted in the instruction stream realising a kernel in order to synchronise operand accesses. This leads to kernel instruction sequences such as that in Fig. 10a. Consider the resulting effect on the execution of a sequence of similar kernels by the FPE. Fig. 10 shows two example two-kernel workloads.
In both these cases, a series of Effective Instructions (EIs) is interspersed with NOPs for the purposes of data synchronisation. Assume that each kernel also requires r register file locations. In Fig. 10a, the two kernels may be executed sequentially, requiring only r RF locations; however, they may also be interleaved, with the EIs from one kernel occupying the NOPs from the other as in Fig. 10b. The interleaved version enables higher efficiency and throughput, but has increased RF cost. Hence each SIMD should interleave kernels as much as possible, so long as RF capacity constraints allow. The estimation problem is to determine the number of cycles required to execute a given multikernel workload, within a given constraint on r. This is determined by profiling the kernels, deriving instructionlevel statistics of their computational operations and NOPs and combining these into a single cost metric.
Assuming an RF occupancy per kernel of r registers, then given a constraint on the number of RF locations r c , 1 the maximum number of interleaved kernels f is given by The PM cost effect of interleaving successive kernels can be estimated by considering each kernel to be a sequence of  1. For the remainder of this paper, assume a maximum RF size of 64 locations instructions subdivided into a sequence of blocks demarcated at NOP; EI f gsequence boundaries -i.e., the first EI following a NOP represents the start of a new block. Each block consists of a set of EIs EI followed by a set of NOPs NOP and may then be represented by a coefficient g Letting P IL denote the maximum pipeline stage length, 2 kernel instruction statistics are categorised into P IL catalogue sets depending on (0 g 1), (1 g 2), . . ., (g ¼ P IL À 1). By defining two cost vectors, a and b, where a i and b i indicate respectively the number of EI and total instructions of a block in the ith catalog ði 2 ½1; P IL Þ, then the PM size increment Dp of adding a further interleaved kernel is given by [42] Dp ¼ X kÀ2 i¼1 a i þ k Á a kÀ1 À b kÀ1 : Hence, for k kernels mapped to an FPE, the total PM cost is given by The two additive terms in (6) respectively represent the total PM cost of the b k f c full interleaves and the final interleave, which may or may not be fully occupied. The value p denotes the number of cycles required to execute the multikernel workload for each FPE.
This analysis allows compile-time evaluation of the number of cycles required to execute a given set of kernels. The final performance in real-world terms depends not only on the number of cycles, but the length of each, as dictated by the clock period of the synthesised architecture. This period is determined by vendor place-and-route tools, such as Xilinx ISE or Vivado and the quality of the final result can be optimised by using additional intelligence to guide the process [43]. However, for this process the primary concern is the ability to estimate the final result, without undergoing the long delays associated with executing these functions. We need to be able to accurately estimate the length of each cycle that will result from any approaches such as these without actually executing them. This is achieved by preprofiling, via RTL synthesis, varying numbers of SIMDs of varying width. Fig. 11 illustrates this profile for 1 to 20 SIMDs of each which has 1 to 16 lanes on Xilinx Virtex-5.
As this shows, the highest clock rate -approximately 370 MHz -is achieved by a single SISD processor with the lowest experienced for 20 SIMD processors with 16 FPEs. As shown in Fig. 11, the anticipated clock rates trends are observed-as the total resource realised on the device increases (represented by points towards the front left hand corner), clock rate reduces, as a natural result of the optimization algorithms executed by Xilinx ISE increasingly struggle to find low-cost/high-performance design space points as the scale of the gate-level netlist being mapped increases. Given this profiling and the estimation of the number of cycles required for workload execution, the throughput of a realisation, in iterations per second, may be estimated. Letting c e denote the estimated clock frequency the estimated number of iterations n e -used in Algorithm 5 to determine the viability of a realisation-is given by

Self-Correction
At the point of estimation the number of instructions can, in fact, be measured rather than estimated. However, the clock frequency is a true estimate: the precise value cannot be known until after FPGA place-and-route is complete. At the proposed pre-synthesis point of estimation there is likely to be some error between the estimated and actual clock frequencies. Since this estimate is an intrinsic part of the design process the inherent inaccuracy may preclude the result from meeting the intended real-time performance. Suppose that the estimated clock frequency is higher than the post-place-and-route actual frequency; this reduced clock frequency will lead to a reduced real-time performance, which in turn may be below the threshold performance target. In this case, allocation needs to be repeated to account for the discrepancy. To automatically derive a viable accelerator whilst accounting for the estimation discrepancy, the multi-phase synthesis process in Fig. 12 is employed.
As this shows, an iterative process adjusts the throughput target to account for inaccuracies in the estimated clock frequency c e . If this exceeds the actual clock rate c a and is sufficiently low that the actual number of iterations n a < n, where n is the throughput requirement, then the threshold is adjusted (increased) to account for the differential, scaling by the ratio of the estimated and actual clock periods.

EXPERIMENTS
To illustrate the capability of the proposed synthesis process, seven exemplar accelerators are addressed for 4 Â 4 MIMO transceivers: 2. For the remainder of this paper, P IL ¼ 6 1) FSD, SSFE-1; 1; 1; 4 ½ and SSFE-1; 1; 2; 4 ½ tree-search SD 2) Zero-Forcing (ZF) and Minimum Mean Square Error (MMSE) equalisation 3) Sorted QR Decomposition (SQRD) pre-processing We propose to evaluate the ability of the FPE AS approach by addressing the context of 802.11n, which demands 480 Mbps detection for FSD and SSFE and ZF/ MMSE equalisation and 30 Â 10 6 iterations/second SQRDdemanding requirements for even hand-crafted accelerators [24], [44]. This application has been chosen because it requires a range of operation types typical in signal, image and data processing-linear algebraic (matrix decomposition, matrix-vector and matrix-matrix multiplication) and tree-search operations, in a demanding real-time setting.
The SFG-AS process described in Sections 4 and 5 have been realised in a prototype, the behaviour of which is described in Fig. 13. In the main, XML is used for all input and intermediate data exchange, with the final result being VHDL and C sources describing the respective structure and executables for the multi-FPE accelerator derived. The intermediate processing stages match those in Sections 4 and 5 and are realised using Java. The C source for each FPE is compiled using a custom LLVM-based compiler, to produce assembly. The RTL source is translated to Xilinx Virtex-5 XC5VSX240T via ISE 14.2. In line with standard practise, to permit objective analysis of the performance and cost of the accelerators produced and comparison with existing and future approaches in the areas of HLS [10], [11], [13], [26], [43], FPGA-based processors [15], [16], [18], [19], [20], [22], [23], [24], [25], [43] and accelerators [1], [2], [13], [26], all performance and cost metrics are measured post-place and route, independent of a specific hardware platform.

Tree Search: FSD & SSFE
In order to realise FSD and SSFE-1; 1; 2; 4 ½ the SFG application models and corresponding kernels are respectively shown in Figs. 5 and 7. The kernels for SSFE-1; 1; 1; 4 ½ are illustrated in Fig. 14 Fig. 14.  The key features of the synthesis process are evident in the SIMD structures in Fig. 14. For FSD (Fig. 7a), two workgroups are created, one each for realisation of k 1 and k 2 in Fig. 6, with point-to-point FIFOs realising the dependencies between the two. For k 1 a workgroup of twelve 16-way SIMDs is realised, whilst for k 2 , the workgroup consists of two 12-way SIMDs. Similarly, three workgroups are created for the three kernels which describe SSFE-1; 1; 1; 4 ½ -a 9-way SIMD for k 1 , three 12-way SIMDs for k 2 and a 6-way SIMD for k 3 .
There are a number of notable aspects of the results in Tables 1 and 2. Immediately obvious is that the real-time performance requirements have been satisfied; to the best of the authors' knowledge, this is the first record of automatic derivation of multicore accelerators in satisfaction of a predefined performance requirement. In addition, it is worth noting the effectiveness of the proposed process in guiding the creation of each accelerator. In no case was the relative error in the estimated clock rate greater than 2 percent. This indicates that the pre-synthesis clock rate estimates were very accurate. Indeed, it is perhaps notable that in a number of instances, clock rate was underestimated.

Equalisation: ZF & MMSE
In MIMO communications, ZF or MMSE equalisation forms an estimatex of the transmitted symbol vector x by forming the product of the received symbol vector y and an equalisation matrix Wx where y 2 C 4Â1 and W 2 C 4Â4 The equalisation matrix W takes different forms depending on whether a ZF or MMSE equalisation strategy is to be employed. For ZF where H 2 C 4Â4 is the channel matrix and H H denotes the hermitian transpose of H. For MMSE equalisation, where I M is an identity matrix of order 4. The very high complexity of matrix inversion is the major challenge presented by this operation. To address this issue, QR decomposition is applied to H to produce where both Q; R 2 C 4Â4 According to this reformulation, the SDF application model for ZF and MMSE equalisation is shown in Fig. 15. As this shows, multiple operations are invoked, including QR decomposition of the channel matrix H, followed by back-substitution to derive R À1 of the R matrix produced. Subsequently the products of R À1 , its hermitian and H H y are formed to deriveŷ. Real-time operation for 4 Â 4 802.11n MIMO requires 480 Mbps throughput-the throughput and cost metrics obtained are described in Table 3. It is again notable that, in both cases, real-time accelerators automatically result; to the best of the authors' knowledge, this is the first time this capability has been demonstrated for algebraic operations, such as the matrix triangularisation, inversion and multiplication operations. In addition, note again the effectiveness of the design process in estimating and refining the accelerator architecture. In the case of the MMSE accelerator, the estimation clock rate is only 6.7 percent in error. The situation is slightly deteriorated for ZF, where an 11.5 percent error in the initial estimate is encountered; whilst this is higher than any other estimate, it is still mild in absolute terms.  Once again, it is notable that these automatically derived accelerators meet the real-time performance requirement and that the estimation-based design process has been highly effective. For MMSE, the estimated clock rate and throughput are only 6.3 percent in error. Similarly, the ZF estimates are actually underestimates.

CONCLUSION & FUTURE WORK
This paper has presented an approach for AS of accelerators for modern FPGA which achieves two unique capabilities. By deriving custom multi-SIMD processors it can harness the programmable datapath resources which increasingly make up a substantial portion of the computational capacity of modern FPGA. Furthermore, it automates the generation of accelerators satisfying real-time performance requirements prescribed by their industrial operating context or as a result in standards-based equipment. This process is facilitated by offline characterisation of the performance of multi-SIMD topologies, compile-time evaluation of the cycle cost of SIMD programs and a self-correcting synthesis strategy which adapts to account for errors in the estimation process. When applied to the design of large-scale linear-algebraic (matrix triangularisation and multiplication) and tree-search operations, it automatically produces a series of accelerators capable of supporting real-time performance for 4 Â 4 802.11n MIMO employing either 16-QAM or 64-QAM. This is a notable achievement since, on the same FPGA technology, implementations of the same operations has had to be enabled by hand-crafted RTL design, if indeed these previously existedthe authors are unaware of any work which enables real-time FSD employing 64-QAM, for instance. Furthermore, this paper targets Virtex 5 FPGA, but the techniques presented are applicable to later generations since the FPE 'virtualizes' the FPGA as it derives networks of FPEs and instructions for execution on each FPE and not the FPGA device architecture.
Despite the effectiveness of this approach, a series of further improvements could be made. For instance, it does not consider the resource cost of inter-processor communication and does not explore the potential for cost reduction via different mapping of the same application on an allocation. Similarly, automatically tuning the FPE RTL architecture to its functionality is not considered. Since the DSP48E slices targetted natively only support fixed-point arithmetic, the only way to support floating-point is via emulation, addition of floating-point co-processors next to, or in place of, the DSP48E in Fig. 2, or by combining this work with standard AS techniques which derive networks of fixedfunction floating-point components. In addition, previous work [43] has shown the benefit of considering the nature of the processing architecture being realised when optimizing its mapping to the FPGA, and it is likely that a similar approach could yield increased performance and/or lower cost FPE-based accelerators. There is considerable performance/cost benefit to all of these considerations.  " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.