DOCTOR OF PHILOSOPHY

FPGA-based programmable embedded platform for image processing applications

Siddiqui, Fahad Manzoor

Award date:
2018

Awarding institution:
Queen's University Belfast

Link to publication

Terms of use
All those accessing thesis content in Queen's University Belfast Research Portal are subject to the following terms and conditions of use

• Copyright is subject to the Copyright, Designs and Patent Act 1988, or as modified by any successor legislation
• Copyright and moral rights for thesis content are retained by the author and/or other copyright owners
• A copy of a thesis may be downloaded for personal non-commercial research/study without the need for permission or charge
• Distribution or reproduction of thesis content in any format is not permitted without the permission of the copyright holder
• When citing this work, full bibliographic details should be supplied, including the author, title, awarding institution and date of thesis

Take down policy
A thesis can be removed from the Research Portal if there has been a breach of copyright, or a similarly robust reason.
If you believe this document breaches copyright, or there is sufficient cause to take down, please contact us, citing details. Email: openaccess@qub.ac.uk

Supplementary materials
Where possible, we endeavour to provide supplementary materials to theses. This may include video, audio and other types of files. We endeavour to capture all content and upload as part of the Pure record for each thesis.
Note, it may not be possible in all instances to convert analogue formats to usable digital formats for some supplementary materials. We exercise best efforts on our behalf and, in such instances, encourage the individual to consult the physical thesis for further information.
FPGA-based Programmable Embedded Platform for Image Processing Applications

Fahad Manzoor Siddiqui
School of Electronics, Electrical Engineering and Computer Science
Queen’s University Belfast

A thesis submitted for the degree of

Doctor of Philosophy

September 11, 2018
Abstract

A vast majority of electronic systems including medical, surveillance and critical infrastructure employs image processing to provide intelligent analysis. They use onboard pre-processing to reduce data bandwidth and memory requirements before sending information to the central system. Field Programmable Gate Arrays (FPGAs) represent a strong platform as they permit reconfigurability and pipelining for streaming applications. However, rapid advances and changes in these application use cases crave adaptable hardware architectures that can process dynamic data workloads and be easily programmed to achieve efficient solutions in terms of area, time and power.

FPGA-based development needs iterative design cycles, hardware synthesis and place-and-route times which are alien to the software developers. This work proposes an FPGA-based programmable hardware acceleration approach to reduce design effort and time. This allows developers to use FPGAs to profile, optimise and quickly prototype algorithms using a more familiar software-centric, edit-compile-run design flow that enables the programming of the platform by software rather than high-level synthesis (HLS) engineering principles.

Central to the work has been the development of an optimised FPGA-based processor called Image Processing Processor (IPPro) which ef-
iciently uses the underlying resources and presents a programmable environment to the programmer using a dataflow design principle. This gives superior performance when compared to competing alternatives. From this, a three-layered platform has been created which enables the realisation of parallel computing skeletons on FPGA which are used to efficiently express designs in high-level programming languages. From bottom-up, these layers represent programming (actor, multiple actors and parallel skeletons) and hardware (IPPro core, multicore IPPro, system infrastructure) abstraction. The platform allows acceleration of parallel and non-parallel dataflow applications.

A set of point and area image pre-processing functions are implemented on Avnet Zedboard platform which allows the evaluation of the performance. The point function achieved 2.53 times better performance than the area functions and point and area functions achieved performance improvements of 7.80 and 5.27 times over single core IPPro by exploiting data parallelism. The pipelined execution of multiple stages revealed that a dataflow graph can be decomposed into balanced actors to deliver maximum performance by hiding data transfer and processing time through exploiting task parallelism; otherwise, the maximum achievable performance is limited by the slowest actor due to the ripple effect caused by unbalanced actors. The platform delivered better performance in terms of fps/Watt/Area than Embedded Graphic Processing Unit (GPU) considering both technologies allows a software-centric design flow.
I would like to express my profound gratitude to my supervisor, Prof. Roger Woods for giving me the opportunity to undertake part-time research and in providing me continuous advice, supervision and encouragement throughout my research. I am grateful for his systematic guidance, comprehensive reviews and critical feedback to improve this thesis. In addition, I am grateful to Dr. Karen Rafferty for providing additional support, reasoning and constructive criticism for my research. I want to thank Prof. Sakir Sezer for supporting me in finishing the thesis.

I would like to thank my colleagues at Queen’s University Belfast with whom I worked during my PhD including Dr. Burak Bardak and Dr. Moslem Amiri with whom I worked on the Rathlin project, for sharing their ideas and knowledge to improve my research activities. Particular thanks go to Dr. Matthew Milford, Dr. Colm Kelly, Umar Ibrahim Minhas and Tiantai Deng for sharing application use case results to optimise and improve the platform architecture.

A warm thanks to Margarita Magdenko for her continuous moral support, encouragement and affection. Above all, I would like to thank my parents who have given me the strength and wisdom to be sincere
in my work, for setting high moral standards, supporting me through their hard work and their unconditional love and affection.
# Table of Contents

## Table of Contents

v

## List of Tables

x

## List of Figures

xv

## 1 Introduction

1.1 Research problem ........................................ 2

1.2 Rathlin Project ........................................ 4

1.3 Proposed approach ...................................... 5

1.4 Thesis Contributions .................................... 8

1.5 Thesis Outline .......................................... 10

## 2 Background

2.1 Parallel embedded architectures ......................... 14

2.1.1 FPGA multiprocessor system-on-chip ................. 14

2.1.2 FPGA hardware accelerator design approaches ........ 15

2.1.3 Need for adaptable hardware architectures ........... 19

2.1.4 FPGA memory and computation resources ............ 20
# TABLE OF CONTENTS

2.1.5 DSP block .................................. 22

2.2 Dataflow model of computation .......................... 23
  2.2.1 Notion of parallelism in dataflow graphs ............ 24
  2.2.2 Dataflow transformation .......................... 25

2.3 Parallel computing skeletons .......................... 25
  2.3.1 Pipeline .................................... 26
  2.3.2 Split, compute and merge .......................... 26
  2.3.3 Farm ........................................ 27

2.4 Related work on FPGA soft processors ................. 28
  2.4.1 Scalar Processors ............................... 29
  2.4.2 Multicore Processors ............................. 30
  2.4.3 DSP Slice Processors ............................. 35

2.5 Summary ........................................ 36

3 Rathlin Project ..................................... 39
  3.1 Rathlin Objectives .................................. 40
  3.2 Programming workflow ................................ 41
  3.3 Cal Actor Language (CAL) ............................ 42
    3.3.1 Semantics and execution model .................... 43
  3.4 Producer-consumer computing .......................... 45
  3.5 Summary .......................................... 46

4 Image Processing Processor (IPPro) ..................... 48
  4.1 Introduction ...................................... 48
  4.2 Algorithmic characteristics of image processing algorithms ... 51
  4.3 Exploration of efficient FPGA soft-core processor .... 52
<table>
<thead>
<tr>
<th>4.3.1</th>
<th>Balance between compute and memory resources</th>
<th>53</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.3.2</td>
<td>FPGA-based soft-core processor functionality vs performance trade-off</td>
<td>56</td>
</tr>
<tr>
<td>4.4</td>
<td>Image Processing Processor (IPPro)</td>
<td>61</td>
</tr>
<tr>
<td>4.4.1</td>
<td>Datapath</td>
<td>63</td>
</tr>
<tr>
<td>4.4.2</td>
<td>Branch and conditional execution</td>
<td>64</td>
</tr>
<tr>
<td>4.4.3</td>
<td>Instruction set architecture</td>
<td>65</td>
</tr>
<tr>
<td>4.4.4</td>
<td>Pipelined stream processing</td>
<td>65</td>
</tr>
<tr>
<td>4.4.5</td>
<td>Dataforwarding</td>
<td>66</td>
</tr>
<tr>
<td>4.4.6</td>
<td>Implementation results</td>
<td>68</td>
</tr>
<tr>
<td>4.5</td>
<td>IPPro Optimisations</td>
<td>69</td>
</tr>
<tr>
<td>4.5.1</td>
<td>Minimum and maximum instructions</td>
<td>70</td>
</tr>
<tr>
<td>4.5.2</td>
<td>Coprocessor extension</td>
<td>71</td>
</tr>
<tr>
<td>4.6</td>
<td>Comparison of IPPro results</td>
<td>74</td>
</tr>
<tr>
<td>4.7</td>
<td>Application use cases</td>
<td>75</td>
</tr>
<tr>
<td>4.7.1</td>
<td>System architecture</td>
<td>79</td>
</tr>
<tr>
<td>4.7.2</td>
<td>Comparison of IPPro with HLS approach</td>
<td>82</td>
</tr>
<tr>
<td>4.7.3</td>
<td>Comparison of IPPro against programmable FPGA-based architecture</td>
<td>83</td>
</tr>
<tr>
<td>4.7.4</td>
<td>Comparison of IPPro with MicroBlaze</td>
<td>84</td>
</tr>
<tr>
<td>4.8</td>
<td>Summary</td>
<td>85</td>
</tr>
<tr>
<td>5</td>
<td>IPPro-based acceleration of dataflow actor</td>
<td>88</td>
</tr>
<tr>
<td>5.1</td>
<td>Introduction</td>
<td>88</td>
</tr>
<tr>
<td>5.2</td>
<td>IPPro: A dataflow processor</td>
<td>90</td>
</tr>
<tr>
<td>Section</td>
<td>Title</td>
<td>Page</td>
</tr>
<tr>
<td>---------</td>
<td>-------</td>
<td>------</td>
</tr>
<tr>
<td>5.2.1</td>
<td>Notion of firing an actor</td>
<td>92</td>
</tr>
<tr>
<td>5.2.2</td>
<td>Producer-consumer computing model</td>
<td>94</td>
</tr>
<tr>
<td>5.2.3</td>
<td>Evaluation of FIFO configurations</td>
<td>95</td>
</tr>
<tr>
<td>5.2.4</td>
<td>Mapping and execution of static dataflow actor</td>
<td>97</td>
</tr>
<tr>
<td>5.2.5</td>
<td>Supporting multi-port dataflow actor</td>
<td>99</td>
</tr>
<tr>
<td>5.2.6</td>
<td>Discussion on hardware acceleration using IPPro over HLS</td>
<td>101</td>
</tr>
<tr>
<td>5.3</td>
<td>Management and provisioning of IPPro hardware accelerators</td>
<td>103</td>
</tr>
<tr>
<td>5.4</td>
<td>Dataflow parallelism and multiple IPPro</td>
<td>108</td>
</tr>
<tr>
<td>5.4.1</td>
<td>Configurable data distribution and collection architecture</td>
<td>111</td>
</tr>
<tr>
<td>5.5</td>
<td>Case Study: $k$-means clustering</td>
<td>115</td>
</tr>
<tr>
<td>5.5.1</td>
<td>MPSoC-based heterogeneous system architecture</td>
<td>117</td>
</tr>
<tr>
<td>5.5.2</td>
<td>IPPro hardware accelerator designs</td>
<td>118</td>
</tr>
<tr>
<td>5.5.3</td>
<td>Acceleration results</td>
<td>120</td>
</tr>
<tr>
<td>5.5.4</td>
<td>Comparison against GPU implementations</td>
<td>122</td>
</tr>
<tr>
<td>5.6</td>
<td>Summary</td>
<td>126</td>
</tr>
<tr>
<td>6</td>
<td>FPGA-based programmable hardware acceleration platform</td>
<td>128</td>
</tr>
<tr>
<td>6.1</td>
<td>Introduction</td>
<td>128</td>
</tr>
<tr>
<td>6.2</td>
<td>Programmable realisation of parallel skeletons on FPGAs</td>
<td>131</td>
</tr>
<tr>
<td>6.3</td>
<td>IPPro core architectural optimisations</td>
<td>132</td>
</tr>
<tr>
<td>6.3.1</td>
<td>Dataflow actor firing rule optimisation</td>
<td>133</td>
</tr>
<tr>
<td>6.3.2</td>
<td>Scratchpad memory to access non-streaming data</td>
<td>136</td>
</tr>
<tr>
<td>6.3.3</td>
<td>Host management of IPPro core using AMBA-AXI4</td>
<td>138</td>
</tr>
<tr>
<td>6.3.4</td>
<td>Implementation results of optimised IPPro core</td>
<td>140</td>
</tr>
<tr>
<td>6.4</td>
<td>Multicore IPPro</td>
<td>141</td>
</tr>
</tbody>
</table>
6.4.1 Exploration of multicore interconnect architecture ........ 142
6.4.2 Impact of interconnect’s core connectivity and core utilisation on area and performance .................. 145
6.4.3 Multicore IPPro architecture .......................... 150
6.4.4 Example: Mapping of dataflow graph onto multicore architecture ............................................. 151
6.5 FPGA-based programmable hardware acceleration platform .... 153
6.5.1 Parallel distribution and collection of data streams ........ 153
6.5.2 Implementation results .................................. 159
6.6 Parallel implementation of image pre-processing functions .... 160
6.6.1 Performance analysis ..................................... 163
6.7 Summary ..................................................... 170

7 Conclusion and Future Work .................................. 172
7.1 Summary ..................................................... 172
7.2 Thesis Contributions ........................................ 173
7.3 Suggestions for further work .................................. 177

A Author’s Publications ........................................ 180
B IPPro: Technical details ....................................... 182

Bibliography ...................................................... 185
List of Tables

2.1 High-level Synthesis (HLS) tools for FPGAs. .......................... 18

3.1 Dataflow semantics and their functional requirements to implement on a hardware architecture. .......................... 44

4.1 Categorisation of image processing operations based on their memory and execution patterns. .................................. 51

4.2 Memory and compute resources in 28nm Xilinx FPGA technology. .......................... 55

4.3 Correlation of FPGA-based soft-core datapath and dataflow models with increasing functionality and memory. .......................... 58

4.4 Details of supported dataflow features and processor datapath memory elements in each presented model. .......................... 58

4.5 IPPro instruction frame structure. .................................. 65

4.6 IPPro supported addressing modes and instructions. ................ 65

4.7 IPPro code to implement func with and without dataforwarding. .......................... 67

4.8 IPPro implementation results on selected Xilinx development boards. .......................... 68

4.9 Implementation of Min/Max using native and optimised IPPro instructions. .......................... 71
4.10 Implementation results of optimised IPPro datapath to support coprocessor extension on ZC706 (Kintex-7). .......................... 73
4.11 Comparison of IPPro against other FPGA-based soft-core processor architectures. ...................................................... 75
4.12 Mathematical representation of image pre-processing functions. .............................................................................. 76
4.13 Area utilisation results of IPPro hardware accelerator. ............... 81
4.14 Comparison of hardware acceleration results obtained from HLS and IPPro using Avnet Zedboard (Artix-7). .................. 82
4.15 Comparison of IPPro performance results against programmable FPGA-based architecture. ................................................ 83
4.16 Area comparison of IPPro against programmable FPGA-based architecture. The normalised per core resource utilisation are reported in the brackets. ......................................................... 84
4.17 Comparison of micro-benchmarks on IPPro and MicroBlaze. .... 84
4.18 Area comparison of IPPro and MicroBlaze processors. ............. 85

5.1 One-to-one mapping of dataflow semantics onto IPPro datapath. . 91
5.2 IPPro code implementing dataflow actor firing rule. ................... 93
5.3 Implementation results of processor datapath using different FIFO configurations on Artix-7 FPGA fabric. ......................... 97
5.4 Hardware resource and control requirements to map multi-port actors onto IPPro core. ................................................... 100
5.5 Impact of accelerator classes on IPPro-based core, multicore and system requirements. .................................................... 104
<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>5.6</td>
<td>IPPro-based multiple core architectures and their impact on system</td>
<td>106</td>
</tr>
<tr>
<td></td>
<td>requirements and inter-core communication.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5.7</td>
<td>Impact on area utilisation of different accelerator configurations.</td>
<td>108</td>
</tr>
<tr>
<td>5.8</td>
<td>Output signals of FSM for each state.</td>
<td>114</td>
</tr>
<tr>
<td>5.9</td>
<td>Summary of the C functions running on the host processor to program and</td>
<td>118</td>
</tr>
<tr>
<td></td>
<td>control the underlying architecture.</td>
<td></td>
</tr>
<tr>
<td>5.10</td>
<td>Dataflow actor mapping and supported parallelism of IPPro hardware</td>
<td>120</td>
</tr>
<tr>
<td></td>
<td>accelerator design presented in Figure 5.15.</td>
<td></td>
</tr>
<tr>
<td>5.11</td>
<td>Performance measurements for design 1 and 2 of Figure 5.15.</td>
<td>120</td>
</tr>
<tr>
<td>5.12</td>
<td>FPGA area utilisation of various designs shown in Figure 5.15.</td>
<td></td>
</tr>
<tr>
<td></td>
<td>The relative Zedboard area utilisation is also reported.</td>
<td>121</td>
</tr>
<tr>
<td>5.13</td>
<td>Performance with task-level parallelism using designs in Figure 5.15.</td>
<td>121</td>
</tr>
<tr>
<td>5.14</td>
<td>Power, resource and combined efficiency comparisons of IPPro-based k-means</td>
<td>124</td>
</tr>
<tr>
<td></td>
<td>implementations on Zedboard.</td>
<td></td>
</tr>
<tr>
<td>5.15</td>
<td>Power, resource and combined efficiency comparisons for k-means using</td>
<td>124</td>
</tr>
<tr>
<td></td>
<td>Xilinx Zynq XC7Z045 Kintex-7 FPGA and GPU NVIDIA GTX980.</td>
<td></td>
</tr>
<tr>
<td>6.1</td>
<td>IPPro instructions to access scratchpad memory.</td>
<td>137</td>
</tr>
<tr>
<td>6.2</td>
<td>Implementation results of the optimised IPPro on Kintex-7 fabric.</td>
<td>140</td>
</tr>
<tr>
<td>6.3</td>
<td>Comparison of IPPro against other FPGA-based soft-core processors.</td>
<td>141</td>
</tr>
<tr>
<td>6.4</td>
<td>Implementation results to evaluate scaling of 4x4 and stream interconnect</td>
<td>148</td>
</tr>
<tr>
<td></td>
<td>architectures on area and core utilisation to realise data (vertical) and</td>
<td></td>
</tr>
<tr>
<td></td>
<td>task (horizontal) parallel implementations.</td>
<td></td>
</tr>
</tbody>
</table>
6.5 Normalised area utilisation numbers of 4x4 with respect to stream interconnect realising parallel implementations. ........................................ 148
6.6 Implementation results of scaled-up stream interconnect designs with increasing core-connectivity on Artix-7 and Kintex-7 fabrics. The normalised area utilisation numbers of each design with respect to single-core IPPro are reported within the brackets. .......... 148
6.7 The AXI4-Lite (control) register map of platform hardware modules. ................................. 157
6.8 Area utilisation results of the system infrastructure. ........................................ 159
6.9 Estimation of number of multicore IPPro on Xilinx Zynq MPSoCs. ........................................... 160
6.10 Formal mathematical representation of chosen image pre-processing functions. ............................................................................. 161
6.11 Data parallel performance results of point and area functions using IPPro on Artix-7 (Zedboard). ........................................ 165
6.12 Comparison of data parallel implementation of point functions using IPPro against ARM (-O2,-O3). ........................................ 165
6.13 Comparison of data parallel implementation of area functions using IPPro against ARM (-O2,-O3). ........................................ 165
6.14 Implementation results of HLS generated IPs on Kintex-7 fabric. (Normalised area and performance results of multicore IPPro to HLS). ............................................................................. 167
6.15 Performance results of task parallel implementations of multiple dataflow actors on multicore IPPro. ........................................ 168
6.16 Performance results of heterogeneous decomposed compute functions using multicore IPPro. ........................................ 169
B.1  IPPro supported instruction set and their corresponding DSP48E1 control signals. .............................................. 182

B.2  IPPro instruction set. .............................................. 183

B.3  The AXI4-Lite control register map. ............................ 184
List of Figures

1.1 Hierarchical illustration of hardware and software abstraction supported by each layer of the proposed programmable hardware acceleration architecture. ........................................ 6

2.1 FPGA-based hardware accelerator design compilation approaches. 16

2.2 Trend of hardware resources, their raw-computation (GMACs) and memory across different families of Xilinx FPGAs. ................. 21

2.3 FPGA memory and bandwidth hierarchy of Xilinx Virtex-7 FPGA. 21

2.4 Block diagram of Xilinx dedicated DSP block (DSP48E1). ...... 23

2.5 Illustration of pipeline, task and data parallelism in dataflow graphs. 24

2.6 Illustration of parallel computing skeletons using dataflow actors. 27

2.7 The layered block diagram of Silicon Hive architecture illustrating Processing Storage Element (PSE), cell and streaming array of cores. 30

2.8 The block diagram of PicoArray processors organised in a two dimensional grid connected together using a deterministic picoBus interconnect. ......................................................... 31

2.9 Datapath of a basic pipelined processing node used in GraphSoC. 33

2.10 Datapath of FlexGrip Streaming Multiprocessor (SM). .......... 34
### LIST OF FIGURES

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>Rathlin workflow of RIPL to IPPro-based platform with alternative compilation paths.</td>
<td>41</td>
</tr>
<tr>
<td>3.2</td>
<td>Block diagram of a CAL dataflow actor and its components.</td>
<td>43</td>
</tr>
<tr>
<td>3.3</td>
<td>Producer-consumer driven data exchange patterns.</td>
<td>45</td>
</tr>
<tr>
<td>4.1</td>
<td>Impact of DSP48E1 configurations on maximum achievable clock frequency ($f_{\text{Max}}$) using different speed grades of Kintex-7 FPGAs. The DSP48E1 configuration used are: fully pipelined datapath with no pattern detector (NOPAT), with pattern detector (PAT-DET), multiply with no output register MREG (MULT_NOMREG) and pattern detector (MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG).</td>
<td>54</td>
</tr>
<tr>
<td>4.2</td>
<td>Impact of BRAM configurations on the maximum achievable clock frequency ($f_{\text{Max}}$) of Artix-7, Kintex-7 and Virtex-7 FPGAs for single and true-dual port RAM configurations.</td>
<td>55</td>
</tr>
<tr>
<td>4.3</td>
<td>Dataflow models (a) DFG node without internal storage 1 (b) DFG actor without internal storage t1 and constant i 2 (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j 3.</td>
<td>57</td>
</tr>
<tr>
<td>4.4</td>
<td>FPGA datapath models (a) Programmable ALU 1 (b) Fine-grained processor 2 (c) Coarse-grained processor 3.</td>
<td>57</td>
</tr>
<tr>
<td>4.5</td>
<td>Impact of datapath models 1, 2, 3 on $f_{\text{Max}}$ across FPGA fabrics.</td>
<td>60</td>
</tr>
<tr>
<td>4.6</td>
<td>Block diagram of FPGA-based soft-core processor IPPro datapath.</td>
<td>62</td>
</tr>
<tr>
<td>4.7</td>
<td>Implementation of dataforwarding exploiting MACC functionality of DSP48E1.</td>
<td>66</td>
</tr>
<tr>
<td>Figure</td>
<td>Description</td>
<td>Page</td>
</tr>
<tr>
<td>--------</td>
<td>-----------------------------------------------------------------------------</td>
<td>------</td>
</tr>
<tr>
<td>4.8</td>
<td>Optimisation of IPPro datapath to support dedicated minimum and maximum instructions.</td>
<td>70</td>
</tr>
<tr>
<td>4.9</td>
<td>(a) Input/output interfaces of division coprocessor (b) Coprocessor extended IPPro datapath.</td>
<td>72</td>
</tr>
<tr>
<td>4.10</td>
<td>Pipelined execution of division coprocessor.</td>
<td>73</td>
</tr>
<tr>
<td>4.11</td>
<td>Block diagram of programmable video processing platform to implement case-studies using single-core IPPro.</td>
<td>80</td>
</tr>
<tr>
<td>5.1</td>
<td>(a) Representation of a CAL dataflow actor (b) Mapping of dataflow actor onto IPPro datapath.</td>
<td>91</td>
</tr>
<tr>
<td>5.2</td>
<td>IPPro datapath supporting firing of dataflow actor.</td>
<td>93</td>
</tr>
<tr>
<td>5.3</td>
<td>Producer-consumer data-driven execution using IPPro core.</td>
<td>94</td>
</tr>
<tr>
<td>5.4</td>
<td>Impact on $f_{Max}$ of realising FIFOs using different resources and configurations.</td>
<td>96</td>
</tr>
<tr>
<td>5.5</td>
<td>Mapping of dataflow execution patterns on IPPro core.</td>
<td>98</td>
</tr>
<tr>
<td>5.6</td>
<td>Pseudo IPPro code to implement dataflow execution patterns.</td>
<td>98</td>
</tr>
<tr>
<td>5.7</td>
<td>Block diagram of multi-port input data interface of IPPro datapath.</td>
<td>99</td>
</tr>
<tr>
<td>5.8</td>
<td>Impact of multi-port IPPro datapath on execution time (in clock cycles) of dataflow actor.</td>
<td>101</td>
</tr>
<tr>
<td>5.9</td>
<td>Multiple IPPro core-based hardware accelerator designs (a) Design A (b) Design B (c) Design C (d) Design D.</td>
<td>105</td>
</tr>
<tr>
<td>5.10</td>
<td>Multiple IPPro cores as dataflow accelerators deploying dataflow optimisations (a) One-to-one actor-core mapping (b) 2-way SIMD mapping per actor.</td>
<td>110</td>
</tr>
<tr>
<td>5.11</td>
<td>Cyclic row-wise image/video pixel distribution.</td>
<td>112</td>
</tr>
</tbody>
</table>
5.12 System level data distribution and control architecture. . . . . . . 113
5.13 FSM used to control the architecture of Fig. 5.14. . . . . . . . . . 114
5.14 Block diagram of implemented system architecture for case study. 116
5.15 IPPro hardware accelerator designs to explore and analyse the im-
 pact of parallelism on area and performance. (1) Single core IPPro,
(2) 8-way SIMD IPPro, (3) Dual core IPPro, (4) Dual core 8-way
SIMD IPPro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1 Software and hardware abstraction of the platform. . . . . . . . . . 132
6.2 Block diagram of hardware dataflow actor firing module. . . . . . . 135
6.3 Data processing paths of the IPPro using scratchpad. . . . . . . . . 137
6.4 AMBA-AXI4 compliant management interfaces of the IPPro. . . . 139
6.5 Theoretical mapping of data exchange patterns on IPPro cores. . . 143
6.6 Realisation of data exchange patterns using stream interconnect. . 144
6.7 Stream interconnect architectures with increasing core connectiv-
 ity. 146
6.8 A dataflow graph example that covers pipelining of multiple data
 parallel actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.9 Flat illustration of mapping and execution of pipelined multiple
data parallel actors exploiting parallelism using multicore IPPro.
The listed IPPro code shows the read, write and tagging of tokens
for each actor. These tags are used by the interconnect to route
token among cores of the multicore IPPro. . . . . . . . . . . . . . . . . . . . 152
6.10 Parallel distribution of row-wise cyclic image pixels. . . . . . . . . 154
6.11 Generation and distribution of the point or window pixels. . . . . 155
6.12 Block diagram of programmable hardware acceleration platform.

The diagram only shows a single multicore IPPro due to space limitations. Cascading of multiple multicore IPPro cores is possible permitted to FPGA area resources. ........................................... 156

6.13 Video processing system architecture using FPGA-based programmable hardware acceleration platform. ........................................... 163
Chapter 1

Introduction

Image Processing has been a field of academic research over the past several decades and is extensively employed to interpret meaning from images or video. A vast majority of electronic systems from automotive industry to factory automation, medical and surveillance employs image processing to provide intelligent analysis of their systems and improve productivity. The processing demands of such workloads often surpass the capacity of traditional computing architectures.

Video analytics is the branch of embedded vision that analyses human activity and extracts information from video content that is meaningful as perceived by the human eye. It is gaining traction in a diverse set of application markets including retail, transportation, consumer, smart-cities, critical infrastructure, and enterprise, among others. These systems use smart cameras with on-board image pre-processing to process data and give a reduction in data bandwidth and memory requirements before sending it to the centralised, server-based software platforms [1], [2]. These platforms are being aided by advanced algorithms to interpret and analyse meaning of an ever-growing increase of video content.
1.1 Research problem

There is a significant amount of investment in industrial and educational research, which is expected to grow in coming years considerably. The *Embedded Vision Alliance* has estimated that the revenue from analytic video hardware, software and services will increase from $858 million to nearly $3 billion by 2022, representing a compound annual growth rate (CAGR) of 19.6% [3]. This growth brings significant challenges to explore new parallel computing architectures in general and image processing architecture in particular, which are portable, efficient and easier to use for a wide range of application developers.

1.1 Research problem

The increasing demands for computation and bandwidth of existing and next-generation image processing applications pose severe challenges to both hardware and software solutions. While special purpose, hardware such as the *Graphics processing unit* (GPU) can handle the increasing computational demands of these data intensive applications, they come at the expense of higher power consumption, longer design times and significant programming effort. However, rapid advances and changes in state-of-art technology for these applications quickly make obsolete a dedicated accelerator or chip. The obsolescence is especially true in case of *Application-specific integrated circuit* (ASIC).

*Field-programmable gate array* (FPGA) technology has evolved significantly over the years from simple regular arrangements of *configurable logic* blocks and routing to a heterogeneous *system-on-chip* (SoC). Much of this improvement has inevitably been driven by market segments where FPGAs are particularly prevalent in signal processing due to pipelining and parallelism that they offer. While
the technology gap between ASIC and FPGA is widening, most of new ASIC designs lag behind due to overall design effort, time and cost making FPGA more attractive. FPGAs are proven computing platforms that offer reconfigurability, concurrency and pipelining. GPUs seem a viable highly programmable platform but, current energy requirements and limitations of Dennard scaling have acted to limit clock scaling, thus limits processing capabilities [4].

Apart from FPGA being a high performance and power efficient computing technology, they have not been accepted as a mainstream computing platform. The primary inhibitor is the need to use specialist programming tools, describing algorithms in hardware description language (HDL) and lack of adaptability. Silicon vendors started to alleviate this issue by introducing high-level programming tools such as Xilinx’s Vivado High-level Synthesis (HLS) and Intel’s (Altera’s) compiler for OpenCL. While the level of abstraction has been raised, a gap still exists between adaptability, performance and efficient utilisation of FPGA resources. Nevertheless, the FPGA design flow still requires design synthesis and place-and-route that can be time-consuming depending on the complexity and size of the design; this is alien to software and algorithm developers. The development of algorithms is usually an experimental process and may require many design iterations involving quick profiling, design exploration and prototyping. In such circumstances, an FPGA design flow that requires synthesis, place-and-route process is not comparable to a more familiar software-centric design flow that uses edit-compile-run. Therefore, an iterative development of a different application on FPGAs is a complicated and time-consuming process which inhibits widespread use of the technology.

The changing technology landscape and fast evolution of new application use-
cases make it imperative that underlying hardware architecture should be *adaptable*. Such platforms are a significant part of some major research initiatives where both quick prototyping and reduced design time are of prime importance. Moreover, the computing platform should allow design exploration possibilities including decomposition and mapping to optimise applications.

## 1.2 Rathlin Project

*Rathlin* research project had undertaken to approach these research problems [5]. The scope of this project was to investigate the rapid developments in image acquisition/interpretation and intelligent algorithms. As they have not been matched by sound software engineering principles, to generate efficient solutions for time, memory and power efficient hardware.

A domain-specific image processing language *Rathlin Image Processing Language* (RIPL) for FPGAs was introduced [6]. RIPL supports algorithmic skeletons to express image processing components, which functionally inherit a dataflow *model of computation*. A RIPL description is converted into an intermediate dataflow language (CAL) which is mapped on to the FPGA as a network of stream processing units [7]. Though, one of the project objectives was to facilitate iterative development of different applications by replacing FPGA design flow to software-centric flow. Therefore, an adaptable *FPGA-based hardware acceleration platform* architecture was developed that efficiently maps and executes parallel CAL dataflow descriptions. This platform aimed to unleash the potential of state-of-art FPGAs in close synergy with a suitable software representation. Further discussion on *Rathlin* programming workflow and relevant background
work will be discussed in Chapter 3.

1.3 Proposed approach

FPGA heterogeneous system-on-chip (SoC) architectures have addressed some of the hardware and software programming challenges [8], [9], [10]. However, fitting different parallel computational tasks onto the underlying FPGA hardware resources by using more processing nodes integrated into a single-chip is important. Besides, the need for architecture specific skills to port and optimise the applications to the underlying FPGA hardware resources which includes managing and exploiting parallelism and system heterogeneity, is also challenging. This problem is directly related to the optimal exploration of type and degree of parallelism among multiple processing nodes available within a heterogeneous system. Realising parallel applications on these heterogeneous platforms often involves design and development of the processing nodes or hardware accelerators. They can comprise fixed, reconfigurable or software programmable processors or combinations thereof. The adaptability of the underlying platform depends on the flexibility and programmability of its processing nodes. This adaptability can be present in the device, in the circuit, in the micro-architecture, in the system or even in the runtime software layer or among all of these.

This research work proposes an FPGA-based programmable hardware acceleration platform. It is a system architecture that takes advantage of heterogeneous computing. The FPGA glue logic can be used as a programmable hardware acceleration architecture that substitutes the traditional FPGA design flow (synthesis and place-and-route) to a software-centric edit-compile-run design flow [11].
1.3 Proposed approach

FPGA-based soft-core processor architectures have been used [12], [13], [14], [15], [16] as they offer better software controlled functionalities, system flexibility/portability, and partitioning of hardware-software co-design over other approaches [17]. The *programmable hardware acceleration architecture* is a three-layer architecture as illustrated in Figure 1.1 and outlined below:

- The bottom layer is comprised of a novel FPGA-based soft-core *Image Processing Processor (IPPro)* architecture tailored to accelerate image pre-processing applications. It supports both shared memory and message passing data processing models. The IPPro core is an independent, self-managed, programmable hardware accelerator that handles the exchange of data among multiple producers and consumers by executing stream instructions. It is used as a basic computational unit of the proposed platform as shown in Figure 1.1.

- The middle layer is composed of multiple IPPro cores connected with an

![Diagram](image)

Figure 1.1: Hierarchical illustration of hardware and software abstraction supported by each layer of the proposed programmable hardware acceleration architecture.
interconnect called **multicore IPPro** as shown in Figure 1.1. It extends both shared memory and stream processing and is supported by the lower layer to realise parallel computing models. The interconnect provides a deterministic, self-synchronising programmable inter-core communication mechanism to facilitate implementation of graph modelling various kinds of parallel/concurrent activities. The shared memory model offers programmable explicit synchronisation mechanism between each IPPro core and host processor to realise distributed computing and coprocessor activities.

- The top layer provides **system infrastructure** that distributes and collects data to the bottom layers. These mechanisms are necessary for efficient implementation of different parallel applications exploiting data and task parallelism as shown in Figure 1.1. Besides, it provides parametric/software configurable and dynamic data and control mechanisms to use common parallel algorithmic skeletons (split, compute and merge, farm and pipeline) and image processing operations (point and area) utilising the architectural features and processing capability provided by the bottom two layers.

The proposed approach provides a hierarchical abstraction to hardware computing resources, and the relevant communication and data access mechanisms that help to address the challenges faced by algorithm and software developers to adopt FPGAs. This approach also enables parallel exploration, profiling and implementation of different image processing algorithms to achieve the required goals.
1.4 Thesis Contributions

The following are the notable contributions presented in this thesis work:

1. Design and development of novel FPGA-based Image Processing Processor (IPPro) soft-core architecture tailored for acceleration of image pre-processing applications. The architecture is carefully designed to support functional computing requirements of image processing while maintaining efficient utilisation of FPGA compute and memory resources. The architecture supports both message passing and shared data models enabling stream and batch processing of uniform and non-uniform distributed data. These data processing paths provide architectural features to facilitate implementation of a split, compute, merge, pipeline and farm parallel computing skeletons. Using IPPro as a fundamental computing element makes the FPGA-based platform flexible and adaptable. It allows deployment of edit-compile-run flow avoiding design synthesis and place-and-route that reduces design time.

2. Design and development of IPPro-based hardware accelerator models to identify the architectural requirements of the accelerator’s management and provisioning policies, and their impact on the timing results of the processor. IPPro is designed as an independent, self-managed, programmable dataflow accelerator. The program code embeds both the actor’s functional description and its interaction with multiple producers and consumers. It avoids the need for external control mechanisms necessary to synchronise interaction between actors while exchanging data tokens and minimises IP-Pro core management and control overheads. It gives better controllability
on the actor’s token production and consumption rate and implements different data exchange patterns (split and merge). Besides, it enables fine and coarse-grained mapping and execution of data and control flow graphs which are commonly found in image processing applications.

3. Development of a *multicore IPPro* architecture that provides flexible connectivity among multiple IPPro cores and enhances platform’s programmable computing and mapping capabilities to map dataflow applications. The architecture complements the supported features of IPPro core and provides dynamic routing of dataflow streams among multiple IPPro cores. The connectivity among cores allows adaptable implementations of one-to-many, many-to-one, many-to-many producer-consumer dataflow graphs utilising the same hardware resources. These architectural features facilitate application profiling, optimisation options to the software and algorithm developer by exploiting data, task and pipeline parallelism.

4. Design and development of FPGA-based software controlled data distribution and collection architecture supporting different image resolutions. It divides an image stream into a variable number of parallel data streams that can be fed across multiple IPPro cores to realise a parallel computing paradigm. The architecture is independent, self-managed and can be integrated with both direct and buffered video processing pipelines to distribute data across multiple processing elements which are fixed in *High-level Synthesis* (HLS) system architectures. It facilitates parallel implementation of a split, compute, merge computing skeleton using multicore IPPro.

5. Design and development of an adaptable *FPGA-based hardware accelera-
1.5 Thesis Outline

The presented work is a part of a larger research project called Rathlin and covers the underlying FPGA-based hardware architecture. Chapter 3 gives an overview of the project’s scope and programming work-flow. It will help the reader to understand the bigger picture of the presented research and reasons of the adopted approach, and some of the design choices made in designing IPPro, multicore IPPro and the platform architecture.
Chapter 4 presents an FPGA-based soft-core *Image Processing Processor* (IP-Pro) architecture tailored to accelerate image pre-processing operations. The processor datapath has been developed after a detailed insight analysis of FPGA resources, processor functionality and dataflow models. It exploits FPGAs dedicated computing and memory resources to achieve the best balance between performance and area utilisation and enables software recompilation of FPGA by avoiding synthesis and place-and-route times. The processor datapath implements dedicated minimum and maximum instructions for optimised implementation of specific image pre-processing functions. A coprocessor extension is also implemented to integrate dedicated processing units and offload complex arithmetic operations transparently. At the end of the chapter, the performance and area results achieved by single-core IP-Pro is compared against a fixed *high-level synthesis* (HLS), FPGA-based programmable processor architecture and well-established MicroBlaze soft-core processor. The IP-Pro core is viable to use as a basic processing element of a *programmable hardware acceleration architecture*.

Chapter 5 presents IP-Pro as a *programmable dataflow accelerator* architecture that can map and execute fine and coarse-grained dataflow actor using producer-consumer computing model. These execution patterns supported by the architecture provide flexible mapping options to the user and software framework to explore and deploy different dataflow graph optimisations. It also presents a detailed analysis of management and provisioning of hardware accelerator when used in heterogeneous system architecture and their impact on the system’s architectural requirements and resource utilisation.

Chapter 6 presents a heterogeneous *FPGA-based programmable hardware acceleration platform* architecture that supports a software-controlled implemen-
tation of parallel skeletons on hardware. The platform is composed of a host processor and tightly-coupled homogeneous FPGA-based programmable hardware accelerators (IPPro cores). The platform facilitates the implementation of the split, compute and merge, pipeline and farm parallel skeletons by providing software-abstraction to make it easy to use for the software developer. The platform covers three hardware and software abstraction layers as indicated in Figure 1.1. At the end of the chapter, the acceleration results of a set of image pre-processing micro-benchmarks and functions, covering data and task parallel balanced and unbalanced dataflow actors are presented. This allows the mapping flexibility and the system’s adaptability to implement different applications and computing paradigms to be evaluated.
Chapter 2

Background

The changing technology landscape and fast evolution of new application use-cases raises the need for adaptable and efficient hardware architectures. These architectures shall handle the processing of dynamic data workloads and at the same time provide adaptability to implement different applications. This research problem initiated the need for look into different FPGA-based design approaches and programmable architectures. This chapter covers the multidisciplinary concepts related to FPGA-based hardware design approaches, dataflow model of computation and parallel computing and reviews their background and related work relevant to the thesis.

Section 2.1 covers the background on parallel embedded architectures focusing on FPGA-based hardware acceleration approaches and details pros and cons. of the existing approaches. It will discuss benefits of FPGA technology to realise efficient hardware acceleration technology to develop programmable/adaptable architectures. Section 2.2 encompasses the basic concepts of a dataflow model of computation and presents the notion of parallelism and dataflow transformations
to achieve optimised implementations. This is followed by a discussion on parallel computing skeletons that provide high-level programming constructs suitable for software and algorithm developers in Section 2.3. Section 2.4 will review the related work on FPGA soft-core and multicore processor architectures.

2.1 Parallel embedded architectures

During the last decade, multiprocessor architectures have emerged as an important computing paradigm for parallel computing [18], [19], [20], [21]. They have driven the development of advanced parallel embedded architectures [22], [23]. The trend of integrating homogeneous and heterogeneous processing units have opened various hardware-based parallel application decomposition, mapping and design exploration possibilities [24], [25], [26], [27]. Hardware architectures composed of tens and hundreds of light-weight compute units have become a commonplace not only to optimise performance [12], [13], [14], [15], [16]. At the same time, these hardware architectures present several challenges such as architecture specific skills to port and optimise the applications to the underlying architecture which includes managing and exploiting parallelism and system heterogeneity. This section covers the background study necessary to understand these challenges and FPGA-based hardware design approaches taken by the research community.

2.1.1 FPGA multiprocessor system-on-chip

Emerging heterogenous multiprocessor system-on-chip (MPSoC) architectures such as Xilinx Zynq-7000 and Altera Arria-V SoC integrates both software pro-
grammability of a general purpose processor (ARM) with the hardware programmability of an FPGA. The integration of the hardware and software made MPSoC architectures suitable computing platform to implement mixed functionality on a single device, and to develop adaptable embedded architectures [15], [28], [29].

Nevertheless, these heterogeneous MPSoC platforms have addressed some of the hardware and software programming challenges. However, fitting of parallel computational tasks to the underlying hardware resources by using more processing nodes integrated into a single chip is still a challenge. This problem is directly related to the optimal exploration of type and degree of parallelism among multiple processing nodes available within the heterogeneous system [12], [13], [14], [30]. Besides, optimised realisation of parallel applications using these heterogeneous platforms, it often involves design and development of hardware accelerators to meet application requirements. The architecture of these hardware accelerators can have a dynamic range of flexibility from fixed, reconfigurable, software programmable or combination of thereof. They reside on the FPGA fabric and are usually managed by a general purpose processor such as ARM Cortex-A processors [28], [29]. There are different FPGA hardware design approaches to realise such hardware accelerators, Section 2.1.2 will discuss in further details.

2.1.2 FPGA hardware accelerator design approaches

The silicon vendors and the research community have developed and proposed different architectures, design tools and software frameworks that ease the development of hardware accelerators. The silicon vendors tools provide a cohesive
2.1 Parallel embedded architectures

heterogeneous hardware-software co-design solution to develop and integrate the custom FPGA-based hardware accelerators. However to realise different application use case, requires architectural changes, design synthesis and place-and-route [20], [21], [31], [32], [33], [34]. These design tools cover both hardware and software design space which can be divided into the front-end software compilation and back-end hardware compilation tasks as illustrated in Figure 2.1 [35].

The front-end software compilation includes application description and accelerator architecture layers. The application description can be a domain or target specific, while the accelerator architecture can encompass a wide range of hardware accelerator architectures. On the other hand, the physical mapping layer uses silicon vendor tools to physically map the chosen hardware accelerator architecture onto the FPGA resources to achieve back-end hardware compilation.

To provide programming abstraction, the application can be described in a high-level language such as C/C++, OpenCL etc. or domain-specific language.
This application description is translated, optimised and compiled into an intermediate representation that can be mapped onto the target-hardware-accelerator. A wide range of target-hardware-accelerator approaches can be adopted ranging from a highly optimised application-specific processor, a flexible and programmable soft-core processor, an overlay architecture or a combination of thereof as illustrated in Figure 2.1. Each of these approaches has their pros and cons regarding design flexibility, area and performance [12], [20], [21], [23]. Based on the chosen target-hardware-accelerator architecture, the intermediate representation can be converted either into a set of dedicated domain-specific instructions, a program code consisting of mix of a general purpose instructions, or a hardware description language (HDL) or combination of thereof. The physical layer takes the HDL description of the target-hardware-accelerator design and converts into an FPGA mappable form, i.e. to the physical resources of an FPGA (flip-flops, lookup tables, dedicated DSP and memory blocks). This task requires technology dependent optimisation and routing mechanisms which are conducted by automated silicon vendor tools. The tasks involve design synthesis, place-and-route and bit-stream generation. These steps can be significantly time-consuming for iterative algorithm development depending on the complexity and size of the hardware design [8], [9], [10], [37].

**High-level synthesis (HLS)**

*High-level synthesis* (HLS) tools take an application description, use different analysis techniques to profile and explore the design space. The majority of these tools use a dataflow model of computation, therefore Table 2.1 lists both academic and commercial HLS tools that are widely reported in the open lit-
2.1 Parallel embedded architectures

erature [38], [39]. These tools support different high-level languages such as C, C++, OpenCL or domain specific languages to describe an application. These tools profile, explore, optimise and compile the high-level description into an intermediate representation which is translated into hardware description languages [7], [40], [41] such as VHDL, Verilog or SystemC as listed in Table 2.1.

These tools take advantage of FPGA deep pipelining to exploit parallelism and explore performance and resource optimisations by tuning the size of the first-in-first-out FIFOs [41], [42]. The oversized buffer uses more resources than needed, while small buffer can cause additional delays, stalls, and deadlocks during execution of the application [40], [43]. Though, all HLS tools generate a fixed hardware architecture tailored to accelerate a specific application or part of an algorithm which is not adaptable. To implement different applications, the only possibility is to rewrite and go through all front-end and back-end tasks discussed in Figure 2.1. The back-end tasks can significantly increase the design time [8], [9], [21], [23], [37] which is not appealing by software and algorithm developers due to the iterative algorithm development process which requires design exploration and fast prototyping. Section 2.1.3 discusses this problem in detail.

Table 2.1: High-level Synthesis (HLS) tools for FPGAs. [38], [39].

<table>
<thead>
<tr>
<th>HLS Tool</th>
<th>License</th>
<th>Input</th>
<th>Output</th>
<th>Data flow</th>
<th>Control Flow</th>
</tr>
</thead>
<tbody>
<tr>
<td>Catapult-C</td>
<td>Commercial</td>
<td>C/C++/SystemC</td>
<td>VHDL/Verilog/SystemC</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Bluespec</td>
<td>Commercial</td>
<td>BSV</td>
<td>SystemVerilog</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>C-to-Silicon</td>
<td>Commercial</td>
<td>SystemC/C++</td>
<td>Verilog/SystemC</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>MaxCompiler</td>
<td>Commercial</td>
<td>MaxJ</td>
<td>RTL</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>ROCCC</td>
<td>Commercial</td>
<td>C subset</td>
<td>VHDL</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>GAUT</td>
<td>Academic</td>
<td>C/C++</td>
<td>VHDL</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Symphony C</td>
<td>Commercial</td>
<td>BDL</td>
<td>VHDL/Verilog</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>LegUp</td>
<td>Academic</td>
<td>C</td>
<td>Verilog</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Vivado HLS</td>
<td>Commercial</td>
<td>C/C++/SystemC</td>
<td>VHDL/Verilog/SystemC</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Altera SDK</td>
<td>Commercial</td>
<td>C/OpenCL</td>
<td>VHDL/Verilog</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>HIPAcc</td>
<td>Academic</td>
<td>C++ Embedded DSL</td>
<td>C++</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>Merlin Compiler</td>
<td>Commercial</td>
<td>C/C++</td>
<td>C/OpenCL</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
2.1 Parallel embedded architectures

2.1.3 Need for adaptable hardware architectures

The emerging versatile application markets raise the demand for high-performance and efficient FPGA architectures, that can handle the processing of dynamic data workloads and at the same time adaptable to accelerate different applications. One way to approach this research problem is by developing adaptable FPGA hardware architecture that enables edit-compile-run flow familiar to software and algorithm developers instead of hardware synthesis and place-and-route. It can be achieved by populating FPGA logic with a light-weight and high-performance soft-core processors used for programmable hardware acceleration. This underlying architecture will be adaptable and can be programmed using conventional software development approaches as illustrated in Figure 2.1. This approach does not require hardware design synthesis and place-and-route. Instead, it will need software re-compilation that shall generate a binary code to run on the underlying soft-core processors.

Though the HLS-based designs are use case optimised as the application is known before realising the underlying hardware. On the contrary, in processor-based approach, the underlying hardware architecture is designed, synthesised, place-and-route in advance. Therefore, the overall area is expected to be more significant and performance is supposed to be lower than HLS, which will come at the cost of adaptability and reduction in design time.

This approach provides hardware abstraction of the underlying FPGA programmable resources by allowing them to reconfigure using traditional software approaches and exposes it to the software developer. It inherits software benefits such as portability, partitioning complex hardware-software co-design, decompo-
2.1 Parallel embedded architectures

sition and mapping options to achieve desired area and performance goals. Be-

sides, avoiding required iterative process of synthesis and place-and-route would reduce design time, improve productivity and allow software-controlled design exploration opportunities. Jain, Rigamonti and Liu have reported an order of magnitude improvements by compiling applications onto processor architectures over HDL and partial reconfiguration approaches [23], [44], [45]. Nevertheless one of the significant challenges is to efficiently compile, map and execute parallel applications onto the underlying programmable hardware accelerator architecture.

2.1.4 FPGA memory and computation resources

FPGA fabric provides essential digital components necessary to build any digital circuit. It has logic blocks, dedicated memory and DSP Blocks, clock management circuitry and routing resources to connect these digital components. In an FPGA, the location of these components are fixed and cannot be changed which makes it essential to consider their layout to obtain area-efficient and high-performance hardware architecture. Figure 2.2 shows the available hardware resources, their raw-computation (GMACs) and the memory resources across different families of Xilinx FPGAs. The raw-computation (GMACs) is directly proportional to the number of DSP blocks. Dinechen et al. show how to map different arithmetic operators on FPGA fabric utilising different approaches including LUT, DSP block etc. [46]. Similarly others presented mapping of mathematical expressions to these DSP blocks and achieved performance improvements [47], [48], [49].

While the computing resources and bandwidth are high, the memory in FPGA is limited compared to other computing technologies. Figure 2.3 shows the distri-
2.1 Parallel embedded architectures

Figure 2.2: Trend of hardware resources, their raw-computation (GMACs) and memory across different families of Xilinx FPGAs [50], [51].

The distribution of on-chip memory and bandwidth on the Virtex-7 FPGA. Moving away from the datapath the memory size increases while the bandwidth get limited. On-chip memory consists of LUT-based Distributed RAM that are small and close to the datapath which can provide faster access to data at higher bandwidth. On the other hand, Block RAM is comparatively larger but limited in bandwidth. It shows that there is a trade-off between the memory-size and bandwidth.

Focusing on the FPGA technology, the 7 series Xilinx FPGAs comes in three different families. The 7 series combine the power reducing process, design tech-

Figure 2.3: FPGA memory and bandwidth hierarchy of Xilinx Virtex-7 FPGA.
2.1 Parallel embedded architectures

Techniques, and architectural enhancements to deliver the lowest-in-class power consumption, compared to the previous generation of Xilinx FPGAs. It covers a low-cost Artix-7 family, a midrange Kintex-7 family, and a high-end Virtex-7 family of FPGAs. All three families use the same 28nm silicon process technology and have the basic FPGA building blocks of logic cells, DSP blocks, BlockRAM making it simpler to migrate designs across FPGA family. The Kintex-7 device family features a perfect balance of FPGA fabric clock rate performance versus power consumption, high-speed I/O, capacity and reliability. Artix-7 uses the same FPGA resources as Kintex-7, but optimized for even lower power consumption and smaller size packages, delivering similar advantages, at the cost of lower chip price and performance.

2.1.5 DSP block

Most of the digital signal processing applications extensively use multiply and accumulate operations that can be performed efficiently using these DSP blocks. These blocks are uniformly-distributed inside the FPGA fabric. They are capable of performing basic arithmetic and logic operations on data that is suitable to design efficient, high-performance arithmetic logic unit (ALU) of a processor. Xilinx DSP Block (DSP48E1 and DSP48E2) supports these operations and can be dynamically configurable in contrast to Altera. Figure 2.4 shows the simplified functional block diagram of the DSP48E1. It has four main arithmetic blocks:

1. 25-bits Pre-Adder
2. 25x18 bits Multiplier
3. 48-bits Adder, Subtractor, Logical
4. Comparator and pattern detector

The DSP48E1 is capable of multiply, multiply-accumulate, add, subtract and other operations. Besides, a set of control registers that allow controlling the internal datapath on a cycle-to-cycle basis (for details see Appendix B Table B.1). There are pipeline registers that enable/disable the internal pipelining of the DSP48E1 block and improve the timing of the block by reducing the critical path [52]. Three internal multiplexers allow mapping of input and output operands to multiplier and adder/subtractor.

2.2 Dataflow model of computation

In early 1970s, various classes of model of a computation (MoC) had been introduced that models the architecture independent functional requirements through semantics, interfaces and provides synergy between processing units [53], choosing a suitable MoC is one of the key hardware design decision. The dataflow MoC could possibly be expressive programming and efficient execution model. It has the property to express applications as network processes which offer parallelism.
scalability, modularity, portability and adaptivity. These characteristics are vital to unify the system level design of heterogeneous platforms. Moreover, it follows the principle of stream processing [22] that are suitable for FPGA-based hardware architectures [54], [55], [56], [57].

2.2.1 Notion of parallelism in dataflow graphs

Stream and dataflow driven programming models allows efficient implementation of different types of parallelism [30], [58], [59]:

**Pipeline parallelism** A pipeline is a chain of actors $a_1, \ldots, a_n$ that are directly connected in the stream graph. Each pair $(a_i, a_{i+1}), i \in \{1, \ldots, n - 1\}$ has a producer/consumer relationship, that is, $a_i$ consumes items produced by $a_{i-1}$ and produces items that serve as input for $a_{i+1}$. Figure 2.5 shows a pipelined execution of function A and B. It is important to note that the throughput shall only be as fast as the slowest group of actors in the pipeline [60].

**Task parallelism** Two actors $a_1, a_2$ are task parallel if they are on different branches of the stream graph. In contrast to pipelines, there are no input/output dependencies between $a_1$ and $a_2$. Figure 2.5 shows task parallel actor D and E.

**Data parallelism** is the property of an actor to have no dependencies between

![Diagram](image)

(a) Pipeline-parallel $A \parallel B$. (b) Task-parallel $D \parallel E$. (c) Data-parallel $G \parallel G$.

Figure 2.5: Illustration of pipeline, task and data parallelism in dataflow graphs [58].
2.3 Parallel computing skeletons

one execution and the next. The actor can be replicated by using multiple instances of an actor such as G is replaced twice as shown in Figure 2.5.

2.2.2 Dataflow transformation

Dataflow transformations are frequently used to enhance system performance, by improving the performance of slower dataflow nodes or part of the graph [24] [26]. These transformations maintain the functionality of original dataflow graph, but increase the throughput or decrease the latency [26], [30]. Dataflow graphs are amenable to coarse-grained transformation to exploit data, task and pipeline parallelism that can be efficiently implemented using FPGA [24]. Single instruction multiple data (SIMD) based hardware architectures had been used to accelerate applications including image pre-processing due to massive pixel processing [61], [62], [63]. The dataflow specific optimisations (decomposition, mapping, and scheduling) and transformations (fission, fusion, etc.) can be exploited to improve performance [24], [30], [64]. These transformations allow decomposition and design space exploration possibilities to achieve desired application goals. The application can map on a multicore architecture, which will enable exploiting data and task parallelism by supporting edit-compile-run design flow.

2.3 Parallel computing skeletons

Parallel computing skeletons capture common parallel-programming paradigms and abstract to the programmer as high-level programming constructs equipped with well-defined functional semantics [12], [65], [66], [67], [68]. They model a precise parallel pattern to exploit parallelism and hides pattern implementation
2.3 Parallel computing skeletons

details from the programmer to exploit parallelism as shown in Figure 2.6. These patterns are parametric and can be re-used in different applications. This approach is adopted by several parallel programming frameworks [31], [69], [70].

2.3.1 Pipeline

The basic idea of the pipeline skeleton is to split processing into a series of sequential steps, with storage at the end of each step as shown in Figure 2.6. It is possible by distributing a sequential application into multiple independent but sequential tasks, where preceding task feeds data to the following task. It enables concurrency where that tasks can execute in parallel as soon as the data is available at the processing node. The computational load of tasks may vary and is not known before run-time unless static model of computation such as static dataflow is used to define the parallel application. Though, the maximum achievable processing rate depends on the processing rate of the slowest task, which is faster than the time needed to perform all the steps at once. However, by static profiling of the application in hand, it is possible to find an efficient decomposition that could lead to balanced tasks with bounded memory requirements.

2.3.2 Split, compute and merge

This skeleton is used to process regularly distributed data-based on static decomposition. The data is divided into a number of equal sized blocks (row-based, column-based or block-based) where the number of parallel data blocks defines the level of exploitable data parallelism. In architectural terms, it is know as scatter-gather or split-compute and merge parallel programming model as shown
in Figure 2.6. Moreover, it can also be extended to implement different derived multi-stage pipelined skeletons to exploit both data and task parallelism using the pipeline, split, compute, communication, compute, merge or a combination thereof, to achieve better performance. The benefit of pipelining multiple stages is that it reduces data transfer overhead, improves data bandwidth, avoids memory bottlenecks in contrast to shared memory model-based acceleration approach where the bandwidth and cache coherency significantly degrade the performance.

### 2.3.3 Farm

This skeleton is used to process irregular data. The farmer (host/master processor) allocates the tasks to the workers until none are left as shown in Figure 2.6. Then, the farmer waits for a result from a worker and immediately sends another work item to it. Each worker receives a work packet, process it, and returns the result to the farmer until it gets a stop condition from the farmer. The advantage of this approach is that the farmer knows which workers have yielded the results.
of their tasks and are hence idle. Thus, the farmer can forward incoming tasks to the idle workers.

However, this approach has its disadvantages. It causes substantial overhead due to the exchange of messages between the farmer and the workers [72]. Moreover, the farmer might become a bottleneck, if the number of workers is large. In this case, the farmer will not be able to keep all workers busy, leading to wasted workers. The number of workers which the farmer can keep occupied depends on the sizes of the tasks and the sizes of the messages the farmer has to propagate. However, since the process of assigning task is cyclic which could lead to deadlocks in case of data dependent tasks where the computation of certain workers might depend on the results of the others leading to deadlocks. On the other hand, one has to make sure that the farmer reacts as quickly as possible to newly arriving tasks and workers delivering their results.

This section has discussed the concept of parallel computing skeletons which will be used to approach the issue of lack of hardware abstraction in FPGA-based architectures. Since application designers face difficulty, utilising the available resources efficiently without hardware knowledge. It involves handling of the low-level core, inter-core and system communication and system interfaces etc. Therefore, Chapter 6 will present a detailed multicore and system level architecture that shall support these parallel computing skeletons.

2.4 Related work on FPGA soft processors

This section investigates different FPGA-based soft processor architectures. Emphasis will be placed on the various word sizes, maximum clock frequencies, and
2.4 Related work on FPGA soft processors

resource usage to evaluate these processors. *Word size* is essential parameter because at least 16-bit words are required to accurately represent the pixel data in different colour spaces with some redundancy. *Clock frequency* directly influences the maximum throughput of the design, which in turn affects the observed speed-up. Processors with less resource usage allow more logic to be used for multicore architectures, which can achieve superior performance.

2.4.1 Scalar Processors

The commercial off-shelf offering from leading FPGA vendors Xilinx and Altera are the *MicroBlaze* and the *Nios II* processors respectively. Both are 32-bit soft processors-based on RISC architecture accompanied by their respective software development tool-chains. The performance optimised MicroBlaze is capable of delivering up to 262 DMIPs having a 5-stage pipeline, while Altera’s capable of delivering 30 DMIPS. Both the Nios II and MicroBlaze are highly configurable with options including a floating point unit, memory management, and interfacing to custom hardware accelerators. These commercial soft-core processors have been investigated and modified in several papers [73], [74]. But, managed to achieve the maximum operating frequency ranging from 77 - 112 MHz.

Other processors are made available under open source licenses such as the OpenRISC. It is an open source RISC-based processor with 32 and 64-bit modes and optional vector support [75]. The LEON3 is a 32-bit SPARC V8 compliant processor described in VHDL which is available under the GNU GPL [76]. It uses 7-stage pipeline, incorporates a floating point unit, supports symmetric multiprocessing and operates at up to 125 MHz.
2.4 Related work on FPGA soft processors

2.4.2 Multicore Processors

In open literature, many FPGA multicore processor architectures have been presented to accelerate different applications. Silicon Hive [77] accelerator architecture replaced ASIC accelerators with the reconfigurable cores, making accelerators fully programmable after fabrication and flexible to maintain throughout the product life-cycle. The basic component of Silicon Hive architecture is the Processing and Storage Element (PSE) consists of multiple functional units (FU) connected via interconnect network (IN) as shown in Figure 2.7. It has one or more operation-issue slots (IS) associated with the FUs, distributed register files (RF) and an optional local memory storage (MEM). The PSE was designed in such as way that it ensures easy and clean datapaths for a compiler to handle, and guaranteeing high-level of programmability. A matrix of one or more PSEs, together with a controller (CTRL) and configuration memory (CONFIG. MEM), makes up a cell. The PSEs within a cell can communicate with each other via

Figure 2.7: The layered block diagram of Silicon Hive architecture illustrating Processing Storage Element (PSE), cell and streaming array of cores [77].
2.4 Related work on FPGA soft processors

Figure 2.8: The block diagram of PicoArray processors organised in a two-dimensional grid connected together using a deterministic picoBus interconnect [79].

Data communication lines (CL). An array of one or more cells connected via a data-driven communication mechanism forms a streaming array as shown in Figure 2.7. The communication across cells takes place through blocking FIFOs accessed from load/store (LD/ST) units within the cells, allowing multiple functions to be concurrently mapped onto the streaming array. Dan et al. extended the Silicon Hive approach and proposed HiveFlex Moustique-IC2 processor [78] as a synthesisable soft-RTL core with an I/O subsystem specifically designed for image processing applications. The Moustique-IC2 was a Single-Instruction-Multiple-Data (SIMD) machine, which means that the same program simultaneously operates on all pixels. By increasing the SIMD factor, the same program can be used to process more pixels at once, thereby increasing the throughput. The 24-way SIMD processor achieved the operating frequency of 200 MHz on a 90 nm technology.

Duller et al. have proposed PicoArray [79] which is a massively parallel architecture designed as an alternative for creating ASIC designs which are complex to
2.4 Related work on FPGA soft processors

design, expensive in cost and requires larger design time and effort. The PicoArray is a tiled processor architecture, composed of a large number of heterogeneous processing cores. The architecture had been primarily designed for wireless infrastructure applications. The processors are organised in a two-dimensional grid and connected together using a deterministic picoBus interconnect as shown in Figure 2.8. The inter-processor communication protocol was based on a time division multiplexing (TDM) scheme, where data transfers between processor ports occur during automatically scheduled time slots by the tool and controlled by the bus switches. The communication between the processors is fixed at the compile time and cannot be changed dynamically. The PicoArray is designed as a 16-bit, 3-way VLIW RISC processor with Harvard memory architecture. It supports four different variants of processors (standard, multiply-accumulate, memory and control). Each variant was designed for a mixture of DSP, stream and block-based processing and therefore, had different internal memory distribution. All four variants use the same RISC instruction set, except the MAC instruction which can only be executed on standard processor. With the exception of loads and branches, all instructions execute in a single cycle. Each processor can only access its own internal memory (between 1KB and 32KB) and communicates with other processors using input/output data ports. Each processor was initialised using a special configuration bus and programmed using assembly language. The PicoChip PC102 runs at 160 MHz on Xilinx Virtex-4 FPGA [80].

Classical vector processing involves sending a stream of values into pipelined functional units [81]. Later, there are several architectures have been proposed including [82] and [83]. Some optimisations have been done to speed up the performance and to reduce the execution time by incorporating vector chain-
2.4 Related work on FPGA soft processors

ing, control flow execution and banked register file etc. The processor runs at a maximum operating frequency of 200 MHz [83]. Others proposed multiprocessor architectures to exploit the hidden parallelism in some parts of streaming applications for efficient implementation such as VENICE and VectorBlox MXP are designed to exploit data level parallelism (DLP) by processing vectors [83], [84].

Nachiket et al. has proposed a GraphSoC custom soft processor [16] for accelerating graph algorithms using Xilinx Zynq SoC. It is 3-stage pipelined processor that supports graph semantics (node, edge operations). A single FPGA can fit multiple instances of these processors interconnected using network-on-chip (NoC). The graphs functional description is stored in the on-chip BRAM for fast local access. Larger graphs can be partitioned into sub-graphs and loaded one-by-one or split across multiple processors. The execute stage of the processor is customisable and supports four graph specific instructions, i.e. (send, receive, accumulate and update) which are implemented as micro-coded datapath shown in the Figure 2.9. The processor datapath has no register file instead, it has special purpose registers to hold edge and node information. The reported timing results shows

![Datapath of a basic pipelined processing node used in GraphSoC [16].](image)

Figure 2.9: Datapath of a basic pipelined processing node used in GraphSoC [16].
2.4 Related work on FPGA soft processors

Andryc et al. have proposed a FlexGrip [36] a customizable softcore architecture that allows the execution of general-purpose processing units (GPGPU) code on an FPGA without the need of design synthesis and place-and-route. FlexGrip is a 32-bit multicore scalable, configurable processor architecture based on a single instruction, multiple-thread (SIMT) model in which an instruction is fetched and mapped simultaneously on multiple scalar processors (SPs) as shown in Figure 2.10. A streaming multiprocessor (SM) is composed of multiple SP that enable multi-threaded execution. The number of threads are equivalent to the number of scalar processors inside a streaming multiprocessor (SM). SM is a five stage pipelined architecture consists of fetch, decode, read, execute and write stages as shown in Figure 2.10. The execute stage consists of multiple scalar processors and a single control flow unit. This unit operates on control flow instructions such as branch and synchronization instructions. Each thread is mapped to one scalar processor, enabling parallel execution of threads. The Write stage stores intermediate data in the vector register file, memory addresses

![Figure 2.10: Datapath of FlexGrip Streaming Multiprocessor (SM) [36].](image-url)
2.4 Related work on FPGA soft processors

in the address register file, and predicate flags in the predicate register file. Final results are stored in the global memory. A design with single SM and 8 SP implemented on Xilinx Virtex-7 device achieved maximum operating frequency of 100MHz.

2.4.3 DSP Slice Processors

In 2009 the concept of using the DSP slice on Xilinx FPGAs as the basis for a soft-core processor was presented by Milford and McAllister [85]. In this paper, the authors design the FPGA Streaming Element (fSE) which is 8-stage pipelined and uses device primitives to maximise the efficiency of the processor. The instruction width is 22-bit where, two bits for the opcode, 32-bit data word and 16-bit for real and imaginary components. They implemented a 16-point FFT and compared it against the Xilinx dedicated IP core implementation. The processor not only runs at a faster- operating clock speed (430 MHz) but also uses fewer LUTs (145) and requires fewer cycles to complete. The same authors adopt this fSE processor as the basis for a 16-way SIMD processor architecture [47]. They also include custom units for minimum and switch operations to decrease the instruction count for their chosen application of a sphere decoder. They have achieved real-time performance for the 802.11n standard with a clock speed of 265 MHz on a Xilinx Virtex-5 FPGA.

Cheah et al. have proposed iDEA processor-based on the DSP48E1 primitive blocks [86]. It is based on a RISC load/store architecture and executes 32-bit instruction words on 32-bit data. They investigated a range of pipelining configurations and achieved a maximum of operating frequency of 407 MHz with a
2.5 Summary

In this chapter, we have discussed the significance of heterogeneous MPSoC platforms in Section 2.1 that provides hardware acceleration opportunities by providing FPGA programmable logic, which can be used to accelerate computation intensive portion of an application. However, a major inhibitor to use this technology to realise adaptable solutions is the lack of hardware abstraction and complexity of FPGA design flow, especially for software and algorithm developers. Both commercial and academic research community have developed high-level synthesis (HLS) tools that allow programming FPGA in high-level programming languages which are familiar to software and algorithm developers, i.e. (C, C++, SystemC, OpenCL etc.). But, these tools generate the application description that requires synthesis, place-and-route which can be significantly time-consuming for iterative
application development process due to lack of adaptability.

To approach this problem, we propose a multicore processor approach that shall replace the traditional hardware synthesis, place-and-route to edit-compile-run design flow. This approach will allow hardware abstraction to the underlying FPGA resources and provide adaptability by programming the underlying architecture using conventional software development approaches. Section 2.4 reviewed range of soft-core processor architectures and shows that they are either not area-efficient or does not deliver high raw-computation evaluated in terms of their maximum operating frequency \( f_{\text{Max}} \) essential for hardware acceleration. It is vital that the soft-core processor shall be light-weight, high-performance and efficiently utilises the FPGA resources. Therefore, Chapter 4 will present the novel Image Processing Processor (IPPro) architecture that will be used as a basic computational unit to realise the flexible and adaptable multicore architecture.

Section 2.2 has briefly covered the concepts and related-work necessary for a novel FPGA-based soft-core Image Processing Processor (IPPro) architecture presented in Chapter 5 to map and execute dataflow actor. Besides, the notion of parallelism and transformations is covered to set the background for multicore IPPro architecture presented in Chapter 6.

Section 2.3 has discussed the concept to model parallel patterns to exploit parallelism. They hide the pattern implementation details and underlying hardware peculiarities from the programmer and provides clean and give a clean abstraction to the programmer to exploit parallelism. These patterns are portable, reusable and shall be supported in the underlying hardware architecture. Implementation of these parallel patterns is central to realise our proposed hardware acceleration approach. Chapter 6 will present a detailed multicore IPPro and system architec-
ture that support the adaptable hardware implementation of discussed parallel skeletons.
Chapter 3

Rathlin Project

To approach the outlined research problems, a collaborative research project called Rathlin had undertaken between Queen’s University Belfast and Heriot-Watt University which was funded by Engineering and Physical Sciences Research Council (EPSRC) [5]. The scope of the project is to investigate the rapid developments in image acquisition/interpretation and intelligent algorithms. As FPGA-based hardware architecture development have not been matched by sound software engineering principles, to generate efficient solutions for time, memory and power efficient hardware.

One of the primary objectives of Rathlin project was to design and develop an FPGA-based hardware acceleration platform architecture for image processing applications which was my contribution to the project. The aim was to unleash the potential of state-of-art FPGAs in close synergy with a suitable software representation. This representation allows application and a programming environment to facilitate exploration, profiling and optimisation and parallel implementation of image processing applications using conventional programming
approaches. Therefore, some design decisions and choices in the presented work have been driven by the scope of Rathlin to complement its aim and objectives.

This chapter aims to present the key objectives of Rathlin, its programming workflow and the model of computation. As it gives a bigger picture of the performed research and understand some of the design decisions to derive the Image Processing Processor (IPPro), multicore IPPro and the FPGA-based hardware acceleration platform.

### 3.1 Rathlin Objectives

The primary project objectives are:

- Creation of a dataflow model of computation representation that allows the processing and data organisation needs of image processing algorithms.

- Design and development of an adaptable FPGA-based hardware acceleration platform using the IPPro soft-core processor and focusing on the efficient utilisation of FPGA resources while matching the computational and memory requirements of the algorithms/applications.

- Development of a programming environment for a Domain-Specific Language (DSL), optimally compiled to the platform using dataflow techniques and integrated with a standard Application Programming Interface (API) to execute on the underlying hardware platform.

- An adaptable realisation of a set of image processing algorithms to evaluate the performance and adaptability of the platform to accelerate different image processing applications.
3.2 Programming workflow

One of the project objectives is the adaptable realisation of image processing algorithms on a FPGA-based hardware platform. Such a realisation consists of various stages as illustrated in Figure 3.1. From top to bottom it involves algorithm development in Rathlin Image Processing Language (RIPL) [6] that was being developed by Heriot-Watt University, a dataflow language Cal Actor Language (CAL) [87] and IPPro-based hardware platform.

The programming workflow consists of RIPL DSL, an intermediate representation and a compiler framework to profile and optimise the IPPro code generation.

![Figure 3.1: Rathlin workflow of RIPL to IPPro-based platform with alternative compilation paths [88].](image)
that can execute on the platform. The CAL language has been chosen as the inter-
mediate dataflow language between the RIPL and a compiler framework that

generates the IPPro code. Because, the CAL compiler (Orcc \cite{89}) allows to gen-
erate application specific implementations for different target platforms (C/JAVA
for the CPU, VHDL for the FPGA) using the available runtime libraries. This
flexibility has enabled alternative design routes to the project team members to
carry on their research activities by implementing, verifying and benchmarking
different applications without dependent on the IPPro design route. The in-
teraction and connectivity details between dataflow actors is described as XML
DataFlow (XDF). From the user perspective, the compiled IPPro code can run
on the FPGA-based hardware platform as executable binary code which avoids
the need of synthesis, place-and-route and bit-stream generation.

3.3 Cal Actor Language (CAL)

CAL is a programming language-based on the dataflow MoC where the actor
executes a sequence of discrete computational steps known as actor firing. In
each step, an actor may (a) consume a finite number of input tokens, (b) produce
a finite number of output tokens, and (c) modify it’s internal state if an actor
has any. In CAL, it has specified as one or more actions. Each action describes
the conditions under which it may be fired. It includes the availability and the
values of input tokens, the actor’s state and what happens when the action is
triggered, i.e. how many tokens are consumed and produced at each port, the
values of the output tokens, and how the actor state is modified. The execution
of such an actor consists of two alternating phases: the determination of an actor
firing conditions are fulfilled, and the execution of that actor itself. In this work, a single action per dataflow actor has been considered scoping the research work to static dataflow than covering dynamic dataflow graphs.

### 3.3.1 Semantics and execution model

A CAL actor is defined by a set of input ports \((P_{in})\), output ports \((P_{out})\), actions, internal variables and a dataflow network. The dataflow network is composed of a set of dataflow actors \(A, B\) and \(C\), and set of FIFOs depicted in Figure 3.2. An action is activated according to its input patterns known as *actor firing rule*. The patterns are determined by the amount of data required for the input sequences that need to be satisfied for enabling the execution of an action.

The CAL execution model is the execution of four stages. The execution starts by checking the *actor firing rule* which defines the number of expected input tokens from each port and output tokens produced by an actor. Once the *actor firing rule* is satisfied, the CAL actor execution starts sequentially by reading the input tokens followed by execution of the actor and storing the produced output tokens into the output FIFO queues. The following are the key advantages of
### 3.3 Cal Actor Language (CAL)

Table 3.1: Dataflow semantics and their functional requirements to implement on a hardware architecture [7], [54].

<table>
<thead>
<tr>
<th>CAL semantics</th>
<th>Functional req.</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Components</strong></td>
<td>Input/output FIFOs, control instructions</td>
<td>It is a node in a data flow graph that describes a module and the firing rules</td>
</tr>
<tr>
<td><strong>Operations</strong></td>
<td>Arithmetic and logical instructions</td>
<td>An operation carry out arithmetic, boolean operations and generates a single value.</td>
</tr>
<tr>
<td><strong>Memory elements</strong></td>
<td>Read/write access to local memory</td>
<td>It is a symbolic representation of a memory space in a design. The two primary attributes are the allocation and access of memory elements.</td>
</tr>
</tbody>
</table>

adopter a dataflow model:

- Intuitive and easily understood by programmers especially in DSP.
- Ability to express concurrency without complex synchronisation.
- Explicitly exposes the natural parallelism (pipeline, task, data).
- Modular programming allows reusability, reconfigurability and hierarchical composition of processing blocks.

The CAL semantic is list down in Table 3.1, categorised into *components*, *operations* and *memory elements* along with the identified functional requirements to map each on the FPGA-based hardware architecture. It can observe that the underlying hardware architecture shall have an input/output FIFOs, ALU and memory to support execution of CAL programs. These functional require-
3.4 Producer-consumer computing

The producer-consumer model is a data-driven data exchange mechanism which is widely used by dataflow-based hardware architectures to pipeline multiple actors and computing stages [55], [56], [57]. Generally, a FIFO data structure is used to implement and ensure deterministic and deadlock-free access to data tokens [90]. The FIFO holds data tokens in the order they have received them and provides access to data tokens using a first-in-first-out access policy. It also isolates the execution boundaries which enables concurrent execution of producer and consumer actors.

There are different possible data passing patterns among dataflow nodes depending on the number of producer and consumer nodes directly connected. Figure 3.3 illustrates multiple actor (many-to-one, one-to-many and many-to-many) data passing patterns [91], [92], [93]. They can further drive other patterns such as (merge-pipeline-split) to implement tree reduction and expansion dataflow.

Figure 3.3: Producer-consumer driven data exchange patterns [91].
graphs. A split and merge can express by Single-Producer-Multiple-Consumers (SPMC) and Multiple-Producers-Single-Consumer (MPSC) in a producer-consumer model, or used to implement data parallel computation. Similarly, a feed-forward can represent by single-producer-single-consumer (SPSC) in producer-consumer model, or used to achieve pipelining or task parallel computation. Since these patterns are reusable, different nested data passing patterns can be derived such as merge-pipeline-split or split-pipeline-merge as shown in Figure 3.3.

To support fine and coarse-grained mapping and execution of dataflow applications onto the proposed FPGA-based hardware acceleration platform requires these data exchange patterns between dataflow actors to be supported by the multicore IPPro. It would enable application exploration, profiling and optimisation opportunities which are discussed in Chapter 5 and 6.

3.5 Summary

This chapter outlined the scope and programming workflow of Rathlin project to differentiate and characterise the novelty of the presented research work beyond FPGA-based hardware acceleration platform architecture itself. Since the platform architecture is one of the parts of the project, some of the architectural choices have been driven by the programming workflow. It shows the need of:

- Hardware and software abstraction to design and develop an adaptable FPGA-based hardware architecture to efficiently utilise them using sound software engineering principles. This will be discussed in Chapter 4 by developing FPGA-based soft-core processor architecture.
3.5 Summary

- Supporting of familiar software driven *edit-compile-run* flow would facilitate profiling, optimisation and fast prototyping of different image processing applications by avoiding synthesis, place-and-route times.

- Supporting dataflow model of computation as one of the data processing models of IPPro and multicore IPPro architecture. Both architectures shall allow flexible mapping and granular execution of CAL algorithmic description. Architectural details of these dataflow semantics and data passing patterns will be discussed in Chapter 4 and 5.

- Supporting of parallel computing skeletons (*split-compute-merge*, *farm*, *pipeline*) to efficiently map high-level domain specific parallel descriptions on the platform. Architectural details will be discussed in Chapter 6.
Chapter 4

Image Processing Processor (IPPro)

4.1 Introduction

The integration of the hardware and software has made system-on-chip (SoC) architectures such as Xilinx Zynq-7000 SoC, suitable as a computing platform to meet the computing demands of wide range of applications. In these architectures, the FPGA programmable logic is tightly-coupled with a general-purpose processor which can be efficiently used by off-loading compute intensive tasks. Nevertheless, the changing technology landscape and fast evolution of new application use cases make it imperative that the underlying hardware architecture shall be adaptable. Silicon vendors have alleviated this issue by introducing high-level programming tools such as Xilinx Vivado high-level synthesis (HLS). While this raises the level of abstraction, a part of the FPGA design tool-flow still requires lengthy FPGA synthesis and place-and-route times \[10, 37, 44, 45\] which are alien to software
and algorithm developers.

This research aims to propose a programmable approach that replaces the specialised HDL-based hardware accelerator design, to software like recompilation of FPGA resources. It can achieve by populating the underlying FPGA architecture with multiple light-weight, high-performance soft-core processors. The user applications are compiled and mapped onto the soft-core processors as a binary code rather than a FPGA bit-stream. It avoids synthesis and place-and-route and provides software developers with the familiar edit-compile-run flow which reduces design time and effort. Compared to the HDL approach, it will be straight-forward, easy to debug/profile and enable better application optimisation possibilities. The first step to realise this approach is to develop an efficient, light-weight, programmable processing node in the form of soft-core processor tailored to accelerate image pre-processing functions. Following are the novelties and contributions of this chapter:

- Exploration of dataflow and FPGA-based soft-core datapath models to identify the best balance among dataflow graph mapping possibilities, processor datapath functionalities and performance. The outcome laid down the architectural design choices for a high-performance and area efficient FPGA-based soft-core processor architecture.

- A novel FPGA-based soft-core Image Processing Processor (IPPro) architecture tailored to accelerate image pre-processing applications. The architecture provides a balance between efficient utilisation of FPGA resources and performance while enabling deployment of edit-compile-run design flow. These features make IPPro suitable to be used as a basic computational unit
4.1 Introduction

of many and multicore hardware acceleration architecture.

- Optimisation of the IPPro datapath to support additional instructions. It include coprocessor extension and dedicated minimum/maximum instructions to improve hardware acceleration results. The optimised datapath supports parallel execution of variable latency custom coprocessors.

- Acceleration of chosen point and area-based pre-processing image processing functions on an Avnet Zedboard using single core IPPro. The results of the proposed adaptable hardware acceleration approach are compared against two programmable approaches including well-established soft-core processor and a fixed high-level synthesis approach.

In this chapter, Section 4.2 presents the most suitable class of image pre-processing algorithms considering minimal data-dependency and efficient utilisation of FPGA dedicated hardware resources. Section 4.3 presents a detailed evaluation of different dataflow and soft-core processor models to find the best balance between dataflow mapping possibilities and achievable performance. Section 4.4 introduces a novel FPGA-based soft-core Image Processing Processor (IPPro) architecture tailored to accelerate point and area image processing operations. Section 4.5 covers the IPPro datapath optimisations supporting dedicated instructions and coprocessor extension to off-load instructions that are computationally expensive to implement using native IPPro instruction set. At the end of this chapter, acceleration of chosen point and area image processing functions are accelerated using proposed approach and compared against hand-coded HLS implementation. In addition, two comparison against programmable FPGA-based approaches including a well-established MicroBlaze soft-core processor.
4.2 Algorithmic characteristics of image processing algorithms

In an image processing pipeline, each stage depending on intended use may have predominant tasks and corresponding pre-processing operations \([94], [95], [96]\). They operate at the beginning of a processing pipeline and therefore, computationally data intensive due to heavy pixel processing which makes them a suitable candidate for hardware acceleration. Table 4.1 shows the categorisation of these pre-processing image operations, where each class has distinctive data dependency, memory access and execution pattern. These algorithmic characteristics provide the functional hardware requirements that shall be supported by the *Image Processing Processor* (IPPro) architecture to accelerate these class of image pre-processing applications.

To achieve improved acceleration and efficient utilisation of FPGAs compute and memory resources, it is crucial to select the most suitable class of image processing operations. FPGA delivers the best performance for streaming applications due to spatial locality and minimal data dependency which are common in *point* and *area* image processing operations. They require basic arithmetic, logic

Table 4.1: Categorisation of image processing operations based on their memory and execution patterns [65].

<table>
<thead>
<tr>
<th>Operation type</th>
<th>Output depends on</th>
<th>Memory Pattern</th>
<th>Execution Pattern</th>
<th>Examples</th>
</tr>
</thead>
<tbody>
<tr>
<td>Point</td>
<td>Single input pixel</td>
<td>Pipelined</td>
<td>One-one</td>
<td>Intensity change by factor, Negative image inversion</td>
</tr>
<tr>
<td>Area/Local</td>
<td>Neighbouring pixels</td>
<td>Coalesced</td>
<td>Tree</td>
<td>Convolution functions: Sobel, Sharpen, Emboss, Morphology</td>
</tr>
<tr>
<td>Geometric</td>
<td>Whole frame</td>
<td>Recursive non-coalesced</td>
<td>Large reduction tree</td>
<td>Rotate, Scale, Translate, Reflect, Perspective and Affine</td>
</tr>
</tbody>
</table>
4.3 Exploration of efficient FPGA soft-core processor

and condition operations which can be efficiently implemented using FPGA logic which makes point and area image pre-processing operations suitable for hardware acceleration. Also, the development of high performance and area-efficient soft-core processor requires analysis of functional configurations of the dedicated DSP and memory blocks as they impact the maximum achievable operating frequency $f_{Max}$ and balance between FPGA compute and memory resources. Moreover, identification of soft-core processor design choices needs detailed datapath analysis. Because optimising the design for one design goals very often reduces the possibility of achieving some of the others.

4.3 Exploration of efficient FPGA soft-core processor

In FPGA fabric, dedicated DSP and memory blocks are hardware optimised computation and memory blocks. Image processing applications extensively use multiply and accumulate operation for image segmentation and filtering tasks which can efficiently map to the DSP block. The dedicated memory blocks are placed next to the DSP blocks to minimise timing penalty. Despite the fact that FPGA has these optimised hardware blocks, the maximum operating frequency ($f_{Max}$) of a design depends on the length of the critical path. In case of soft-core processors, $f_{Max}$ represents the raw-computation rate of the processor. It is one of the reasons that current many and multicore architectures use simple, light-weight processing datapaths over complex and large out-of-order processors. However, to maintain balance among soft processor functionality, scalability, performance
4.3 Exploration of efficient FPGA soft-core processor

and efficient utilisation of FPGA resources remains an open challenge.

4.3.1 Balance between compute and memory resources

The goal is to build a soft-core processor that implements arithmetic and logic functions by exploiting DSP and memory blocks where computing is defined as the raw performance of a soft-core processor is expressed by \( f_{\text{Max}} \). Therefore, this section evaluates different configurations of Xilinx DSP block (DSP48E1) and Block RAM (BRAM), and their impact on \( f_{\text{Max}} \) using different FPGAs. It has six configurations that offer different functionalities (multiplier, accumulator, pre-adder and pattern detector) based on different internal pipeline configurations of DSP48E1 [52]. Therefore each configuration directly impacts the \( f_{\text{Max}} \) of DSP48E1 (central to realise high-performance processor architecture). The Xilinx Vivado Design Suite v2015.2 is used for each DSP48E1 configuration and obtained \( f_{\text{Max}} \) trend is reported in Figure 4.1.

It can be observed that a drastic variation of \( \approx 15 - 52\% \) has recorded for same speed-grade and reduction of \( \approx 12 - 20\% \) when the same design has ported from -3 to -1 speed grade. This analysis shows that the configuration of DSP48E1 block significantly impacts the \( f_{\text{Max}} \) and identifying the optimum configuration is essential. Therefore, a fully pipelined DSP48E1 block with a pattern detector \textbf{PATDET} configuration is selected as it gives fully pipelined multiply, accumulate, add, subtract and pattern detector functionality with minimal \( f_{\text{Max}} \) penalty of \( \approx 12\% \) compared to fully pipelined DSP48E1 block without a pattern detector \textbf{NOPAT}. The built-in pattern detector allows implementation of condition statement and execution of data dependent instructions which are commonly found...
4.3 Exploration of efficient FPGA soft-core processor

in image processing functions. The presented results in Figure 4.1 shows that soft-core processor could run at 627, 549, 463 MHz for speed grade -3, -2, -1 respectively, if DSP48E1 is used as an ALU.

To analyse the impact of dedicated memory resources on $f_{Max}$, BRAM is configured as single and true-dual port RAM [97]. Figure 4.2 shows the $f_{Max}$ trend across Artix-7, Kintex-7 and Virtex-7 to analyse the impact across different FPGA fabrics. The true-dual port RAM configuration result $f_{Max}$ reduction of $\approx 25\%$. On the other hand, improvement of $\approx 16\%$ is possible by migrating the design from Artix-7 to Kintex-7 FPGA technology. FPGAs are limited in memory and for efficient design, it is vital to find the balance between memory and performance. Table 4.2 shows the distribution of compute (DSP48E1)

![Figure 4.1: Impact of DSP48E1 configurations on maximum achievable clock frequency ($f_{Max}$) using different speed grades of Kintex-7 FPGAs. The DSP48E1 configuration used are: fully pipelined datapath with no pattern detector (NOPAT), with pattern detector (PATDET), multiply with no output register MREG (MULT_NOMREG) and pattern detector (MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG).]
4.3 Exploration of efficient FPGA soft-core processor

Table 4.2: Memory and compute resources in 28nm Xilinx FPGA technology [98].

<table>
<thead>
<tr>
<th>Product</th>
<th>Family</th>
<th>Part Number</th>
<th>BRAM (18 Kb each)</th>
<th>DSP48E1</th>
<th>GMAC/s</th>
<th>BRAM/DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td>Standalone</td>
<td>Artix-7</td>
<td>XC7A200T</td>
<td>7.30</td>
<td>740</td>
<td>929</td>
<td>0.99</td>
</tr>
<tr>
<td>Standalone</td>
<td>Kintex-7</td>
<td>XC7K480T</td>
<td>1.910</td>
<td>1.920</td>
<td>2.845</td>
<td>0.99</td>
</tr>
<tr>
<td>Standalone</td>
<td>Virtex-7</td>
<td>XC7VX980T</td>
<td>3.000</td>
<td>3.600</td>
<td>5.335</td>
<td>0.83</td>
</tr>
<tr>
<td>Zynq SoC</td>
<td>Artix-7</td>
<td>XC7Z020</td>
<td>280</td>
<td>220</td>
<td>276</td>
<td>1.27</td>
</tr>
<tr>
<td>Zynq SoC</td>
<td>Kintex-7</td>
<td>XC7Z045</td>
<td>1,090</td>
<td>900</td>
<td>1,334</td>
<td>1.21</td>
</tr>
</tbody>
</table>

and memory (BRAM) resources, and present raw performance in GMAC/s (giga multiply-accumulates per second) across the largest FPGA devices covering both standalone and Zynq MPSoC chips [98]. A new metric BRAM/DSP ratio is introduced to quantify the balance between compute and memory resource and reported in Table 4.2. In Zynq MPSoC devices, the BRAM/DSP ratio is higher than standalone devices because more memory is required to implement substantial data buffers to exchange data between FPGA fabric and the host processor.

Figure 4.2: Impact of BRAM configurations on the maximum achievable clock frequency ($f_{Max}$) of Artix-7, Kintex-7 and Virtex-7 FPGAs for single and true-dual port RAM configurations.
while it is close to unity for standalone devices. This comparison shows that BRAM/DSP ratio can be used to quantify the area efficiency of FPGA designs.

### 4.3.2 FPGA-based soft-core processor functionality vs performance trade-off

A system composed of light-weight and high-performance soft-core processor architecture that supports modular computation with fine and coarse-grained functional granularity is more feasible than fixed dedicated hardware accelerators. A light-weight soft processor shall allow populating more programmable hardware accelerators onto a single MPSoC chip which would lead to better acceleration possibilities by exploiting data and task level parallelism.

#### Evaluation of processor functionality and dataflow models

This section presents the design exploration approach to analyse and evaluate functional granularity of FPGA-based soft-core datapaths while correlating each model with their realistic dataflow model. Table 4.3 lists three models driven by previous work [35], [99] which functionally corresponds to soft-core datapath models 1, 2 and 3. These models are used to find a trade-off between the functionality of soft-core processor and $f_{Max}$. They also laid the foundation to find the suitable soft-core datapath to map and execute the dataflow specification. Gupta et al. have reported different dataflow graph models [99], as illustrated in Figure 4.3. The input/output interfaces are marked in red while, the grey box represents the mapped functionality onto the datapath models shown in Figure 4.4.
4.3 Exploration of efficient FPGA soft-core processor

Figure 4.3: Dataflow models [35], [99] (a) DFG node without internal storage (1) (b) DFG actor without internal storage \( t_1 \) and constant \( i \) (2) (c) Programmable DFG actor with internal storage \( t_1, t_2 \) and \( t_3 \) and constants \( i \) and \( j \) (3).

Figure 4.4: FPGA datapath models (a) Programmable ALU (1) (b) Fine-grained processor (2) (c) Coarse-grained processor (3).

The first model (1) exhibits datapath of a programmable ALU as shown in Figure 4.4(a). It has instruction register (IR) that defines a DFG node (OP1) programmed at system initialisation. At each clock cycle, the datapath explicitly reads a token from the input FIFO, process token based on the programmed operation and stores into the output FIFO that are consumed by the following dataflow node (OP3). This model only allows mapping of data independent fine-grained dataflow nodes as shown in Figure 4.3(a) which limits its applicability due to lack of control and data dependent execution commonly found in image processing applications where the output pixel depends on the input or neigh-
4.3 Exploration of efficient FPGA soft-core processor

Table 4.3: Correlation of FPGA-based soft-core datapath and dataflow models with increasing functionality and memory.

<table>
<thead>
<tr>
<th>Model#</th>
<th>Datapath model</th>
<th>Dataflow model</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Programmable ALU</td>
<td>Programmable node without memory</td>
</tr>
<tr>
<td>2</td>
<td>Fine-grained processor</td>
<td>Programmable actor without memory</td>
</tr>
<tr>
<td>3</td>
<td>Coarse-grained processor</td>
<td>Programmable actor with memory</td>
</tr>
</tbody>
</table>

bouring pixels. Table 4.4 list specific dataflow features supported by 1. This model is only suitable for mapping a single dataflow node.

The second model 2 increases the datapath functionality to a fine-grained processor by including BRAM-based instruction memory (IM), program counter PC and kernel memory (KM) to store constants as shown in Figure 4.4(b). Conversely, 2 can support mapping of multiple data independent dataflow nodes as shown in Figure 4.3(b). The node (OP2) requires a memory storage to store variable (t1) to compute output token (C) which feeds back from the output of the ALU needed for next instruction in the following clock cycle. This model supports improved dataflow mapping functionality over 1 by introducing IM which comes at the cost of variable execution time and throughput proportional to the number of instructions required to implement the dataflow actor. Table 4.4 list supported dataflow features of 2. This model is suitable for accelerating combinational logic computations.

Table 4.4: Details of supported dataflow features and processor datapath memory elements in each presented model.

<table>
<thead>
<tr>
<th>Model</th>
<th>Dataflow features</th>
<th>Datapath memory elements</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Control flow</td>
<td>Node mapping</td>
</tr>
<tr>
<td>1</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>2</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>3</td>
<td>✔</td>
<td>✔</td>
</tr>
</tbody>
</table>
4.3 Exploration of efficient FPGA soft-core processor

The third model ③ increases the datapath functionality to map and execute data dependent dataflow actor as shown in Figure 4.3(c). The datapath has memory element as register file (RF) which represents a coarse-grained processor shown in Figure 4.4(c). The RF stores intermediate results to execute data dependent operations, implements (feed-forward, split, merge and feedback) dataflow execution patterns and facilitates dataflow transformations (actor fusion/fission, pipelining etc.) constraints by the size of RF. It can implement modular computations which are not possible in ① and ②. In contrast to ① and ②, the token production/consumption (P/C) rate of ③ can be controlled through soft-core program code as listed in Table 4.4 and allow software controlled scheduling and load balancing possibilities.

Functionality vs Performance trade-off analysis

The presented models show that the processor datapath functionality significantly impacts the dataflow decomposition, mapping and optimisation possibilities, but at the same time increases the processor critical path length and affects $f_{Max}$ by incorporating more memory elements and control logic. Table 4.4 lists the datapath memory elements in each presented model by incrementally allocating more memory resource (IM, KM, RF). Each presented model has coded in Verilog HDL, synthesised and place-and-route using Xilinx Vivado Design Suite v2015.2 on the Xilinx chips installed on widely available development kits which are Artix-7 (Zedboard), Kintex-7 (ZC706) and Virtex-7 (VC707). The obtained $f_{Max}$ results are reported in Figure 4.5.

In this analysis, $f_{Max}$ is considered as a performance metric for each processor datapath model. The implementation result shows that increasing datapath
4.3 Exploration of efficient FPGA soft-core processor

Figure 4.5: Impact of datapath models ①, ②, ③ on $f_{Max}$ across FPGA fabrics.

functionality resulted in a reduction of $f_{Max}$ by a maximum of $\approx 8\%$ and $23\%$ for ② and ③ compared to ① using same FPGA technology. For ②, the addition of memory elements specifically IM realised using dedicated BRAM affected $f_{Max}$ by $\approx 8\%$ compared to ①. Nevertheless, the instruction decoder (ID) which is a combinational part of a datapath significantly increases the critical path length of the design. A further $15\%$ $f_{Max}$ degradation from ② to ③ has resulted by adding memory elements KM and RF to support control and data dependent execution, which requires additional control logic and data multiplexers. Comparing across different FPGA fabrics, $f_{Max}$ reduction of $\approx 14\%$ and $23\%$ is observed for Kintex-7 and Artix-7. When ③ is ported from Virtex-7 to Kintex-7 and Artix-7, maximum $f_{Max}$ reduction of $\approx 5\%$ and $33\%$ is observed.

This analysis has laid firm foundations by comparing different processor datapath and dataflow models and how they impact the raw computation rate ($f_{Max}$) of the resultant soft processor. The trade-off analysis shows that an area-efficient, high-performance softcore processor architecture can be realised that
supports requirements to accelerate image pre-processing applications. Among the presented models, ③ provides the best balance among functionality, flexibility, dataflow mapping and optimisation possibilities, and performance. This model is used to develop a novel IPPro architecture in Section 4.4.

4.4 Image Processing Processor (IPPro)

This section presents the novel Image Processing Processor (IPPro) datapath by mapping it onto FPGA resources. Image pre-processing functions requires grey-level image where the value of pixel represents the colour contrast. For specific functions, e.g. image filtering that involves multiply and multiply-accumulate operations, it is essential to maintain precision. Therefore, IPPro designed as 16-bit, signed, reduced instruction set (RISC), pipelined soft-core architecture shown in Figure 4.6.

The IPPro datapath exploits DSP48E1 and BRAM blocks and supports stream processing using blocking input/output FIFOs that handle a stream of pixels. On the contrary to out-of-order processor architectures, IPPro is designed as a five-stage, in-order pipelined processor because: 1) It consumes fewer area resources and can achieve better timing closure leading to the higher processor operating frequency $f_{Max}$. 2) The in-order pipeline execution is predictable and simplifies scheduling and compiler development. In fact, the area hungry out-of-order processor architectures are suitable for ASIC or custom designs where chip resource are not technologically bounded. Based on the exploration of processor datapath and dataflow models and evaluation of their functionality and performance trade-off analysis presented in Section 4.3, following memory areas are supported
4.4 Image Processing Processor (IPPro)

Figure 4.6: Block diagram of FPGA-based soft-core processor IPPro datapath.
4.4 Image Processing Processor (IPPro)

in the IPPro datapath:

- Instruction memory (IM) (512x32) to store the dataflow actor functional description in the form of IPPro program code.

- Register file (RF) (32x16) to map fine and coarse-grained dataflow actors by storing intermediate results and provide random access to a stream of tokens or window of pixels stored inside the RF, *e.g.* 3x3, 3x4, 4x4 *etc.*

- Kernel memory (KM) (32x16) to save the parameters that are reusable such as filter coefficients and constant values.

- The blocking input/output FIFOs to buffer data tokens between a producer and a consumer to realise pipelined processing stages.

### 4.4.1 Datapath

RISC architecture performs computation on register values in contrast to stack-based complex instruction set (CISC) architecture. RISC-based architectures have faster memory access to the registers which involves random access to variables rather than access of stacked operands [100]. Therefore, a *Register file* (RF) of size 32x16 bits is implemented using Xilinx RAM32M primitive that uses look-up tables (LUT) resources. It provides a quad-port RAM with synchronous write and three asynchronous read ports compared to dual-port RAM supported by BRAM primitive. It supports three operand operations such as multiply-add commonly used for pixel processing. Figure 4.6 shows the detailed IPPro datapath. It has BRAM-based instruction memory (IM) configured as true dual-port RAM which stores program code. IPPro has a dedicated KM that can store
4.4 Image Processing Processor (IPPro)

32x16 bit constant values to accelerate area operations by maximising memory reuse and avoid reloading of filter coefficients. The input FIFO stores the incoming stream of data, the GET instruction reads and stores them in the RF. PUSH instruction reads the processed data from specified RF location and stores it in the output FIFO.

The IPPro datapath has no stack memory and therefore, does not support recursive function call as it requires context switching (which involves passing of parameters between functions and storing the function state/variable). But as long as the memory requirement of the calling function (number of critical function variables) matches the size of the register file, limited recursive function call can be implemented using the branch instructions (JUMP and BZ). From image processing perspective, the IPPro datapath has been designed to implement point and area image processing operations which only require neighbouring pixels and can be stored in the register file.

4.4.2 Branch and conditional execution

IPPro supports branch instructions to handle control flow graphs previously discussed in Table 4.4 to implement commonly known constructs such as if-else and case statements. The DSP48E1 block has a pattern detector that compares the input operands or the generated output results depending on the configuration and sets/resets the PATTERNDETECT (PD) bit. IPPro datapath uses the PD bit along with some additional control logic to generate four flag zero (ZF), equal (EQF), greater than (GTF) and sign (SF) bits. When IPPro encounters branch instruction, the branch controller (BC) compares the flag status and branch han-
4.4 Image Processing Processor (IPPPro)

Table 4.5: IPPPro instruction frame structure.

<table>
<thead>
<tr>
<th>BITS</th>
</tr>
</thead>
<tbody>
<tr>
<td>31 30 29 25 24 20 19 15 14 10 9 5 4 0</td>
</tr>
<tr>
<td>INSTR_TYPE</td>
</tr>
</tbody>
</table>

dler (BH) updates the PC as shown in Figure 4.6.

4.4.3 Instruction set architecture

IPPPro has a 32-bit instruction set architecture (ISA). Table 4.5 shows the simplified IPPPro frame structure where $R_A, R_B, R_C, R_D$ and $K_n$ represents 5-bit address fields to point a location in RF or KM. $R_A, R_B, R_C, K_n$ are source registers while $R_D$ is a destination register. The 5-bit $OPCODE$ field represents a unique IPPPro instruction. The 2-bit $INSTR\_TYPE$ field differentiates between supported addressing modes listed in Table 4.6. (for details on supported instruction set see Appendix B Table B.2).

4.4.4 Pipelined stream processing

The IPPPro datapath is a five stage pipeline soft-core processor composed of fetch, decode, execute#1 (EXE1), execute#2 (EXE2) and write-back (WB) stages as shown in Figure 4.6. It starts execution by fetching the instruction from the instruction memory, the instruction decoder decodes the fetched instruction and generates required control signals to control the datapath. During this stage

Table 4.6: IPPPro supported addressing modes and instructions.

<table>
<thead>
<tr>
<th>Addressing Mode</th>
<th>Data abstraction</th>
<th>Instructions</th>
</tr>
</thead>
<tbody>
<tr>
<td>FIFO handling</td>
<td>Stream access</td>
<td>get, push</td>
</tr>
<tr>
<td>RF - RF</td>
<td>Randomly accessed data</td>
<td>str, add, sub, mul, mulacc, muladd, etc.</td>
</tr>
<tr>
<td>KM - FIFO</td>
<td>Stream and fixed data</td>
<td>addkm, subkm, mulkm, muladdkm, etc.</td>
</tr>
</tbody>
</table>
4.4 Image Processing Processor (IPPro)

Based on addressing mode (Table 4.6), IPPro read data operands either from input FIFO, RF or KM and stores into the pipeline registers and forwards to DSP48E1 block in EXE1 stage. The DSP48E1 is dynamically reconfigured on a cycle-to-cycle basis by ID. The configuration of DSP48E1 control signals to implement IPPro instructions (for details see Appendix B Table B.1). The DSP48E1 processes the data operands and store the results back to the register file in WB stage. Both EXE1 and EXE2 are DSP48E1 internal pipeline stages. The GET and PUSH modules shown in Figure 4.6 makes sure that input/output FIFOs are not empty/full. If any of the conditions persist, the IPPro stop processing and waits until both input and output FIFO have enough space to store the tokens.

4.4.5 Dataforwarding

Data hazards are common in pipelined processors, IPPro supports internal data forwarding by exploiting multiply-accumulate (MACC) feature of DSP48E1 to avoid pipeline stalls and NOP fillers. During instruction decoding, the datapath checks if the source address of the next instruction is equal to the destination address of the decoded instruction. If it is true, the dataforwarding path is

![Figure 4.7: Implementation of dataforwarding exploiting MACC functionality of DSP48E1.](image-url)
4.4 Image Processing Processor (IPPro)

enabled by configuring a DSP48E1 control register and the result of DSP48E1 is forward to the next instruction as shown in Figure 4.7.

To demonstrate the impact of dataforwarding on execution time (clock cycles) and the program code size, consider the following equation, and the corresponding IPPro code listed in Table 4.7.

\[ A = \text{func}(z - (x + y) + (y \ast z)) \]  

(4.1)

This function requires three data dependent computations as listed in Table 4.7. In case of no dataforwarding, the NOP fillers are required to avoid data hazards due to lack of available data independent instructions which can fill the pipeline. On the other hand, in case of dataforwarding, the computed data can be forwarded directly to the next instruction as highlighted by blue and red in Table 4.7. The IPPro processor with and without dataforwarding takes 18 and 10 clock cycles respectively to process the function. Mathematically, it can be represented by:

<table>
<thead>
<tr>
<th>Instr.</th>
<th>No Data Forwarding</th>
<th>Description</th>
<th>Data Forwarding</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>GET R1</td>
<td>R1=x</td>
<td>GET R1</td>
<td>R1=x</td>
</tr>
<tr>
<td>2</td>
<td>GET R2</td>
<td>R2=y</td>
<td>GET R2</td>
<td>R2=y</td>
</tr>
<tr>
<td>3</td>
<td>GET R9</td>
<td>R9=z</td>
<td>GET R9</td>
<td>R9=z</td>
</tr>
<tr>
<td>4</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>ADD R3,R1,R2</td>
<td>R1+R2</td>
<td>ADD R3,R1,R2</td>
<td>R1+R2</td>
</tr>
<tr>
<td>8</td>
<td>NOP</td>
<td></td>
<td>SUB R4,R3,R9</td>
<td>(R1+R2) - R9</td>
</tr>
<tr>
<td>9</td>
<td>NOP</td>
<td></td>
<td>MULACC R5,R4,R2</td>
<td>(R9*R2)+(R1+R2)-R9</td>
</tr>
<tr>
<td>10</td>
<td>NOP</td>
<td></td>
<td>PUSH R5</td>
<td></td>
</tr>
<tr>
<td>11</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td>SUB R4,R9,R3</td>
<td>(R1+R2)-R9</td>
<td></td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>17</td>
<td>MULADD R3,R4,R2,R9</td>
<td>(R9*R2)+(R1+R2)-R9</td>
<td></td>
<td></td>
</tr>
<tr>
<td>18</td>
<td>PUSH R10</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
4.4 Image Processing Processor (IPPro)

Table 4.8: IPPro implementation results on selected Xilinix development boards.

<table>
<thead>
<tr>
<th>Resources</th>
<th>VC707</th>
<th>ZC706</th>
<th>Zedboard</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFs</td>
<td></td>
<td>447</td>
<td></td>
</tr>
<tr>
<td>LUTs</td>
<td></td>
<td>484</td>
<td></td>
</tr>
<tr>
<td>BRAMs</td>
<td></td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>DSP48E1</td>
<td></td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>Freq. (MHz)</td>
<td>372</td>
<td>337</td>
<td>187</td>
</tr>
</tbody>
</table>

\[ t = (n - 1) \times 4 \] (4.2)

Where \( n \) the number of consecutive data dependent instructions and \( t \) is the saved number of clock cycles per iteration. The impact of data forwarding becomes significant when processing images consist of hundreds of thousands of pixels. Saving tens of clock cycles per pixel results in a significant saving of processing time. Nevertheless, it also reduces the code size which is \( \approx 45\% \) for the presented case.

4.4.6 Implementation results

IPPro soft-core processor architecture has written in Verilog, synthesised and implemented using Xilinx Vivado Design Suite v2015.2. Table 4.8 reports the implementation results obtained using tool’s default settings. The implementation results show that IPPro consumes \(< 1\%\) of Kintex-7 (ZC706) FPGA resources and delivers 337 MIPS while maintaining BRAM/DSP ratio equal to unity. The IPPro design has ported to various FPGA fabrics to analyse the potential performance, by implementing it on widely available Xilinx development boards used by research community which are ZedBoard (XC7Z020CLG484-1), ZC706 (XC7Z045FFG900-2) and VC707 (XC7VX485T-2). Table 4.8 shows the maximum possible frequency \( f_{\text{Max}} \) on the selected Xilinx development boards.
IPPro running on Virtex-7 (VC707) and Kintex-7 (ZC706) can deliver $\approx 2.00$ and 1.80 times improved $f_{Max}$ compared to Artix-7 (Zedboard) by porting on different FPGA fabric. The obtained results closely correspond to the results reported in Table. 4.5 where it was expected to be $\approx 19\%$ and $48\%$ for ZC706 and Zedboard respectively in Section 4.3.

4.5 IPPro Optimisations

To evaluate performance and identify limitations of the developed IPPro, two group students have accelerated colour, morphology [101] and two-stages of the histogram of gradient [102]. Russell et al. have reported 9.6 times performance improvement for morphology operations using native IPPro instructions compared to ARM processor-based implementation. He identified that supporting dedicated minimum and maximum instructions will improve performance. Kelly et al. have profiled and explicitly translated the first two stages of HOG algorithm from mathematical expressions to native IPPro instructions. He reported that $77.3\%$ of the total instructions belong to the normalise overlapping spatial blocks function, out of which $72.2\%$ of the IPPro instructions belong to the division calculation. He indicated that division function is the computational bottleneck and off-loading division from IPPro to dedicated coprocessor could significantly improve the acceleration results. To this end, this section presents IPPro optimisations by extending IPPro datapath capabilities beyond DSP48E1 supported instructions to enhance the performance further.
4.5 IPPro Optimisations

4.5.1 Minimum and maximum instructions

In image processing applications, morphological operations are applied to the filtered image to clean up small holes in objects and remove small groups of pixels which saves processing time for later stages. Morphology involves finding either the maximum (dilation) or minimum (erosion) value in a set of pixels contained within a masked region around the input pixel. Russel et al. have reported that implementation using native IPPro instructions takes $\approx 48$ cycles for a 3x3 kernel or 81 cycles for a 5x5 kernel. To include dedicated MIN and MAX instruction, the additional control logic and a 4-1 multiplexer to select the minimum or maximum result are added into the datapath as shown in Figure 4.8.

The MIN and MAX registers externally hold the operand values. The DSP48E1 block compares the operands and updates the sign flag (SF), which is used to select either MIN/MAX value and store it into the RF.

Table 4.9 shows the IPPro code to compare the impact of optimised MIN/-MAX on the execution time. It shows that native implementation first compares the operands using subtraction followed by branch evaluation to find the minimum and maximum value which takes $\approx 13 - 20$ clock cycles per pixel depending

![Figure 4.8: Optimisation of IPPro datapath to support dedicated minimum and maximum instructions.](image)
### 4.5 IPPro Optimisations

Table 4.9: Implementation of Min/Max using native and optimised IPPro instructions.

<table>
<thead>
<tr>
<th>Instr.</th>
<th>Native</th>
<th>Description</th>
<th>Dedicated</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>GET R1</td>
<td>R1 = a</td>
<td>GET R1</td>
<td>R1 = a</td>
</tr>
<tr>
<td>2</td>
<td>GET R2</td>
<td>R2 = b</td>
<td>GET R2</td>
<td>R2 = b</td>
</tr>
<tr>
<td>3</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>NOP</td>
<td></td>
<td>NOP</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>SUB R3, R2, R1</td>
<td>R3 = a-b ? +ve/-ve</td>
<td>MIN R3, R2, R1</td>
<td>R3 = min(a,b)</td>
</tr>
<tr>
<td>8</td>
<td>BS MAX</td>
<td></td>
<td>MAX R4, R2, R1</td>
<td>R4 = max(a,b)</td>
</tr>
<tr>
<td>9</td>
<td>NOP</td>
<td>send minimum value</td>
<td>PUSH R3</td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>NOP</td>
<td>JUMP FUNCTION</td>
<td>JUMP FUNCTION</td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>PUSH R2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td>JUMP FUNCTION</td>
<td>send maximum value</td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td>...</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>...</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>17</td>
<td>...</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>18</td>
<td>MAX:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>19</td>
<td>PUSH R1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td>JUMP FUNCTION</td>
<td>send maximum value</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

On whether the branch has taken or not. On the other hand, optimised implementation takes ten clock cycles per pixel irrespective of pixel value resulting approx. 50% reduction in execution time which is significant for pixel processing.

#### 4.5.2 Coprocessor extension

In image processing, some of the algorithms require arithmetic operations which are not supported by the IPPro. For such applications, IPPro has a coprocessor interface that allows a transparent integration of a custom coprocessor into the IPPro datapath. In this section, the example of a division coprocessor will be discussed as Kelly *et al.* have reported that it is appropriate to off-load computationally expensive functions to a coprocessor which adds complexity to the processor architecture. In case of IPPro, it is adding a coprocessor interface into the datapath and balancing the pipelined execution while dispatching the
operands, collecting the processed results and storing them into the RF such that coprocessor shall execute in parallel and not stall the IPPro to achieve best possible improvement.

Figure 4.9(a) shows the block diagram of the pipelined division coprocessor and Figure 4.9(b) shows the IPPro coprocessor extension datapath. Four 16-bit registers (C_IR1, C_IR2, C_OR1 and C_OR2) are incorporated between input/output interface of coprocessor and the IPPro datapath. The coprocessor enable signal (C_ENABLE) is asserted by instruction decoder once the IPPro encounters the dedicated coprocessor instruction and writes the input operands to C_IR1 and C_IR2 registers. These registers isolate coprocessor and IPPro datapath and ensure transparent exchange of data and independent parallel execution of coprocessor and IPPro. The coprocessor process input operands and stores result into the output registers (C_OR1 and C_OR2). The IPPro reads the coprocessor generated results from these output registers by executing a particular coprocessor read instruction and stores them into the RF as illustrated in Figure 4.9(b).

Figure 4.9: (a) Input/output interfaces of division coprocessor (b) Coprocessor extended IPPro datapath.
4.5 IPPro Optimisations

A division coprocessor has been incorporated into the extended IPPro datapath to evaluate the coprocessor extension. The division coprocessor takes two input operands (numerator and denominator) and generates (quotient and remainder), which mapped into IPPro datapath via (C_IR1, C_IR2, C_OR1 and C_OR2) registers respectively as shown in Figure 4.9(a). In this implementation, the coprocessor clock (CLK) is synchronised to the IPPro datapath. Figure 4.10 shows the timing diagram of parallel execution of IPPro and the division coprocessor. The operands (C_IR1, C_IR2) are exchanged and they become valid once C_Enable is asserted. The coprocessor takes a fixed number of clock cycles to process input data, generate results and store them into output registers. These processed tokens are then collected using the process described earlier.

The coprocessor extended datapath has implemented using Xilinx Vivado De-

Table 4.10: Implementation results of optimised IPPro datapath to support co-

<table>
<thead>
<tr>
<th>Resources</th>
<th>Standalone</th>
<th>Coprocessor extension</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFs</td>
<td>447</td>
<td>481</td>
</tr>
<tr>
<td>LUTs</td>
<td>484</td>
<td>573</td>
</tr>
<tr>
<td>BRAMs</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>DSP48E1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Freq. (MHz)</td>
<td>337</td>
<td>302</td>
</tr>
</tbody>
</table>

Figure 4.10: Pipelined execution of division coprocessor.
4.6 Comparison of IPPro results

Kapre et al. have proposed GraphSoC, a custom soft processor for accelerating graph algorithms using Zynq MPSoC [16]. It is a 3-stage pipelined processor that supports graph semantics (node, edge operations). The graphs were stored in on-chip BRAM for fast local access. A compilation framework developed including assembler to configure the processor instruction and data memories where each core uses 9 BRAMs and operates at 200 MHz. Andryc et al. presented an FPGA-based FlexGrip architecture for compute-intensive streaming applications [36]. It is composed of an array of streaming multiprocessors (SMs), each SM contains multiple 5-stage pipelined scalar processor (SP) cores connected in a SIMD computing paradigm. The framework maps pre-compiled CUDA kernels on SP that operates at 100 MHz.
4.7 Application use cases

Table 4.11: Comparison of IPPro against other FPGA-based soft-core processor architectures.

<table>
<thead>
<tr>
<th>Resources</th>
<th>IPPro</th>
<th>GraphSoC [16]</th>
<th>FlexGrip 8 SP [36]</th>
<th>MicroBlaze</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFs</td>
<td>447</td>
<td>551</td>
<td>12,972=103,776/8</td>
<td>518</td>
</tr>
<tr>
<td>LUTs</td>
<td>484</td>
<td>974</td>
<td>8,915=71,323/8</td>
<td>897</td>
</tr>
<tr>
<td>BRAMs</td>
<td>1</td>
<td>9</td>
<td>15=120/8</td>
<td>4</td>
</tr>
<tr>
<td>DSP48E1s</td>
<td>1</td>
<td>1</td>
<td>19=156/8</td>
<td>3</td>
</tr>
<tr>
<td>No. of Stage</td>
<td>5</td>
<td>3</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>BRAM/DSP ratio</td>
<td>1.0</td>
<td>9.0</td>
<td>0.76</td>
<td>1.3</td>
</tr>
<tr>
<td>Freq. (MHz)</td>
<td>337</td>
<td>200</td>
<td>100</td>
<td>211</td>
</tr>
</tbody>
</table>

*Scaled to a single streaming processor.

Table 4.11 compares the implementation results of IPPro processor against other processors. The reported area utilisation results of FlexGrip is normalised to single processing core as each SP is composed of 8 cores connected in SIMD. The results show that IPPro is compact and delivers $\approx 1.6x - 3.3x$ times better performance, considering $f_{\text{Max}}$. The reported area results show that the FFs utilisation is relatively similar except FlexGrip uses 18 times more FFs. While comparing LUTs, IPPro uses 50% fewer LUT resources compared to both MicroBlaze and GraphSoC. Analysing design area efficiency, a significant difference 0.76 - 9.00 in BRAM/DSP ratio is observed which makes IPPro an area-efficient design-based on the proposed metric.

4.7 Application use cases

Two different comparison approaches are adopted to evaluate the area and performance of IPPro architecture by comparing it against HLS, programmable FPGA-based architecture and softcore processor. Firstly, a set of chosen point and area operations image pre-processing functions are implemented using IPPro and compared against the hand-coded HLS implementations. Secondly, the chosen image pre-processing functions will be compared against programmable FPGA-based
### 4.7 Application use cases

Table 4.12: Mathematical representation of image pre-processing functions.

<table>
<thead>
<tr>
<th>Function</th>
<th>Mathematical representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thresholding</td>
<td>$P_{\text{output}} = P_{\text{input}} &gt; P_{\text{threshold}} ? 255 : 0$</td>
</tr>
<tr>
<td>Gaussian</td>
<td>$P_{\text{output}} = \begin{bmatrix} P_1 &amp; P_2 &amp; P_3 \ P_4 &amp; P_5 &amp; P_6 \ P_7 &amp; P_8 &amp; P_9 \end{bmatrix} \ast \begin{bmatrix} K_1 &amp; K_2 &amp; K_3 \ K_4 &amp; K_5 &amp; K_6 \ K_7 &amp; K_8 &amp; K_9 \end{bmatrix} = \sum_{i=1}^{9} (P_i \ast K_i)$</td>
</tr>
<tr>
<td>Sobel</td>
<td>$P_{\text{output}} = \begin{bmatrix} P_1 &amp; P_2 &amp; P_3 \ P_4 &amp; P_5 &amp; P_6 \ P_7 &amp; P_8 &amp; P_9 \end{bmatrix} \ast \begin{bmatrix} 1 &amp; 0 &amp; -1 \ 2 &amp; 0 &amp; -2 \ 1 &amp; 0 &amp; -1 \end{bmatrix}$</td>
</tr>
<tr>
<td>Gradient calculation</td>
<td>$P_{\text{Gradient}} =</td>
</tr>
</tbody>
</table>

architecture and lastly, a set of micro-benchmarks are selected to analyse the IPPro performance against well established MicroBlaze softcore processor.

**Image pre-processing functions** Image pre-processing algorithms are used extensively for feature detection, image analysis and noise reduction which includes image filtering functions [94], [95], [96]. *Convolution* operation is central to filtering algorithms that use *area* image processing operations as previously identified in Table 4.1. On the other hand *point* operations are commonly used for image segmentation. Functions from both classes, i.e. (thresholding, Gaussian, Sobel and gradient calculation) are accelerated using IPPro processor to evaluate the performance of IPPro architecture. These functions are commonly used front-end image processing operations [104], [105], [106], [107], [108]. Table 4.12 presents the mathematical representation of chosen functions. These operations are accelerated by developing a system that implements real-time video pipeline composed of a camera and VGA output. The obtained acceleration results are
compared against fixed HLS approach.

**Processor Micro-benchmarks** The performance of a processor can be measured in many ways, often it is reported in millions-instructions-per-second (MIPS). Though it is not always a good metric as one processor may accomplish more work than an instruction on another processor by issuing a single instruction or negatively impact the performance due to the branch penalty. One of the commonly used performance metric is the time required to accomplish a defined task. Therefore, a set of commonly used micro-benchmarks [86], [109] have been chosen and implemented on the IPPro and the obtained results are compared against well established MicroBlaze soft-core processor. Each of the chosen micro-benchmarks are fundamental kernels of larger algorithms and often the core computation of more extensive practical applications. The following are the details of each chosen micro-benchmark and the architecture aspects tested by each:

- Digital filter is an important function that signal processors use to modify and improve signals. In image processing they are used to improve the appearance of an image by smoothing, blur and removing noise. It allows to analyse 1D stream processing capabilities of the IPPro architecture. The implementation of 5-tap FIR function reads an element from an input stream, computes FIR and push the output to the FIFO for 50 samples.

- **Convolution** is a stream processing micro-benchmark extensively used in the image processing. It allows to analyse 2D data processing capability of a processor. For the IPPro architecture, it would help to analyse the impact of single cycle MULACC optimisation. The implemented micro-benchmark
4.7 Application use cases

reads 3 data elements per iteration, computes the convolution function and push output to the FIFO.

- **Polynomials** are one of the most fundamental types of functions generally used in mathematics as well as in image processing to realise non-linear filters used for contrast enhancement, texture segmentation and edge extraction. Usually they are formed entirely by repeated multiplications and addition. For the IPPro architecture, it allows to analyse the impact of dataforwarding optimisation. The implementation of **degree-2 polynomial** function reads an element from an input stream, computes \( y(x) = ax^2 + bx + c \) and push output to the FIFO.

- **Matrix multiply** is a widely used operation in digital signal processing applications and its non-linear complexity is often the critical part of many algorithms. It is computational expensive as it requires extensive data independent multiplications and data dependent additions. This micro-benchmark allows to analyse the computation capability (MULACC) and the memory limitations of the IPPro architecture. The implementation of matrix multiply function reads two matrices from the register file, computes the product and stores the resultant matrix into the local memory.

- In digital image processing, the **Sum Of Absolute Differences** (SAD) is a measure of the similarity between image blocks. It calculates by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. It is used for object recognition, disparity map and motion estimation. This micro-benchmark allows to analyse the impact of branch operations necessary to compute
4.7 Application use cases

the absolute value. The implementation of SAD function reads a window of elements stored in local memory, computes the absolute difference and pushes the results to the output FIFO.

- **Fibonacci** sequence requires adding of the two preceding numbers to generate the output number which makes it extensively data dependent computation. It allows to analyse the impact of both the data dependent execution and the branch penalty on the IPPro architecture. The implementation of Fibonacci function calculates first 50 numbers of the series and pushes into the output FIFO.

Section 4.7.1 presents the system architecture used to accelerate the chosen image pre-processing operations.

### 4.7.1 System architecture

The system architecture is composed of OV7670 camera (to capture real-time video stream), single core IPPro (to process the incoming video stream) and VGA output (to display processed video stream). Figure 4.11 shows the developed system architecture used to accelerate the chosen *point* and *area* operations by feeding pixel or window of pixels configured during system initialisation. This system architecture is implemented and tested on Avnet Zedboard development board has an on-board Xilinx Zynq SoC (XC7Z020-CLG484-1). The Zynq heterogeneous MPSoC has on-chip *programmable system* (PS) tightly-coupled with *programmable logic* (PL). The AXI-AMBA communication protocol is supported between PS and PL. The AXI-Lite interface is used to program the IPPro instruction memory, and control register during system configuration.
4.7 Application use cases

Figure 4.11: Block diagram of programmable video processing platform to implement case-studies using single-core IPPro.

The OV7670 is a CMOS colour image sensor that supports configurable VGA and CIF video resolutions, and RGB 565/555, YUV(4:2:2) and YCbCr(4:2:2) pixel formats. The camera module is directly connected to the Zynq PL using PMOD-A and PMOD-B interface on Zedboard. The VGA resolution and YUV(4:2:2) pixel format is selected where (Y) grey-scale component is used to accelerate the chosen front-end image processing operations. A dedicated camera controller handles camera initialisation sequence and configurations using I²C protocol. It captures the incoming video stream and stores them into input frame buffer. The input frame controller sequentially reads the video frame (starting from address 0 → 307200) from the input frame buffer and converts it into a stream of pixels based on configured point or window (using line buffers) then
4.7 Application use cases

Table 4.13: Area utilisation results of IPPro hardware accelerator.

<table>
<thead>
<tr>
<th>Module</th>
<th>Resources</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>FFs</td>
</tr>
<tr>
<td>Datapath</td>
<td>695</td>
</tr>
<tr>
<td>Point/area</td>
<td>349</td>
</tr>
<tr>
<td>Total</td>
<td>1044</td>
</tr>
</tbody>
</table>

store into the input FIFO as shown in Figure 4.11.

The input FIFO isolates the camera and IPPro clock domains which allows IPPro to run at higher operating frequency 187 MHz than the camera interface. It also provides handshaking mechanism, to propagate the ripple effect and halts the input frame controller to avoid data corruption when IPPro executes an unbalanced actor. As soon as the pixels are available in the input FIFO, IPPro reads the stream of pixels, sequentially processes and store them into the output FIFO. The output frame controller reads the processed pixels and converts into video frame by sequentially storing pixels (starting from address 0 → 307200). The VGA controller reads the processed video frame, generates the required VGA control signals (V-SYNC and H-SYNC) to display it on the VGA monitor.

Table 4.13 reports the implementation results of datapath and point/area module. The reported datapath is composed of the necessary control logic composed of AXI-Lite control registers. The point/area module composed of line buffers to organise video data into a point or window of pixels. The point/area module uses three BRAMs to implement three line buffers required to generate 3x3 pixel window as reported in Table 4.13. Additionally, to support AXI4-Lite control and configuration register, IPPro datapath consumes 1.55 and 1.68 times more FFs and LUTs compared to the reported results in Table 4.10.
4.7.2 Comparison of IPPro with HLS approach

The acceleration results of the proposed IPPro-based programmable approach are compared against high-level synthesis (HLS) approach. The chosen image processing operations are hand-coded in C++ and compiled using Xilinx Vivado HLS. The implementations exploit pipeline optimisation and the designs are synthesised and implemented using Xilinx Vivado Design Suite v 2015.2. The Vivado HLS has generated each operation as intellectual property (IP) which has AXI4-Lite and AXI4-Stream interfaces for easy integration into the previously presented system architecture in Figure 4.11. In system architecture, the IPPro core is replaced with Vivado generated IP.

The HLS implementations achieved 28 and 15 times better than IPPro due to higher computation rate (MPixel/s) as reported in Table 4.14 at the cost of software-centric edit-compile-run design flow. In case of IPPro, the computation rate MPixels/s is inversely proportional to cycles/pixel which depends on the complexity of the function. Therefore, further comparison of the proposed IPPro-based programmable approach against other programmable FPGA-based architecture is presented and analysed in Section 4.7.3.

Table 4.14: Comparison of hardware acceleration results obtained from HLS and IPPro using Avnet Zedboard (Artix-7).

<table>
<thead>
<tr>
<th>Acceleration approach</th>
<th>Dedicated accel.</th>
<th>Proposed IPPro</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>fps</td>
<td>MPixel/s</td>
</tr>
<tr>
<td><strong>Thresholding</strong></td>
<td>651</td>
<td>200</td>
</tr>
<tr>
<td><strong>Gaussian (3x3)</strong></td>
<td>488</td>
<td>150</td>
</tr>
<tr>
<td><strong>Sobel (3x3)</strong></td>
<td>488</td>
<td>150</td>
</tr>
<tr>
<td><strong>Gradient calculation</strong></td>
<td>651</td>
<td>200</td>
</tr>
</tbody>
</table>
4.7 Application use cases

4.7.3 Comparison of IPPro against programmable FPGA-based architecture

Reichenbach et al. have presented a programmable image processing architecture for smart cameras [110]. The architecture is based on programmable coarse grained application specific processing elements (PE) that enables fine-grained configurability to realise algorithmic peculiarities of image processing applications. Each PE only supports a set of application-specific assembly instructions that can be used to compute that specific image processing function such as Gaussian, Sobel and Gradient operations. The architecture had been implemented on heterogeneous Xilinx Zynq XC7Z020 SoC platform where the programmable logic is used to populate the PEs and process video frames.

Table 4.15: Comparison of IPPro performance results against programmable FPGA-based architecture.

<table>
<thead>
<tr>
<th>Function</th>
<th>110# of cores</th>
<th>IPPro</th>
<th>Speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>Gaussian</td>
<td>12</td>
<td>295 24</td>
<td>46</td>
</tr>
<tr>
<td>Sobel</td>
<td>6</td>
<td>180 30</td>
<td>54</td>
</tr>
<tr>
<td>Gradient</td>
<td>20</td>
<td>120 6</td>
<td>35</td>
</tr>
</tbody>
</table>

To compare the performance and resource utilisation results of this architecture against IPPro, the performance and resource utilisation numbers have been reported in Table 4.15 and Table 4.16 has been normalised to single-core. Focusing on area utilisation numbers, the PE implementing a Sobel filter consumed 2.8 and 2.6 times less FFs and LUTs respectively than a Gaussian by exploiting kernel coefficient optimisation. IPPro has achieved 5.8 and 1.8 times better performance at the cost of approximately equal number of FFs and 1.5 times less LUT resources over [110] for gradient calculation and Sobel filter respectively. This performance improvement at reduced area cost by IPPro architecture has
4.7 Application use cases

Table 4.16: Area comparison of IPPro against programmable FPGA-based architecture. The normalised per core resource utilisation are reported in the brackets.

<table>
<thead>
<tr>
<th>Resources</th>
<th>Gaussian</th>
<th>Sobel</th>
<th>Gradient</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFs</td>
<td>6177 (1029)</td>
<td>4360 (363)</td>
<td>592 (30)</td>
</tr>
<tr>
<td>LUTs</td>
<td>10017 (1669)</td>
<td>7718 (643)</td>
<td>1782 (90)</td>
</tr>
<tr>
<td>BRAMs</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>DSPs</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

been achieved by exploiting DSP block optimisation over [110] which can be clearly observed in Table 4.16.

4.7.4 Comparison of IPPro with MicroBlaze

The selected micro-benchmark results are compared against well established Xilinx MicroBlaze soft-core processor. The micro-benchmarks are written in standard C and implemented using Xilinx Vivado SDK v2015.1. MicroBlaze has been configured for performance with no debug module, instruction/data cache and single AXI-Stream link enabled to stream data into the MicroBlaze using getfsl and putfsl instructions in C which are equivalent to (get and put) in assembly.

Table. 4.17 reports the performance results of micro-benchmarks implemented using IPPro and MicroBlaze soft-core processors using Kintex-7 FPGA fabric. Table. 4.18 shows the area utilisation of proposed IPPro and MicroBlaze soft-core processors.

Table 4.17: Comparison of micro-benchmarks on IPPro and MicroBlaze.

<table>
<thead>
<tr>
<th>Processor</th>
<th>MicroBlaze</th>
<th>IPPro</th>
<th>Speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td>FPGA Fabric</td>
<td>Kintex-7</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Freq (MHz)</td>
<td>287</td>
<td>337</td>
<td></td>
</tr>
<tr>
<td>Micro-benchmarks</td>
<td>Exec. Time (us)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Convolution</td>
<td>0.60</td>
<td>0.14</td>
<td>4.11</td>
</tr>
<tr>
<td>Degree-2 Polynomial</td>
<td>5.92</td>
<td>3.29</td>
<td>1.80</td>
</tr>
<tr>
<td>5-tap FIR</td>
<td>47.73</td>
<td>5.34</td>
<td>8.94</td>
</tr>
<tr>
<td>Matrix Multiply</td>
<td>0.67</td>
<td>0.10</td>
<td>6.7</td>
</tr>
<tr>
<td>Sum of Absolute Differences</td>
<td>0.73</td>
<td>0.77</td>
<td>0.95</td>
</tr>
<tr>
<td>Fibonacci</td>
<td>4.70</td>
<td>3.56</td>
<td>1.32</td>
</tr>
</tbody>
</table>
4.8 Summary

Table 4.18: Area comparison of IPPro and MicroBlaze processors.

<table>
<thead>
<tr>
<th>Processor</th>
<th>MicroBlaze</th>
<th>IPPro</th>
<th>Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>FFs</td>
<td>746</td>
<td>422</td>
<td>1.77</td>
</tr>
<tr>
<td>LUTs</td>
<td>1114</td>
<td>478</td>
<td>2.33</td>
</tr>
<tr>
<td>BRAMs</td>
<td>4</td>
<td>2</td>
<td>2.67</td>
</tr>
<tr>
<td>DSP48E1</td>
<td>0</td>
<td>1</td>
<td>0.00</td>
</tr>
</tbody>
</table>

processors. IPPro consumes $\approx 1.7$ and $2.3$ times fewer FFs and LUTs respectively. It can be observed that for streaming functions (3x3 filter, 5-tap FIR and Degree-2 Polynomial), IPPro has achieved $1.80$, $4.41$ and $8.94$ times better performance compared to MicroBlaze due to support of single cycle multiply-accumulate with dataforwarding and get/push instructions in IPPro processor. However, as IPPro datapath does not support branch prediction that impacts IPPro performance implementing data dependent or conditional functions (Fibonacci and Sum of absolute differences), where SAD implementation using IPPro resulted in $5\%$ performance degradation compared to Microblaze. On the other hand for memory-bounded functions such as Matrix Multiplication, IPPro performed $6.7$ times better than MicroBlaze due to higher operating frequency.

4.8 Summary

This chapter has presented a FPGA-based soft-core processor architecture to achieve programmable hardware acceleration of front-end image processing operations and compared the obtained performance and area results against fixed HLS design approach. The proposed approach has achieved software recompilation of FPGA by avoiding synthesis, place and route. It has achieved by developing a FPGA-based soft-core Image Processing Processor (IPPro) architecture tailored to accelerate front-end image processing operations. The architecture is devel-
4.8 Summary

oped after detailed insight analysis of FPGA resources, processor functionality and dataflow models. The architecture exploited FPGAs dedicated computing and memory resources to achieve best balance between performance $f_{Max}$ and area utilisation.

The IPPro datapath supports is a 16-bit signed, 5-stage pipelined RISC processor that supports basic arithmetic, logical and branch instructions with dataforwarding to implement data dependent point and area operations. It is light-weight soft-core processor that consumes less than 1% of Kintex-7 (ZC706) FPGA fabric resources and delivers 337 MIPS. IPPro running on Virtex-7 (VC707) and Kintex-7 (ZC706) can deliver $\approx 2.00$ and $1.80$ times improved $f_{Max}$ compared to Artix-7 (Zedboard) by porting IPPro to different FPGA fabric. The area and performance results make it viable to be used as basic processing element for programmable many and multicore architectures.

To evaluate the performance and identify limitations of the developed IPPro architecture, Russell and Kelly has accelerated morphology filtering and first two-stages of histogram of gradient (HOG) using native IPPro supported instructions. They reported that significant performance improvements by extending the datapath capabilities beyond supported instructions offered purely by the DSP48E1 block. Two IPPro optimisations are implemented which are; supporting MIN/-MAX instruction; and coprocessor extension which resulted in $\approx 82\%$ reduction in IPPro instructions for HOG.

In the end, three comparison approaches are adopted to evaluate the performance and area of the IPPro architecture. The obtained results have compared against HLS, FPGA-based programmable architecture and well established MicroBlaze soft-core processor. The acceleration of point and area im-
4.8 Summary

age pre-processing functions using HLS delivered significant performance compared to IPPro at the cost of programmability. IPPro has achieved 5.8 and 1.8 times better performance over FPGA-based programmable architecture that uses dedicated programmable processing elements by exploiting DSP block optimisation. On the other hand, IPPro delivered up to 8.94 times better performance, and 1.7 and 2.3 times fewer FFs and LUTs resources compared to MicroBlaze. Analysing the micro-benchmarks, IPPro has outperformed implementing data independent streaming functions due to the pipelined support of single-cycle multiply-accumulate operation and dataforwarding. For data dependent micro-benchmarks, reduction in performance is due to lack of branch prediction. Although IPPro delivered better performance and results than MicroBlaze, the results presented in this chapter uses a single-core IPPro. In Chapter 4 further investigation is carried out to explore performance improvement by exploiting data and task parallelism in streaming applications.
Chapter 5

IPPro-based acceleration of dataflow actor

5.1 Introduction

Chapter 4 presented the IPPro as a FPGA-based soft-core processor architecture to achieve programmable hardware acceleration of image pre-processing by exploiting the FPGAs dedicated computing and memory resources. This chapter extends this work by looking at the dataflow MoC and how it can effectively be used to accelerate dataflow actors by supporting it in the IPPro datapath. Initially, the chapter covers support of a dataflow actor at core-level focusing on firing actors, handling multi-port dataflow, the impact of FIFO implementation on the timing results ($f_{Max}$) and hardware constraints of mapping dataflow actor onto the IPPro core. It also present the benefits of the IPPro-based programmable approach over HLS. Then it focuses on a system architecture by integrating multiple IPPro accelerators to exploit dataflow parallelism. To evaluate
5.1 Introduction

the performance of discussed core and system level features, a detailed implementation of a $k$-means case study is presented, and compared against an equivalent implementation using an embedded CPU and GPU. The major contributions of this chapter are:

- Creation of an optimised IPPro core architecture which supports mapping and execution of static dataflow actor. The architecture is an independent, self-managed and area-efficient dataflow accelerator.

- Design and development of IPPro-based hardware accelerator models to analyse the management and provisioning policies of IPPro as a programmable dataflow accelerator and their impact on system design and control requirements to exploit parallelism.

- Design and implementation of a configurable system architecture that facilitates flexible decomposition and mapping of dataflow actors onto multiple IPPro cores using scatter-gather data distribution and a collection mechanism for image processing.

- Acceleration of *distance calculation* and *averaging* stages of the $k$-means clustering algorithm using four different IPPro accelerators exhibiting different actor-core mappings on an Avnet Zedboard. Performance, power, and resource efficiency have been compared against embedded CPU and GPU implementation.

Section 5.2 presents the IPPro core that supports dataflow components, execution patterns and stream-based producer-consumer model while maintaining a
balance between area and performance. Section 5.3 explores different IPPro management and provisioning possibilities when incorporated in a heterogeneous system. It evaluates the impact on the host, inter-core communication and resource utilisation. Section 5.4 presents coarse and fine-grained mapping possibilities of dataflow actors onto multiple IPPro cores. It also presents a configurable system architecture tailored to accelerate image processing applications. Section 5.5 presents a case study acceleration of $k$-means clustering computing stages using IPPro accelerators. The solution uses data and task level parallelism by pipelining multiple stages. The results achieved with the IPPro accelerators are compared with the equivalent embedded CPU and GPU implementation in Table 5.15.

5.2 IPPro: A dataflow processor

A CAL dataflow application is a collection of computing units known as actors, which are composed of components, operations and memory elements as discussed in Section 2.2 and listed in Table 3.1 Figure 5.1(a) shows a CAL actor representation consisting of an action, state variables and a finite-state-machine (FSM). An actor exchange stream of tokens coming from unidirectional data buffers and starts execution as soon as the actor firing rule is satisfied. Once this happens, the actor reads token from the input buffer, processes it and stores it into the output buffer. It is essential that these functional requirements must be supported by the IPPro datapath to map and execute the dataflow actor. Table 5.1 lists one-to-one mapping of dataflow semantics onto the IPPro datapath.

The IM stores the functional description of a dataflow actor which contains the actor’s description and its interaction with other actors, state variables and an
5.2 IPPro: A dataflow processor

Figure 5.1: (a) Representation of a CAL dataflow actor (b) Mapping of dataflow actor onto IPPro datapath.

FSM which is stored in the form of IPPro program code. The IPPro instruction set architecture (ISA) implements the dataflow compute nodes defined within the action using arithmetic, logic and dedicated instructions \((MUL, MULACC, MULADD, MULSUB, MIN, MAX, ADD, SUB, etc.)\). The branch instructions \((BZ, BNZ, BS, BNS, etc.)\) implements conditional, relational and data dependent nodes of the actor. RF is a memory element that stores state-variables, intermediate tokens and results of dependent nodes. One of the benefits of processor-based dataflow processing is \textit{modularity}, as it allows fine and coarse-grained hierarchical decomposition and mapping of an actor onto IPPro core \([30]\). Figure 5.1(b) illustrates the mapping of an actor onto the IPPro datapath.

Section 5.2.1 presents the support of actor firing in the IPPro datapath while

Table 5.1: One-to-one mapping of dataflow semantics onto IPPro datapath.

<table>
<thead>
<tr>
<th>No.</th>
<th>Dataflow semantics</th>
<th>IPPro datapath (component)</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1)</td>
<td>Actor</td>
<td>Instruction memory (IM)</td>
<td>Functionality of dataflow actor</td>
</tr>
<tr>
<td>2)</td>
<td>State variable</td>
<td>Register file (RF)</td>
<td>Stores intermediate data for data dependent node</td>
</tr>
<tr>
<td>3)</td>
<td>Operator node</td>
<td>Instruction set (ALU)</td>
<td>Arithmetic, logical and conditional operations</td>
</tr>
<tr>
<td>4)</td>
<td>Input buffer</td>
<td>Input FIFO</td>
<td>Stores input tokens</td>
</tr>
<tr>
<td>5)</td>
<td>Output buffer</td>
<td>Output FIFO</td>
<td>Stores output tokens</td>
</tr>
</tbody>
</table>
Section 5.2.2 extends it to support a data-driven computing model. It gives an analysis of realising FIFO’s using different FPGA memory resources and their impact on the overall timing ($f_{Max}$) of the IPPro datapath. Section 5.2.4 presents the implementation of basic dataflow execution patterns using IPPro. Usually, dataflow actors support multiple data ports which are not feasible for IPPro architecture due to inefficient utilisation of FPGA resources which is covered in Section 5.2.5.

### 5.2.1 Notion of firing an actor

The notion of firing an actor is essential for functional correctness due to the un-timed behaviour of dataflow MoC. The token consumption and production rate depends on the functional description of an actor, and it is only known once the application use case has been chosen by the algorithm developer. Therefore, the IPPro must provide a flexible/programmable approach to handle actor firing and support a data-driven control mechanism to exchange data among actors. The initial IPPro datapath does not support the exchange of data tokens among multiple actors and is only suitable to map and execute an independent actor. It uses \textit{GET} and \textit{PUSH} instructions to read and write data tokens.

To this end, an \textit{actor firing} module and a \textit{TEST} instruction has been added into the IPPro datapath and instruction set as shown in Figure 5.2. The \textit{TEST} instruction allows the algorithm developer to specify the actor’s consumption rate as a part of the actor firing and defined inside the IPPro program code. This instruction checks the number of tokens available for consumption by reading \textit{TOKEN.COUNT} value of the input FIFO and comparing it with the expected
5.2 IPPro: A dataflow processor

Table 5.2: IPPro code implementing dataflow actor firing rule.

<table>
<thead>
<tr>
<th>#</th>
<th>Instructions</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>MAIN</td>
<td>MAIN routine to check actor firing rule</td>
</tr>
<tr>
<td>2</td>
<td>STR R1, 4</td>
<td>Set no. of tokens required to fire the actor</td>
</tr>
<tr>
<td>3</td>
<td>TEST R2, R1</td>
<td>Check FIFO has more than 4(R1) tokens?</td>
</tr>
<tr>
<td>4</td>
<td>BZ FIRE_ACTOR</td>
<td>If YES fire actor</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>JMP MAIN</td>
<td>else wait until firing rule is satisfied!</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>15</td>
<td>FIRE_ACTOR:</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>30</td>
<td>JMP MAIN</td>
<td>The execution of actor is finished. Go back to MAIN to check firing rule again for next iteration</td>
</tr>
</tbody>
</table>

consumption rate (passed as an argument with the instruction). The result of this comparison either grants or restricts the execution of the actor.

Table 5.2 presents IPPro code that implements the actor firing rule by initialising R1 (STR R1,4), where the value stored in R1 represents the actor’s consumption rate. During program execution, the processor jumps between the MAIN and FIRE_ACTOR sub-routines. In case, the input FIFO has four or more tokens, the program execution jumps to the FIRE_ACTOR, executes a single iteration and returns to the MAIN sub-routine. Otherwise, the program execution returns to the MAIN and checks the firing rule. The TEST instruc-

Figure 5.2: IPPro datapath supporting firing of dataflow actor.
5.2 IPPro: A dataflow processor

Figure 5.3: Producer-consumer data-driven execution using IPPro core.

5.2.2 Producer-consumer computing model

This section discusses the problem of realising programmable multicore architecture using IPPro as basic programmable computation unit where some cores are producers and others are consumers requires control/handshake mechanism. These control mechanism ensures a continuous flow of data tokens between pipelined processing stages. The input and output FIFO provides isolation that could be used to exploit task level parallelism by minimising the maximum execution time of all stages of a pipeline and improves acceleration by keeping the cores busy in processing data.

For this purpose, dedicated PUT and GET hardware modules are included at the input and output data interfaces of the datapath as illustrated in Figure 5.3. They use EMPTY and FULL signals to check the status of a FIFO to identify whether input FIFO is empty or output FIFO is full. If this happens, the
core stops execution and only resumes if both the output FIFO has empty space to store processed tokens and the input FIFO has tokens available for processing. Therefore, executing unbalanced actors, the slowest actor of an algorithm defines the worst-case execution time due to the ripple effect. However, there are different dataflow optimisations that could improve results by exploiting data parallelism and choosing a suitable decomposition [24], [30] which will be discussed in Section 5.4.

### 5.2.3 Evaluation of FIFO configurations

In an FPGA, FIFO can be realised using *Block RAM* (BRAM), *lookup-table* (DistRAM) or *shift register* (SR). BRAM is suitable for realising large FIFO structure similar to a line buffer that stores line of pixels, while DistRAM and SR are efficient for smaller FIFO realisation [111]. Realisation of FIFO using shift register exploits the LUT resources of a *configurable logic block* (CLB) as a shift register instead of a dual-port RAM. The CLB can be configured either as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two 16-bit shift registers (SRL16). From a hardware perspective, FIFOs isolate processing elements running at different clock frequencies, hence are available in two configurations: *common-clock* (CC) or *independent-clock* (IC) depending on write and read clock sources. Thus, different FIFO configurations have been implemented on different FPGA fabrics using Xilinx Vivado v2015.2, and the results are reported in Figure 5.4.

Comparing common-clock (CC) implementations, the DistRAM delivers best $f_{Max}$ followed by SR and BRAM where degradation of 8% and 17% have ob-
5.2 IPPro: A dataflow processor

![FIFO Configurations Graph](image)

Figure 5.4: Impact on $f_{Max}$ of realising FIFOs using different resources and configurations.

served on Artix-7 FPGA fabric. Moreover, comparing independent clock (IC) implementations, the DistRAM delivered best $f_{Max}$ compared to BRAM which resulted in $\approx 27\%$ degradation. Realisation of FIFO using DistRAM is only feasible when deployed in the middle of the processing pipeline to store intermediate data tokens. On the other hand, BRAM-based FIFO are suitable and resource efficient for larger memory data structures such as line buffers (640, 1024, 2048, etc.). The result reported in Figure 5.4 shows the impact of FIFO configurations across FPGA technologies and can be used to find suitable FIFO configuration for the IPPro datapath.

For this purpose, the input and output FIFOs of the processor are realised using BRAM, DistRAM and SR. These designs are implemented using Xilinx Vivado v2015.2, and the area and timing results are reported in Table 5.3. In
Table 5.3: Implementation results of processor datapath using different FIFO configurations on Artix-7 FPGA fabric.

<table>
<thead>
<tr>
<th>FIFO (size)</th>
<th>FF</th>
<th>LUT</th>
<th>LUTRAM</th>
<th>BRAM</th>
<th>DSP48E1</th>
<th>Frequency (MHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>BRAM (512x16)</td>
<td>478</td>
<td>422</td>
<td>66</td>
<td>1.5</td>
<td>1</td>
<td>195</td>
</tr>
<tr>
<td>Shift Register (64x16)</td>
<td>510</td>
<td>411</td>
<td>90</td>
<td>1</td>
<td>1</td>
<td>237</td>
</tr>
<tr>
<td>DistRAM (64x16)</td>
<td>416</td>
<td>459</td>
<td>119</td>
<td>1</td>
<td>1</td>
<td>242</td>
</tr>
</tbody>
</table>

case of DistRAM, the processor datapath can operate up to 242 MHz giving a raw computation of 242 MIPS utilising 8% more LUTs compared to BRAM design. A reduction of 3% and 19% in processor operating frequency have observed for SR and BRAM designs respectively at the cost of 18% more FFs and 10% less LUTs. The presented processor datapath results show the impact of different FIFO configurations on timing and area utilisation. The design choice to realise FIFO depends on the deployment scenario and the application use case. DistRAM is efficient for small data buffers usually in the middle of an image processing pipeline. On the other hand, BRAM is resource efficient for large data buffer commonly found at the beginning or end of the image pipeline.

5.2.4 Mapping and execution of static dataflow actor

A static dataflow actor could represent a single operation node, a set of multiple operation nodes or a complex dataflow graph depending on the chosen decomposition. Each dataflow node can also have different execution patterns [90], [17], [25], [112]. These execution patterns include feed-forward, split, merge and feedback as illustrated in Figure 5.5 using dataflow nodes A, B, C and D. Figure 5.6 presents the pseudo IPPro program codes to implement each execution pattern using IP-Pro core.

In feed-forward, the GET reads the data tokens and stores them into R1
5.2 IPPro: A dataflow processor

Figure 5.5: Mapping of dataflow execution patterns on IPPro core.

Figure 5.6: Pseudo IPPro code to implement dataflow execution patterns.

and $R2$ register of RF and executes function A, stores result into $R3$ and $R4$, and PUSH results to the output FIFO. In case of a split, the tokens produced by function A ($R3$ and $R4$) are fed to B and C. In case of a merge, A and B produce tokens ($R1$ and $R2$) and ($R3$ and $R4$) respectively which are fed to C that computes output tokens $R5$ and $R6$. The benefit of supporting these execution patterns with the help of RF in IPPro core is that, it not only allows implementation of a dataflow actor but also to provides flexible decomposition and mapping options to the user and software framework to explore and exploit dataflow optimisations.
5.2 IPPro: A dataflow processor

5.2.5 Supporting multi-port dataflow actor

The dataflow programming languages support multi-port actors. The HLS driven hardware architectures support an input *interface* for each computation block, where each data-port is directly translated into a FIFO structure [43], [17]. However, the application use case or algorithm to be implemented is known in-advance before the hardware design is synthesised and implemented. It allows the HLS tool to profile and find optimal memory requirements for the chosen application.

On the contrary, in a processor-based approach, the underlying hardware architecture is pre-implemented using generic processing and memory requirements of the class of applications. Because of this, the number of input/output interfaces supported by the IPPro datapath must be fixed. The higher number of ports could lead to inefficient utilisation of resources and small number of ports could limit the actor mapping possibilities. Thus, this section discusses this design problem by increasing number of ports and analysing their impact on resource requirements and the execution time of an actor.

Figure 5.7 depicts increasing input data interfaces to identify the architec-

![Block diagram of multi-port input data interface of IPPro datapath.](image-url)

Figure 5.7: Block diagram of multi-port input data interface of IPPro datapath.
tural requirements and theoretically estimate their impact on the actor execution
time. The datapath can be composed of single, dual, triple and quad input ports
(A, B, C, and D). Each input port can receive tokens produced by different pro-
ducers via ports $A_n$, $B_n$, $C_n$ and $D_n$, where $n$ distinguishes each unique producer.
For functional correctness, the order of tokens is important. Therefore, a dedi-
cated FIFO channel is required for each producer core to avoid token re-ordering
problem.

Table 5.4 lists the architectural and control requirements for the input inter-
face illustrated in Figure 5.7. A number of FIFO channels and multiplexers are
required to connect the cores and receive data produced by the connected cores.
It can be observed that the required number of FIFO channels are multiple of
producers and input ports and the number of multiplexers required are directly
proportional to the input ports. In FPGA design, a multiplexer is implemented
using combinational logic which increases the critical path length of the design
which affects the timing results. Therefore from both resource utilisation and
timing point-of-view, a multi-port IPPro datapath is not a suitable design choice.

Figure 5.8 depicts cycle-based execution of $func(X)$ using Single, dual, triple
and quad input ports. A single port design sequentially reads token from input
FIFO compared to dual, triple and quad port designs. The DFG node processing
time $t_x$ (execution time of single iteration) is greater than time to read/write

### Table 5.4: Hardware resource and control requirements to map multi-port actors onto IPPro core.

<table>
<thead>
<tr>
<th>Resource</th>
<th>Single-port</th>
<th>Dual-port</th>
<th>Triple-port</th>
<th>Quad-port</th>
</tr>
</thead>
<tbody>
<tr>
<td>Producer cores</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>FIFO channels</td>
<td>4</td>
<td>8</td>
<td>12</td>
<td>16</td>
</tr>
<tr>
<td>No. of multiplexers</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>Source port addressing</td>
<td>No</td>
<td>1-bits</td>
<td>2-bits</td>
<td>3-bits</td>
</tr>
</tbody>
</table>
5.2 IPPro: A dataflow processor

Figure 5.8: Impact of multi-port IPPro datapath on execution time (in clock cycles) of dataflow actor.

token $t_1$, $t_2$, $t_3$, $t_4$ and $t_{out}$ (which is single clock cycle each) and has a negligible impact on the total execution time of an actor. In the best case scenario, the multi-port designs could save maximum of two or three clock-cycles as illustrated in Figure 5.8, at the cost of using more resources. Therefore, a single input port datapath is selected that can handle multiple operands using time multiplexing.

5.2.6 Discussion on hardware acceleration using IPPro over HLS

Usually, FPGA-based dataflow programming frameworks and HLS-based tools take a dataflow description, using static timing analysis techniques to profile and find a suitable decomposition that meets the application requirements. After finding proper decomposition, further FPGA/hardware specific optimisations are carried out and then equivalent HDL circuit is generated. On the contrary, in IPPro approach, the dataflow specification is statically profiled based on the IPPro mapping constraints. The following are major IPPro mapping constraints:
5.2 IPPro: A dataflow processor

- The number of instructions to implement a dataflow actor - \( \text{inst}_{(\text{actor})} \). The IPPro has IM of 512x32 bit (because it efficiently exploits the distribution of BRAM resources by using exactly half of the BRAM block (18 KB) and allows a maximum of 512 instructions. This metric drives the level of decomposition of an actor. The framework must describe actor operations within 512 or less IPPro instructions.

- Actor execution time - \( \text{t}_{(\text{exec.})} \) It is a measure of time needed for IPPro to execute a single iteration of an actor. \( f_{\text{Max}} \) is the IPPro system maximum clock frequency where each instruction takes one clock cycle to complete. This metric facilitates the framework during decomposition, when balancing actors and avoids blocking.

\[
\text{t}_{(\text{exec.})} = \frac{\text{inst}_{(\text{actor})}}{f_{\text{max}}} \quad (5.1)
\]

- Register utilisation - \( \text{RF}_{(\text{util})} \): It is the measure of registers used by single execution of an actor. It covers storage of input, intermediate and output variables used in a single iteration. This metric can aid the algorithm developer to find a suitable actor decomposition.

This section has presented IPPro core as FPGA-based soft-core dataflow accelerator supporting flexible mapping and execution of static multi-port dataflow actor. Besides, IPPro specific mapping constraints have been outlined that are essential for software profiling, mapping and compilation of dataflow actors onto IPPro. Section 5.3 investigates IPPro accelerator from a system level perspective, where multiple IPPro cores are connected and exchange tokens. The focus is to
identify the system level management and control requirements, the inter-core communication mechanisms and their impact on the resource utilisation.

5.3 Management and provisioning of IPPro hardware accelerators

Hardware accelerators are used in data intensive computing systems, including many and multicore processors architectures [22], [23], [12], [43], [17]. Generally, in MPSoC-based systems, hardware accelerators are managed by a host/master processor. It handles system configuration, communication, data and control among accelerators that impacts the performance [113], [114]. It is vital to minimise host intervention not only in managing control and data transfer but also managing the hardware accelerators to achieve better acceleration. The hardware accelerators can be classified based on management policies into three classes [113]:

- Class I: Host managed dependent accelerator
- Class II: Host managed independent accelerator
- Class III: Self-managed independent accelerator

Table 5.5 lists the core, multicore and system level control and management requirements of each class of accelerators. To identify the desired synchronisation and inter-core communication mechanisms and analyse their impact on the area of each class of accelerator, four multiple IPPro core designs A, B, C, D have been implemented as shown in Figure 5.9.
### 5.3 Management and provisioning of IPPro hardware accelerators

Table 5.5: Impact of accelerator classes on IPPro-based core, multicore and system requirements [113].

<table>
<thead>
<tr>
<th>Accelerator</th>
<th>Pipeline</th>
<th>Management</th>
<th>Parallel Skeleton</th>
<th>Core level</th>
<th>Multicore level</th>
<th>Memory model</th>
<th>Control management</th>
<th>Code Sync.</th>
<th>Data</th>
<th>Task</th>
</tr>
</thead>
<tbody>
<tr>
<td>I) Dependent host managed</td>
<td>Pipeline</td>
<td>Host manages both data and control mechanisms</td>
<td>All accelerators are directly connected and managed by host processor (No inter-core communication)</td>
<td>Host controlled (complex host application)</td>
<td>Yes (host shall synchronize order of execution)</td>
<td>No</td>
<td>Yes</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>II) Independent host managed</td>
<td>Pipeline, Split-compute-merge</td>
<td>Instruction driven data mechanism TEST, GET, PUSH</td>
<td>Programmable inter-core communication controller</td>
<td>Separate code for inter-core communication controller</td>
<td>Yes (between inter-core controller and each actor)</td>
<td>Yes</td>
<td>Yes</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>III) Independent self-managed</td>
<td>Pipeline, Split-compute-merge</td>
<td>Instruction driven data and control mechanism TEST, GET, CH#, PUSH</td>
<td>Self-managed inter-core communication</td>
<td>Message passing</td>
<td>Embedded within IPPro code</td>
<td>No</td>
<td>Yes</td>
<td>Yes</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
5.3 Management and provisioning of IPPro hardware accelerators

The *Class I* accelerator represents a "host managed dependent accelerator" which has been common in hardware solutions where the compute intensive part of the application is off-loaded on the dedicated IPs. The host assigns a job to the worker and is solely responsible for managing data distribution via shared memory using an appropriate control mechanism. IPPro is a stream accelerator that uses GET and PUSH instructions and does not require explicit data management by the host; therefore, *Class I* accelerators is not relevant. The designs A, B and C functionally exhibit *Class II* accelerators as per Table 5.5. The difference
## 5.3 Management and provisioning of IPPro hardware accelerators

Table 5.6: IPPro-based multiple core architectures and their impact on system requirements and inter-core communication.

<table>
<thead>
<tr>
<th>Design</th>
<th>No. of cores</th>
<th>Host-core Synchronisation</th>
<th>Communication Management</th>
<th>Programmable Token re-ordering needed</th>
<th>Token deterministic</th>
<th>Inter-core connectivity</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>8</td>
<td>Yes</td>
<td>Static configuration</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>B</td>
<td>8</td>
<td>Yes</td>
<td>Static configuration</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>C</td>
<td>8</td>
<td>Yes</td>
<td>Inter-core controller</td>
<td>Yes</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>D</td>
<td>8</td>
<td>No</td>
<td>Self-managed</td>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody>
</table>

between A and B is the level of inter-core connectivity of 2x2 and 4x4 between producer and consumer cores as shown in Figure 5.9(a) and (b). In A and B, the host processor statically configures the multiplexers during system configuration by setting a configuration word which remains fixed for the rest of the system operation. To map a tree-based dataflow actor, B requires $4^N$ computing stages compared to A, where $N$ is the level of connectivity between cores which is 2x2 for A and 4x4 for B. When multiple cores are exchanging data simultaneously, both designs need collision avoidance mechanisms.

Kelly et al. have proposed a solution to address this issue by scheduling actors with a fixed offset [60]. This mechanism is common in HLS-based fine-grained architectures where the connectivity of dataflow actors are identified at compile time before realising hardware [22], [25], [43], [115]. On the other hand, C supports dynamically configurable inter-core connectivity of 4x4 managed by an external inter-core controller as shown in Figure 5.9(c) which allows runtime configuration of the inter-core communication using routing program produced by the compiler extracted from the XDF. However, this increases hardware complexity as it requires synchronisation between IPPro cores and the controller. Lastly, design D illustrates a Class III hardware accelerator where each IPPro core itself manages the inter-core communication. At the input interface, each core has FIFO queues.
5.3 Management and provisioning of IPPro hardware accelerators

(equal to the number of producers) which ensures deterministic token, resolve token re-ordering and avoids the collision. The design minimises host intervention and system level control compared to previous designs due to the absence of an external controller as shown in Figure 5.9(d). It has achieved this by attaching additional information (FIFO channel) along with a data token and forwarded to the interconnect where, each FIFO channel number represents the producer of data token.

This solution simplifies the system architecture by avoiding distributed control and data mechanisms and integrating them into a single point of control. It has been achieved by making IPPro an independent self-managed dataflow accelerator. It provides flexibility to explore and implement applications only by changing the IPPro program code that contains information related to data processing, control/synchronisation mechanism and exchange of tokens among multiple producer and consumer. Therefore, the application developer or software compiler has to generate only the IPPro code, instead of additional code for the inter-core controller as required by C.

Implementation Results

Table 5.7 reports the implementation results obtained from Xilinx Vivado Suite v2015.2. Statically managed inter-core communication designs B consumes 1.25 and 1.94 times more FFs and LUTs compared to A by increasing the level of core connectivity from 2x2 to 4x4. On the other hand, using an inter-core controller to dynamically manage the inter-core communication further increases the FFs and LUTs utilisation by 1.07 and 1.20 times compared to B in addition
5.4 Dataflow parallelism and multiple IPPro

Table 5.7: Impact on area utilisation of different accelerator configurations.

<table>
<thead>
<tr>
<th>Design</th>
<th>FF</th>
<th>LUT</th>
<th>DSP</th>
<th>BRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>1902</td>
<td>709</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>B</td>
<td>2381</td>
<td>1376</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>C</td>
<td>2549</td>
<td>1632</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>D</td>
<td>7616</td>
<td>5989</td>
<td>8</td>
<td>8</td>
</tr>
</tbody>
</table>

to complex system level synchronisation mechanism. Besides, \( \square \) result in \( \approx 2.30 \) and \( 3.67 \) times increased in FFs and LUTs. Though, comparing \( \square \) with previously reported IPPro core results in Table 5.3, the presented functionalities and reduced management overhead come at the maximum cost of approx. 2.88 and 1.54 times of FFs and LUTs.

The presented area results in Table 5.7 shows that the increasing level of connectivity and avoiding off-loading host management tasks come at the cost of higher resource utilisation while the BRAM/DSP ratio remains constant.

5.4 Dataflow parallelism and multiple IPPro

Dataflow is a stream driven MoC that allows exploiting data and task level parallelism using different parallel computing paradigms as previously discussed in Section 2.2.1. IPPro is a light-weight programmable architecture that can use to realise programmable parallel dataflow computing system architecture by connecting multiple IPPro cores to exploit parallelism. In contrast to the pipelined parallel architectures, the iterative execution of a dataflow actor is a sequential operation which could take a variable number of clock cycles depending on the complexity of an actor. Therefore, to achieve acceleration, the computation load and data transfer load are chosen as application constraints which are defined as the actor execution time and token production-consumption rate. These application-
5.4 Dataflow parallelism and multiple IPPro

tion constraints shall be used by the compiler framework to find out the suitable application decomposition and mapping on the IPPro cores for the user. Frames per second \( fps \) has been chosen as a performance metric for image processing applications. Because it will be used as the input parameter to the compiler framework to start profiling and optimising the application. Mathematically, it can be represented using Equation 5.2.

\[
fps = \frac{f(IIPPro)}{t(actor) \times \frac{N_{\text{total, pixels}}}{N_{\text{pixel, consumption}}}}
\]  

(5.2)

where \( f(IIPPro) \) is IPPro operating frequency (extensively discussed previously as performance metric in IPPro core level discussions and analysis development of IPPro core), \( t(actor) \) is the execution time (in clock cycles) of the slowest dataflow actor, \( N_{\text{total, pixels}} \) the number of pixels in a frame and \( N_{\text{pixel, consumption}} \) the number of pixels consumed by an actor in each iteration. To improve the \( fps \), the following options are possible as depicted in Figure 5.10:

- **Reducing the actor’s execution time** by decomposing it into multiple pipelined stages, thus reducing \( t(actor) \) to improve \( fps \). Shorter actors can be merged sequentially to minimise the data transfer overhead by localising data into FIFOs between processing stages.

- **Vertical scaling to exploit data parallelism** by mapping an actor on multiple IPPro cores thus, reducing \( n \times \frac{N_{\text{total, pixels}}}{N_{\text{pixel, consumption}}} \). Though, it requires an additional system level data distribution, control, and collection mechanisms.

Figure 5.10 shows two actor-core mapping examples to elaborate both optimi-
5.4 Dataflow parallelism and multiple IPPro

Figure 5.10: Multiple IPPro cores as dataflow accelerators deploying dataflow optimisations (a) One-to-one actor-core mapping (b) 2-way SIMD mapping per actor.

sations. The first example focuses on the pipelined one-to-one actor-core mapping of dataflow actors as shown in Figure 5.10(a) where individual actors $A - G$ are mapped on separate IPPro cores. The actors are unbalanced and have different execution times represented by $t_{(actor)}$. The inter-core communication architecture is used to exchange data among cores. This example illustrates pipelined mapping of dataflow actors using IPPro cores. It enables implementation of dataflow optimisation by dividing complex actor into multiple small actors and reduce the
The second example focuses on exploiting parallelism using vertical scaling of IPPro cores as shown in Figure 5.10(b). An actor is replicated onto multiple IPPro cores to exploit data parallelism. The level of connectivity supported by the interconnect defines the exploitable degree of data parallelism. This issue will be further discussed in Chapter 6.

5.4.1 Configurable data distribution and collection architecture

To realise parallel computing paradigms, scatter-gather is used to exploit data and task level parallelism [116], [71]. It uses the static decomposition of data where data is divided up into many equal-sized parts where each part can be processed by a separate processing core as shown in Figure 5.11. The research community has reported various image data distribution patterns driven by row, column and block-based static decomposition that result in row-strip, column-strip, row-cyclic, column-cyclic, block-wise and window-wise distributions [40], [117], [118]. In this thesis, the row-cyclic data distribution has been chosen because it allows buffering of pixels in a pattern suitable for point and area operations after storing them into the line buffers. It simplifies the reading process of pixels from the image buffer. The system level architecture composed of line buffers, a scatter module to distribute the buffered pixels, gather module to collect the processed pixels and a finite state machine to manage and synchronise these modules as shown in Figure 5.12.

The host processor uses control and data interfaces to configure, manage and
5.4 Dataflow parallelism and multiple IPPro

Figure 5.11: Cyclic row-wise image/video pixel distribution.

distribute pixels through a programmable host application. The host sequentially feeds the pixels into the line buffers using IN interface as shown in Figure 5.12. The width of the line buffer is configurable by loading a suitable value in LINE_WIDTH register using AXI4-Lite interface. It makes the system infrastructure adaptable to various image sizes. As soon as line buffers fill, the Scatter starts feeding data to the cores by storing it into the input FIFOs. The cores begin to process data as soon the actor firing rule is satisfied and pushes the processed data into the output FIFO. Gather reads processed data and feeds it back to host processor using OUT interface. Figure 5.12 shows Control interface that is used to control the FSM presented in Figure 5.13 by the host processor and relevant output control signals for each state listed in Table 5.8. The following are the details of FSM states:

- **RESET** resets the programmable logic, i.e. IPPro cores, multicore interconnect and data distribution and collection mechanisms.

- **CONFIGURE_SYSTEM** enables the system SYS_EN and assigns a user-defined value to LINE_WIDTH register (as defined in the host application) which configures the line buffer, scatter and gather modules. The
5.4 Dataflow parallelism and multiple IPPro

Figure 5.12: System level data distribution and control architecture.

value stored in the $LINE\_WIDTH$ register specifies the number of pixels stored in each line buffer.

- **IDLE** waits for host program to dispatch data by asserting $FILL\_LINES$.

- **FILL\_BUFFERS** initiates filling of the line buffers by asserting $start\_fill$, waits until all line buffers are filled and assert $Finish\_fill$.

- **SCATTER** asserts $start\_scatter$ and IPPro core $en$ signals. The scatter module reads the line buffers and loads data into the input FIFOs. The core process data in parallel and stores processed data into respective output FIFOs. The $gather$ compares the FIFO token count with LINE\_WIDTH value and asserts $DAvailable$ signal which triggers the $GATHER$ state.

- **GATHER** asserts $start\_read$ signal and starts reading the output FIFOs of each core. It controls the multiplexer based on the defined $LINE\_WIDTH$
value by checking the FIFO token count. The host reads processed data via OUT interface.

The presented stream-based data distribution and collection architecture abstracts the low-level hardware implementation details from the user and simplifies the application development process by providing underlying functionality via a control register. This approach provides task-level optimisations by pipelining multiple computing stages and localising data within the programmable logic.

Table 5.8: Output signals of FSM for each state.

<table>
<thead>
<tr>
<th>FSM output</th>
<th>RESET</th>
<th>CONFIGURE SYSTEM</th>
<th>IDLE</th>
<th>FILL BUFFERS</th>
<th>SCATTER</th>
<th>GATHER</th>
</tr>
</thead>
<tbody>
<tr>
<td>sel.LINE_W</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>start_fill</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>start.scatter</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>start.read</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>coreN_rst</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>coreN_en</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>
before sending it back to a host processor thus reducing data transfer overhead.

To evaluate IPPro as a dataflow accelerator and implement some of the discussed dataflow optimisations (data and task level parallelism), Section 5.5 present the acceleration of k-means clustering.

5.5 Case Study: k-means clustering

Image segmentation is the process of partitioning an image into multiple segments. Three recognised methods by scientists and researchers are image thresholding, edge detection and clustering [119]. k-means belongs to image clustering, which is an unsupervised image segmentation method that classifies the image into a finite number of clusters. It has been chosen because of its simple control flow, data dependent execution and inherent fine-grained parallelism which makes it suitable for FPGA-based hardware acceleration [120]. It involves two stages which are Distance Calculation and Averaging. The distance calculation is mathematically represented as:

\[ Y = \sum_{i=1}^{n} \sum_{j=1}^{k} (||P_i - C_j||)^2 \]  

(5.3)

Where, (||P_i - C_j||) is the Euclidean distance between a data point (pixel) P_i and a centroid value C_j, iterated over n points in the cluster for all k clusters. Averaging is used to calculate the updated centroid values for the next iteration by finding the average of clustered data/pixels in the dimension to find the new centroid value. In this case study, 512x512 resolution of images have been clustered by accelerating both stages of the k-means algorithm. To explore different
5.5 Case Study: \( k \)-means clustering

Figure 5.14: Block diagram of implemented system architecture for case study.

Data and task parallelism and actor-core mapping possibilities, four IPPro hardware accelerator designs have been implemented. These designs cover single-core, dual-core, 8-way SIMD and dual 8-way SIMD-based IPPro acceleration architectures and allow evaluating the impact of exploiting data and task parallelism on area and performance. The system has implemented on Avnet Zedboard (XC7Z020CLG484-1) and the same \( k \)-means implementation has been realised on the desktop NVIDIA GTX980 GPU, embedded ARM Mali-T628 GPU and ARM Cortex-A7 CPU to compare the technologies.
5.5.1 MPSoC-based heterogeneous system architecture

Xilinx Zynq MPSoC is composed of a host processor known as programmable system (PS) and FPGA programmable logic (PL). The system architecture is used to accelerate the *distance calculation* and *averaging* using IPPro as shown in Figure 5.14. PS configures and controls the underlying architecture while PL is used to implement image processing pipeline and *IPPro hardware accelerator* as illustrated in Figure 5.14. The AMBA-AXI bus transfers the data between PS and PL using the AXI-DMA protocol. The Xillybus IP core [121] is deployed as a bridge between PS and PL to feed data into the image processing pipeline. It gives an intuitive DMA-based end-to-end turnkey solution for transporting data between PL and PS while running the Linux Operating System (OS) on an ARM host processor thus reducing engineering and device driver development effort [121]. The *IPPro hardware accelerator* interacts with the Xillybus IP core via FIFOs. The Linux application running on PS streams data between the FIFO and the file handler opened by the host application. The Xillybus-Lite interface allows control registers from the user space program running on Linux to manage the underlying hardware architecture.

Figure 5.14 shows the implemented system architecture which consists of the necessary control and data infrastructure. The data interfaces involve stream (Xillybus-Send and Xillybus-Read); uni-directional memory mapped (Xillybus-Write) to program the IPPro cores; and Xillybus-Lite to manage Line buffer, scatter, gather, IPPro cores and the FSM. Xillybus Linux device drivers are used to access each of these data and control interfaces. An additional layer of C functions is developed using Xillybus device drivers to configure and manage the
system architecture, program IPPro cores and exchange pixels between PS and PL. Table 5.9 presents the developed C functions that the host application uses to program IPPro cores and control the system architecture are presented in Figure 5.14.

The Linux host application uses these C functions to feed image pixels into a line buffer module. These functions allow to control/manage the data distribution and collection architecture and program the IPPro cores using the process discussed in Section 5.4.1.

### 5.5.2 IPPro hardware accelerator designs

The case study is implemented to explore the different acceleration possibilities of distance calculation and averaging. Therefore, both stages are accelerated individually as an independent dataflow actor using single and multiple IPPro cores realised in design 1 and 2 as shown in Figure 5.15. Later, both stages are accel-

<table>
<thead>
<tr>
<th>C function</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>int open_system (void);</td>
<td>The host uses <em>Linux User I/O</em> (UIO) interface to access the IPPro core as a device file by its memory map. This function is used to get the address of the PL hardware blocks by OS.</td>
</tr>
<tr>
<td>int close_system (int fd);</td>
<td></td>
</tr>
<tr>
<td>int system_reset (int fd, int addr);</td>
<td>It sets the SYS_RST bit</td>
</tr>
<tr>
<td>int system_enable (int fd, int addr);</td>
<td>It clears SYS_RST and set SYS_EN bit.</td>
</tr>
<tr>
<td>int set_line_size (int fd, int addr, short int value);</td>
<td>It sets the size of line buffer.</td>
</tr>
<tr>
<td>int fill_lines (int fd, int addr);</td>
<td>It sets the FILL_LINES bit.</td>
</tr>
<tr>
<td>int program_core (FILE *fp);</td>
<td>It programs the IPPro core by reading a .hex file using AXI-MM interface</td>
</tr>
<tr>
<td>int send_stream (short int *sdata, int len);</td>
<td>They are used to send/receive stream of data from host to PL using the Xillybus-Send and Xillybus-Read interfaces.</td>
</tr>
<tr>
<td>int read_stream (short int *rdata, int len);</td>
<td></td>
</tr>
</tbody>
</table>
5.5 Case Study: \(k\)-means clustering

Figure 5.15: IPPro hardware accelerator designs to explore and analyse the impact of parallelism on area and performance. ① Single core IPPro, ② 8-way SIMD IPPro, ③ Dual core IPPro, ④ Dual core 8-way SIMD IPPro.

...erated together as pipelined dataflow actors using dual and multiple-dual-IPPPro cores realised in design ③ and ④. Figure 5.15 illustrates the block diagram of all four designs, and their data and control interfaces. Each design is used as a IPPro hardware accelerator illustrated earlier in Figure 5.14 and incorporated into the presented IPPro-based heterogeneous system architecture. These designs are selected as they enable different acceleration paradigms, dataflow actor mapping possibilities and parallelism options as listed in Table 5.10. Moreover, they allow the analysis of different algorithmic decompositions and their impact on the execution time and area utilisation.
Table 5.10: Dataflow actor mapping and supported parallelism of IPPro hardware accelerator design presented in Figure 5.15.

<table>
<thead>
<tr>
<th>Design</th>
<th>Acceleration Paradigm</th>
<th>Dataflow mapping</th>
<th>Parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Single core IPPro</td>
<td>Single actor</td>
<td>No</td>
</tr>
<tr>
<td>2</td>
<td>8-way SIMD IPPro</td>
<td>Single actor</td>
<td>Yes</td>
</tr>
<tr>
<td>3</td>
<td>Dual core IPPro</td>
<td>Dual actor</td>
<td>No</td>
</tr>
<tr>
<td>4</td>
<td>Dual 8-way SIMD IPPro</td>
<td>Dual actor</td>
<td>Yes</td>
</tr>
</tbody>
</table>

5.5.3 Acceleration results

The presented IPPro hardware accelerator designs have used different sample images for classification due to the data dependent characteristics of the clustering algorithm. Table 5.11 and Table 5.13 report the average execution time and fps numbers while, the area utilisation results have been reported in Table 5.12.

Table 5.11 reports the results obtained by individually accelerating the stages of k-means clustering using 1 and 2. In each iteration, distance calculation takes two pixels and classifies them into one of the four clusters which take an average of 45 cycles/pixel. To classify the whole image, it takes 118.2 ms which corresponds to 8.45 fps. On the other hand, the averaging takes four tokens and produces four new cluster values, which takes an average of 55 clock cycles/pixel results in 145 ms or 6.88 fps. Both the stages involve point-based pixel processing. Therefore design 2 is developed and used to exploit data level parallelism. As a result, the execution time is reduced to 23.32 ms and 27.02 ms for distance calculation and averaging respectively. This is an improvement of 5.06 and 5.37 times over 1. It came at the cost of 4.1, 2.3 and 8.0 times more BRAMs, LUTs and DSP blocks.

Table 5.11: Performance measurements for design 1 and 2 of Figure 5.15.
5.5 Case Study: \textit{k}-means clustering

Table 5.12: FPGA area utilisation of various designs shown in Figure 5.15. The relative Zedboard area utilisation is also reported.

<table>
<thead>
<tr>
<th>Design</th>
<th>FF</th>
<th>LUT</th>
<th>BRAM</th>
<th>DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 Single-core IPPro</td>
<td>5197 (4.89)</td>
<td>4736 (8.90)</td>
<td>4.5 (3.21)</td>
<td>1 (0.45)</td>
</tr>
<tr>
<td>2 8-way SIMD IPPro</td>
<td>12279 (11.54)</td>
<td>10941 (20.57)</td>
<td>18.5 (13.21)</td>
<td>8 (3.63)</td>
</tr>
<tr>
<td>3 Dual-core IPPro</td>
<td>5737 (5.19)</td>
<td>5215 (3.21)</td>
<td>7.5 (3.21)</td>
<td>2 (0.90)</td>
</tr>
<tr>
<td>4 Dual 8-way SIMD IPPro</td>
<td>16106 (15.14)</td>
<td>13864 (26.06)</td>
<td>34 (13.21)</td>
<td>16 (7.27)</td>
</tr>
</tbody>
</table>

respectively as reported in Table 5.12. The major contributor to increased area utilisation is data distribution and control infrastructure. Theoretically, scaled up design has been expected to give eight times increase in performance, which is not achieved in \( \circ \) because, the data transfer overhead involved in filling the line buffers, collecting the processed pixels and sending them back to the host is not negligible.

Table 5.13 reports the execution time and performance (fps) numbers of both stages together to exploit task-level parallelism using designs 3 and 4. The reported results of 1 and 2 obtained by combining the execution time of both stages previously reported in Table 5.11. Using design 3, the effect of task-level parallelism implemented via \textit{intermediate FIFO} result in an average of 63 clock cycles/pixel which is 163 ms and 6 fps. By pipelining both actors, 3 has achieved 1.6 times better performance compared to 1 at the cost of 1.6 and 2.0 times more BRAM and DSP blocks using the same Xillybus IP infrastructure as 1. The reason for the improvement is the localisation of intermediate data

Table 5.13: Performance with task-level parallelism using designs in Figure 5.15.
5.5 Case Study: *k*-means clustering

within FPGA fabric using an *intermediate FIFO*, which hides the data transfer overhead to and from host processor as shown in Figure 5.15.

Analysing the impact of exploiting both task and data level parallelism using \(4\) results in average 14 clock cycles/pixel and execution time of 35.9 ms or 28 fps. It is 1.4, 4.5 and 7.3 times better than \(2\), \(3\) and \(1\) respectively. For comparison, both stages are coded in C language and executed on an embedded ARM Cortex-A7 processor that achieved execution time of 286 ms and 3.49 fps which is 8 times slower than the performance achieved by \(4\).

5.5.4 Comparison against GPU implementations

This section presents the details of adopted power measurement methods and compares the IPPro-based implementation to the equivalent *k*-means GPU implementations. The IPPro power measurements obtained by running post-implementation timing simulation. A *Switch activity interchange format* (SAIF) file is used to record the switching activity of designs data and control signals of each presented IPPro designs. Xilinx Power Estimator (XPE) takes SAIF file and reports the power consumption. At *Queens University Belfast* (QUB) Minhas, a research student doing research on big data computing has coded an equivalent version of *k*-means in CUDA and OpenCL which is implemented and profiled on nVIDIA GeForce GTX980 and ODROID-XU3, due to in-house availability of both GPU platforms.

The nVIDIA desktop GPU card supports 2048 CUDA cores running at a base frequency of 1126 MHz. OpenCL and CUDA have used for programming the GPU, and both stages merged into the single kernel. For performance measure-
5.5 Case Study: $k$-means clustering

OpenCL’s profiling function `clGetEventProfilingInfo` is used which returns the execution time of kernel in nanoseconds. The power consumption during kernel execution was logged using nVIDIA System Management Interface (nvidia-smi) which allows to measure the power consumed by the GPU and the host processor separately. It is a command line utility, based on top of the nVIDIA Management Library (NVML), intended to aid the management and monitoring of nVIDIA GPUs.

To set the base line figures and for fair comparison of the FPGA against the GPU technology, an embedded CPU (ARM Cortex-A7) and an embedded GPU (ARM Mali-T628) implementation has been carried out on ODROID-XU3 platform. This is a heterogeneous multi-processing platform that hosts 28nm Samsung Exynos 5422 application processor which has on-chip ARM Cortex-A7 CPUs and ARM Mali-T628 embedded GPU. The platform is suitable for power constraint application use cases where ARM Cortex-A7 CPU and mid-range ARM Mali-T628 GPU runs at 1.2 GHz and 600 MHz respectively. The platform have separated current sensors to measure the power consumption of ARM Cortex-A7 and ARM Mali-T628, thus allow component-level power measurement capability.

Table 5.14 shows the results of IPPro-based accelerator designs running on Zedboard where both data and task parallel implementation achieved 4.6 times better performance over task only implementation at the cost of 1.57 times higher power consumption. Table 5.15 shows the performance results of the $k$-means implementation on Kintex-7 FPGA and compares them against equivalent embedded CPU (ARM Cortex-A7), embedded GPU (ARM Mali-T628) and desktop GPU (nVIDIA GeForce GTX680) implementation. The presented embedded CPU results has been considered as baseline figures for the comparison.
Table 5.14: Power, resource and combined efficiency comparisons of IPPro-based $k$-means implementations on Zedboard.

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Power (mW)</th>
<th>Freq. (MHz)</th>
<th>Exec. (ms)</th>
<th>fps</th>
<th>Power efficiency (fps/W)</th>
<th>Approx. transistor utilised (TU) (x10$^{-9}$)</th>
<th>Resource efficiency (fps/TU) (x10$^{-8}$)</th>
<th>Combined efficiency (fps/W/TU) (x10$^{-9}$)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Static</td>
<td>Dynamic</td>
<td>Total</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Dual-core IPPro</td>
<td>118</td>
<td>18</td>
<td>136</td>
<td>100</td>
<td>163.2</td>
<td>6</td>
<td>591 (9%)</td>
<td>1.0</td>
</tr>
<tr>
<td>Dual 8-way SIMD IPPro</td>
<td>122</td>
<td>92</td>
<td>214</td>
<td>100</td>
<td>35.9</td>
<td>28</td>
<td>1564 (23%)</td>
<td>1.8</td>
</tr>
</tbody>
</table>

Table 5.15: Power, resource and combined efficiency comparisons for $k$-means using Xilinx Zynq XC7Z045 Kintex-7 FPGA and GPU NVIDIA GTX980.

<table>
<thead>
<tr>
<th>Platform</th>
<th>Implementation</th>
<th>Power (mW)</th>
<th>Freq. (MHz)</th>
<th>Exec. (ms)</th>
<th>fps</th>
<th>Power efficiency (fps/W)</th>
<th>Approx. transistor utilised (TU) (x10$^{-9}$)</th>
<th>Resource efficiency (fps/TU) (x10$^{-8}$)</th>
<th>Combined efficiency (fps/W/TU) (x10$^{-9}$)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Static</td>
<td>Dynamic</td>
<td>Total</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>FPGA</td>
<td>Dual-core IPPro</td>
<td>158</td>
<td>26</td>
<td>184</td>
<td>337</td>
<td>48.43</td>
<td>114.1</td>
<td>3.6</td>
<td>193.1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>160</td>
<td>153</td>
<td>313</td>
<td>337</td>
<td>10.65</td>
<td>300.3</td>
<td>6.0</td>
<td>192.0</td>
</tr>
<tr>
<td></td>
<td>Dual 8-way SIMD IPPro</td>
<td>160</td>
<td>37000</td>
<td>60000</td>
<td>1127</td>
<td>1.19</td>
<td>840</td>
<td>13.1</td>
<td>63.1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>1560</td>
<td>20000</td>
<td>59000</td>
<td>1127</td>
<td>1.58</td>
<td>632</td>
<td>10.7</td>
<td>51.5</td>
</tr>
<tr>
<td></td>
<td>OpenCL</td>
<td>37000</td>
<td>1127</td>
<td>632</td>
<td>1127</td>
<td>1.58</td>
<td>1331</td>
<td>1227</td>
<td>9.8</td>
</tr>
<tr>
<td></td>
<td>CUDA</td>
<td>37000</td>
<td>12700</td>
<td>632</td>
<td>1127</td>
<td>1.58</td>
<td>1227</td>
<td>1227</td>
<td>8.7</td>
</tr>
<tr>
<td>GPU</td>
<td>ARM Mali-T628</td>
<td>120</td>
<td>-</td>
<td>1500</td>
<td>600</td>
<td>3.69</td>
<td>173</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>ARM Cortex-A7</td>
<td>250</td>
<td>-</td>
<td>670</td>
<td>1200</td>
<td>3.49</td>
<td>5.2</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>
5.5 Case Study: \( k \)-means clustering

Comparing the performance results (fps), both FPGA implementations achieved 6 and 27 times over the embedded CPU, while the embedded GPU delivered 6.7 times better performance over the FPGA by exploiting parallelism and higher operating frequency. Focusing on the power consumption results, the FPGA consumed 2.1 and 4.9 times less power than both the embedded CPU and embedded GPU respectively. It shows that the FPGA technology delivers power-optimised solution while, the GPU technology provides performance-optimised solution. Though by considering both performance and power together, the power efficiency (fps/W) numbers shows that FPGA and embedded GPU implementations are 57 and 33 times more power efficient than embedded CPU. These results shows that FPGA implementation is 24 times more power efficient than embedded GPU. Nevertheless, this power efficiency edge can be further improved by applying dataflow transformations and increasing the number of IPPro cores.

Table 5.15 also compares the FPGA results against desktop GPU and reports resource efficiency as a metric due to significant difference in the power consumption numbers. The resource efficiency has been presented in terms of frames-per-second-per-Transistor-Utilisation (fps/TU) which is 6 and 63 for 28nm FPGA and GPU technologies. For embedded CPU and GPU, these results are not reported due to unavailability of transistor count numbers by the ARM. The reported resource efficiency results shows that GPU utilises area resources more efficiently than FPGA when power is kept out of the equation. Combining all three metrics (fps/W/TU) shows that the advantage gained from FPGA designs is significant which is 22 times more efficient than GPU. This advantage becomes more valuable considering the fact that presented FPGA-based MPSoC design is adaptable, allows exploration, profiling and implementation of different dataflow
transformation possibilities over dedicated FPGA approaches to accelerate image processing applications where energy is vital.

5.6 Summary

This chapter presented IPPro as a programmable dataflow accelerator that supports dataflow MoC. The presented IPPro architecture implements multi-port static dataflow actors supported with notion of firing actor and execution patterns using producer-consumer computing model. These execution patterns provide flexible mapping options to the user and software framework to explore and deploy dataflow optimisations. The input and output FIFOs of IPPro are realised and implemented using BRAM, DistRAM and SR. IPPro implementation using DistRAM-based FIFO has achieved timing of $\approx 242$ MHz, utilising $\approx 8\%$ more LUTs compared to BRAM-based implementation. On the other hand a degradation of 3\% and 19\% in operating frequency has observed for SR and BRAM-based FIFO implementations.

To use IPPro as basic dataflow computation unit in heterogeneous MPSoC-based system architecture requires communication between the accelerator and the host. Four multiple IPPro core designs have been implemented to evaluate the impact of host-accelerator communication and inter-core communication mechanism on area utilisation. These designs cover the different level of connectivity between producer and consumer cores, as well as static and dynamic handling of inter-core connectivity managed either by the host or the core itself. The last design $\mathcal{D}$ has offered desired functionalities with reduced management overhead at the maximum cost of 2.88 and 1.54 times more FFs and LUTs.
5.6 Summary

To deploy dataflow optimisations (decomposition, mapping, and scheduling) using multiple IPPro cores different actor-core mapping possibilities are discussed supported with inter-core communication. A configurable stream-based data distribution and collection system architecture has proposed to deploy and realise the selected optimisations. The architecture abstracts the low-level hardware implementation details from the user and simplifies application development process by providing underlying functionality via C-APIs. The design facilitates exploitation of discussed dataflow optimisations including multi-stage pipelined and parallel computing models (split, compute, and merge).

To evaluate the proposed architecture deploying dataflow optimisations, distance calculation and averaging stages have implemented on Avnet Zedboard. Four IPPro hardware accelerator designs have realised that cover single-actor, dual-actor, data and task level parallelism. The obtained results show that by exploiting both data and task level parallelism, it is possible to achieve 7.3 times better performance than task parallelism alone. Comparing against other technologies, FPGA achieved 27 times better performance over the embedded CPU by exploiting parallelism and consumes 4.9 times less power than the embedded GPU. Moreover, the power efficiency (fps/W) numbers shows that FPGA implementation is 57 and 24 times more power efficient than embedded CPU and GPU respectively.
Chapter 6

FPGA-based programmable hardware acceleration platform

6.1 Introduction

Many and multicore hardware accelerators have been used in data intensive computing systems [22], [23]. Despite the efficiency of the heterogeneous system, the designers and system architects are facing challenges to quickly implement tailored applications on FPGA-based platforms to meet design goals [12], [43], [17]. One of the shortfalls in these parallel architectures is the scarcity of hardware abstraction, which makes it difficult for application designers to efficiently use the available FPGA compute resources [17], [122]. It requires a certain level of hardware knowledge which software and application developers lack in order to maximise efficiency and reusability of the available parallel architecture as, it involves handling of the low-level core, inter-core and system communication and system interfaces etc. To approach this problem is by designing an
FPGA-based multicore processor using an IPPro core that allows an efficient, high-performance, fine and coarse-grained mapping and execution of dataflow actors. This multicore processor extends the flexibility provided by the IPPro core and allows both pipelined and parallel execution of dataflow actors to realise *programmable streaming networks*. Multiple instances of these multicore processors can be cascaded together to achieve a *FPGA-based programmable hardware acceleration platform*. This platform facilitates exploration, profiling and acceleration of image processing applications to software and algorithm developers using software-centric *edit-compile-run* flow by avoiding *synthesis and place-and-route* design flow. In addition, it supports implementation of parallel computing skeletons that provide higher programming abstraction of parallel structures which can be efficiently realised on the underlying architecture to implement parallel applications. The resulting platform allows software-controlled adaptable execution of parallel skeletons, by abstracting the underlying hardware architecture to the developer which gives better granularity to the application programmer realising parallel applications using FPGA technology. The following are the main contributions:

- Creation of an optimised IPPro core architecture which supports *message passing* and *shared* data models to process uniform and non-uniform distributed data. These data models enable realisation of *split, compute and merge, pipeline and farm* parallel skeletons.

- A novel multicore IPPro architecture that supports dynamic routing of data streams among cores, exploiting parallelism using horizontal and vertical scaling. It includes one-to-many, many-to-one, many-to-many producer-
consumer data passing patterns for flexible actor-core mapping possibilities.

- A software configurable data distribution and collection architecture to realise parallel implementation on heterogeneous architecture. It handles different image resolutions, provides flexible control on data stream generation and distribution and can be integrated in direct and buffered video pipelines.

- Software abstraction of the proposed programmable platform and its hardware supported features to realise software driven parallel implementations.

This chapter presents IPPro core-level optimisations in Section 6.3. It covers incorporation of data and control mechanisms required to implement parallel skeletons, hardware-optimised implementation of dataflow actor firing rule to minimise control overhead, and implementation results of the optimised IPPro core architecture. Section 6.4 presents the multicore IPPro architecture focusing on the identification of multicore architectural features, exploration of a suitable stream-based multicore interconnect design and their impact on performance and core utilisation. Section 6.5 presents the FPGA-based programmable hardware acceleration platform with focus on dynamic data distribution and collection requirements for parallel implementations. Section 6.6 discusses the performance results of the chosen image processing functions exploiting data/task parallelism, and heterogeneous computing to evaluate the flexibility of the platform. Each IPPro acceleration result is compared against the equivalent optimised ARM implementation.
6.2 Programmable realisation of parallel skeletons on FPGAs

Parallel skeletons are pre-defined generic components derived from higher-order functions which can be parametrised in sequential problem-specific code and can be efficiently implemented on hardware architectures [116], [123]. In this research, a data-driven producer-consumer computing paradigm has been adopted which can be used to exploit data and task parallelism. Therefore, the underlying architecture must support the desired data exchange and synchronisation mechanisms and the functional requirements of skeletons. For this purpose, Figure 6.1 presents three-layer programming (actor, parallel actors and parallel skeletons) and hardware (IPPro core, Multicore IPPro, System Infrastructure) abstraction.

From bottom-up, a *programmable streaming unit* supports the functional requirements of a dataflow actor and, a *programmable streaming network* supports dataflow driven data exchange patterns across multiple actors to enable flexible mapping possibilities to implement *parallel actors*. The top layer allows parametric implementation of a *parallel skeleton* by supporting stream and non-stream data access and control mechanisms that are necessary to exploit parallelism.

Figure 6.1 shows the *hardware abstraction* to realise the *programmable hardware acceleration platform*. The IPPro core is used to implement a programmable dataflow actor, the multicore IPPro gives algorithm exploration possibilities using different actor-core mappings of multiple actors. The *system infrastructure* allows the necessary software configurable data distribution and collection mechanisms to support control and data requirements of parallel skeletons.

To this end, the IPPro core already supports some architectural features as
6.3 IPPro core architectural optimisations

The existing IPPro datapath supports a message-passing data communication model which is only suitable for stream and uniformly distributed data processing to realise *split*, *compute*, *merge* and *pipeline* skeletons. On the other hand, the

Figure 6.1: Software and hardware abstraction of the platform.

presented in Chapter 4 and 5. The architectural features which are required and not supported by existing IPPro datapath are highlighted in Figure 6.1. Section 6.3 presents IPPro core optimisations focusing on data and control mechanisms needed to implement parallel skeletons and the hardware-optimised implementation of dataflow actor firing rule.

### 6.3 IPPro core architectural optimisations

The existing IPPro datapath supports a message-passing data communication model which is only suitable for stream and uniformly distributed data processing to realise *split*, *compute*, *merge* and *pipeline* skeletons. On the other hand, the
6.3 IPPro core architectural optimisations

Farm skeleton requires access to non-uniform distributed data which need data memory. This memory would serve as a data exchange path between master (host) and worker (IPPro) as abstracted in Figure 6.1. It facilitates the implementation of global functions (subject to the size of data memory) using IPPro cores. Based on the functional requirements, the following optimisations have been identified:

1. Optimisation of dataflow actor firing rule minimising control overhead to implement multiple-consumer and multiple-producer dataflow actors (Section 6.3.1).

2. IPPro scratchpad memory to exchange data between IPPro and the host processor to realise a farm skeleton (Section 6.3.2).

3. IPPro core interfaces compliance with industry standard MPSoC communication protocols for easy integration and portability within SoC and other systems as IP (Section 6.3.3).

6.3.1 Dataflow actor firing rule optimisation

Chapter 5 presented a programmable software solution to handle actor firing rule which is not suitable for multi-port actors (MPMC and MPSC). It adds execution overhead directly proportional to the number of producer nodes as shown in Listing 6.1. During code execution, the core iteratively checks the firing rule dedicated to each producer node using branch instructions. As a result, the actor’s execution time is dependent on the number of producer nodes.

A hardware actor firing module has been designed and integrated into IPPro datapath to reduce execution overhead as shown in Figure 6.2. The control inter-
face allows configuration of eight set token count registers (STC_Q0 - STC_Q7) to check the number of tokens available from each producer. From the multicore architecture perspective, it allows actor mapping opportunities for up to eight producers feeding an actor by storing tokens into their appropriate FIFO queues. Also, an actor firing mask (AFMR) register holds the information about the number of producer nodes connected to the actor while values stored in (STC_Q0 - STC_Q7) registers define the number of tokens expected from each producer.

Listing 6.1: IPPro code of un-optimised actor firing rule.

```plaintext
; Check if the expected number of tokens (1, 2, 1) in FIFO queues coming from
; source nodes (0, 1, 2) are available? If yes, fire the actor
# Store expected number of tokens from each producer node
STR R1, 1;
STR R2, 2;
STR R3, 1;
...
# Check producer#1 rule
CHECK_RULE1:
TEST R20, R1, #0
BNZ CHECK_RULE1
...
# Check producer#2 rule
CHECK_RULE2:
TEST R20, R2, #1
BNZ CHECK_RULE2
...
# Check producer#3 rule
CHECK_RULE3:
TEST R20, R3, #0
BNZ CHECK_RULE3
...
ACTOR_FIRED:
...
```

These registers are initialised by the host. During actor execution, the actor firing module concurrently reads the token counts of input FIFO queues, compares them against (STC_Q0 - STC_Q7), masks it with AFMR, and updates the result in firing status register (FSR) as shown in Figure 6.2. This allows software integration of an actor firing rule into the IPPro code using TEST instruction. Individual bits of the FSR shows the availability of expected number
6.3 IPPro core architectural optimisations

Figure 6.2: Block diagram of hardware dataflow actor firing module.

of tokens from each producer. It compares the value of FSR against the set ACTOR_FIRING_MASK defined in IPPro code as shown in Listing 6.2.

Once an actor has fired, execution of \textit{GETRx, CHANNEL\#} reads the token from the addressed FIFO queue and stores it into the addressed location of the register file. Similarly, \textit{PUSHRx, CHANNEL\#} reads token from the register file and forwards it to the output FIFO. The \textit{output FIFO controller} shown in Figure 6.2 encodes SRC_ID and DEST_ID tags, required to re-order and route tokens to the different consumer node. The SRC_ID and DEST_ID specifies a source node (producer) and a destination node (consumer) of the token.

By comparing the execution time of the presented IPPro code Listing 6.1 and 6.2 shows that optimised implementation takes a fixed number of clock cycles
6.3 IPPro core architectural optimisations

which is independent of the number of producer nodes. The proposed hardware actor firing module enables programmable implementation of both fixed and multi-rate actor firing rule using IPPro by merely changing the program code. On the contrary, high-level synthesis approaches generates a fixed architecture [25], [43], [124] that needs design reccompilation, synthesis and place-and-route to deploy small changes such as actor firing rule.

Listing 6.2: IPPro code of optimised actor firing rule.

```
; Check if the expected number of tokens set by the host STK,Qx expecting from 
; source node 0, 1, 2, 3, 4, 5 are available in respective FIFO queues?
: If yes, fire the actor
CHECK_FIRING_RULE:
    STR R15, #0000_0000_0011_1111 ; Set ACTOR_FIRING_MASK
    TEST R30, R15 ; Check FSR
    BNZ CHECK_FIRING_RULE

ACTOR_FIRING:
    GET R10,#0 ; Read token from node # 0
    PUSH R20,#1 ; Send token to node # 1

JMP CHECK_FIRING_RULE
```

6.3.2 Scratchpad memory to access non-streaming data

Section 6.3 outlined the importance of data memory in the IPPro datapath, providing a path between the host processor and the IPPro core to implement farm parallel computing skeleton. For this purpose, a scratchpad memory of size 512x16 bits configured as true dual-port RAM has been added into the IPPro datapath. This design choice has been made to efficiently utilise the BRAM resources as, 512x16 bits size maps well on 18KB BRAM block (half of the BRAM). The other half of the BRAM has been used for the instruction memory. One of the port is connected to the host processor via an AXI4 interface, and the other to the datapath using a native interface as shown in Figure 6.3.
6.3 IPPro core architectural optimisations

Figure 6.3: Data processing paths of the IPPro using scratchpad.

To maintain a balance among better functionality, area and timing, six additional instructions have been supported by IPPro to access scratchpad memory as listed in Table 6.1. These instructions allow reading and writing data into the scratchpad memory and return assigned task status to the host processor. *Direct* (LDSP, STSP) and *in-direct* (LDSPI, STSPI) addressing modes facilitates iterative access to memory locations using loops and offsets, which are commonly practised by software programmers. Listing 6.3 shows the example code using direct and indirect addressing modes to access the scratchpad memory.

This optimisation has also improved the data processing capabilities of IPPro core by processing stream and non-streamed data simultaneously using four sup-

<table>
<thead>
<tr>
<th>IPPro Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>TASK_FINISHED</td>
<td>Inform host that task is completed</td>
</tr>
<tr>
<td>SP_VALID</td>
<td>Inform host that scratchpad is valid</td>
</tr>
<tr>
<td>LDSP, STSP</td>
<td>Load/store data to/from directly addressed location</td>
</tr>
<tr>
<td>LDSPI, STSPI</td>
<td>Load/store data to/from indirectly addressed location</td>
</tr>
</tbody>
</table>
ported data execution paths as highlighted in Figure 6.3. These data execution paths facilitate a flexible dataflow actor to IPPro core decomposition and mapping options. It has been done in such a way that the stream execution path has minimal data transfer overhead compared to non-stream execution path. This is because the FIFO-based transfers exploits pipelining compared to memory-based transfers via a host processor where cache coherency latencies can be significant.

Listing 6.3: Code demonstrating direct and in-direct access to the scratchpad.

```
# Indirect access to scratchpad memory using loop
INIT:
    STR R31,#1 ; Loop initial value / indirect address pointer
    STR R1, #1 ; Loop increment constant value
    STR R20, #10 ; Loop terminate count value
...
LOOP:
    LDSP R21, R31 ; R21 <= SP[R31]
    ADD R31, R31, R1 ; Increment loop count
    SUB R22, R31, R20 ; Check whether Loop condition
    BNZ LOOP
...
# IPPro core as a hardware accelerator (farm worker)
# c = FUNC(a* b)
# It is pre-defined that SP(0) = a ; SP(1) = b ; SP(3) = c
FUNC:
    LDSP R1, #1 ; Load a
    LDSP R2, #2 ; Load b
...
    MUL R3, R1, R2
    STSP R3, #3 ; Store c
    SP,VALID
...
JMP FUNC
```

### 6.3.3 Host management of IPPro core using AMBA-AXI4

A vital aspect of any SoC solution is not only the hardware components it houses, but also the way these components are connected. The ARM Advanced Microcontroller Bus Architecture (AMBA) is an open-standard on-chip interconnect specification. Most leading SoC chips supports the fourth generation AMBA-AXI4. In these systems, a host processor configures, manages and in some cases
6.3 IPPro core architectural optimisations

AMBA-AXI4 specification supports three protocols: 1) AXI4-Lite to provide register-based control mechanisms, 2) AXI4-Stream to feed a stream of data, and 3) AXI4-memory mapped to exchange random access data between the host and the underlying architecture.

IPPro supports all three AMBA-AXI4 protocols where AXI4-Lite interface is used to configure actor firing module, SRC_ID decoder and DEST_ID encoder using nine AXI4-Lite IPPro registers (for details see Appendix B Table B.3). It has two AXI4-memory-mapped interfaces that allow the host processor to program instruction memory and access scratchpad memory. Two AXI4-Stream interfaces allow sending/receiving a data stream into the core, which can be either the host processor via direct memory transfer or system architecture. The AXI4 Slave and Master wrapper modules are added into the IPPro datapath as shown in Figure 6.4 that convert a native FIFO handshaking to AXI4-Stream interface. They use native EMPTY and FULL handshake signals to generate respective AXI4 master and slave handshake signals (TREADY and TVALID). The modules also handle separation of the data payload (TDATA), routing tags (TDEST), and the generation of reading and writing control signals to a native DIN, DOUT.
6.3 IPPro core architectural optimisations

Table 6.2: Implementation results of the optimised IPPro on Kintex-7 fabric.

<table>
<thead>
<tr>
<th>Resources</th>
<th>Initial IPPro</th>
<th>Optimised IPPro</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flip Flops</td>
<td>447</td>
<td>884</td>
</tr>
<tr>
<td>LUTs</td>
<td>484</td>
<td>755</td>
</tr>
<tr>
<td>BRAMs</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>DSP48E1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Freq. (MHz)</td>
<td>337</td>
<td>300</td>
</tr>
</tbody>
</table>

ports. C-APIs have been developed to abstract control and management of the core (for details see Appendix B).

6.3.4 Implementation results of optimised IPPro core

The optimised IPPro datapath is synthesised and implemented using Xilinx Vivado v2016.4 design suite. Table 6.2 summarises the results and compares them against the initial IPPro core indicated in Table 1.12. The critical path has increased approx. 11% and resulted in operating frequency of 300 MHz. This reduction come at the cost of fix actor firing execution time (Section 6.3.1), and data compute capability of both stream and non-streamed data (Section 6.3.2). The optimised datapath consumes 1.9 and 1.5 times more FFs and LUTs, while the BRAM/DSP ratio remains constant. Generally, an FPGA fabric has two times more FFs than LUTs and therefore, the maximum number of cores that can be populated on the chip will be affected by the FF/LUT ratio. Regarding mapping possibilities, an actor with up to eight producer nodes which has been reflected in the reported LUT utilisation. This increase in LUT utilisation is caused by eight 16x32 FIFO queues to re-order received data tokens from multiple producers. Similarly, increase in FF utilisation occurred due to FIFO count registers used by the hardware actor firing module, and AXI4-Lite registers which were absent in initial IPPro. However, the area utilisation represents < 1% of
Table 6.3: Comparison of IPPro against other FPGA-based soft-core processors.

<table>
<thead>
<tr>
<th>Resource</th>
<th>IPPro</th>
<th>Graph-SoC [16]</th>
<th>FlexGrip [36]*</th>
<th>MicroBlaze</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flip-flops</td>
<td>884</td>
<td>551</td>
<td>12972</td>
<td>318</td>
</tr>
<tr>
<td>LUTs</td>
<td>755</td>
<td>974</td>
<td>8916</td>
<td>897</td>
</tr>
<tr>
<td>BRAMs</td>
<td>1</td>
<td>9</td>
<td>15</td>
<td>-</td>
</tr>
<tr>
<td>DSP48E1</td>
<td>1</td>
<td>1</td>
<td>19.5</td>
<td>3</td>
</tr>
<tr>
<td>Stages</td>
<td>5</td>
<td>3</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>Freq.</td>
<td>300</td>
<td>200</td>
<td>100</td>
<td>211</td>
</tr>
</tbody>
</table>

* Scaled to a single streaming processor.

Table 6.3 compares the results of optimised IPPro core against other FPGA-based soft-core processors. The optimised IPPro delivers 1.4 - 3.0 times better $f_{Max}$ compared to other processors. Comparing area utilisation numbers, IPPro has used 37% and 41% more FFs than GraphSoC and MicroBlaze but lower than FlexGrip. On the other hand, IPPro consumed $\approx 15\%$ and 22% less LUTs than MicroBlaze and GraphSoC.

Section 6.3 has presented IPPro datapath optimisations to minimise execution overhead to implement multi-port actor and achieve essential data and control mechanisms to map and execute stream and non-stream data processing. The control of supported mechanisms is abstracted by developing C-APIs to maintain flexibility.

### 6.4 Multicore IPPro

The low-level communication and synchronisation mechanisms must be managed by the multicore architecture itself that created the need of a flexible *multicore interconnect*. It facilitates adaptability to exploit different dataflow transformations, provide flexible level of connectivity and essential data exchange patterns among cores to map parallel dataflow actors. It will help not only to map dif-
different pipelined dataflow graphs onto multicore architecture but also to exploit data and task parallel implementation adopting a horizontal and vertical scaling approach. Considering these architectural features the following design requirements are identified:

- Software controlled connectivity among cores of multicore IPPro to realise one-to-many, many-to-one, many-to-many consumer-producer data passing patterns, to have flexible actor-core mapping possibilities (Section 6.4.1).

- Dynamic routing of data streams among IPPro cores to achieve area-efficient horizontal and vertical scaling of the architecture (Section 6.4.1).

### 6.4.1 Exploration of multicore interconnect architecture

In a multicore architecture, the *multicore interconnect* defines connectivity across IPPro cores. In the open literature, the research community has proposed and analysed different types of interconnect architectures such as *bus, crossbar and network-on-chip (NoC)* [125]. Each interconnect architecture has pros and cons based on the supported connectivity, flexibility, area and performance [112], [125]. From an application mapping point-of-view, the chosen level of connectivity can limit data exchange possibilities among cores, leading to a restrained actor-core mapping and realising parallel possibilities. From a hardware design point-of-view, it could significantly impact performance and area utilisation.

This section discusses the dataflow data passing patterns and highlight their importance to achieve better actor-core mapping possibilities, and horizontal and vertical scaling. Besides, it presents the dynamic routing of the data streams approach to achieve flexible mapping possibilities onto multicore IPPro.
Section 6.2 has emphasised the significance of supporting dataflow data passing patterns which must be supported by the multicore architecture to realise adaptable implementations. It includes multiple actor (many-to-one, one-to-many and many-to-many) data passing patterns (MPSC, SPMC, MPMC) [92], [93], [91].

Figure 6.5 models the required connectivity among cores that shall be supported by the multicore interconnect to map and execute different dataflow graphs. This architecture will provide both horizontal and vertical connectivity among cores which was absent in the 4x4 interconnect architecture. A split and merge can be expressed by SPMC and MPSC in producer-consumer model, or used to implement data parallel computation. Similarly, a feed-forward can be represented by SPSC in producer-consumer model, or used to achieve pipelining or task parallel computation. Since these patterns are reusable, different nested data passing patterns can be derived such as merge-pipeline-split or split-pipeline-merge as shown in Figure 3.3.

It shows that the multicore interconnect should support the identified data exchange patterns to improve actor mapping possibilities and maximise core util-
isolation for parallel implementations.

Dynamic routing of data streams

One of the set design requirement of multicore interconnect is the dynamic routing of data streams across multiple cores by sharing resources. It requires data-channel arbitration to avoid data collision, resource starvation and to ensure balanced bandwidth distribution across cores. For this purpose, a Xilinx AXI4-Stream switch IP is chosen as multicore interconnect that supports M x N crossbar connectivity between AXI-Stream master and slave channels. It uses an address control signal (TDEST) to route a stream of data between a master and a slave. It supports slave decoding and master arbitration mechanisms (fixed and round-robin) where each master is statically assigned a TDEST value.

Figure 6.6 illustrates the realisation of identified dataflow data passing patterns in Section 6.4.1 using TDEST signal. To maintain the balance between area utilisation and the level of connectivity among cores, the maximum support

Figure 6.6: Realisation of data exchange patterns using stream interconnect.
6.4 Multicore IPPro

of up to eight cores (IPPro#0 - IPPro#7) is considered as shown in Figure 6.6. This configuration would allow realising parallel implementations up to the 7-way split, 7-way merge, 8-way SIMD, 8-stage pipeline or a combination of thereof. The arrow shows the flow of data from producer to the consumer core. Section 6.3.1 has detailed the process of tagging tokens with \textit{DEST,ID} whenever IPPro encounters PUSH CHANNEL# instruction. This tag specifies the destination core (consumer) and is used as TDEST.

This multicore interconnect architecture compliments, the features supported by IPPro and extend actor-core mapping possibilities using dynamic routing of data streams. The level of core connectivity supported by the interconnect defines the granularity of exploitable parallelism by the resultant multicore IPPro which in this case is 8-way SIMD.

6.4.2 Impact of interconnect’s core connectivity and core utilisation on area and performance

Three designs have been selected using 4x4, 8x8 and 16x16 cross-bar configurations to accommodate 2, 4 and 8 IPPro cores as illustrated in Figure 6.7. These designs express an increasing level of core connectivity allowing better actor-core mapping possibilities by providing both horizontal and vertical connectivity necessary to realise tree expansion and reduction while maximising core utilisation (CU) as illustrated in Figure 6.6.

Each design has AXI4-stream master and slave interfaces ($M_x$ and $S_x$). Half of the interfaces of each design, are assigned to the number of supported IPPro cores while remaining interfaces are used to feed data in and out of the multicore IPPro.
Each input and output interface has an internal 32x24-bit FIFO realised using FPGA’s LUT resources that buffer data locally to avoid congestion during channel arbitration while, the interconnect is serving other cores. The data payload of each channel is three bytes TDATA (2-bytes data token, 1-byte source/destination tag). The interconnect uses round-robin scheduling to avoid resource starvation and provide equal bandwidth to all cores. The size of TDEST has been fixed to 2, 3 and 4-bits for 4x4, 8x8 and 16x16 designs respectively to uniquely address each slave channel (input interface of IPPro core). The designs have been synthesised and implemented using Vivado v2016.4 for Artix-7 and Kintex-7 FPGA. The area and timing results are reported in Table 6.6.

**Impact of scaling on the interconnect architecture** Table 6.4 details area and CU of stream interconnect and compares it against 4x4 interconnect. The stream interconnect provides a software controlled implementation of data passing patterns as illustrated in Figure 6.6.

Table 6.4 presents resource utilisation where both data parallel mappings
have achieved 100% core utilisation (CU). The task parallel mappings of 4x4 interconnect have achieved 25% CU due to lack of vertical connectivity among cores. It shows that stream interconnect provides flexible actor-core mapping options to exploit both data and task parallelism using the same underlying architecture.

Table 6.5 presents the normalised area results of Table 6.4. The normalised FF and LUT utilisation is close to unity for data parallel implementations and consumes twice the number of BRAMs and DSP48E1s. On the other hand, a significant difference approx. 1.67 to 2.19 times in LUTs and FFs utilisation, is observed for task parallel implementations and four times number of BRAMs and DSP48E1s. The results show that the stream interconnect architecture is flexible, supports better actor-core mapping possibilities suitable for data and task parallel implementations, and area efficient than the 4x4 interconnect architecture.

Performance Analysis Table 6.6 compares the implementation results of stream interconnect on Artix-7 and Kintex-7. Using Artix-7, 4x4 connectivity has resulted in $f_{\text{Max}}$ 200 MHz, which reduced $\approx 1.33$ and 1.66 times when connectivity is scaled-up to 8x8 and 16x16 respectively due to larger cross-bar connections implemented using multiplexers. When the same 4x4 connectivity is ported to Kintex-7, the design has achieved $f_{\text{Max}}$ 285 MHz which is 1.09 and 1.29 times lower when scaled-up to 8x8 and 16x16 respectively. It can be observed that Kintex-7 delivered 1.45 times better timing than Artix-7 because, Kintex-7 FPGA technology is optimised for performance, which comes at higher chip cost.

It is important that the cores do not require full bandwidth of the interconnect as they sequentially process data and their execution time is directly proportional
Table 6.4: Implementation results to evaluate scaling of 4x4 and stream interconnect architectures on area and core utilisation to realise data (vertical) and task (horizontal) parallel implementations.

<table>
<thead>
<tr>
<th>Impl.</th>
<th>4x4 Interconnect</th>
<th>Stream Interconnect</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Cores</td>
<td>Conn.</td>
</tr>
<tr>
<td>4-stage</td>
<td>16</td>
<td>4x4</td>
</tr>
<tr>
<td>8-stage</td>
<td>32</td>
<td>4x4</td>
</tr>
<tr>
<td>4-way</td>
<td>8</td>
<td>4x4</td>
</tr>
<tr>
<td>8-way</td>
<td>16</td>
<td>4x4</td>
</tr>
</tbody>
</table>

Table 6.5: Normalised area utilisation numbers of 4x4 with respect to stream interconnect realising parallel implementations.

<table>
<thead>
<tr>
<th>Resource</th>
<th>Task Parallel</th>
<th>Data Parallel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4-stage</td>
<td>8-stage</td>
</tr>
<tr>
<td>FF</td>
<td>2.19</td>
<td>1.99</td>
</tr>
<tr>
<td>LUTs</td>
<td>1.86</td>
<td>1.67</td>
</tr>
<tr>
<td>BRAM</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>DSP48E1</td>
<td>4</td>
<td>4</td>
</tr>
</tbody>
</table>

Table 6.6: Implementation results of scaled-up stream interconnect designs with increasing core-connectivity on Artix-7 and Kintex-7 fabrics. The normalised area utilisation numbers of each design with respect to single-core IPPro are reported within the brackets.

<table>
<thead>
<tr>
<th>Connectivity</th>
<th>FFs</th>
<th>LUTs</th>
<th>LUTRAM</th>
<th>$f_{Max}$ (MHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Artix-7</td>
<td>Kintex-7</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4x4</td>
<td>1524 (1.7)</td>
<td>1092 (1.9)</td>
<td>160 (0.9)</td>
<td>200</td>
</tr>
<tr>
<td>8x8</td>
<td>3414 (3.9)</td>
<td>2840 (5.0)</td>
<td>320 (1.7)</td>
<td>150</td>
</tr>
<tr>
<td>16x16</td>
<td>8266 (9.4)</td>
<td>8335 (14.6)</td>
<td>768 (4.2)</td>
<td>120</td>
</tr>
</tbody>
</table>
to the complexity of the actor. The implementation of a simple dataflow actor on IPPro requires at least approx. 12 instructions. The stream interconnect arbitrates data channels and routes data from source to destination in a round-robin fashion on a cycle-to-cycle basis. Due to this reason, the bandwidth requirements per core is less than what is usually expected in a fully pipelined FPGA architectures (where a slower data transfer rate could limit the performance). Moreover, deployment of input/output FIFOs at interconnect boundaries allow data buffering and isolates clock boundaries which allow interconnect and IPPro cores to operate at different operating frequencies. Therefore, the operating frequency of the multicore interconnect ($f_{\text{Interconnect}}$) is not necessarily required equal to the operating frequency of the IPPro core ($f_{\text{IPPro}}$). Based on this fact, the maximum $f_{\text{Max}}$ degradation of 1.66 and 1.29 times at the cost of flexible core connectivity among cores is a viable choice.

**Area Analysis** Table 6.6 reports the area utilisation of 4x4, 8x8 and 16x16 designs. The difference margin between LUTs and FFs of 5.00 and 3.19 is higher due to FIFO buffers realised using LUT resources. The normalised area utilisation of 4x4, 8x8 and 16x16 interconnect to single-core IPPro has been reported in Table 6.6. They consume 1.7, 3.9 and 9.4 times more FFs, and 1.9, 5.0 and 14.6 times LUTs respectively. This show that stream interconnect fulfils the identified requirements of the multicore interconnect identified in Section 6.4, and provides a balance between area and performance.
6.4 Multicore IPPro

6.4.3 Multicore IPPro architecture

Considering the performance and area analysis results of multicore interconnect, the multicore IPPro is composed of eight IPPro cores, connected through 16x16 stream interconnect as shown in Figure 6.12. The AXI4-Lite interface allows to manage and AXI-MM to program dataflow actors onto IPPro cores. The interfaces (S8 - S15) and (M8 - M15) allow data in and out of the multicore IPPro. Depending on TDEST value, the incoming data stream is dynamically routed to the destination core realising multi-level split, merge implementations using the same underlying hardware architecture.

The interconnect interfaces (S0 - S7) and (M0 - M7) connected to the IPPro cores has 32x24 bits FIFO buffers. These buffers serve three purposes: 1) It temporarily stores data tokens produced/consumed by the cores which keep cores in processing due to data buffering. 2) It gives interconnect necessary time to arbitrate and route data streams among cores. 3) It isolates the clock domain boundaries allowing IPPro cores and multicore interconnect to run on independent clock frequencies. The buffering of data hides the data transfer time between cores by storing data tokens at input and output interfaces of the cores. It is possible to run IPPro cores ($f_{IPPra}$) at a maximum of 300 MHz while the multicore interconnect ($f_{Interconnect}$) can run up to 220 MHz which is 1.83 and 1.90 times higher compared to Artix-7 respectively as reported in Table 6.2.
6.4.4 Example: Mapping of dataflow graph onto multicore architecture

The chosen dataflow graphs cover the parallel and pipeline dataflow transformations. Consider an example dataflow graph composed of actors (A, B, C, D, E, F and G) as shown in Figure 6.8. The graph is decomposed such that A, D, E and F are mapped onto separate cores but, actor B, C and G require different data parallel granularity 3-way and 4-way SIMD to implement (B1, B2, B3), (C1, C2, C3) and (G1, G2, G3, G4) which needs split and merge. Figure 6.9 shows the mapping onto multicore IPPro, the interconnect interfaces used by each core are shown explicitly for a clear understanding of data execution flow. The dataflow graph is decomposed and mapped onto two multicore IPPro to demonstrate scalability and parallel implementation of actors.

A receives input data stream at M0 routed from input interface S8. The stream is processed by core#0 as defined by A and fed to B1, B2, B3 when encountering (PUSH Rx, 1, PUSH Rx, 2 and PUSH Rx, 2) instructions. Each core has a dedicated FIFO queue to receive tokens from other cores (Section 6.3.1

Figure 6.8: A dataflow graph example that covers pipelining of multiple data parallel actors.
6.4 Multicore IPPro

and Figure 6.2) residing within multicore IPPro. B1, B2, B3 can concurrently read tokens (processed by A) into CHANNEL#0 of their respective FIFO queues using (GET Rx, 0). This process continues until D push the processed tokens to M8 output interface of the multicore interconnect. This interface is statically connected to S8 of the following multicore IPPro as indicated in Figure 6.9. Therefore, the tokens processed by D are received by E at S8, routed to M0 by the interconnect. The execution continue till reach the split (G1, G2, G3, G4) where the cores concurrently process the tokens and send processed tokens out of multicore IPPro using output interface (M8, M9, M10, M11).

Figure 6.9: Flat illustration of mapping and execution of pipelined multiple data parallel actors exploiting parallelism using multicore IPPro. The listed IPPro code shows the read, write and tagging of tokens for each actor. These tags are used by the interconnect to route token among cores of the multicore IPPro.
6.5 FPGA-based programmable hardware acceleration platform

The data distribution and collection requirements depend on the application in-hand, and the adopted decomposition and mapping which are not known at design time [32], [40]. A flexible hardware-based data distribution and collection architecture is needed so the following design requirements are supported by the system infrastructure to parallel skeletons:

- *Split, compute and merge*, and *pipeline* skeleton require parallel streams which raises the need of parametrised distribution of multiple data streams.

- *Farm* skeleton require access to parallel data blocks which needs programmable distribution of data blocks into the cores scratchpad memories.

6.5.1 Parallel distribution and collection of data streams

Scatter-gather has widely adopted as a parallel data distribution and collection paradigm for regularly distributed data which makes it suitable for pixel processing [116], [126]. It uses static decomposition and divides data into multiple equal-sized blocks as illustrated in Figure 6.10 for parallel processing using multiple cores. In open literature, various image processing data distribution patterns driven by row, column and block-based static data decomposition are reported [32], [40], [117], [118]. However, these hardware architectures handle fixed image sizes and parallel distribution of streams. The software or application developer needs granular control on both stream generation and distribution using software APIs without dealing with low-level data and control mechanisms.
6.5 FPGA-based programmable hardware acceleration platform

Besides, each parallel data stream can be converted into a form necessary for point and area processing.

**Parallel point and window generation**

Image pre-processing functions are composed of point and window/area operations. Figure 6.10 shows the row-wise scatter and gather process of an image with the maximum parallel granularity of eight for both operations. Compared to point operations, overlapping of multiple lines of pixels (two lines in case of a 3x3 window) is required for window operations. Therefore, dedicated software configurable point and area-based data distribution architecture has been proposed in Figure 6.11. The value of programmable P_A_REG register defines whether a stream or window of pixels is feeding to the core. Buffering of three incoming lines of pixels into LINE_BUFFER#1, LINE_BUFFER#2 and LINE_BUFFER#3 allows generation of 3x3 window. The window controller iteratively reads the line buffers and generates a stream of window pixels which can be fed to the cores of multicore IPPro through (S8 - S15) input interfaces as shown in Figure 6.12.
6.5 FPGA-based programmable hardware acceleration platform

This approach gives software control on data generation and mapping of point and area operations on IPPro cores. On the contrary, the HLS-based hardware architectures require code rewriting, verification, synthesis, place-and-route.

**Configurable scattering and gathering of data streams**

Software configurable *scatter* and *gather* hardware blocks have been designed with a FIFO interface to easily integrate with other image processing system [127]:

1. **Direct video streaming** An incoming video stream is stored into an on-chip frame buffer. A controller sequentially reads pixels from the frame buffer and stores into the data FIFO.

2. **Buffered video streaming** An incoming video stream is stored into an off-chip frame buffer. The host processor initiates a *direct-memory-access* (DMA) to read pixels from the frame buffer and stores into the data FIFO.

Both hardware blocks have AXI4-Lite registers to provide controllability on data distribution as shown in Figure 6.12 and listed in Table 6.7. The host processor must configure these registers during platform initialisation process. *Scatter* and *Gather* blocks have five and three programmable registers where
Figure 6.12: Block diagram of programmable hardware acceleration platform. The diagram only shows a single multicore IPPro due to space limitations. Cascading of multiple multicore IPPro cores is possible permitted to FPGA area resources.
6.5 FPGA-based programmable hardware acceleration platform

Table 6.7: The AXI4-Lite (control) register map of platform hardware modules.

<table>
<thead>
<tr>
<th>AXI4-Lite Registers</th>
<th>Bits</th>
<th>Addr.</th>
<th>31 - 28</th>
<th>27 - 24</th>
<th>23 - 20</th>
<th>19 - 16</th>
<th>15 - 12</th>
<th>11 - 8</th>
<th>7 - 4</th>
<th>3 - 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Scatter Module</td>
<td></td>
<td>0x00</td>
<td>CONTROL</td>
<td>xxx</td>
<td>PAREG</td>
<td>RST</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>0x04</td>
<td>SRC_ID_REG</td>
<td>LINE7</td>
<td>LINE6</td>
<td>LINE5</td>
<td>LINE4</td>
<td>LINE3</td>
<td>LINE2</td>
<td>LINE1</td>
<td>LINE0</td>
</tr>
<tr>
<td></td>
<td>0x0C</td>
<td>DEST_ID_REG</td>
<td>LINE7</td>
<td>LINE6</td>
<td>LINE5</td>
<td>LINE4</td>
<td>LINE3</td>
<td>LINE2</td>
<td>LINE1</td>
<td>LINE0</td>
</tr>
<tr>
<td>Gather Module</td>
<td></td>
<td>0x10</td>
<td>SCA_MASK</td>
<td>xxx</td>
<td>LINE_WIDTH</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>0x00</td>
<td>CONTROL</td>
<td>xxx</td>
<td>RST</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>0x04</td>
<td>LINE_REG</td>
<td>xxx</td>
<td>LINE_WIDTH</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>0x08</td>
<td>GAT_MASK</td>
<td>xxx</td>
<td>MASK</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

CONTROL, LINE_REG and MASK registers are common. LINE_REG defines the width of line buffers to support different image/video resolutions up to 2048, e.g. (640x480, 800x600). MASK defines data distribution granularity to generate parallel streams to exploit data parallelism. SRC_ID_REG and DEST_ID_REG register stores control tags (line buffer - FIFO queue) and (line buffer - core) mappings respectively. Listing 6.4 and 6.5 presents the C-APIs developed to configure scatter and gather blocks. These C-API hides the underlying implementation details of scatter and gather modules, provides transparent software interface (driver) that shall be used by the compiler framework to deploy different data and task parallel optimisations and hidden from the user.

Listing 6.4: C-APIs to manage scatter and gather blocks.

```c
// Split , compute and merge skeleton

// Scatter functions
int initScatter(Scatter* inst, uint32_t ScatterBase);
int ScatterWrite(Scatter* inst, ScatterAddr addr, uint32_t command);

// Gather functions
int initGather(Gather* inst, uint32_t GatherBase);
int GatherWrite(Gather* inst, GatherAddr addr, uint32_t command);

// Farm skeleton

// Scratchpad read/write functions
int writeSP(Core* inst, uint32_t *data, uint32_t n);
int readSP(Core* inst, uint32_t addr, uint32_t n);
```
// Set video horizontal resolution (640)
ScatterWrite(&Scatter, LINE_WIDTH, 640);
GatherWrite(&Gather, LINE_WIDTH, 640);
// One to one (line buffer - core) mapping
ScatterWrite(&Scatter, SRC_ID_REG, 0x000000);
GatherWrite(&Gather, GATHER_MASK, 0x01);
// Single-core, single active line buffer and no SIMD
ScatterWrite(&Scatter, DEST_ID_REG, 0x00000000);
GatherWrite(&Gather, DEST_ID_REG, 0x00000000);
ScatterWrite(&Scatter, SRC_ID_REG, 0x000010);
GatherWrite(&Gather, SRC_ID_REG, 0x000010);
// Dual-core, 2-way SIMD
ScatterWrite(&Scatter, DEST_ID_REG, 0x00000010);
GatherWrite(&Gather, DEST_ID_REG, 0x00000010);
ScatterWrite(&Scatter, SRC_ID_REG, 0x00000000);
GatherWrite(&Gather, SRC_ID_REG, 0x00000000);
// 3-way SIMD
ScatterWrite(&Scatter, DEST_ID_REG, 0x00000021);
GatherWrite(&Gather, DEST_ID_REG, 0x00000021);
ScatterWrite(&Scatter, SRC_ID_REG, 0x00000000);
GatherWrite(&Gather, SRC_ID_REG, 0x00000000);
// 7-way SIMD
ScatterWrite(&Scatter, DEST_ID_REG, 0x06543210);
GatherWrite(&Gather, DEST_ID_REG, 0x06543210);
ScatterWrite(&Scatter, SRC_ID_REG, 0x07);
GatherWrite(&Gather, SRC_ID_REG, 0x07);
// 8-way SIMD
ScatterWrite(&Scatter, DEST_ID_REG, 0x76543210);
GatherWrite(&Gather, DEST_ID_REG, 0x76543210);
ScatterWrite(&Scatter, DEST_ID_REG, 0xFF);
GatherWrite(&Gather, DEST_ID_REG, 0xFF);

During execution, scatter block sequentially reads data stream from the input data FIFO, divides it into equal blocks (defined by LINE_WIDTH), and consecutively stores into the line buffers (LINE_BUFFER#0 - LINE_BUFFER#7) depending on the SCA_MASK value. Each bit of SCA_MASK corresponds to the individual line buffer. The asserted bits specify that line buffer shall fill during scatter. Once the data is available into the line buffers, it is ready for consumption for point or window operation based on the value of PAREG as discussed in section 6.5.1. The cores concurrently process data while scatter refills line buffers as soon as there is a space into the line buffers. The gather block reads a stream of processed pixels from output line buffers defined by the LINE_WIDTH. GAT_MASK specifies how many output line buffers shall be read consecutively to reconstruct the output video stream. The C-APIs that pro-
6.5 FPGA-based programmable hardware acceleration platform

Table 6.8: Area utilisation results of the system infrastructure.

<table>
<thead>
<tr>
<th>Module</th>
<th>FFs</th>
<th>LUT</th>
<th>BRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multicore Interconnect</td>
<td>9085</td>
<td>9965</td>
<td>0</td>
</tr>
<tr>
<td>AXI-Lite scatter control</td>
<td>576</td>
<td>1163</td>
<td>0</td>
</tr>
<tr>
<td>Scatter point only</td>
<td>594</td>
<td>980</td>
<td>8</td>
</tr>
<tr>
<td>Scatter point and window</td>
<td>2665</td>
<td>2300</td>
<td>20</td>
</tr>
<tr>
<td>AXI-Lite gather control</td>
<td>169</td>
<td>208</td>
<td>0</td>
</tr>
<tr>
<td>Gather</td>
<td>559</td>
<td>717</td>
<td>8</td>
</tr>
<tr>
<td>AXI-Interconnect</td>
<td>221</td>
<td>221</td>
<td>0</td>
</tr>
<tr>
<td>Reset processing system 1</td>
<td>48</td>
<td>30</td>
<td>0</td>
</tr>
<tr>
<td>Reset processing system 2</td>
<td>48</td>
<td>31</td>
<td>0</td>
</tr>
</tbody>
</table>

grammer shall use in the host application to adjust the underlying architecture depending on the requirements of the application. The user does not have to deal with underlying hardware mechanisms.

6.5.2 Implementation results

Table 6.8 presents the area results of *system infrastructure* implemented on Avnet Zedboard using Xilinx Vivado v2016.4. The multicore interconnect uses 10.27 and 13.19 times more FFs and LUTs than a single IPPro core and 1.28 and 1.64 times more FFs and LUTs than 8 IPPro cores. The cost of flexible multicore interconnect is close to the programmable pipelined implementation of eight dataflow actors. The AXI-Lite control modules consume 1.53 and 1.54 times fewer FFs and LUTs respectively than a single IPPro core. Thus, the cost of incorporating software-driven control and management is marginal.

The cost of scattering parallel windows (area) resulted in 4.48 and 2.34 times more FFs and LUTs compared to scattering parallel point lines. The impact of triple buffered line buffers to generate pixel windows is evident in the reported BRAM utilisation. A consistent area usage has been observed by point *scatter* and *gather*, as the process of scattering line buffers is similar to the gathering of processed pixels. The AXI-interconnect and reset processing system blocks
are mandatory system components. They allow to receive data from the host processor and route it to the addressed slave devices. *System infrastructure* has two clock domains (AXI4 bus and IPPro clock) which require two reset processing systems to ensure synchronous reset of the slaves. These are the costs of making *FPGA-based hardware acceleration platform* adaptable which abstracts the FPGA resources and improves design time by avoiding synthesis, place-and-route.

So far, it is considered that the proposed platform is composed of single multicore IPPro. But, Zynq Kintex-7 chips could accommodate more instances of multicore IPPro. Table 6.9 reports the available area resources of Zynq XC7Z045 and XC7Z100 chips, and the numbers normalised to single multicore IPPro are reported in the brackets. These normalised numbers give an estimate that Zynq chips could potentially accommodate up to $\approx 16$ to 25 instances of multicore IPPro.

### 6.6 Parallel implementation of image pre-processing functions

Table 6.10 lists the mathematical representation of chosen functions that are fundamental kernels of larger algorithms and often represent the core computation of more extensive practical image processing applications [104], [105],
Table 6.10: Formal mathematical representation of chosen image pre-processing functions.

<table>
<thead>
<tr>
<th>Cat.</th>
<th>Functions</th>
<th>Mathematical representation</th>
<th>Actor-core mapping</th>
</tr>
</thead>
<tbody>
<tr>
<td>Point</td>
<td>• Contrast</td>
<td>$P_{\text{output}} = P_{\text{input}} + \text{Contrast}_{\text{val}}$</td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Thresholding</td>
<td>$P_{\text{output}} = P_{\text{input}} &gt; \text{Threshold}_{\text{val}} ? 255 : 0$</td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Gradient calc.</td>
<td>$P_{\text{gradient}} =</td>
<td>P_x</td>
</tr>
<tr>
<td></td>
<td>• Histogram</td>
<td>$\text{Image}<em>{\text{histogram}} = \sum</em>{i=0}^n \text{Bin}(P_i)$</td>
<td></td>
</tr>
<tr>
<td>Area</td>
<td>• Gaussian</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Sobel</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Morphology</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Task</td>
<td>• Sobel edge</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Wavelet</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Hetero.</td>
<td>• Adaptive Threshold</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>• Sum-of-absolute difference</td>
<td>$P(S.A.D) = \sum_{i=1}^9</td>
<td>P_i - R_i</td>
</tr>
</tbody>
</table>
6.6 Parallel implementation of image pre-processing functions

[106], [107], [108], [128]. The adopted actor-core mapping of each onto multi-core IPPro are detailed below:

Data parallel - balanced actor  Point and area functions are individually mapped on the cores to realise $2 \rightarrow 8$-way data parallel implementations as shown in Table 6.10. This actor-core mapping impersonates split, compute and merge parallel skeleton as scatter-gather modules distribute and collect lines of pixels.

Task parallel - unbalanced actors  They pipeline the point and area functions, where each core maps and executes separate actor as shown in Table 6.10. Sobel edge uses CORE#0-2 to perform area and CORE#3 to perform point operations. Wavelet transform uses six cores CORE#0-5 for pipelined implementation of area-based Gaussian low and high pass filters. Each core has window generation module as previously presented in Section 6.5.1.

Data parallel - heterogeneous computing  The chosen heterogeneous functions demonstrate stream and non-stream computing possibilities necessary to realise the farm parallel skeleton on the proposed platform. Adaptive threshold requires image histogram to compute the new threshold value that involves floating-point calculation which is viable to be implemented on the host processor. The memory-mapped in and out execution paths of IPPro (Figure 6.3) have used to pass image histogram and receive new threshold value from the host processor as shown in Table 6.10. Similarly, for SAD implementation, during platform configuration, the host processor writes the 3x3 kernel value into the scratchpad memory of each IPPro. This kind of decomposition and execution impersonates realisation of farm parallel skeleton.
6.6 Parallel implementation of image pre-processing functions

The discussed functions have implemented on Avnet Zedboard that has Xilinx Zynq SoC (XC7Z020-CLG484-1). Figure 6.13 shows the simplified block diagram of the realised video processing system which is similar to previously presented in Chapter 4, except the middle processing block has replaced with programmable hardware acceleration platform as shown in Figure 6.13. A FPS Monitor module has been implemented in FPGA logic to measures a time between start and end of frame to calculate the achieved VGA (640x480) frame processing time in frames/second (fps).

6.6.1 Performance analysis

**Point functions** Table 6.11 reports the acceleration results of point functions using multicore IPPro on Avnet Zedboard. The single-core results affirm a direct relationship between the average cycles/pixel and actor’s execution time, which signifies that smaller (decomposed) dataflow delivers better performance. Both
6.6 Parallel implementation of image pre-processing functions

gradient and threshold are data dependent functions and require branch instructions compared to data independent histogram and contrast functions. This is evident in the reported results as histogram and contrast have achieved 1.18 and 1.90 times better performance over threshold and gradient due to lack of branch executions leading to fixed execution time/pixel.

The point functions have been implemented with increasing data parallel granularity from 2 → 8-way SIMD using 2 - 8 cores are reported in Table 6.11 to analyse the performance improvements. It has achieved a maximum of 7.8 times improvement over single-core implementation because of direct streaming video pipeline which avoided host-to-accelerator data transfer times and achieved a maximum of 75 and 149 fps for gradient and contrast.

Area functions  Table 6.11 reports the acceleration results of area functions on Avnet Zedboard. In contrast to the point, all three functions are data independent, Morphology uses min and max instructions to compute dilate and erode image operations. As min and max do not support dataforwarding, they have taken more execution time than Gaussian and Sobel. Implementation of both Gaussian and Sobel filter has taken advantage of single-cycle multiply-accumulate, moreover, the zero kernel values has further optimised Sobel allowed to save four clock cycles per pixel more than Gaussian. Therefore, Sobel has achieved 1.12 and 1.20 times better performance over Morphology and Gaussian filters.

Each function has implemented with increasing data parallel granularity from 2 → 6-way using 2 - 6 cores and the results are reported in Table 6.11. It achieved a maximum performance of 5.27 times which is 2.53 times less than point due to parallel scattering of windows. The direct streaming video pipeline delivered
Table 6.11: Data parallel performance results of point and area functions using IPPro on Artix-7 (Zedboard).

<table>
<thead>
<tr>
<th>Functions</th>
<th>Point</th>
<th></th>
<th></th>
<th></th>
<th>Point</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Contrast</td>
<td>Threshold</td>
<td>Gradient</td>
<td>Histogram</td>
<td>Gaussian</td>
<td>Sobel</td>
<td>Morphology</td>
</tr>
<tr>
<td>Avg. Cycles/ Pixel</td>
<td>53</td>
<td>116</td>
<td>49</td>
<td>56</td>
<td>33</td>
<td>39</td>
<td>25</td>
<td>44</td>
</tr>
<tr>
<td>Execution time (ms)</td>
<td>73</td>
<td>123</td>
<td>61</td>
<td>100</td>
<td>52</td>
<td>61</td>
<td>57</td>
<td>66</td>
</tr>
</tbody>
</table>

Table 6.12: Comparison of data parallel implementation of point functions using IPPro against ARM (-O2,-O3).

<table>
<thead>
<tr>
<th>Point Functions</th>
<th>Architecture</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
<th>IPPro</th>
<th>ARM</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
<td>-O3</td>
<td></td>
</tr>
<tr>
<td>Exec. time (ms)</td>
<td>33.20</td>
<td>45.90</td>
<td>45.71</td>
<td>38.30</td>
<td>45.67</td>
<td>62.20</td>
<td>49.41</td>
<td>48.47</td>
<td>33.20</td>
<td>53.96</td>
<td>49.44</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 6.13: Comparison of data parallel implementation of area functions using IPPro against ARM (-O2,-O3).

<table>
<thead>
<tr>
<th>Area Functions</th>
<th>Gaussian</th>
<th>Sobel</th>
<th>Morphology</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>IPPro</td>
<td>ARM</td>
<td>IPPro</td>
</tr>
<tr>
<td>Optimisation</td>
<td>-O2</td>
<td>-O3</td>
<td>-O2</td>
</tr>
<tr>
<td>Exec. time (ms)</td>
<td>61.00</td>
<td>71.61</td>
<td>70.69</td>
</tr>
</tbody>
</table>

6.6 Parallel implementation of image pre-processing functions
approx. 84 and 95 fps for *Morphology* and *Sobel* respectively using six cores.

By porting the platform to Kintex-7 fabric as reported in Table 6.12 and Table 6.13, further improvements up to \( \approx 1.60 \) times is possible for both *point* and *area* functions due to higher operating frequency of IPPro cores and the multicore interconnect at 300 MHz and 220 MHz respectively (Table 6.3 and Table 6.6).

**Performance comparison of IPPro against embedded ARM Cortex-A9 CPU implementation**

To set the baseline figures and compare the IPPro performance, both *point* and *area* image processing functions has been implemented on embedded ARM Cortex-A9 CPU operating at 667 MHz. Two compiler optimisations -O2 (high) and -O3 (maximum) have been used which are supported by the ARM GCC compiler available in *Xilinx Vivado Software Development Kit* (SDK).

The detailed results are reported in Table 6.12 and Table 6.13 respectively. For *point* and *area* functions, the average performance of 20 and 14 fps have been achieved irrespective of the fact that ARM CPU operates at 2.23 times faster than IPPro core. The performance is limited due to the fact that ARM uses AXI4-DMA to read and write pixels which takes maximum 40 ms data transfer time for a 640x480 video frame configured as maximum burst size of 256x32 bits per DMA transfer. By exploiting ARM compiler optimisations from -O2 to -O3, the maximum performance improvement (excluding the data transfer times) of 1.47 times has been observed. However, this performance improvement become insignificant, as fixed data transfer overhead is \( \approx 5.4 \) times larger than the function’s processing time which limits the best achievable theoretical performance
6.6 Parallel implementation of image pre-processing functions

Table 6.14: Implementation results of HLS generated IPs on Kintex-7 fabric. (Normalised area and performance results of multicore IPPro to HLS).

<table>
<thead>
<tr>
<th>IP</th>
<th>Operations</th>
<th>Freq. (MHz)</th>
<th>Area Utilisation</th>
<th>Exec. (ms)</th>
<th>fps</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>FF</td>
<td>LUT</td>
<td>BRAM</td>
<td>DSP48E1</td>
</tr>
<tr>
<td>Point</td>
<td>Add, Subtract, And, Or, Xor, Mul, Min, Max</td>
<td>250</td>
<td>1526 (8.04)</td>
<td>1266 (8.64)</td>
<td>0 (18.5)</td>
</tr>
<tr>
<td>Area</td>
<td>Convolution, Morphology</td>
<td>222</td>
<td>3444 (2.08)</td>
<td>3350 (2.94)</td>
<td>2.5 (13)</td>
</tr>
</tbody>
</table>

Table 6.14 also presents the normalised performance and area numbers. The IPPro implementation is 1.8 and 2.36 times slower than HLS developed IPs, at the cost of 8.04 and 8.64 times more FFs and LUTs respectively. This increase come at the cost of flexible and programmable architecture that not only allows software programmer to map and execute multiple dataflow actors. But also,

up to 25 fps.

By comparing the obtained ARM CPU results against IPPro, the single-core IPPro implementation achieved maximum of 1.48 times better performance over ARM which is operating at 2.23 times lower operating frequency. Because, IPPro exploits stream processing and avoids reading and writing data transfer overheads. In addition, the data parallel IPPro implementations achieved further performance improvements of 11.47 and 6.43 times using 8 and 6 cores for point and area functions respectively.

Cost analysis of proposed adaptable approach against HLS  
The point and area IPs have been developed by Deng (a research student) using Xilinx Vivado HLS v2016.4. Both IPs have AXI4-Lite interface to select operations as listed in Table 6.14 and fully pipelined. The area is 1.12 times slower than point due to necessary line buffering that reduced performance from 423 and 361 fps. The area used 3.56 and 2.64 more FFs and LUTs compared to the point.

Table 6.14 also presents the normalised performance and area numbers. The IPPro implementation is 1.8 and 2.36 times slower than HLS developed IPs, at the cost of 8.04 and 8.64 times more FFs and LUTs respectively. This increase come at the cost of flexible and programmable architecture that not only allows software programmer to map and execute multiple dataflow actors. But also,
provide software controlled granularity to exploit desired data and task parallel implementations using skeletons. This area cost is narrow down to ≈ 2.08 and 2.94 times for area IP and the memory utilisation gap is reduced approx. by 1.43 times. This performance gap can be reduced using multiple multicore IPPro as estimated and reported in Table 6.9.

**Pipelining multiple tasks**  Table 6.15 reports the performance results of pipelining multiple dataflow actors exploiting task parallelism. The *Wavelet transform* consists of pipelined execution of balanced area actors, i.e. (high and low-pass filter) as illustrated in Table 6.10 while, *Sobel edge* represents pipelined execution of unbalanced gradient actor.

During execution of *Wavelet Transform*, the first-stage cores pass processed pixels to the second-stage cores. As the actors are balanced, no ripple-effect has been observed as balanced execution hides the data transfer times to second stage cores. Therefore, the average cycles/pixel is close to the *Gaussian*. Though the computation requirement of the *Wavelet transform* is six times more than the *Gaussian*, similar performance has achieved exploiting task parallelism which can further be improved by exploiting data parallelism.

In case of *Sobel Edge*, the *gradient* takes average of 56 cycles/pixel due to data dependent operations which is 1.43 and 1.60 times higher than *Gaussian* and the

<table>
<thead>
<tr>
<th>Pipelined Tasks</th>
<th>Artix-7 (Zedboard)</th>
<th>Kintex-7 (ZC706)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Wavelet</td>
<td>Sobel Edge</td>
</tr>
<tr>
<td>No. of cores</td>
<td>6</td>
<td>6</td>
</tr>
<tr>
<td>Avg. Cycles/Pixel</td>
<td>38</td>
<td>64</td>
</tr>
<tr>
<td>Execution Time (ms)</td>
<td>62</td>
<td>133</td>
</tr>
<tr>
<td>Performance (fps)</td>
<td>16</td>
<td>8</td>
</tr>
</tbody>
</table>
6.6 Parallel implementation of image pre-processing functions

Table 6.16: Performance results of heterogeneous decomposed compute functions using multicore IPPro.

<table>
<thead>
<tr>
<th>Heterogeneous Operations</th>
<th>Artix-7 (Zedboard)</th>
<th>Kintex-7 (ZC706)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Adaptive threshold</td>
<td>Adaptive threshold</td>
</tr>
<tr>
<td>Avg. Cycles/Pixel</td>
<td>49</td>
<td>260</td>
</tr>
<tr>
<td>Execution time</td>
<td>61</td>
<td>515</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Performance</th>
<th>Frame per second (fps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-core</td>
<td>16</td>
</tr>
<tr>
<td>2-way</td>
<td>30</td>
</tr>
<tr>
<td>3-way</td>
<td>45</td>
</tr>
<tr>
<td>4-way</td>
<td>61</td>
</tr>
<tr>
<td>5-way</td>
<td>76</td>
</tr>
<tr>
<td>6-way</td>
<td>92</td>
</tr>
<tr>
<td>7-way</td>
<td>108</td>
</tr>
<tr>
<td>8-way</td>
<td>124</td>
</tr>
</tbody>
</table>

Sobel as reported in Table 6.11. Therefore, during execution the gradient stage forces a backward ripple effect which propagates to Sobel and Gaussian limiting overall performance to 8 fps. The pipelined implementation of unbalanced actors delivered 2.03 times improvement over non-pipelined implementation which suggests that decomposition of dataflow graph into balanced actors is vital to gain maximum advantage of task parallelism.

Heterogeneous computing tasks Table 6.16 presents the results of heterogeneously decomposed Adaptive Threshold and Sum of absolute difference (SAD) functions illustrated in Table 6.10. The SAD takes 8.44 times more time than adaptive threshold due to a nested execution of data dependent branch instructions necessary to compute absolute values of $G_x$ and $G_y$ produced by Sobel filter. The data parallel implementation of SAD has achieved maximum improvement of 5.50 times using six cores of multicore IPPro.

In Adaptive threshold, the host processor takes 10µs and 25µs to read the image histogram bins from the scratchpad memory and compute the new threshold value respectively. Since the execution of the host processor and the multicore IPPro
are concurrent, a maximum performance improvement of 7.6 times has achieved using eight cores which can be improvement by 1.60 times using Kintex-7.

6.7 Summary

This chapter presented an *FPGA-based programmable hardware acceleration platform* that supports a software-controlled implementation of parallel skeletons. The platform provides three layers of software programming abstractions to the software and algorithm developers. Each layer complements the adaptable features supported by the following layer. These layers allow to explore, optimise, map and implement parallel dataflow applications onto the FPGA using IPPro core, multicore IPPro and system infrastructure. The platform enables deploying software-centric *edit-compile-run* flow that improves design time.

The IPPro core sits at the bottom layer implementing a programmable dataflow actor, the multicore IPPro lies in the middle implementing programmable multiple dataflow actors. Middle layer supports producer-consumer data exchange patterns to explore and exploit parallelism. The top layer provides software controlled data distribution and collection mechanisms necessary to support the functional requirements of bottom layers.

The implementation results show that platform’s adaptability and flexibility come at the cost of area where, significant amount is consumed by the interconnect followed by software-controlled distribution of window of pixels. The interconnect used 10.27 and 13.19 times more FFs and LUTs than a single IPPro core, and 1.28 and 1.64 times more FFs and LUTs than eight IPPro cores. Similarly, the scattering of pixels consumed $\approx 3$ times FFs and LUTs than a single IPPro core.
The platform operates in two separate clock domains and the maximum $f_{IPPro}$ and $f_{Interconnect}$ are 300 MHz and 220 MHz respectively.

A set of point and area image pre-processing functions are implemented on the platform using Avnet Zedboard (Artix-7), to evaluate and analyse the flexibility and performance. The decomposition and mapping possibilities cover acceleration of balanced and unbalanced actors exploiting both data and task parallelism. The implementation results show that data independent functions deliver better performance over data dependent functions because of the non-linearity introduced by branches. The point functions maps better on the platform and provides $\approx 2.53$ times better acceleration than area functions, due to the absence of line buffering mandatory to obtain window of pixels. It can further improve by realising data parallel implementation which can deliver a maximum of 7.80 and 5.27 times for point and area functions. Comparison of results with embedded ARM Cortex-A9 CPU shows that single-core IPPro has achieved maximum of 10 times better performance while operating at 2.23 times less frequency by avoiding data transfer overheads. In addition, by exploiting data parallelism maximum performance improvements of 11.47 and 6.44 times using 8 and 6 cores for point and area functions respectively over ARM CPU.

The results of pipelined execution show that balanced actors implementation had achieved maximum performance, as they hide the data transfer and processing time of the following stages. In case of unbalanced actors, the maximum achievable performance is limited by the slowest actor due to the ripple effect. These results suggest that it is essential to decompose the dataflow graph into balanced actors to achieve maximum benefit of task parallelism and avoid the ripple effect.
Chapter 7

Conclusion and Future Work

7.1 Summary

FPGAs have not accepted as mainstream computing platform due to longer design times and need of specialist programming tools which can be challenging for use by algorithm and software developers. As existing FPGA-based design approaches struggle to approach the discussed challenges while providing a balance between adaptability and performance, this thesis has proposed an FPGA-based programmable hardware acceleration platform architecture implementing different image pre-processing applications. It is maintained that the approach offers a balance between performance and efficient resource utilisation by reducing design time. The platform can be programmed using conventional software development approaches. It enables software and algorithm developers to accelerate applications on an FPGA using edit-compile-run flow rather than time-consuming synthesis, place-and-route design flow, thus reducing design time.

The major architectural challenge has been to find a balance between the supported hardware and software abstraction while maintaining the concurrency and pipelining benefits of the FPGA technology. A hierarchical hardware and
software abstraction layers have been used to achieve flexibility where each layer provides unique features that allow hardware platform to implement different high-level application descriptions down to low-level FPGA resources. This allows fine-and coarse-grained mapping and exploitation of data and task parallel realisations on the platform.

7.2 Thesis Contributions

This work has presented an approach to make FPGA-based hardware acceleration easier for software and algorithm developers using software-centric edit-compile-run flow with reduced design time.

1. Design and development of novel FPGA-based Image Processing Processor (IPPro) soft-core architecture tailored for acceleration of image processing applications. The architecture has been carefully designed to allow the functional computing requirements to be supported and FPGA compute and memory resources to be efficiently utilised. It comprises a 16-bit signed, 5-stage pipelined RISC processor that supports basic arithmetic, logical and branch instructions with data forwarding that implements data dependent point and area image processing operations. It is then used as a basic programmable processing element of the proposed FPGA-based hardware acceleration platform. The IPPro operates maximum at 300 MHz and delivers up to three times better raw-computation considering the operating frequency over other soft-core processor architectures by exploiting dedicated DSP block and minimises use of FPGA resources. Results show that the IPPro has achieved up to 5.8 times better performance by util-
isising approximately same amount of FPGA resources compared to other FPGA-based programmable architecture. In addition, comparison of chosen micro-benchmarks shows that IPPro has achieved up to 8.94 times better performance over well established MicroBlaze soft-core processor and consumes fewer resources.

2. The processing capabilities of the IPPro datapath has been extended beyond supported purely by the dedicated DSP48E1 block. Specialised min, max and coprocessor instructions are included in the datapath where coprocessor extension allows complex arithmetic operations to be off-loaded to the coprocessor. The coprocessor executes in parallel and does not stall the execution of IPPro datapath to maximise performance. This optimisation has increased the length of the critical path which reduced the maximum operating frequency of the datapath by 11% and consumed 89 LUTs, 34 FFs.

3. Creation of the IPPro as an independent, self-managed, programmable dataflow accelerator that receives tokens from multiple producers and sends the processed token to multiple consumers by executing stream instructions. The architecture supports fine-and coarse-grained mapping and execution of dataflow nodes using producer-consumer computing model. The actor firing rule is software programmable as the IPPro code consists of both actor’s functional description and control (firing rule). It avoids the need for an external controller, token re-ordering and synchronisation mechanisms and which are necessary for high-level synthesis (HLS) and HDL-based design approaches. In addition, stream instructions based on data and control
7.2 Thesis Contributions

mechanisms avoid data-transfer overheads and simplify multicore synchronisation problems by avoiding the intervention of both the host processor and communication controller.

4. The IPPro datapath supports both message-passing and shared memory data models which allows for processing of both uniform and non-uniform distributed data or combinations thereof. These data processing paths allow the IPPro to implement split, compute, merge, pipeline and farm parallel computing skeletons on the FPGA. It facilitates better programming abstraction which can be used to explore, profile, optimise and evaluate different mapping possibilities and deploy them on the underlying architecture using software-centric edit-compile-run flow to find a suitable solution to the problem.

5. Creation of a multicore IPPro architecture that allows mapping and execution of multiple dataflow actors using dynamic routing of data streams among IPPro cores. The architecture facilitates both data and control mechanisms supported by the underlying IPPro cores. It uses the stream routing information issued by the producer core to forward data tokens to the consumer cores. The supported data passing patterns are one-to-many, many-to-one, many-to-many which are essential to map tree reduction and expansion structures effectively. This flexible connectivity among cores enables the adaptable realisation of a pipelined dataflow graph exploiting task parallelism, vertical scaling of a dataflow actor to exploit data parallelism or combinations thereof in order to maximise resource re-use. It enables a wide range of application profiling, exploration and optimisation options
7.2 Thesis Contributions

for the user. It comes at the cost of 10.27 and 13.19 times more FFs and LUTs than a single IPPro core. The multicore IPPro has two separate clock domains, where maximum frequencies of the IPPro and interconnection are 300 MHz and 220 MHz respectively.

6. Implementation of $k$-means algorithm using multiple IPPro cores on Avnet Zedboard has been achieved which allows exploration of actor-core mapping possibilities and their evaluation on area, performance, power and resource efficiency. Four IPPro-based hardware accelerator designs composed of single, dual, 8-way-SIMD and dual 8-way-SIMD cores have been realised. The results have been compared against equivalent HLS and GPU implementations. The results shows that up to 7.3 times performance improvements over single-core is possible by exploiting both data and task parallelism at the cost of increased area. Comparing against other technologies, FPGA achieved 27 times better performance over the embedded CPU by exploiting parallelism and consumes 4.9 times less power than the embedded GPU. Moreover, the power efficiency (fps/W) numbers shows that FPGA implementation is 57 and 24 times more power efficient than embedded CPU and GPU respectively.

7. Point and area image pre-processing functions are implemented on Avnet Zedboard to evaluate performance and analyse flexibility of the platform. The selected decomposition and mapping possibilities cover acceleration of both balanced and unbalanced, data independent and dependent dataflow actors exploiting data and task parallel implementations. They exhibit implementation of a split, compute, merge, pipeline and farm parallel skeletons
on FPGA technology. The results show that data independent functions deliver better performance over data dependent functions because of the non-linearity introduced by branches. Comparison of results with embedded ARM Cortex-A9 CPU shows that single-core IPPro has achieved maximum of 10 times better performance while operating at 2.23 times less frequency by avoiding data transfer overheads. In addition, by exploiting data parallelism maximum performance improvements of 11.47 and 6.44 times using 8 and 6 cores for point and area functions respectively over ARM CPU.

### 7.3 Suggestions for further work

The presented work was intended to propose a novel *FPGA-based programmable hardware acceleration platform soft processor* that is adaptable, and facilitates fast-prototyping and exploration possibilities for software and algorithm developers using software-centric edit-compile-run flow. Some suggested future directions to extend this work:

1. **Extension IPPro datapath to support execution of dynamic dataflow graphs** where an actor can produce and consume the different number of tokens in each firing. One possible solution is to extend the data payload (ACTION, TRIGGER, SRC_ID, DEST_ID, DATA) and include additional instruction similar to TEST to decode control information generated by preceding actor nodes.

2. **Syntactic extension of high-level programming language** to effectively exploit the underlying supported parallel skeletons. It can be an extension of well-established high-level languages such as OpenCL and OpenMP.
3. **Software-based profiling framework** that uses static analysis techniques to profile the execution and interaction among actors of the dataflow application. This profiling information can be used to optimise different processing stages. It shall also provide data dependent analysis capability to profile and analyse the impact of control and data dependent dataflow nodes on the performance. This can be achieved by determining their computational load, data transfers and storage load. The computational load can be determined by recording the execution of control statements. Data-transfer and storage load can be determined by the rate of token production/consumption and inter-stage buffer utilisation.

4. **Extending the IPPro platform to modern FPGA architecture** such as Xilinx Zynq UltraScale+ MPSoC. The *programmable hardware acceleration platform* uses IPPro softcore processor to implement different applications that exploits hardened DSP block. In terms of performance, the modern DSP48E2 block in the Zynq UltraScale+ delivers $\approx 16\%$ better timing ($f_{\text{Max}}$) compared to DSP48E1 block and provides $\approx 20\%$ more blocks. This would allow the IPPro cores not only to operate at higher operating frequency (improving raw-computation capacity) but also the possibility to accommodate more IPPro cores within the FPGA fabric. A high density UltraRAM has been introduced in the Zynq UltraScale+ memory hierarchy to extend the on-chip memory capabilities. It enables up to 500Mb of total on-chip storage which is equivalent to a 6 times increase in on-chip memory resource compared to Zynq-7000. It is a dual-port synchronous memory block similar to the dual-port true Block RAM with higher memory density.
The scratchpad memory in the IPPro architecture can be realised using UltraRAM that would allow to store/buffer large images and implement global image processing functions within the FPGA fabric. In terms of power, the Zynq UltraScale+ provides 3.5 times better performance/Watt compared to Zynq-7000 MPSoC. It supports clock gating, frequency scaling and ability to assign different computational units into multiple power domains, i.e. (Full power, Low power, Battery power). These features gives better power optimisation opportunities to the designer to realise power optimised domain specific applications.
Appendix A

Author’s Publications


Appendix B

IPPro: Technical details

Implementation of DSP48E1-based ALU

The IPPro datapath uses dedicated DSP48E1 block to implement arithmetic and logical instructions. The DSP48E1 can be dynamically configured using OPMODE, INMODE and ALUMODE control and CEA2, CEB2, CEC, CEM and CEP pipelined registers. Table B.1 shows the detailed configuration use by the IPPro.

Table B.1: IPPro supported instruction set and their corresponding DSP48E1 control signals.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>INMODE</th>
<th>OPMODE</th>
<th>ALUMODE</th>
<th>CEA2</th>
<th>CEB2</th>
<th>CEC</th>
<th>CEM</th>
<th>CEP</th>
</tr>
</thead>
<tbody>
<tr>
<td>Add</td>
<td>00000</td>
<td>110011</td>
<td>0000</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>Sub, Min, Max</td>
<td>00000</td>
<td>110011</td>
<td>0011</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>Mul</td>
<td>10001</td>
<td>000101</td>
<td>0000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>Muladd</td>
<td>10001</td>
<td>110101</td>
<td>0000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Mulsbr</td>
<td>10001</td>
<td>110101</td>
<td>0011</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Mulacc</td>
<td>10001</td>
<td>100101</td>
<td>0000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>and</td>
<td>00000</td>
<td>110011</td>
<td>1100</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lor</td>
<td>00000</td>
<td>110011</td>
<td>0100</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lar</td>
<td>00000</td>
<td>110011</td>
<td>0101</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lor</td>
<td>00000</td>
<td>110011</td>
<td>1100</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lnor</td>
<td>00000</td>
<td>110011</td>
<td>1110</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lanot</td>
<td>00000</td>
<td>110011</td>
<td>1111</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lanand</td>
<td>00000</td>
<td>110011</td>
<td>1110</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>lal, lar</td>
<td>10001</td>
<td>000101</td>
<td>0000</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

182
Instruction Set

IPPro supports a 32-bit instruction set architecture (ISA) to process stream and non-stream data. Table B.2 lists the supported instruction set.

Table B.2: IPPro instruction set.

<table>
<thead>
<tr>
<th>Addressing mode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>NOP</td>
<td></td>
<td>No Operation</td>
</tr>
<tr>
<td>Register File</td>
<td>ADD</td>
<td>RD = RB + RC</td>
</tr>
<tr>
<td></td>
<td>SUB</td>
<td>RD = RC - RB</td>
</tr>
<tr>
<td></td>
<td>MUL</td>
<td>RD = RA * RB</td>
</tr>
<tr>
<td></td>
<td>MULADD</td>
<td>RD = RC + (RA * RB)</td>
</tr>
<tr>
<td></td>
<td>MULSUB</td>
<td>RD = RC - (RA * RB)</td>
</tr>
<tr>
<td></td>
<td>MULACC</td>
<td>RD = (RA * RB) + RD-1</td>
</tr>
<tr>
<td></td>
<td>LAND</td>
<td>RD = RB &amp; RC</td>
</tr>
<tr>
<td></td>
<td>LXOR</td>
<td>RD = RB xor RC</td>
</tr>
<tr>
<td></td>
<td>LXNR</td>
<td>RD = ~ (RB xor RC)</td>
</tr>
<tr>
<td></td>
<td>LOR</td>
<td>RD = RB and RC</td>
</tr>
<tr>
<td></td>
<td>LNOR</td>
<td>RD = ~(RB and RC)</td>
</tr>
<tr>
<td></td>
<td>LNOT</td>
<td>RD = ~ RB</td>
</tr>
<tr>
<td></td>
<td>LNAND</td>
<td>RD = ~(RB &amp; RC)</td>
</tr>
<tr>
<td></td>
<td>MIN</td>
<td>RD = MIN(RB, RC)</td>
</tr>
<tr>
<td></td>
<td>MAX</td>
<td>RD = MAX(RB, RC)</td>
</tr>
<tr>
<td>Data Handling</td>
<td>LDWM</td>
<td>RD = WMn(ADDRESS)</td>
</tr>
<tr>
<td></td>
<td>STWM</td>
<td>WMn(ADDRESS) = RC</td>
</tr>
<tr>
<td></td>
<td>LDWMI</td>
<td>RD = WM(R31)</td>
</tr>
<tr>
<td></td>
<td>STWMI</td>
<td>WM(R31) = RC</td>
</tr>
<tr>
<td></td>
<td>PUSH</td>
<td>FIFO(output) = RA</td>
</tr>
<tr>
<td></td>
<td>GET</td>
<td>RD = FIFO (input)</td>
</tr>
<tr>
<td></td>
<td>TEST</td>
<td>Checks no. of input tokens available in FIFO</td>
</tr>
<tr>
<td></td>
<td>STR</td>
<td>RD = IMM (16-bit signed value)</td>
</tr>
<tr>
<td>BRANCH</td>
<td>JMP</td>
<td>16-bit code memory address</td>
</tr>
<tr>
<td></td>
<td>BNEQ*</td>
<td>Branch if equal flag is clear</td>
</tr>
<tr>
<td></td>
<td>BEQ*</td>
<td>Branch if equal flag is set</td>
</tr>
<tr>
<td></td>
<td>BZ*</td>
<td>Branch if zero flag is set</td>
</tr>
<tr>
<td></td>
<td>BNZ*</td>
<td>Branch if zero flag is clear</td>
</tr>
<tr>
<td></td>
<td>BS*</td>
<td>Branch if Sign flag is set</td>
</tr>
<tr>
<td></td>
<td>BNS*</td>
<td>Branch if Sign flag is clear</td>
</tr>
</tbody>
</table>

* The branch instructions have been added at no extra cost and included in the IPPro instruction set, as the IPPro flags (zero, sign and equal) have been generated using a pattern-detect and a sign-bit produced by the embedded DSP48E1 block.
AXI4 Control Registers

IPPro has AMB-AXI4-Lite interface that allows configuration of actor firing, source-ID and destination-ID encoder modules necessary to implement multi-rate dataflow actors. The datapath has nine registers listed in Table B.3 that stores the configurations.

Table B.3: The AXI4-Lite control register map.

<table>
<thead>
<tr>
<th>AXI4-Lite Registers</th>
<th>Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>Addr.</td>
<td>IPPro Core</td>
</tr>
<tr>
<td></td>
<td>31 - 28</td>
</tr>
<tr>
<td>CONTROL</td>
<td>xxx</td>
</tr>
<tr>
<td>FIRING MASK</td>
<td>xxx</td>
</tr>
<tr>
<td>Tk_CONSUMPTION</td>
<td>Q7</td>
</tr>
<tr>
<td>Tk_PRODUCTION</td>
<td>xxx</td>
</tr>
<tr>
<td>Tk</td>
<td>IM_ADDR</td>
</tr>
<tr>
<td>IM_ADDRESS</td>
<td>WR_EN</td>
</tr>
<tr>
<td>SP_ADDRESS</td>
<td>WR_EN</td>
</tr>
<tr>
<td>SP_DATA_IN</td>
<td>SP_DATA_IN (16-bits)</td>
</tr>
<tr>
<td>SP_DATA_OUT</td>
<td>SP_DATA_OUT</td>
</tr>
</tbody>
</table>

Software-based control interface

C-APIs has been developed to ease programming and control of supported features. The IPPro datapath has nine AXI4-Lite control registers. Listing A.1 shows these functions:

Listing B.1: C-APIs to control and manage IPPro core.

```c
// Core Register read/write functions
int IPProWrite(IPPro* inst, IPProRegAddr addr, uint32_t command);
uint32_t IPProRead(IPPro* inst, IPProRegAddr addr);
int IPProSetTokenConsumption(IPPro* inst, IPProRegAddr addr, uint32_t tCount);
int IPProSetTokenProduction(IPPro* inst, IPProRegAddr addr, uint32_t command);
int IPProSetFiringMask(IPPro* inst, IPProRegAddr addr, uint32_t command);

// Instruction memory functions
int IPProIMWrite(IPPro* inst, uint32_t addr, uint32_t program);
int IPProIMInit(IPPro* inst, uint32_t code, uint32_t n);

// Scratchpad memory functions
int IPProSPWrite(IPPro* inst, uint32_t addr, uint32_t data);
uint32_t IPProSPRead(IPPro* inst, uint32_t addr);
int IPProSPInit(IPPro* inst, uint32_t data, uint32_t n);
```
Bibliography


and operating systems (ASPLOS XII), vol. 41, no. 11, pp. 151 – 162, Oct. 2006.


