

# A soft coprocessor approach for developing image and video processing applications on FPGAs

Deng, T., Crookes, D., Woods, R., & Siddiqui, F. (2022). A soft coprocessor approach for developing image and video processing applications on FPGAs. *Journal of Imaging*, *8*. https://doi.org/10.3390/jimaging8020042

# Published in: Journal of Imaging

**Document Version:** Publisher's PDF, also known as Version of record

# Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal

#### Publisher rights

© 2022 The Authors.

This is an open access article published under a Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the author and source are cited.

#### General rights

Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

#### Take down policy

The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.

#### **Open Access**

This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback





2

3

4

5

6

21 22

23

# A soft coprocessor approach for developing image and video processing applications on FPGAs

Tiantai Deng 1, Danny Crookes 2 Roger Woods 2 and Fahad Siddiqui 2

| Department of Electronics and | Floatrical Engineering    | The University of Sheffield |
|-------------------------------|---------------------------|-----------------------------|
| Department of Electronics and | i Electrical Elignieernig |                             |

- <sup>2</sup> School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast
- \* Correspondence: d.crookes@qub.ac.uk

Abstract: Developing Field Programmable Gate Array (FPGA)-based applications is typically a slow 7 and multi-skilled task. Research in tools to support application development has gradually become 8 more high-level. This paper describes an approach which aims to raise further the level at which an 9 application developer works in developing FPGA-based implementations of image and video pro-10 cessing applications. The starting concept is a system of streamed soft coprocessors. We present a 11 set of soft coprocessors which implement some of the key abstractions of Image Algebra. Our soft 12 coprocessors are designed for easy chaining, and allow users to describe their application as a data-13 flow graph. A prototype implementation of a development environment, called SCoPeS, is pre-14 sented. An application can be modified even during execution without requiring re-synthesis. The 15 paper concludes with some performance and resource utilization results for different implementa-16 tions of a sample algorithm. We conclude that the soft coprocessor approach has the potential to 17 deliver better performance than the soft processor approach, and can improve programmability 18 over dedicated HDL cores for domain specific applications while achieving competitive real time 19 performance and utilization. 20

Keywords: Image Processing, FPGA, Soft coprocessor

# 1. Introduction

Image processing algorithms are used in many applications, such as image classifi-24 cation, medical image processing, video surveillance and target detection and tracking [1-25 3]. These applications have been embedded in more and more devices such as 26 smartphones, unmanned autonomous vehicles and surveillance cameras [4-6]. Safety crit-27 ical image processing applications require the processing system to be accurate, and often 28 fast [7]. With the rapid development of image sensors, the resolution of images and videos 29 is becoming higher than ever. For high-resolution images, traditional processors struggle 30 to keep up with increasing resolutions [8]. It may not be possible to process very large 31 images in real-time using conventional CPUs. Thus, it is necessary to consider ways of 32 accelerating the <u>most</u> time-consuming <u>computing tasks</u> parts of the application in these 33 cases. Commonly, there are four approaches to accelerating image processing algorithms<sub>7</sub> 34 which are: multi-core clusters of CPUs, GPUs, FPGAs and ASICs. CPUs and GPUs are 35 instruction-based processors, and so they operate on the normal fetch-execute cycle 36 model. This <u>can</u>-means that it <u>can</u> takes several clock cycles to execute one instruction. 37 They are also relatively high power compared to ASICs and FPGAs when implementing 38 the same application [9]. ASICs usually have the best performance and lowest power, but 39 they are not programmable and are very expensive to produce. FPGAs are somewhere 40 between GPUs and ASICs. They are capable of producing low power, low cost but high-41 performance solutions. However, the design time for custom cores can be much longer 42 than for GPUs [10]. In the field of image processing, because of the independence of pixels, 43 FPGAs can produce good speedup, particularly when used as a coprocessor for low-level 44

Citation: Lastname, F.; Lastname, F.; Lastname, F. Title. J. Imaging 2022, 8, x. https://doi.org/10.3390/xxxx

Academic Editor: Firstname Lastname

Received: date Accepted: date Published: date

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.



**Copyright:** © 2021 by the authors. Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/license s/by/4.0/). image processing operations [11]. However, the key challenge is to speed up the process45of producing an FPGA-based solution to image processing application problems.46

The need to accelerate the application development process is generally acknowl-47 edged. Although vendors and researchers have been putting effort into creating higher-48 level design environments for building hardware accelerators usingon FPGAs, some 49 problems still remain [10,12-14]. Hardware designers tend to use 'high-level' in the sense 50 that the syntax is at a higher level than Very high-speed integrated circuit Hardware De-51 scription Language (VHDL) or Verilog HDL [15]. But for application developers and soft-52 ware programmers, 'high-level' means that hardware design issues can be practically 53 ignored, and the coding focuses on the application alone, as though the developer were 54 coding for a PC. To application developers, the above tools remain low level, even if they 55 use C syntax [15]. If application developers use the tools naively, without taking hardware 56 design issues into account, the result is likely to be very inefficient hardware. Also, alt-57 hough these High-level Synthesis (HLS) tools are described as high-level, there are some 58 features of the input language that cannot be synthesized. For example, Xilinx Vitis HLS 59 does not support the use of pointers and dynamic memory allocation in C [13,16]. 60

Since the result of the HLS tools above is still HDL, users typically require the usual 61 long re-synthesis time when they make changes to the algorithm or application [14]. This 62 hinders the experimental nature of image processing application development, which is 63 one of the targets of this paper. 64

Xilinx recently aimed to shorten the synthesis time by their new product, Adaptive 65 Computing Accelerating Platform (ACAP) and released an early product of the ACAP 66 family, Versal [17]. The main advantage of the ACAP family is the ability to rapidly do re-67 synthesis (within milliseconds). Xilinx also provides their AI Engine to accelerate the de-68 ployment of AI applications on their selected Xilinx devices. Combined with Xilinx Vitis, 69 the development of AI and image processing applications on some of the Xilinx devices 70 can be significantly accelerated [18]. Unfortunately, only some of the latest Xilinx devices 71 support this feature. 72

Thus, the current challenges toof using FPGAs to accelerate an image processing system can be summarized as follows:

1) It is hard to achieve both programmability and performance on FPGAs across all devices.

2) Current vendors' HLS tools still require users to be knowledgeable about hardware and the limitations of the tools.

3) Lengthy synthesis time is a hindrance during experimental and iterative image processing system development.

In this paper, a higher level approach for image processing system development is proposed to address to some extent the above-mentioned challenges. We will present a number of concepts, which are integrated into a prototype Soft Coprocessor System (SCoPeS), to support the development of FPGA-based image processing applications. Detailed contributions of the paper are as follows: 85

1) We propose the concept of customizable Soft Co-Processors (SCPs) as the basic86building block for stream-based applications. We allow users to chain SCPs together so87they can communicate directly with each other and not merely with the host. We use AXI-88Stream Interconnect to connect the SCPs in the-a\_system. In this way, we provide users89with a flexible system which can be programmed as a dataflow graph (DFG). Users do not90normally need to re-synthesize when they change the DFG.91

2) We provide a set of customizable Software Co-Processors based on the key concepts of Image Algebra (IA), including a range of point, neighborhood, and global operations.
92
93
94

3) We provide a set of efficient hardware skeletons for defining new IA-like operations, where users need only supply their own C-based pixel-level function. –This enables the creation of very efficient function specific SCPs. 97

73

74

75

76

77

78

79

Our prototype SCoPeS environment includes several tools to support the SCP approach. A hardware configuration generator tool enables users to specify the number and of each type of SCPs to be available for the current application project. Provided the application uses only this pool of SCPs, no resynthesis is required. A code generator enables users to define their applications in terms of a text-based DFG. (This is in place of special tools for editing a graphical view of the DFG). Users can edit the textual DFG description, normally without requiring re-synthesis.

The rest of this paper is organized as follows. In section II, we introduce the back-105 ground and related work in terms of Image Algebra, high-level programming models and 106 different FPGA implementations of image processing algorithms and systems. Section III 107 provides a user's view of our design approach. In section IV, we describe the architectures 108 and underlying implementations, including our Generic and Function-specific Image Al-109 gebra based SCPs and how they connect and communicate with each other. In section V 110 we demonstrate how we create a new image processing system using our SCoPeS envi-111 ronment. In section VI, a comparison between different design approaches for a simple 112 image processing operation is presented for evaluation purposes. 113

#### 2. Background and Related Work

#### 2.1. Current Tools for Designing FPGA Custom Cores in a High-level Environment

Modern FPGAs are no longer thought of as arrays of gates, but as collections of larger 116 scale functional blocks, integrated using programmable logic. They are still programma-117 ble but are not restricted to programmable logic (PL), and sometimes come equipped with 118 on-chip ARM processors or embedded GPUs. When implementing an image processing 119 system on FPGAs, the design effort is a critical project requirement. Very large image pro-120 cessing systems are difficult to design efficiently and require very detailed hardware 121 knowledge. To address this challenge, vendors have released their HLS tools to reduce 122 the design time. The syntax of design description languages has moved up from 123 VHDL/Verilog HDL to C/C++ level because of HLS tools like Vivado HLS and Intel HLS 124 compilers [19,20]. Applications are becoming more complex. System-on-chip solutions are 125 achievable through the hybrid architecture of ARM+PL and the HLS design approach. 126

There are also some HLS tools from academia, such as LegUp [21], CyberWorkBench 127 [22], autoBridge [23] and LeFlow [24]. autoBridge is an HLS tool specifically for floor 128 planning and pipelining high-frequency designs on multi-die FPGAs. LeFlow is an HLS 129 tool designed specifically for deep learning inference implementation. LegUp can gener-130 ate a hybrid system of custom cores and soft processors; the other tools only generate 131 custom cores. In currently available HLS tools, users need to rely on the vendor's tools to 132 integrate the RTL design into a whole system, which is a non-trivial task. After the HLS 133 stage, there is generally no additional help for users to integrate their resulting system. 134

#### 2.2. Soft processors

As an alternative to the inflexible custom core approach, it has become popular to 136 provide cores for simple programmable processors. These allow users to program in high-137 level languages. A soft processor is achieved by configuring FPGA hardware resources as 138 a processor. Soft processors can reduce the design time through using a high-level lan-139 guage. They also reduce the hardware knowledge required to design a full system. How-140 ever, usually single core performance of a soft processor is poor, since soft processors go 141 through the standard fetch-execute cycle for each instruction, and they cannot run at as 142 high a clock rate as normal hard-core processors. For example, Xilinx Microblaze usually 143 runs under 400 MHz, while Intel and ARM processors can run at well over 1GHz [25-29]. 144

When users program these soft processor systems, they do not normally have to think in terms of the hardware but at a relatively high-level, and potentially get decent performance. Unfortunately, there are no soft processors optimized directly for image processing from vendors like Intel (Altera) and Xilinx. Two soft processors developed 148

135

114

152

158

159

160

161

149

# 2.3. Image Algebra and Pixel Level Abstractions

Image Algebra (IA) [32] is a mathematical theory concerned with the transformation 153 and analysis of digital images at the whole image (rather than pixel) level. The main goal 154 is the establishment of a comprehensive and unifying theory of image transformations, 155 image analysis, and image understanding. Basic IA operations can be classified as: point 156 operations, neighborhood operations, and global operations. 157

specifically for image processing are, for example, IPPro [30] and a RISC-V soft processor

[31]. -These processors require fewer resources than Nios II and Microblaze.

In point operations (P2P), the same operation is applied at every input pixel position using only pixels at that position. Operations can be binary or unary; they include relational (e.g. '>', '='), arithmetic (e.g. '+', '×'), and logical (e.g. 'and', 'or') operations. Normally one output pixel is generated for each corresponding input pixel position.

A neighborhood operation (N2P) is applied to each (potentially overlapping) region 162 of an image. It is most common to use a 3×3 or 5×5 window. A new pixel value will be 163 generated for each window position. The user specifies the matrix of weights for the win-164 dow which <u>can beis</u> used in calculating the result value. 165

A global operation is a reduction operation which is applied to the whole image and 166 produces a scalar (R2S) or a vector (R2V). For example, the global maximum will produce 167 one scalar value, whereas histogram will produce a 256-element vector (for standard grey 168 level images). 169

# 2.4. FPGA-based image processing

In embedded systems, FPGAs are powerful tools for accelerating image processing 171 algorithms, especially for real-time embedded applications, where latency and power are 172 important considerations. FPGAs can be embedded in the a camera to directly provide 173 pre-processed image streams. In this way, the sensor will provide an output data stream 174 rather than <u>merely</u> a sequence of images [33]. FPGAs can achieve both data parallelism 175 and task parallelism within many image processing tasks. Unfortunately, simply putting 176 a PC-based algorithm onto an FPGA usually gives disappointing results [34]. Also, many 177 image processing algorithms have been optimized for scalar processors. Thus it is usually 178 necessary to optimize the algorithm specifically for an FPGA before implementing. 179

There have typically been three approaches to implementing an image processing algorithm/system on FPGAs:

1) Custom hardware designed using Verilog HDL or VHDL and combined with the vendor's IPs.

2) Use high-level synthesis tools to convert a C-based representation of the algorithm 184 to hardware. 185

3) Map the algorithm on to one or a network of soft processors.

When users need to implement an algorithm on FPGAs using custom cores, they 187 need to consider the memory mapping, architecture, and algorithmic optimizations. On 188 the other hand, when users try to use soft processors to implement their complex algo-189 rithm, they will usually be limited by the poor single core performance on the one hand, 190 and resource usage-utilization of a multi-core architecture on the other. Thus, balancing 191 programmability, resource<u>utilizations</u> and performance is a key challenge for implement-192 ing algorithms on FPGAs. 193

# 2.5. Summary

Currently, HLS tools are the key to rapidly implementing FPGA-based image pro-195 cessing algorithms or systems. HLS tools can even accept different input languages, such 196 as C/C++, Java, Python and LabVIEW. Users need to use Xilinx Vivado or Intel Quartus 197

180

181

182

183

186

194

Prime to do the integration. This stage and usually this stage requires detailed hardware 198 knowledge. 199

In terms of the efficiency of implementing image processing algorithms and systems 200 on FPGAs, custom cores have better performance than soft processors, but require users 201 to have detailed hardware knowledge to design efficient accelerators. Soft processors 202 keep the high-level programming model, but single core performance is poor. Users need 203 to use multiple soft processors in order to meet the performance requirements. Fig. 1 in-204dicates informally the programmability (easey of use) vs performance (throughput) of the 205 different approaches. Our goal is to move a step closer to achieving both performance and 206 programmability at the same time. For suitable applications, our soft coprocessor ap-207 proach has-seeks to have performance approaching HLS products and HDL custom cores, 208 but even if it is not as programmable <u>or as general purpose</u> as <u>soft</u> processors. 209



Figure 1. <u>Qualitative Rrepresentation of Programmability vs Performance</u> 212

To address these challenges and problems, in this paper we present our approach – the soft coprocessor (SCP) approach. This aims to achieve performance closer to custom 215 cores while providing users with a higher-level programming model than the current Vivado toolchain. 217

# 3. User's View of the Soft Co-Processor Approach

# 3.1. The Concept of Soft Coprocessors

For FPGAs, performance and programmability are in conflict with each other. For 220 specific applications such as image and signal processing, it is sometimes possible to pre-221 sent a higher\_-level programming model which is less general purpose but can exploit 222 common data access patterns. One of the first uses of coprocessors was in the early days 223 of microprocessor design. For example, the Intel 8086 processor could use a separate 8087 224 coprocessor chip to increase the speed of floating-point calculations with which it was closely integrated [35]. In this case aA coprocessor does not have the usual overhead of 226 the fetch-execute cycle which is a significant overhead for soft processors. –We therefore 227 propose the concept of soft coprocessors to try to gain many of the benefits of an applica-228 tion-specific processor but with the efficiency of a coprocessor. All our soft coprocessors 229 have the following basic properties: 230

• A standard interface for data transfer between soft coprocessors, allowing devel-231 opers to add a soft coprocessor to a system without having to design custom I/O hard-232 ware. 233

 Each soft coprocessor can be parameterizable, allowing a degree of programmabil-234 ity and functional flexibility, but without requiring re-synthesis. 235

• The soft coprocessors should be able to interact with each other, and to be formed 236 into a DFG arrangement, to reduce communication and buffering overheads. This as-237 sumes a stream-based system-approach. 238

213 214

211

210

216

218 219

• Each soft coprocessor should be able to interact with the background control and 239 communication system which controls-manages the operation of the whole FPGA-based 240 system. 241

FPGAs have a lot of computing resources but a more restrictive on-chip memory 242 model. The efficient use of memory resources is crucial to system performance. Skilled 243 developers can choose the optimal memory management approach from a vast-range of 244 possibilities. However, for application developers, it is difficult to properly arrangefully 245 exploit the precious-limited on-chip memory resources using HLS tools. In order tTo pro-246 vide optimized memory allocation for point, neighborhood and global operations, we 247 provide three fundamental types of soft coprocessor based on the core Image Algebra (IA) 248 operations. 249

#### 3.2. Soft coprocessors for stream-based image processing

Stream-based processing using on-chip memory is preferred where possible, since 251 simultaneous access to off-chip memory by multiple coprocessors would be a bottleneck. 252

For a specific application domain such as image processing, we would like a set of 253 SCPs which can be reusedinstantiated, and which cover the common domain operations. 254A good way to identify such a set is to find an existing algebra for the domain and build 255 on the abstractions which have been identified and used at a mathematical level. In the 256 case of image processing, we have chosen some of the core concepts of Image Algebra 257 (IA).

# 3.3. Single Image Algebra-based SCPs

We provide a built-in library of core SCPs which carry out the core operations of Image Algebra. There are four core classes of IA SCPs, plus a fifth type for compound operations:

(i) Point operations. We provide two types of SCPs which apply a point function to every pair of pixels in the two streamed input images (or to each pixel and a scalar param-264 eter), and generates an output pixel stream. The actual function applied is a parameter. 265 The range of point functions include all the standard (integer) arithmetic, logical and re-266 lational functions. For example, a threshold operation would use the image-scalar SCP 267 with the <u>two</u> parameters ( $\geq$ , threshold value). <u>Image stream pPixels</u> are implemented usingheld as 8-bit integers, and intermediate values created during the result of addition, subtraction and multiplication are designed according to the worst case of the calculation are held in higher precision as necessary. 271

(ii) Neighborhood operations. We provide an SCP for each common size of neighbor-272 hood (3x3, 5x5, etc.). The NxN matrix of weights is supplied as a parameter. A standard 273 neighborhood operation has two functions: the point function which is applied pairwise 274 to each pixel-weight pair in the window; then the reduction operation which reduces the 275 NxN intermediate results to a single pixel result. For example, for a standard convolution, 276 the two function parameters are  $(\times, +\Sigma+)$ . Using this type of SCP a range of common image 277 processing functions are is possible, such as dilation, erosion, convolution-based edge detection, and image filtering.

For example, a simple dilation SCP on a binary image would be an instance of the 280 3x3 SCP with the kernel weights [1,1,1,1,1,1,1,1] and the functions (x,  $\frac{OR'or'}{OR'or'}$ ) (effectively 281 just a neighborhood OR). An erode SCP would use have AND 'and' instead of 'or'OR as 282 a parameter. 283

For some operations (perhaps involving image reduction), the window can step by 284 more than one pixel: for example, in the convolution layer of a Convolutional Neural Net-285 work (CNN) [36]. This is achieved by having a stride parameter as part of the neighbor-286 hood operation SCP. The default stride is 1x1. 287

(iii) Global operations. We provide an SCP which performs a reduction operation on 288 a streamed image. The result is a single value. The available reduction functions include 289

250

258





268 269 270

 $\Sigma$ ,  $|\Sigma|$ , max, min, count, and average. A second global SCP produces a vector as a result 290 (typically used for finding the image histogram). 291

(iv) Block operations. Sometimes, we need to divide an image into multiple smaller 292 blocks and then apply the same algorithm to each block. For example, for the Histogram 293 of Oriented Gradients (HOG) algorithm, we find a histogram of edge gradients for each 294 block. Thus, we provide a Block-based SCP which provides a Neighborhood operation or 295 other function, for each block separately. 296

(v) Common complex operations. Although the above basic SCPs can be chained to-297 gether to perform a compound IA-based algorithm, in practice there are certain common 298 patterns of operations which can be more efficiently implemented as a single operation. 299 We therefore provide a number of pattern-specific SCPs. For example, edge-finding and 300 morphological operations sometimes apply a window in several rotated orientations, and 301 have a final reduction stage to produce a single result. We provide a Cycle Neighborhood 302 SCP which takes as its parameters the weight matrix, the number and step-angle of rota-303 tions, the two functions for the neighborhood operation, and the final reduction operation. 304

For example, suppose we want a complete Sobel edge detection operation using a 305 single complex neighborhood SCP. We supply the kernel (the vertical one, say) and spec-306 ify two orientations, with a rotation step-angle of 90°. The two neighborhood function 307 parameters are " $\times$ " and " $|\Sigma|$ " and the vector of kernel weights is [-1, 0, 1, -2, 0, 2, -1, 0, 1]. 308 The final operation to combine the two window outputs (the vertical and horizontal edge 309 strengths) is '+'. (Adding the absolute edge strengths is a common approximation to avoid 310 squaring and adding). The code to create an instance of the complex neighborhood SCP 311 with all these parameters is shown in fig. 4. 312

#### 3.4. Chaining Multiple Core-SCPs in a Data Flow Graph

Multiple instances of the above generic SCPs can be chained together to implement 314 a compound algebraic expression. The output stream of one SCP is fed directly as the input to the next without buffering the complete intermediate image or without involving 316 the host processor. Synchronization is handled automatically by the SCP framework. This chaining can be represented by a simple Data Flow Graph (DFG). 318

For example, the above Sobel edge detector could have been created using two basic 3x3 neighborhood SCPs feeding their results into a third point SCP.

# 3.5. Skeleton SCPs for Function Specific Coprocessors

Using generic SCPs is useful during the algorithm experimentation stage, because 322 the hardware does not need to be changed even if different functions are selected. How-323 ever, once the algorithm is finalized, more efficient function specific coprocessors for com-324 pound operations can be created. To make this convenient without requiring hardware 325 knowledge, we provide a set of SCP skeletons. These are effectively 'hollow' codingse of 326 the above four classes of SCP (point, neighborhood, global and block). The skeletons con-327 tain HLS code to manage the dataflow patterns of each type of operation. In this way, 328 users need only to supplyies the core pixel-level function in the form of a simple C/C++ 329 function. It is in this C function that the user specifies the arbitrarily complex operation. 330 Users can code detailed optimizations, for example, by embedding constant kernel coeffi-331 cients. An example we will see later is an SCP specifically for a more efficient implemen-332 tation of the Sobel edge detector. 333

A new SCP created using our skeletons will need to be synthesized the first time. Once it is added to the SCP library, it is available thereafter.

Function specific SCPs are commonly used to replace a chain of SCPs, or they can 336 replace a generic SCP with one which is optimized for the specific purpose. For example, 337 a more efficient dilation SCP could be created using the 3x3 neighborhood skeleton, and 338 encoding a simple OR function which avoids the need to apply the redundant ' $\times$ 1' step. 339

315

313

317

319

320

321

334

Function-specific SCPs will be more area-efficient than their generic counterparts. 340 Each generic SCP must retain the hardware for all the available functions, in case the user 341 wishes to experiment with different functions during development, without resynthesis. 342 Of course, the function specific SCPs are not as functionally flexible. There are also several 343 coding conventions which must be followed, for accessing the parameters. This is one of 344 the necessary trade-offs when working with FPGAs. 345

We now give an example of using a neighborhood skeleton SCP to implement a Sobel 346 Operation as a single and efficient function\_-specific SCP. The code of the Sobel opera-347 tionfunction, including thresholding, is given in fig.ure 2. 348

PIXEL TYPE (PIXEL SIZE) user defined function ( PIXEL TYPE (PIXEL SIZE) window [K X] [K Y]) #pragma HLS INLINE \_PIXEL\_TYPE(PIXEL\_SIZE) valOutput; \_INT(16) temp1 = 0; INT(16) temp2 = 0; \_INT(16) tempE sult = 0; temp1 = window[0][2] - window[0][0] + window[2][2] - window[2][0]; temp1 = window[0][2] [1] << 1 + temp2 = window[2][0] - window[0][0] + window[2][2] - window[0][2];</pre> temp2 = temp2 + window[1][2] << 1 - window[2][1] << 1; tempResult abs(temp1) + abs(temp2); valOutput = 255; valOutput = 0; if (tempResult >=200)
if (tempResult <200)</pre> return valOutput;

Figure 2. The core function for the Sobel Operation when using a skeleton SCP

# 3.6. Generating SCP Configurations

We distinguish between the application program and the hardware configuration it runs on. To avoid frequent re-synthesis, our model is that a (pre-synthesized) configuration contains the set of SCPs which are available to the application developer. Pro-355 vided the application makes use of only these SCPs, then changes to the application can 356 be made without any re-synthesis. There are separate tools for defining both the configu-357 rations and the application. 358

To speed up the process of getting a runnable FPGA configuration, our SCoPeS en-359 vironment maintains a library of FPGA configurations which contain different mixes of 360 SCPs from the SCP library. The need for this arises because the developer may not know 361 in advance exactly how many instances of which each type of SCP s-will be needed. If 362 the Configuration library does not have the necessary mix for the current project, then we 363 provide a tool which enables the user to create a new SCP configuration. The user can 364 specify the number of each class of SCP, and the Hardware Configuration Generator 365 (HGC) tool will then generate the complete FPGA bitstream, and add it to the Configura-366 tion library, as shown in Figure fig. 3. Obviously, the required hardware resources of the 367 defined configuration must be able to fit on to the target FPGA. 368



Figure 3. The GUI for creating a new project configuration

351 352

350

349

# 3.7. Text-based DFG Code Generator (TCG)

Normally, users could use the default Xilinx SDK to program the Zynq-based hard-372 ware platform in baremetel mode or use PetaLinux+Xilinx SDK to build a Linux-based 373 application-for more complex applications. In this stage, there is no hardware level design; 374 normally users can develop their application in C/C++. Users need to use the HLS-ex-375 ported driver to create their own initialization function, set all the parameters individu-376 ally, and invoke them\_for during the execution. We use the AXI-Stream InterconnectS in-377 terface for connecting all our coprocessors (see later) to match the user-supplied DFG. This 378 textual DFG specifies the coprocessor instances, their parameters, and their interconnec-379 tion channels. Our Textual Code Generator (TCG) tool takes the text representation of the 380 DFG and generates the executable C code for the Xilinx SDK. This simplifies and speeds 381 up the development of the final application. 382

As an example, fig.Figure 4a shows the developer's code (the textual DFG) for an 383 automatic thresholding system using the Otsu method (assuming we have already have 384 written the final Otsu SCP to select and apply the threshold using our skeletons) after an 385 Open operation. Because in the system, wTe fix the entry point as the 'Streamer', which is 386 used a block whichto is directly connected to the camera, and which generates a stream 387 with all the parameters and the image data, and the camera is directly connected to the 388 streamer. There is no need to define the input source because in the system we cannot 389 split the stream. Thus, in the code, wWe set theuse 'Streamer' to define the first output 390 channel to channel 3. And We then we do the dilation and erosion through neighborhood 391 operations. After that, we do the edge detection, histogram finding and Otsu threshold-392 ing. The result image stream is returned through channel 2. Figure 4b outlines the gener-393 ated Xilinx SDK useable code from the DFG in fig\_ure 4a. 394

| Streamer(3);                                        |
|-----------------------------------------------------|
| NeighborhoodOP([1,1,1,1,1,1,1,1,1], "*", "or", 4);  |
| NeighborhoodOP([1,1,1,1,1,1,1,1,1], "*", "and", 5); |
| Sobel(6);                                           |
| GlobalOP("Histogram", 7);                           |
| Otsu(2);                                            |

Figure 4a. Example Textual Description of a DFG for Otsu after an  $\Theta$  pen operation 397

398

396

395



10 of 18



Figure 4b. Example of the code generated by the TCG tool from the DFG in fig\_ure 400 401

In fig<u>ure</u> 4b, the first block <u>on the right describes is the output from</u> the first pass of our TCG <u>tool</u> through the text description, generating all the necessary header files based on the names of the functions. The second block shows the generated initialization functions. Then the main body of the program is generated based on the text-based DFG. 403

# 3.8. Using the SCoPeS Development Environment

4a.

Our SCoPeS development environment includes the tools necessary to build an application using the SCP library, as mentioned above. It is currently a prototype IDE. The408typical design flow for a new project/application is thus as follows:410

Decompose the desired algorithms into IA expressions.
 Select (or create) a suitable configuration from the Configuration library. (We can

select a different one later if we run out of <u>instances of</u> a certain type of SCP).
3. Define each algorithm as a Data Flow Graph (DFG)<sub>7</sub> and use the TCG tool to set up the system defined by the DCG.
413

4. Experiment with the system, until the <u>functions and parameters</u> 416 tasks are <u>fixed finalised</u>. 417

5. If necessary, design function specific coprocessors to replace some of the IA-based 418 SCPs selected in step 2. 419

6. If step 5 was utilized, import the function specific coprocessors into the system and resynthesis the system configuration. 420

# 4. Architectures and Implementations of Coprocessors

In this section, we discuss <u>some</u> key implementation aspects of the SCP approach, including the architecture for single IA-based SCPs, hardware skeletons and hardware configurations. 423

4.1. SCP Architectures for Image Algebra Operation Types

426

422

399

402

407

411

When implementing the SCPs on FPGAs, the use of the internal memory depends on 427 the type of operation. Point operations usually do not need image buffers; neighborhood 428 operations require line buffers to hold the relevant pixels within the window according 429 todepending on the size of the convolution kernel. Some global operations do not require 430 any buffering; but some function-specific global SCPs may need a whole frame buffer to 431 hold the frame untilto the end of frame has been processeding, such as Otsu adaptive 432 thresholding [30]. When creating an instance of one type of SCP, the optimized data han-433 dling then comes for free. Fig.ure 5 shows how we handle the data flow and buffering in 434 different types of SCPs. Since we are using HLS to implement these SCPs, the detailedde-435 tails of the architectures are hidden from us, and we only have control over the data flow 436 and buffering. 437

The Point operation SCP reads the next pixel from the input stream and performs the438calculation before pushing the result to the output stream. With pipelining, one pixel is439output every clock cycle.440

In the neighborhood operation SCP (e.g. convolution), the example architecture of a 441 generic 3×3 neighborhood operation is shown in Figure fig. 5. As the streamed pixels ar-442 rive, we use a BRAM-based line buffer to hold two lines and two pixels. When the third 443 pixel of the third line arrives, we have the whole window ready for a neighborhood oper-444 ation to produce one single output pixel. Then, we increment the window position, and 445 read one more pixel, and do the next neighborhood operation. The neighborhood calcu-446 lation in our generic operator is divided into two stages. In the first stage, for each position 447 in the window, each image pixel in the window is combined pairwise with the corre-448 sponding value in the kernel (the matrix of window weights supplied by the users). These 449 intermediate results are then reduced in the second stage. (For convolution, this would be 450 an accumulation operation). 451

As a global operation can reduce a streamed input image to either a scalar result or a 452 vector result, two versions of global SCP, R2S and R2V, are createdavailable. Sometimes 453 the result of a global operation is subsequently used to process the same image (e.g. to 454 threshold an image based on its average pixel value). In this case, it will be necessary to 455 buffer the whole input image in an image buffer. Thus, in the architecture for a global 456 operation SCP (Figure fig. 5), when a streamed image comes from a camera or another 457 SCP or from a file, users can choose if they need a built-in frame buffer or not before push-458 ing the result pixel. During the buffering or streaming of the input frame, the calculation 459 for the global operation can be done at the same time, since the global SCPs are fully pipe-460 lined. Supported operations include Min, Max,  $\Sigma$ ,  $|\Sigma|$ , Count<sub>7</sub> and Global Average, and 461 are applied to give either a scalar or vector result. An image histogram can be obtained by 462 selecting the R2V SCP and specifying the address in BRAM where the vectorit will be 463 stored so that subsequent SCPs can access the result directly. However, when internal 464 memory allocation such as a frame buffer is needed, re-synthesis may be required. 465



Figure 5. Data flow and buffering for the four different Operation <u>t</u>Types (clockwise: Global, Neighborhood, Block and Point operations)

467 468

466

The Block Operation can be regarded as a special neighborhood operation which op-470 erates on a stream of blocks. This requires an outer level of processing to extract blocks in 471 order, and to stream each block to the neighborhood operation. For each block, we can do 472 any neighborhood-based operation. When performing a neighborhood operation (e.g. 473 3x3) on a block, we must allow for the edge effect at block boundaries. Therefore, the block 474 buffer is one column larger (for a 3x3 operation) than the original block (see Figure fig. 5). 475 Also, the buffering hardware will handle any block stride length dynamically in SCPs as 476 it is sometimes useful to experiment with different block strides at runtime. 477

Complex SCPs which perform a neighborhood operation with a kernel in different 478 multiple orientations avoid the need to replicate the line buffer. Using the complex 479 neighborhood SCP, and supplying the appropriate kernel plus the rotation parameters, 480 we can do these operations in a single pass of the stream. -This solution uses only a single 481line buffer. 482

# 4.2. Communication between Coprocessors

To allow users to change the DFG interconnections between SCPs without re-synthe-484 sis, we use AXI-stream Interconnect (a Xilinx provided IP core) to connect SCPs instead of using naïve FIFOs. Each SCP has a 'TDEST' input to indicate where its output stream 486 goes in the AXI-stream interconnect system. 487

When there are many SCPs in the application, there will be many parameters to be 488 sent to the various SCPs, so it is crucial to find an efficient way of distributing these pa-489 rameters. We also need-would like parameter distribution to be dynamic (in the sense 490 that they can <u>be</u> changed while the program is running). Our solution is to send the pa-491 rameters as part of the header package for every new frame. It would be possible to send 492 them using the ARM processor through the AXI bus using the AXI-Lite interface [32] by 493 enabling the data stream [33], but the ARM has would have to work sequentially in send-494 ing all the parameters every frame, which is time-consuming when there are many SCPs 495 involved. This is why our approach is to group the command and data together by ap-496 pending the parameters to the front of each frame in the image data stream. 497

The parameter stream is illustrated in Figure fig. 6. The parameter stream comprises, 498 for each SCP, the ID of the SCP, its various parameters, and the output channel ( $\underline{T}Dest$ -499 EST). Because we fix the entry point of the system to be the streamer, in this particular 500 case we only need to define the output channel of each SCP. (More generally, of course, 501 both the input and output channels would be defined). Each SCP receives the complete 502 parameter stream for all SCPs; it extracts only those parameters relevant to it, passes the 503 parameter stream on to the output SCPchannel (the next SCP), and then starts processing 504 the image data <u>which</u> followsing the parameter section. 505

# 4.3. Coding SCPs behind the Scenes

We created the Image Algebra-based soft coprocessors using Xilinx Vivado HLS. For 507 interoperability of SCPs, the way of interfacing any coprocessor to the rest of the system 508 is always the same. 509

When the developer introduces a new SCP instance in the textual DFG description, 510 behind the scenes one of the free instances of the SCP will be acquired from those still 511 available in the user-selected configuration. The parameters in the DFG are used by the 512 TCG tool to generate and set the various properties of the SCP in an object-oriented fash-513 ion. Code is also generated to form the connections via the channels in the AXI inter-con-514 nection scheme described above. This code is for the Xilinx SDK after the hardware plat-515 form has already been defined and synthesized. For example, Figure fig. 7 shows the TCG-516 generated generated code for the Xilinx SDK to set up a complex SCP (of type NeighOP2) 517 followed by a thresholding SCP (of type PointOP) for the Sobel operation outlined previ-518ously, based on a two-step rotating kernel. 519

485

506

533

534

535 536

When implementing designs using Xilinx Vivado HLS, directive settings (or prag-520 mas) can have a significant effect on hardware utilization and performance. Optimization 521 using well-designed directives can be several times more effective than an un-optimized 522 design. To master these directive settings takes a lot of time and requires a deeper under-523 standing of how the hardware works. We therefore developed our own internal library of 524 reusable macros and reserved variables which we used to simplify and standardize the 525 HLS coding of all the IA SCPs. These macros are also available to the developer when 526 creating skeleton-based function specific SCPs and when writing the low-level C function. 527 This library is not normally required to be visible to the developer, but we mention it as a 528 valuable approach to simplify the retargeting of our HLS coding of SCPs and skeletons to 529 another types of FPGA type of our HLS coding of SCPs and skeletons. This internal library 530 includes: 531

- Interface settings
- Pipelining directives
- Buffer settings
- Special data types and hardware-level signal handling



Figure 7. From Text-based DFG to Hardware Platform through Xilinx SDK

# 5. Evaluation and Comparisons

In this section, we present some details of the performance and hardware utilization 544 of the SCPs. We use the Xilinx Zedboard with an I2C OV7670 camera module as the test 545 platform. The OV7670 camera can produce a 640×480 8-bit greyscale video stream and 546 can be connected to the Zedboard. The Zedboard is equipped with an XC-7Z020 FPGA, 547 which has programmable logic (PL) and an ARM processor. We use the Xilinx Zedboard 548 to implement our designs and evaluate two different versions of our IA-based SCPs: the 549 Minimum Area mode, and Maximum Performance mode (these have to be separately 550 synthesized). We compare example operations using SCPs with equivalent implementa-551 tions using the image processing soft processor, IPPro. Finally, we also compare the use 552 of a generic (complex) single SCP formulation of a Sobel operator with an equivalent func-553 tion-specific SCP created using a neighborhood skeleton SCP. 554

5.1. Performance and Hardware Utilization

542

Table 1 shows the SCPs' hardware utilization and performance (in frames per sec-556 ond) on a Virtex FGPA running at 150MHz in Minimum Area mode. This is compared 557 with the utilization and performance of the soft processor-based solution using a multi-558 core IPProRO. The comparison is for four basic SCP operations (point, neighborhood, 559 complex and global). Table 2 shows the equivalent figures using Max Performance mode 560 for the SCPs. In both cases, the image size is 512×512 and in the neighborhood operation 561 SCP, the kernel is a 3×3 matrix. 562

**Table 1** Comparison Between SCP (in Min Area mMode) and IPPRO Approach

in Utilization and Performance LUTs SCPs FFs **BRAMs** DSPs FPS 3 Point 1659 2015 0 186 5 9 **Neighborhood Basic** 1104 1404 127 **Neighborhood Complex** 5 4963 72 7141 125 0 Global 998 0 622 189 IPPRO [15] **FFs** LUTs **BRAMs DSPs** FPS 12279<del>FF</del> 10941<del>LUT</del> 18.5BRAM 8<del>DSP</del> 120<del>FP</del> Point (8 core) IPPRO [15] S S S <del>s</del> s Point (8 core) 12279 10941 18.5 8 120 Neighborhood Basic (6\_-core) 13202 11826 32.5 6 76

**Table 2** <u>SCP The-Utilization and Performance (in Max Performance mMode)</u>

| SCPs                 | FFs  | LUTs  | BRAMs | DSPs | FPS |
|----------------------|------|-------|-------|------|-----|
| Point                | 3346 | 2965  | 0     | 3    | 556 |
| Neighborhood Basic   | 2309 | 1963  | 5     | 9    | 380 |
| Neighborhood Complex | 9862 | 12368 | 5     | 72   | 374 |
| Global               | 1432 | 1353  | 0     | 0    | 568 |

The first observation is on the difference between Min Area Mmode and Max Performance Mmode. Max Performance Mmode is roughly three times as fast, but takes twice as much area, as Min Area <del>M</del>mode. However, in practice there may be no advantage in being able to process at nearly 400 FPS, and so the Min Area Mmode is often to be preferred.

To make comparison with various IPPro configurations easier, Table 3 shows the 573 normalized inverse-ratios of performance and resources (to one decimal place) based on 574 the data in tables 1 and 2 (first for min usage area and then for max performance). Note 575 that, in the performance ratio, a value greater than 1 in the IPPro rows indicates the degree 576 to which SCP outperforms-IPPro is worse than SCP. And in the utilization part, a value 577 below 1 indicates the degree to which SCP uses fewer resources than IPPro. Thus ine 578 <u>mMax pP</u>erformance <u>mMm</u>ode SCPs process 4.63 <u>times</u> faster than IPPros in point oper-579 ations and 7.31 times faster in neighborhood operations, while using less hardware than 580 IPPro. This is partly because the IPPro has to go through the standard fetch-execute cycle. 581 In the min resources mode Min Area Mm, ode, the SCP performance is comparable a little 582 faster thanto the IPProRO, yet uses only 20% of the resources (apart from DSPs) as Table 583 3 shows. 584

To illustrate the benefit of using a function-specific SCP, we choose Sobel for our final comparison. We compare the generic complex SCP with a function-specific SCP in doing a Sobel operation in Table 4. 587

 Table 3 Inverse Ratios for SCP over-to IPPro for Performance and Utilization (>1 is worse)

565 566

568 569 570

567

571

572

588

| 589          | Operation      | Performance |                    | UsageUtilization (>1 is worse) |                           |                 |                            |
|--------------|----------------|-------------|--------------------|--------------------------------|---------------------------|-----------------|----------------------------|
| Min Area     |                | Freq        | FPS                | FFs                            | LUTs                      | BRAMs           | DSPs                       |
| Point        | SCP            | 150 MHz     | 1                  | 1                              | 1                         | 1               | 1                          |
|              | IPPro (8 core) | 150 MHz     | 1.5 <mark>4</mark> | <del>0.14<u>7.4</u></del>      | <del>0.18<u>5.4</u></del> |                 | <del>0.375<u>2.7</u></del> |
| Neighborhood | SCP            | 150 MHz     | 1                  | 1                              | 1                         | 1               | 1                          |
|              | IPPro (6 core) | 150 MHz     | 2.4 <mark>3</mark> | <del>0.08</del> 8.0            | <del>0.11<u>5.9</u></del> | <del>0.15</del> | <del>1.5</del> 2.0         |
| Мах          | Operation      | Performance |                    | Usage                          |                           |                 |                            |
| Performance  |                | Freq        | FPS                | FFs                            | LUTs                      | BRAMs           | DSPs                       |
| Point        | SCP            | 150 MHz     | 1                  | 1                              | 1                         | 1               | 1                          |
|              | IPPro (8 core) | 150 MHz     | 4.6 <del>3</del>   | <del>0.26<u>3.7</u></del>      | <del>0.27<u>3.7</u></del> |                 | <del>0.375<u>2.7</u></del> |
| Neighborhood | SCP            | 150 MHz     | 1                  | 1                              | 1                         | 1               | 1                          |
|              | IPPro (6 core) | 150 MHz     | 7.3 <mark>1</mark> | <del>0.17<u>5.7</u></del>      | <del>0.16<u>6.0</u></del> | <del>0.15</del> | <u>1.50.7</u>              |

To illustrate the benefit of using a function -specific SCP, we choose Sobel for our final comparison. We compare the generic complex SCP with a function-specific SCP\_-in doing a Sobel operation, in Table 4. 593

Table 4. Comparison between <u>a Generic and a Function-specific SCPs</u>

|                   | FFs  | LUTs  | BRAMs | DSPs | FPS |
|-------------------|------|-------|-------|------|-----|
| Generic           | 9862 | 12368 | 5     | 72   | 125 |
| SkeletonFunction- | 932  | 1107  | 2     | 3    | 128 |
| <u>specific</u>   |      |       |       |      |     |

Interestingly, the generic SCP approach and the function specific SCP have very similar performance (around 125 FPS for a 640×480 video stream). However, the skeleton approach is clearly <u>much</u> more area efficient (by a factor of approximately 310), because it removes all the unused function logic which is part of the generic SCP.

# 6. Conclusion

In this paper, we have presented several concepts and tools which are intended to 602 make it easier for application developers to achieve-design FPGA-based acceleration of image and video processing systems while designing at a high level. By 'high\_-level', we 604 do not mean merely using the syntax of a high-level language; we mean designing sys-605 tems with no, or as little as possible, hardware knowledge. -Where it becomes necessary 606 to drop down into hardware design, we have introduced approaches and customizable 607 components intended to abstract away many of the hardware-aware details. 608

Our main specific conclusions are as follows:

1) We propose the concept of soft coprocessors, which are single-instruction proces-610 sors which can be parameterized to support a range of different functions. SCPs can be 611 assembled into a DFG for efficient stream-based processing. 612

2) The SCPs allow users to conveniently design and experiment with an image pro-613 cessing application by chaining SCPs together. We use AXI-Stream Interconnect to con-614nect all the SCPs in the system in a way which reflects the algorithm's Dataflow Graph 615 (DFG). In this way, we provide users with a flexible system which can be programmed as 616 a textual DFG. Users do not need to re-synthesize when they change the DFG. 617

3) We provide reusable hardware-SCP skeletons to allow developers to create effi-618 cient function\_specific soft-coprocessors without needing to know (much) about hard-619 ware structures. 620

590 591 592

594 595

596 597 598

599 600

603

609

631

4) We have provided a set of generator tools which comprise the SCoPeS environ-621 ment – a prototype IDE to support the SCP concept. 622

5) Overall, we conclude that the soft coprocessor approach has the potential to deliver 623 better performance than the soft processor approach, and can improve programmability 624 over dedicated HDL cores for domain specific applications while achieving competitive 625 real time performance and utilization. 626

However, our work also has the following main limitations:

1) Our current work is designed only for image and video processing development, 628 and is not a general-purpose tool. However, as a general rule, the coprocessor approach 629 is suited to any application area which has an associated under-pinning algebra. 630

2) Our implementation currently only supports relatively simple DFGs.

3) Our tools do not yet support image partitioning for greater parallelism, which can 632 be a useful additional approach technique for accelerating image processing applications. 633 Updating our tools to include this option of a multi-core approach is a valuable-promising 634 future development. 635

| Acknowledgement                                           | 636 |
|-----------------------------------------------------------|-----|
| This work was sponsored by the China Scholarship Council. | 637 |

This work was sponsored by the China Scholarship Council.

# References

|     |                                                                                                                                                                                                                           | 639        |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| 1.  | Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image                                                                                                   | 640        |
| _   | classification with transformers. IEEE Transactions on Geoscience and Remote Sensing 2021.                                                                                                                                | 641        |
| 2.  | Wu, T.; Yang, Z. Animal tumor medical image analysis based on image processing techniques and embedded system.<br>Microprocessors and Microsystems 2021, 81, 103671.                                                      | 642<br>643 |
| 3.  | Khasanova, A.; Makhmutova, A.; Anikin, I. Image Denoising for Video Surveillance Cameras Based on Deep Learning                                                                                                           | 644        |
|     | Techniques. In Proceedings of the 2021 International Conference on Industrial Engineering, Applications and Manufactur-<br>ing (ICIEAM), 2021; pp. 713-718.                                                               | 645<br>646 |
| 4.  | Kalinowska, K.; Wojnowski, W.; Tobiszewski, M. Smartphones as tools for equitable food quality assessment. Trends in                                                                                                      | 647        |
| 1.  | Food Science & Technology 2021.                                                                                                                                                                                           | 648        |
| 5.  | Nguyen, M.T.; Truong, L.H.; Le, T.T. Video surveillance processing algorithms utilizing artificial intelligent (AI) for un-                                                                                               | 649        |
|     | manned autonomous vehicles (UAVs). MethodsX 2021, 8, 101472.                                                                                                                                                              | 650        |
| 6.  | Aslan, S.; Güdükbay, U.; Töreyin, B.U.; Çetin, A.E. Deep convolutional generative adversarial networks based flame detec-                                                                                                 | 651        |
| -   | tion in video. arXiv preprint arXiv:1902.01824 2019.                                                                                                                                                                      | 652        |
| 7.  | Arvin, R.; Khattak, A.J.; Qi, H. Safety critical event prediction through unified analysis of driver and vehicle volatilities:<br>Application of deep learning methods. Accident Analysis & Prevention 2021, 151, 105949. | 653<br>654 |
| 8.  | Siska, J.; Jaeschke, T.; Wagner, J.; Pohl, N. FPGA-Accelerated Multispectral Ultra-High Resolution SAR-Imaging with Wide-                                                                                                 | 655        |
| _   | band FMCW Radars. In Proceedings of the 2019 IEEE Radio and Wireless Symposium (RWS), 2019; pp. 1-4.                                                                                                                      | 656        |
| 9.  | Attaran, N.; Puranik, A.; Brooks, J.; Mohsenin, T. Embedded low-power processor for personalized stress detection. IEEE                                                                                                   | 657        |
| 10  | Transactions on Circuits and Systems II: Express Briefs 2018, 65, 2032-2036.                                                                                                                                              | 658        |
| 10. | Chen, X.; Tan, H.; Chen, Y.; He, B.; Wong, WF.; Chen, D. ThunderGP: HLS-based graph processing framework on fpgas.                                                                                                        | 659        |
|     | In Proceedings of the The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021; pp. 69-80.                                                                                                      | 660        |
| 11. | Yuan, H.; Ding, D.; Fan, Z.; Sun, Z. A Real-time Image Processing Hardware Acceleration Method based on FPGA. In                                                                                                          | 661<br>662 |
| 11. | Proceedings of the 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), 2021; pp.                                                                                                    | 663        |
|     | 200-205.                                                                                                                                                                                                                  | 664        |
| 12. | Xiao, Z.; Chamberlain, R.D.; Cabrera, A.M. HLS Portability from Intel to Xilinx: A Case Study. In Proceedings of the 2021                                                                                                 | 665        |
|     | IEEE High Performance Extreme Computing Conference (HPEC), 2021; pp. 1-8.                                                                                                                                                 | 666        |
| 13. | Winterstein, F.; Bayliss, S.; Constantinides, G.A. High-level synthesis of dynamic data structures: A case study using Vi-                                                                                                | 667        |
|     | vado HLS. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT), 2013; pp. 362-<br>365.                                                                                              | 668        |
| 14. | Liu, S.; Lau, F.C.; Schafer, B.C. Accelerating FPGA prototyping through predictive model-based HLS design space explo-                                                                                                    | 669<br>670 |
| 14. | ration. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC), 2019; pp. 1-6.                                                                                                                       | 671        |
| 15. | Coussy, P.; Gajski, D.D.; Meredith, M.; Takach, A. An introduction to high-level synthesis. IEEE Design & Test of Comput-                                                                                                 | 672        |
| 10. | ers 2009, 26, 8-17.                                                                                                                                                                                                       | 673        |
| 16. | O'Loughlin, D.; Coffey, A.; Callaly, F.; Lyons, D.; Morgan, F. Xilinx vivado high level synthesis: Case studies. 2014.                                                                                                    | 674        |
| 17. | Gaide, B.; Gaitonde, D.; Ravishankar, C.; Bauer, T. Xilinx adaptive compute acceleration platform: VersalTM architecture.                                                                                                 | 675        |
|     | In Proceedings of the Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,                                                                                                        | 676        |
| 10  | 2019; pp. 84-93.                                                                                                                                                                                                          | 677        |
| 18. | Chatarasi, P.; Neuendorffer, S.; Bayliss, S.; Vissers, K.; Sarkar, V. Vyasa: A high-performance vectorizing compiler for tensor                                                                                           | 678        |
|     | convolutions on the Xilinx AI Engine. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), 2020; pp. 1-10.                                                                               | 679<br>680 |
| 19. | Kathail, V.; Hwang, J.; Sun, W.; Chobe, Y.; Shui, T.; Carrillo, J. SDSoC: A higher-level programming environment for Zynq                                                                                                 | 681        |
|     | SoC and Ultrascale+ MPSoC. In Proceedings of the Proceedings of the 2016 ACM/SIGDA international symposium on field-                                                                                                      | 682        |
|     | programmable gate arrays, 2016; pp. 4-4.                                                                                                                                                                                  | 683        |
| 20. | Domingo, R.; Salvador, R.; Fabelo, H.; Madronal, D.; Ortega, S.; Lazcano, R.; Juárez, E.; Callicó, G.; Sanz, C. High-level                                                                                                | 684        |
|     | design using Intel FPGA OpenCL: A hyperspectral imaging spatial-spectral classifier. In Proceedings of the 2017 12th In-                                                                                                  | 685        |
| 01  | ternational Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2017; pp. 1-8.                                                                                                                   | 686        |
| 21. | Canis, A.; Choi, J.; Fort, B.; Syrowik, B.; Lian, R.L.; Chen, Y.T.; Hsiao, H.; Goeders, J.; Brown, S.; Anderson, J. Legup high-                                                                                           | 687<br>682 |
| 22. | level synthesis. In FPGAs for Software Programmers; Springer: 2016; pp. 175-190.<br>Wakabayashi, K. CyberWorkBench: integrated design environment based on C-based behavior synthesis and verification.                   | 688<br>689 |
| ∠∠. | In Proceedings of the 2005 IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, 2005.(VLSI-                                                                                                         | 689<br>690 |
|     | TSA-DAT). 2005; pp. 173-176.                                                                                                                                                                                              | 691        |
| 23. | Guo, L.; Chi, Y.; Wang, J.; Lau, J.; Qiao, W.; Ustun, E.; Zhang, Z.; Cong, J. AutoBridge: Coupling Coarse-Grained Floorplan-                                                                                              | 692        |
| -   | ning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. In Proceedings of the The 2021 ACM/SIGDA                                                                                                            | 693        |
|     | International Symposium on Field-Programmable Gate Arrays, 2021; pp. 81-92.                                                                                                                                               | 694        |
|     |                                                                                                                                                                                                                           |            |

- Noronha, D.H.; Salehpour, B.; Wilton, S.J. LeFlow: Enabling flexible FPGA high-level synthesis of tensorflow deep neural networks. In Proceedings of the FSP Workshop 2018; Fifth International Workshop on FPGAs for Software Programmers, 2018; pp. 1-8.
- Hebbar SR, R.; Milenković, A. SPEC CPU2017: Performance, event, and energy characterization on the core i7-8700K. In Proceedings of the Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, 2019; pp. 111-118.
- 26. Beutel, J.; Trüb, R.; Forno, R.D.; Wegmann, M.; Gsell, T.; Jacob, R.; Keller, M.; Sutton, F.; Thiele, L. The dual processor platform architecture: demo abstract. In Proceedings of the Proceedings of the 18th International Conference on Information Processing in Sensor Networks, 2019; pp. 335-336.
- 27. Bellemou, A.; Benblidia, N.; Anane, M.; Issad, M. Microblaze-based multiprocessor embedded cryptosystem on FPGA for elliptic curve scalar multiplication over F p. Journal of Circuits, Systems and Computers 2019, 28, 1950037.
- 28. Shamseldin, A.; Soubra, H.; ElNabawy, R. Performance of DSP operations implemented using a soft microprocessor: a case study based on Nios II. In Proceedings of the 2021 International Conference on Microelectronics (ICM), 2021; pp. 66-69.
- 29. Mplemenos, G.-G.; Papaefstathiou, I. Mplem: An 80-processor fpga based multiprocessor system. In Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, 2008; pp. 273-274.
- 30. Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. Fpga-based processor acceleration for image processing applications. Journal of Imaging 2019, 5, 16.
- 31. Kimura, Y.; Kikuchi, T.; Ootsu, K.; Yokota, T. Proposal of Scalable Vector Extension for Embedded RISC-V Soft-Core Processor. In Proceedings of the 2019 Seventh International Symposium on Computing and Networking Workshops (CAN-DARW), 2019; pp. 435-439.
- 32. Wilson, J.N.; Ritter, G.X. Handbook of computer vision algorithms in image algebra; CRC press: 2000.
- 33. Liu, G.; Luo, Q.; Liu, B.; Lu, B.; Guo, P. Embedded intelligent camera algorithm based on hardware IP. In Proceedings of the Tenth International Symposium on Precision Engineering Measurements and Instrumentation, 2019; p. 110533T.
- 34. Bailey, D.G. Image processing using FPGAs. 2019, 5, 53.
- 35. Palmer, J.F. The Intel® 8087 numeric data processor. In Proceedings of the Proceedings of the May 19-22, 1980, national computer conference, 1980; pp. 887-893.
- 36. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems 2021.

#### Bio

**Tiantai Deng** received his PhD from Queen's University Belfast, MSc from the University of Manchester and BEng from Harbin Institute of Technology. He is currently a lecturer at the University of Sheffield. Prior <u>to</u> his career as an academic, he was a senior engineer at HiSilicon, Huawei. His main research focus is on hardware acceleration for image processing, deep learning and high-level design environments.

**Danny Crookes** received the BSc and PhD degrees from Queen's University Belfast in 1977 and 1980 respectively. He was appointed to the Chair of Computer Engineering at Queen's University Belfast in 1993, where he was the Head of Computer Science from 1993 to 2002. He has published over 260 scientific papers in journals and international conferences. His current research interests include medical image processing, hardware acceleration, and speech enhancement and separation.

**Roger Woods** received the BSc and PhD degrees from Queen's University Belfast in 1985 and 1990 respectively, and is currently a professor and Dean of Research <u>with in</u> the university. He has also formed Analytics Engines Ltd., and acts as their chief scientist. His research interests include heterogeneous programmable systems and design tools for data, signal and image processing, and telecommunications.

Fahad Siddiquireceived the BSc degree in Electronic Engineering from Sir Syed University of Engineering and Technology,747Pakistan in 2007, the MSc degree in Electronic Engineering from the Polytechnic University of Turin, Italy in 2012, and the PhD748degree from Queen's University Belfast in 2018. His research interests focus on FPGA\_based programmable architectures with749an emphasis on hardware acceleration. He is currently Senior Hardware Security Architect at NVIDIA, Belfast, UK.750

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

730 731

732

733

734

735 736

737

738

739

740 741

742

743

744

745 746