Hardware Acceleration Design Tutorials
The Big Picture: Teaching Hardware Acceleration by Example
The Hardware_Acceleration_Design_Tutorials module is a curated collection of reference designs that bridge the gap between algorithmic intent and high-performance FPGA implementation. Think of it as a "cookbook" for hardware acceleration — each tutorial demonstrates how to transform a compute-intensive workload from a CPU-bound reference implementation into an efficient FPGA-accelerated solution using Vitis HLS (High-Level Synthesis) and the Vitis unified software platform.
Why This Module Exists
Modern data-center and edge workloads — from real-time video processing to combinatorial optimization — increasingly bump against the "CPU memory wall." While CPUs excel at sequential, branch-heavy code, they struggle with data-parallel "throughput kernels" where the same operation applies to massive data streams.
The traditional path to FPGA acceleration was daunting: write RTL (Verilog/VHDL), verify in simulation, handle timing closure manually — a months-long effort requiring specialized hardware design expertise. This module demonstrates the Vitis HLS approach: express algorithms in C/C++, add pragmas to guide hardware generation, and let the compiler build the pipeline. The goal is to enable software engineers to achieve RTL-level performance without writing RTL.
The Mental Model: A Pipeline Factory
To understand how these tutorials work, imagine a modern automated factory floor:
-
The Raw Materials (Global Memory) — Large batches of unprocessed data (images, city coordinates, network packets) sit in off-chip DDR memory, waiting for processing.
-
The Loading Dock (Memory-Mapped AXI4) — A specialized loader (
ReadFromMem) pulls raw materials from the warehouse into the factory in optimal-sized batches (burst transfers), converting wide memory bus transactions into a steady stream of individual items. -
The Assembly Line (Dataflow Pipeline) — Inside the factory, materials move through specialized workstations connected by conveyor belts (
hls::streamFIFOs):- Window Formation Station (
Window2D): Assembles individual pixels into 2D neighborhoods (convolution windows) using line buffers — like organizing parts into assemblies. - Processing Station (
Filter2D): Applies the actual computation (multiply-accumulate for convolution, distance calculation for TSP) to each assembled unit.
- Window Formation Station (
-
The Shipping Dock (AXI4 Write) — Finished products flow to the unloading station (
WriteToMem), which repackages the stream into burst writes back to global memory. -
The Dispatcher (Host Orchestration) — A central coordinator (the
Filter2DDispatcherclass) manages multiple "delivery requests" to the factory. It implements software pipelining: while one batch is being processed on the FPGA, the next batch's data is being transferred from host memory to FPGA memory, overlapping communication with computation.
The key insight: Hardware acceleration isn't just about making one operation faster; it's about keeping the pipeline full. The tutorials demonstrate how to structure both the hardware kernel (the factory floor) and the host software (the logistics coordinator) to maintain maximum throughput.
Architecture Overview
The module is organized into three distinct tutorial tracks, each targeting different problem domains and optimization techniques:
1. Convolution Tutorial: The Stream Processing Pattern
The Convolution Tutorial is the "hello world" of image processing acceleration. It implements a 2D convolution filter (blur, sharpen, edge detection) using the classic stream processing pattern.
Key architectural insights:
- Line Buffer Pattern: The
Window2Dfunction demonstrates the canonical FPGA image processing architecture — using on-chip BRAM to buffer image lines, enabling efficient 2D window extraction with minimal external memory bandwidth. - DATAFLOW Architecture: The kernel uses
#pragma HLS DATAFLOWto create a 4-stage pipeline (Read → Window → Filter → Write) where all stages execute concurrently, connected byhls::streamFIFOs. - Software Pipelining on Host: The
Filter2DDispatcherclass implements triple-buffering at the system level — while the FPGA processes batch N, the CPU prepares data for batch N+1 and reads results from batch N-1.
2. Traveling Salesperson: Algorithmic Optimization
The Traveling Salesperson Tutorial tackles combinatorial optimization — a domain where the algorithmic approach matters more than raw data bandwidth. It implements a brute-force TSP solver that evaluates all city permutations to find the shortest route.
Key architectural insights:
- Reference vs. Optimized Flow: The tutorial provides both a baseline HLS implementation (
build/hls.tcl) and an optimized version (build/hls_opt.tcl), demonstrating the progression from "functional but slow" to "pipeline-optimized." - Memory-Bound vs. Compute-Bound: Unlike the convolution tutorial (memory bandwidth limited), TSP is compute-bound — the challenge is efficiently generating permutations and accumulating distances without pipeline stalls.
- Fixed-Point Arithmetic: The kernel uses
uint16_tdistances (scaled integers) rather than floating-point, dramatically reducing DSP48 usage and enabling higher clock frequencies.
3. Alveo Aurora: High-Speed Serial Communication
The Alveo Aurora Tutorial demonstrates high-speed serial communication using the Aurora protocol over QSFP interfaces on Alveo cards. Unlike the previous tutorials (which focus on computation), this focuses on data movement at the edge of the FPGA — connecting the device to external networks or sensors.
Key architectural insights:
- GT Transceiver Integration: The configuration file shows how to connect HLS kernels to hardened GT (Gigabit Transceiver) blocks — the physical layer for high-speed serial.
- Stream-Based Datapaths: The
strm_issueandstrm_dumpkernels generate and consume streaming data, demonstrating how to test high-bandwidth links without external equipment. - Clock Domain Crossing: The configuration shows connections between the Aurora core's clock domain and the user logic clock domain — a common source of subtle bugs in high-speed designs.
Cross-Module Dependencies
This module sits at the intersection of several larger ecosystems:
-
Vitis_HLS_Tutorials: Shares the HLS toolchain but focuses on language features and pragmas rather than system integration. The convolution tutorial here uses techniques demonstrated there, but adds the host-kernel integration layer.
-
Hardware_Acceleration_Feature_Tutorials: Explores specific Vitis features (debugging, RTL kernel integration, multi-CU dispatch). The convolution tutorial's
Filter2DDispatcherdemonstrates a production-ready version of the multi-CU dispatch pattern described there. -
AI_Engine_Development/AIE: For workloads that don't fit the HLS model (especially DSP-heavy signal processing with complex dataflow), the AIE (AI Engine) offers a different paradigm. The convolution tutorial here is a "pure HLS" approach; for very large filters or multi-channel video streams, an AIE implementation might be more efficient.
Design Tradeoffs and Philosophy
Throughout these tutorials, several recurring design tensions appear:
-
Abstraction vs. Control: HLS provides high-level C++ abstraction, but achieving optimal performance requires understanding the underlying hardware (pipeline stages, memory ports, DSP48s). The tutorials show the "pragma-augmented" middle path — C++ with hardware hints rather than raw RTL.
-
Host-Visible vs. Kernel-Autonomous: The convolution tutorial keeps the host deeply involved (dispatching individual frames), while the Aurora tutorial is more autonomous (streams flow without per-packet host intervention). This reflects the fundamental difference between "accelerator" (host-driven) and "smart NIC" (autonomous) architectures.
-
Throughput vs. Latency: The TSP tutorial sacrifices latency (it takes time to evaluate all permutations) for throughput (evaluating many permutations in parallel via pipelining). The convolution tutorial optimizes for sustained throughput of video frames. Understanding which metric matters for your use case is critical — these tutorials demonstrate both strategies.
-
Portability vs. Optimization: The
host_randomized.cppvariant of the convolution tutorial exists precisely for portability — it removes the OpenCV dependency at the cost of less realistic input data. This is a common pattern: provide a "full featured" path and a "minimal dependency" path.
What New Contributors Should Watch Out For
If you're joining the team to work on these tutorials or extend them, here are the non-obvious gotchas:
HLS Kernel Development:
-
DATAFLOW vs. PIPELINE:
DATAFLOWenables task-level parallelism (concurrent functions), whilePIPELINEenables loop-level parallelism (overlapping loop iterations). Mixing them incorrectly causes "stalled pipeline" warnings in the HLS console. The convolution tutorial usesDATAFLOWat the top level andPIPELINEinside the window processing loops. -
Stream Depth Matters: The
hls::streamtemplate has a default depth that may not be sufficient for your data rate mismatch. If a producer writes faster than a consumer reads, and the stream depth is too shallow, the producer stalls, killing throughput. The convolution tutorial sets explicit depths (hls::stream<char,2>,hls::stream<U8,64>) based on the producer-consumer rate ratios. -
Alignment Assertions: Notice the
assert(stride%64 == 0)inReadFromMem. This isn't just defensive programming — it ensures the AXI4 interface can use 512-bit (64-byte) bursts, maximizing memory bandwidth. Violating this silently degrades performance by 8x or more.
Host Application Development:
-
Buffer Pinning vs. Migration: The
Filter2DRequestconstructor usesenqueueMigrateMemObjectswithCL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINEDaftersetArgcalls. This is a subtle optimization:setArgbinds buffers to specific memory banks (pinning), then migration makes them resident in those banks without a copy (since content is undefined/irrelevant). Getting this order wrong causes extra data copies. -
Out-of-Order Queues: The host code creates the command queue with
cl::QueueProperties::OutOfOrder. This allows the runtime to overlap data transfers and kernel execution — essential for the software pipelining pattern. Using an in-order queue would serialize these operations, destroying throughput. -
Event Chaining: Notice how
eventsvectors are passed toenqueueWriteBuffer,enqueueTask, andenqueueReadBuffer. This creates explicit dependencies: the kernel can't start until writes complete, and the read can't start until the kernel completes. TheFilter2DDispatcherrelies on this for correctness when issuing overlapping requests.
System Integration:
-
XCLBIN Compatibility: The host code checks
getenv("XCL_EMULATION_MODE")to conditionally print timing info. This matters because emulation flows don't report accurate hardware timing. Missing this check causes confusing "0 MB/s" throughput reports in emulation. -
OpenCV Dependency Management: There are two host variants — one with OpenCV (image I/O) and one randomized (no dependencies). When building on headless servers without OpenCV, you must use the randomized version or the build fails with missing headers.
With this context established, let's dive into the detailed sub-module documentation to understand the specific implementation patterns in each tutorial track.
Sub-Module Documentation
For detailed implementation specifics of each tutorial track, refer to the sub-module documentation:
-
Convolution Tutorial: 2D Filter Pipeline — Complete image processing pipeline demonstrating HLS DATAFLOW, line-buffer architecture, and multi-CU host dispatch.
-
Traveling Salesperson: Optimization Flow — Combinatorial optimization tutorial showing progression from reference C++ to optimized HLS, highlighting algorithm-specific hardware tradeoffs.
-
Alveo Aurora: High-Speed Serial — High-speed serial communication tutorial demonstrating GT transceiver integration and Aurora protocol configuration.
Data Flow Architecture
The following diagram illustrates the end-to-end data flow for the convolution tutorial (the most complex of the three), showing how data moves from host memory through the FPGA and back:
Y/U/V Planes] B[Filter Coefficients] C[Output Image] end subgraph "PCIe/XRT" D[OpenCL Buffer
cl::Buffer] E[Command Queue
Out-of-Order] F[Event Dependencies] end subgraph "FPGA Global Memory" G[DDR Bank 0
src_buffer] H[DDR Bank 1
coef_buffer] I[DDR Bank 2
dst_buffer] end subgraph "Filter2DKernel
HLS DATAFLOW" J[ReadFromMem
AXI4-Full] K[Window2D
Line Buffers] L[Filter2D
Convolution MAC] M[WriteToMem
AXI4-Full] end A -->|enqueueWriteBuffer| D B -->|enqueueWriteBuffer| D D -->|Migrate| G D -->|Migrate| H G -->|AXI4-Full
burst=512b| J H -->|coeff_stream| J J -->|pixel_stream| K K -->|window_stream| L L -->|pixel_stream| M M -->|AXI4-Full
burst=512b| I I -->|Migrate| D D -->|enqueueReadBuffer| C E -->|Event Chaining| F F -.->|Wait for| J F -.->|Wait for| K F -.->|Wait for| L F -.->|Wait for| M
This architecture embodies several key design principles demonstrated across all tutorials:
-
Decoupled Producer-Consumer Stages: Each stage in the DATAFLOW region operates independently, pulling data from input streams and pushing to output streams. This decouples timing between stages — a slow memory read doesn't stall the compute unit if the stream FIFO has depth.
-
Burst-Based Memory Access: The
ReadFromMemandWriteToMemfunctions are designed to issue wide, aligned burst transfers (512-bit / 64-byte). This amortizes the ~100ns latency of DDR access over hundreds of bytes, achieving near-peak bandwidth. Theassert(stride%64 == 0)enforces alignment requirements. -
Software Pipelining at System Level: The host
Filter2DDispatchermaintains multipleFilter2DRequestobjects (typically 3), each representing an in-flight transaction. While Request 0's kernel executes, Request 1's input data is being transferred to the FPGA, and Request 2's output data is being transferred back to the host — the classic "double buffering" or "triple buffering" pattern. -
Event-Driven Synchronization: OpenCL events (
cl::Event) explicitly encode dependencies between operations. The kernel execution event depends on the write completion events; the read event depends on the kernel event. This allows the runtime to optimize scheduling without sequential host-side waiting.
Key Design Decisions
1. HLS DATAFLOW vs. Sequential Execution
Decision: Use #pragma HLS DATAFLOW to enable task-level parallelism across the four sub-functions (ReadFromMem, Window2D, Filter2D, WriteToMem).
Rationale:
- In a sequential implementation,
ReadFromMemwould read the entire image beforeWindow2Dstarts. With large images (1080p), this requires buffering the entire frame on-chip (impossible) or in external memory (bandwidth waste). - With DATAFLOW,
Window2Dstarts processing as soon as the first few lines are read. The four functions execute concurrently in a pipeline, withhls::streamFIFOs decoupling their execution rates.
Tradeoff: DATAFLOW requires careful management of stream depths. If the producer writes faster than the consumer reads, and the stream depth is insufficient, the producer stalls. The tutorial sets explicit depths (hls::stream<char,2>, hls::stream<U8,64>) based on the rate mismatch between memory access (burst) and compute (sample-by-sample).
2. Line Buffer Architecture for 2D Convolution
Decision: Implement Window2D using line buffers (BRAM arrays storing FILTER_V_SIZE-1 complete lines) to assemble 2D convolution windows from a 1D input stream.
Rationale:
- 2D convolution requires accessing pixels from a neighborhood (e.g., 3x3 or 5x5) around each pixel. In a row-major stream, these pixels arrive at different times (the row above arrived
widthcycles ago). - The line buffer acts as a delay line: as pixels stream through, the buffer holds the previous
FILTER_V_SIZE-1rows. When a new pixel arrives, it can be combined with buffered pixels from the lines above to form the complete window.
Tradeoff: Line buffers consume significant BRAM. For a 1080p image with 3 line buffers, each holding 1920 bytes, that's ~6KB per color plane — acceptable for small filters but scaling to larger kernels or 4K video requires careful BRAM budgeting. The tutorial uses #pragma HLS ARRAY_PARTITION on the line buffer dimension to ensure parallel access to all lines for window formation.
3. Software Pipelining via Filter2DDispatcher
Decision: Implement the host-side Filter2DDispatcher class to maintain multiple in-flight requests (Filter2DRequest objects), enabling software pipelining where data transfers overlap with kernel execution.
Rationale:
- Without pipelining, the sequence is: Write input → Run kernel → Read output → (repeat). The FPGA sits idle during host→FPGA transfers and FPGA→host transfers.
- With
maxReqs=3, the dispatcher maintains three request slots. While Request 0 runs on the FPGA, Request 1's input data is being written, and Request 2's previous output is being read. This triple-buffering pattern keeps the FPGA continuously busy.
Tradeoff: Increased host memory usage (3× the buffering) and code complexity. The dispatcher must track which request slot is available using round-robin allocation (cnt%max). Additionally, the out-of-order OpenCL queue is required for this to work — an in-order queue would serialize the operations regardless of the dispatcher's logic.
4. HLS Stream Depth Sizing
Decision: Size hls::stream depths based on the producer-consumer rate mismatch: small depths (2-3) for balanced rates, large depths (64) for bursty producers.
Rationale:
- Between
Filter2D(the compute stage) andWriteToMem(the memory write stage), the rates differ.Filter2Dproduces one pixel per cycle (II=1), butWriteToMemissues 64-byte bursts to memory, consuming 64 pixels every N cycles (where N depends on memory latency). - A deep FIFO (64 entries) absorbs this burstiness, allowing
Filter2Dto run continuously even whenWriteToMemis occasionally stalled waiting for DRAM arbitration.
Tradeoff: Each hls::stream maps to a FIFO implemented in BRAM or LUTRAM. Deep FIFOs consume significant resources. The tutorial carefully sizes streams: the coefficient stream (read once, used many times) is shallow (2), while the output stream (bursty) is deep (64).
5. Build System: TCL-Based HLS Workflows
Decision: Use TCL scripting (build.tcl, hls.tcl, hls_opt.tcl) to orchestrate the HLS synthesis flow, enabling reproducible builds and easy exploration of design space parameters (clock period, target part, optimization directives).
Rationale:
- HLS design space exploration requires iterating on pragmas, clock constraints, and data types. Manual GUI-based iteration is error-prone and non-reproducible.
- TCL scripts encode the "golden" build procedure:
open_project,set_top,add_files,csynth_design,cosim_design. They can be version-controlled and run in CI/CD pipelines. - The tutorial provides both baseline (
hls.tcl) and optimized (hls_opt.tcl) versions, showing the progression from functional prototype to production-optimized kernel.
Tradeoff: TCL is less readable than Python-based build systems (like Vitis's newer Python APIs). Error messages from TCL scripts can be cryptic, and debugging HLS synthesis failures often requires reading detailed logs. The tutorials include extensive comments in the TCL files to mitigate this.
Sub-Module Documentation
1. Convolution Tutorial: 2D Filter Pipeline
The convolution_tutorial_filter2d_pipeline sub-module is the flagship tutorial, demonstrating a complete image processing pipeline from host code to HLS kernel. It covers:
- Hardware Architecture: The line-buffer-based
Window2Dfunction and the MAC-basedFilter2Dfunction, orchestrated byFilter2DKernelusing DATAFLOW. - Host Software Architecture: The
Filter2DRequestclass (single transaction management) andFilter2DDispatcherclass (multi-transaction pipelining), showing how to overlap data transfers with computation. - Build System: The TCL-based HLS build flow (
build.tcl) and the Vitis system integration flow.
2. Traveling Salesperson: Algorithmic Optimization
The traveling_salesperson_hls_and_reference_flow sub-module focuses on combinatorial optimization and demonstrates the progression from reference C++ to optimized HLS:
- CPU Reference: The
main_gold.cppprovides a functional, unoptimized reference for correctness checking. - Baseline HLS: The
hls.tclflow synthesizes a naive implementation, establishing a performance baseline. - Optimized HLS: The
hls_opt.tclflow demonstrates optimization pragmas (PIPELINE, UNROLL, ARRAY_PARTITION) to achieve target throughput.
This sub-module is essential for understanding how to optimize control-heavy, irregular algorithms (unlike the regular dataflow of image processing).
3. Alveo Aurora: High-Speed Serial Communication
The alveo_aurora_kernel_stream_config sub-module demonstrates network-attached acceleration using the Aurora protocol:
- GT Transceiver Integration: Shows how to connect HLS kernels to hardened serial transceivers (GTs) for 10/25/100G networking.
- Stream Connectivity: The
krnl_aurora_test.cfgfile defines stream connections betweenstrm_issue(traffic generator),krnl_aurora(the network core), andstrm_dump(traffic checker). - Clock Domain Management: Demonstrates proper handling of Aurora reference clocks (
gt_refclk) and init clocks for transceiver stability.
This sub-module is crucial for developers building network-attached accelerators or chip-to-chip communication systems.
Conclusion
The Hardware_Acceleration_Design_Tutorials module is more than a collection of example programs — it's a structured curriculum for learning FPGA acceleration. By progressing through the convolution (stream processing), TSP (algorithmic optimization), and Aurora (network integration) tutorials, developers gain a holistic understanding of the hardware acceleration design space.
The key takeaway is that efficient hardware acceleration requires co-design: the HLS kernel architecture (line buffers, DATAFLOW), the host dispatch strategy (pipelining, event management), and the system connectivity (memory banks, stream depths) must be designed together. These tutorials provide the template for that co-design process.