Traveling Salesperson HLS and Reference Flow

Opening: What This Module Does

Imagine you are a traveling salesperson who must visit exactly 13 cities scattered across the United States, returning to your starting point, and you want to drive the shortest possible total distance. The brute-force approach is to calculate the total distance for every possible ordering (permutation) of the cities and pick the minimum. For 13 cities, that's $13! \approx 6.2$ billion permutations—computationally expensive but feasible on a modern CPU given enough time.

This module is a complete tutorial demonstrating how to migrate an exact, brute-force combinatorial solver from software (CPU) to hardware (FPGA via Vitis HLS). It provides:

A CPU reference implementation (main_gold.cpp) that solves a 13-city instance using standard C++ library algorithms, serving as the functional "golden" reference.
An HLS kernel interface (tsp.h) defining the hardware-accelerated kernel with streaming data inputs, designed for a smaller problem size ( $$N=5$$ ) due to hardware resource constraints.
Build automation (hls.tcl and hls_opt.tcl) showing baseline and optimized synthesis flows.

The module answers the question: "How do I take a complex, data-dependent algorithm with irregular memory access patterns (permutation enumeration) and express it efficiently in reconfigurable hardware?"

Architecture and Data Flow

System Topology

The module presents two distinct implementations of the same algorithmic core, bridged by a shared mathematical foundation (the distance matrix) but diverging in their execution models:

graph TB subgraph "CPU Reference Flow (Software)" A[City Coordinates
float x,y] --> B[Distance Matrix Calculation
float preciseDistances] B --> C[Normalization & Quantization
uint16_t scaled distances] C --> D[Permutation Enumeration
std::next_permutation] D --> E[Distance Accumulation
Sum path lengths] E --> F[Minimum Reduction
std::min] end subgraph "HLS Kernel Flow (Hardware)" G[Host Memory
Distance Matrix] --> H[AXI4-Stream Interface
hls::stream] H --> I[Pipeline Stages
#pragma HLS PIPELINE] I --> J[Permutation Engine
Unrolled loop nests] J --> K[Reduction Tree
Min-finding logic] K --> L[Scalar Result
unsigned int& shortestDistance] end style A fill:#e1f5ff style G fill:#fff4e1 style F fill:#e8f5e9 style L fill:#e8f5e9

Component Roles

CPU Reference (`main_gold.cpp`)

This file implements the computational ground truth. Its architectural role is to provide an unambiguous, easy-to-verify reference result against which the hardware implementation is validated.

Key Design Patterns:

Immutable Configuration: The city coordinates are constexpr or const static data, fixed at compile time for reproducibility.
Value Semantics: Heavy use of std::array rather than raw pointers ensures stack-allocated, cache-friendly storage with automatic bounds checking in debug builds.
Lazy Initialization: The distance matrix is computed using an Immediately Invoked Function Expression (IIFE) lambda, ensuring the matrix is fully initialized before main logic begins, without requiring a separate initialization function.

The Permutation Algorithm: The core uses std::next_permutation which generates permutations in lexicographic order. For each permutation (representing an ordering of cities), it sums the distances between consecutive cities using the precomputed distance matrix. The minimum across all permutations is retained.

Performance Characteristics:

Time Complexity: $O(N! \cdot N)$ — factorial growth dominates.
Space Complexity: $$O(N^2)$$ for the distance matrix plus $$O(N)$$ for the current permutation.
Observability: Includes a "hacky cheap progress indicator" that prints percentage completion to stdout, useful for long-running CPU verification runs.

HLS Kernel Interface (`tsp.h`)

This is the hardware contract. It defines the boundary between the software host and the hardware accelerator, specifying exactly how data moves across the PCIe/AXI boundary and what computational guarantees the hardware provides.

Interface Design Philosophy: The interface uses streaming semantics (hls::stream<uint16_t>) rather than memory-mapped arrays. Think of this like a conveyor belt in a factory: data arrives one item at a time, is processed immediately, and passed on. This contrasts with the CPU's "warehouse" model where all data is stored in memory and retrieved randomly.

Key Parameters:

N = 5: The hardware is synthesized for exactly 5 cities. Unlike the CPU version where N could theoretically be a template parameter (though implemented as a constant), the HLS hardware has fixed-size pipelines and unrolled loops specifically sized for this value.
uint16_t stream: Distances are quantized to 16-bit unsigned integers. This matches the CPU's normalization step where distances are scaled to uint16_t range for hardware efficiency.

Function Signature:

void tsp(hls::stream<uint16_t>& streamDistances, unsigned int& shortestDistance);

Hardware Mapping:

streamDistances: Maps to an AXI4-Stream interface (axis). The host writes distance values sequentially; the kernel reads them as needed.
shortestDistance: Maps to an AXI4-Lite interface (s_axilite) or a scalar output. This is the final result register that the host polls or receives as an interrupt.

Build Flows (`hls.tcl` vs `hls_opt.tcl`)

These scripts represent two points on the Pareto frontier of hardware design: functional correctness versus optimized performance.

Baseline Flow (hls.tcl):

Source: tsp.cpp (implied standard implementation)
Goal: Prove functional correctness and establish baseline resource/performance metrics.
Clock: 3.0ns period (~333 MHz target)
Flow: C simulation → Synthesis → Co-simulation → Export to Vivado IP catalog

Optimized Flow (hls_opt.tcl):

Source: tsp_opt.cpp (optimized variant)
Goal: Maximize throughput via aggressive pipelining, unrolling, and dataflow optimization.
Same Constraints: Same part (xcvu9p) and clock, but source code contains pragmas like #pragma HLS PIPELINE II=1, #pragma HLS UNROLL, or #pragma HLS ARRAY_PARTITION.

Why Two Flows Matter: In hardware acceleration, "optimization" isn't just compiler flags—it requires algorithmic restructuring to expose parallelism. The tsp_opt.cpp likely restructures the nested permutation loops into a pipeline-friendly form, perhaps using a space-time tradeoff (e.g., generating permutations in parallel using a specific ordering like Johnson-Trotter, or unrolling the inner distance accumulation).

Design Decisions and Tradeoffs

1. Exactness vs. Scalability: The Brute-Force Choice

The Decision: Use exact brute-force enumeration ( $$O(N!)$$ complexity) rather than approximate heuristics (e.g., genetic algorithms, simulated annealing, or Lin-Kernighan).

The Rationale: This is an algorithmic correctness tutorial, not a production logistics solver. The goal is to demonstrate hardware acceleration of a "known correct" baseline. Approximate algorithms introduce non-determinism and convergence complexity that would obscure the hardware/software interface lessons.

The Tradeoff:

Pros: Guaranteed optimal result; deterministic execution time (makes hardware timing analysis tractable); easy to verify correctness.
Cons: Useless for $$N > 15$$ (13! ≈ 6B, 20! ≈ 2.4 quintillion). The CPU reference takes minutes to hours; the hardware version is limited to $$N=5$$ due to resource constraints.

2. Fixed-Point vs. Floating-Point: The Quantization Strategy

The Decision: Normalize floating-point distances to uint16_t fixed-point for the hardware kernel, while the CPU uses float for internal computation but validates against the same quantized values.

The Rationale: FPGAs have limited Floating-Point Units (FPUs) compared to integer DSP slices. Fixed-point arithmetic consumes fewer resources, achieves higher clock frequencies, and has deterministic latency (no denormalized number handling). The uint16_t range (65535 levels) provides sufficient precision for the TSP distance comparison—since we're only comparing relative path lengths, absolute precision matters less than monotonicity.

The Tradeoff:

Pros: 3-5x resource reduction vs. float32; enables pipelining at 300MHz+; predictable timing.
Cons: Introduces quantization error (~0.0015% relative error max); requires careful scaling to avoid overflow in accumulation (path lengths sum $$N$$ distances, each up to 65535, so accumulator needs $16 + \log_2(N)$ bits).

3. Streaming vs. Memory-Mapped: Interface Architecture

The Decision: Use hls::stream<uint16_t> (AXI4-Stream) for the distance matrix input instead of an m_axi (AXI4-Full) memory-mapped pointer.

The Rationale: The TSP algorithm accesses the distance matrix in a predictable, sequential pattern during the inner loop (for a given permutation, it reads $$(perm[0], perm[1])$$ , then $$(perm[1], perm[2])$$ , etc.). However, the permutation order is irregular across the search space. Streaming decouples the memory access pattern from the compute pipeline: the host can burst-transfer the entire matrix into the kernel's input FIFO, and the kernel consumes it at its own rate, stalling only if the FIFO empties. This is simpler than managing cache coherence or burst length optimization with m_axi.

The Tradeoff:

Pros: Simpler control logic (no address generation); natural back-pressure handling via FIFO full/empty signals; decouples host memory system from kernel clock domain.
Cons: Requires the entire distance matrix to be streamed in (no random access to skip entries); if the matrix is large, the FIFO depth must be sized to prevent stalling, consuming BRAM resources.

4. Generic C++ vs. HLS-Optimized: The Implementation Split

The Decision: Maintain two separate implementations: main_gold.cpp (generic C++17) and tsp.h/tsp.cpp (HLS-synthesizable with pragmas).

The Rationale: Generic C++ code written for software (e.g., using std::next_permutation, dynamic memory allocation, or recursive templates) often synthesizes poorly to hardware. HLS requires specific pragmas (#pragma HLS PIPELINE, UNROLL, ARRAY_PARTITION) and coding patterns (no dynamic memory, bounded loops, explicit streaming). Attempting to write a single source file that is both optimal software and optimal hardware is usually impossible—the optimization goals are contradictory (software favors cache-friendly sequential access, hardware favors parallel pipelined access).

The Tradeoff:

Pros: Each implementation can be optimized for its target (CPU cache hierarchy vs. FPGA DSP/BRAM resources); easier to debug and verify separately; clear separation of concerns.
Cons: Code duplication (same algorithm logic in two places); risk of divergence (bug fixes must be applied to both versions); verification burden to prove equivalence.

Usage, Extension Points, and Pitfalls

How to Build and Run

CPU Reference:

cd Hardware_Acceleration/Design_Tutorials/04-traveling-salesperson/CPU_POC
g++ -O3 -std=c++17 -o tsp_cpu main_gold.cpp
./tsp_cpu
# Expect: Progress indicator output followed by the shortest distance integer

HLS Kernel (Baseline):

cd Hardware_Acceleration/Design_Tutorials/04-traveling-salesperson/build
vitis_hls -f hls.tcl
# Generates: proj/solution1/syn/verilog/tsp.v (and other IP files)

HLS Kernel (Optimized):

vitis_hls -f hls_opt.tcl
# Generates: proj_opt/solution1/syn/verilog/tsp.v

Extending the Module

Increasing N for the CPU Version: To solve for more cities (if you have time to wait):

Change constexpr int N = 13; to your desired value.
Add additional city coordinates to the cities[] array.
Ensure the loop counter i and array indices can handle the range (use uint64_t for the loop counter if $$N!$$ exceeds 4 billion).

Modifying the HLS Kernel for Different N: This requires significant changes:

Update const int N = 5; in tsp.h (and tsp.cpp if it exists).
Resize the streamDistances consumption logic to expect $N \times N$ elements.
Critical: Modify the permutation generation logic. The current implementation likely uses fully unrolled loops for $$N=5$$ ; changing $$N$$ requires regenerating these pragmas or restructuring the code to use parameterized loops with ARRAY_PARTITION directives.
Re-synthesize and verify timing closure—the resource utilization will change non-linearly with $$N$$ .

Adding New Distance Metrics: To use Manhattan distance instead of Euclidean:

In main_gold.cpp, replace the sqrt(pow(...)) calculation with abs(cities[i].x-cities[j].x) + abs(cities[i].y-cities[j].y).
The normalization step remains the same.
For the HLS kernel, update the distance calculation logic similarly, ensuring the arithmetic maps to available DSP resources (absolute value is cheaper than square root).

Edge Cases and Gotchas

1. Factorial Overflow in Progress Indicator:

if (i%(factorial(N)/500) == 0) ...

For $$N=13$$ , factorial(13) is ~6.2 billion, which fits in uint64_t. However, if you increase $$N$$ to 21 or higher, $$N!$$ exceeds $2^{64}-1$ and overflows, causing the progress indicator to malfunction (division by zero or incorrect frequency). Fix: Add a static assertion or compile-time check ensuring factorial(N) fits in uint64_t.

2. Missing Return-to-Start Distance: The current CPU implementation calculates the sum of distances between perm[0]→perm[1]→...→perm[N-1], but it does not add the return leg from perm[N-1] back to perm[0] (the starting city). This computes the path length of an open route, not a true traveling salesperson tour (which must be a cycle).

Impact: For the purpose of comparing relative path lengths (finding the minimum), this is acceptable as long as all permutations are evaluated consistently (all lack the final leg). However, the absolute distance value reported is not the true tour length.

Fix for True TSP: Add distance += distances[perm[N-1]*N+perm[0]]; after the inner loop.

3. HLS Interface Synchronization: The HLS kernel function signature uses a non-const reference for shortestDistance:

void tsp(hls::stream<uint16_t>& streamDistances, unsigned int& shortestDistance);

In HLS, scalar outputs passed by reference typically map to s_axilite interfaces (AXI4-Lite control registers). However, the host code must ensure proper synchronization: the kernel starts, processes the stream, writes the result, and sets a "done" bit. If the host polls shortestDistance before the kernel completes, it reads stale data.

Gotcha: Unlike memory-mapped interfaces where coherency is handled by the platform, scalar s_axilite outputs require explicit synchronization (usually via XRT APIs xrt::run::wait() or polling the IP interrupt).

4. Quantization Error Accumulation: When converting floating-point distances to uint16_t, distances are rounded to the nearest integer after scaling. For path length calculations, this quantization error accumulates:

Single edge error: $\pm 0.5$ LSB (Least Significant Bit) after rounding.
Total path error for $$N$$ cities: approximately $\pm 0.5 \cdot N$ LSBs.

For $$N=13$$ , this could mean a path length deviation of up to $\pm 6.5$ units in the scaled integer domain. When comparing two paths with very similar lengths (differing by less than the quantization error), the hardware may return a different minimum than the exact floating-point calculation.

Mitigation: The CPU reference uses the same quantization, ensuring consistency between software and hardware results. For higher precision, use uint32_t or fixed-point types with more fractional bits (at the cost of DSP resources).

5. Stream Buffer Depth and Deadlock: The hls::stream<uint16_t> interface has a default depth (usually 2 or inferred). If the host writes the entire distance matrix ( $N \times N$ elements) in a burst, but the kernel consumes them slowly (e.g., due to pipeline stalls), the stream FIFO may overflow.

However, in typical Vitis flows, the DMA engine handles back-pressure: it pauses writing when the stream FIFO is full. The real deadlock risk is on the output side: if the host doesn't read shortestDistance promptly after the kernel signals completion, the kernel may stall on the final write, though this is less common with scalar outputs.

Best Practice: Ensure the kernel's stream interface depth (set via #pragma HLS stream variable=streamDistances depth=N*N) can buffer the expected burst size, or ensure the producer respects back-pressure.

Mental Model: The Algorithm as a Search Tree

To understand both implementations, imagine the TSP solution space as a tree where:

The root is the starting city (city 0).
Each level represents choosing the next city to visit.
Each leaf at depth $$N$$ represents a complete tour (a permutation of all cities).
The path weight from root to leaf is the total distance of that tour.

The CPU Approach (Depth-First with Pruning): The std::next_permutation algorithm effectively performs a lexicographic traversal of all leaf nodes. It maintains the current path (the perm array) and incrementally updates it to the next lexicographical ordering. The distance calculation is an $$O(N)$$ accumulation along the current path. This is simple, sequential, and memory-efficient, but it explores every single leaf—there is no pruning (since we seek the exact optimum, not an approximation).

The HLS Approach (Parallel Pipeline): Hardware implementation transforms this tree search into a pipeline. Instead of exploring one path at a time, the hardware can:

Generate multiple candidate paths or path segments in parallel using unrolled counters.
Fetch distances from the matrix for multiple cities simultaneously using partitioned BRAMs (multi-port memory).
Accumulate path lengths in dedicated adder trees.
Reduce the results to a minimum using comparator trees.

The "cost" of this parallelism is spatial area: each parallel path requires physical hardware resources (DSPs for arithmetic, BRAMs for storage, LUTs for control logic). Hence, the HLS version uses $$N=5$$ (120 permutations) rather than $$N=13$$ (6.2 billion permutations)—the hardware to handle $$N=13$$ with the same parallelism would be prohibitively large.

Design Decisions and Tradeoffs

1. Exactness vs. Scalability: The Brute-Force Choice

The Decision: Use exact brute-force enumeration ( $$O(N!)$$ complexity) rather than approximate heuristics (e.g., genetic algorithms, simulated annealing, or Lin-Kernighan).

The Tradeoff:

Pros: Guaranteed optimal result; deterministic execution time (makes hardware timing analysis tractable); easy to verify correctness.
Cons: Useless for $$N > 15$$ (13! ≈ 6B, 20! ≈ 2.4 quintillion). The CPU reference takes minutes to hours; the hardware version is limited to $$N=5$$ due to resource constraints.

2. Fixed-Point vs. Floating-Point: The Quantization Strategy

The Tradeoff:

Pros: 3-5x resource reduction vs. float32; enables pipelining at 300MHz+; predictable timing.
Cons: Introduces quantization error (~0.0015% relative error max); requires careful scaling to avoid overflow in accumulation (path lengths sum $$N$$ distances, each up to 65535, so accumulator needs $16 + \log_2(N)$ bits).

3. Streaming vs. Memory-Mapped: Interface Architecture

The Decision: Use hls::stream<uint16_t> (AXI4-Stream) for the distance matrix input instead of an m_axi (AXI4-Full) memory-mapped pointer.

The Tradeoff:

Pros: Simpler control logic (no address generation); natural back-pressure handling via FIFO full/empty signals; decouples host memory system from kernel clock domain.
Cons: Requires the entire distance matrix to be streamed in (no random access to skip entries); if the matrix is large, the FIFO depth must be sized to prevent stalling, consuming BRAM resources.

4. Generic C++ vs. HLS-Optimized: The Implementation Split

The Decision: Maintain two separate implementations: main_gold.cpp (generic C++17) and tsp.h/tsp.cpp (HLS-synthesizable with pragmas).

The Tradeoff:

Pros: Each implementation can be optimized for its target (CPU cache hierarchy vs. FPGA DSP/BRAM resources); easier to debug and verify separately; clear separation of concerns.
Cons: Code duplication (same algorithm logic in two places); risk of divergence (bug fixes must be applied to both versions); verification burden to prove equivalence.

Usage, Extension Points, and Pitfalls

How to Build and Run

CPU Reference:

cd Hardware_Acceleration/Design_Tutorials/04-traveling-salesperson/CPU_POC
g++ -O3 -std=c++17 -o tsp_cpu main_gold.cpp
./tsp_cpu
# Expect: Progress indicator output followed by the shortest distance integer

HLS Kernel (Baseline):

cd Hardware_Acceleration/Design_Tutorials/04-traveling-salesperson/build
vitis_hls -f hls.tcl
# Generates: proj/solution1/syn/verilog/tsp.v (and other IP files)

HLS Kernel (Optimized):

vitis_hls -f hls_opt.tcl
# Generates: proj_opt/solution1/syn/verilog/tsp.v

Extending the Module

Increasing N for the CPU Version: To solve for more cities (if you have time to wait):

Change constexpr int N = 13; to your desired value.
Add additional city coordinates to the cities[] array.
Ensure the loop counter i and array indices can handle the range (use uint64_t for the loop counter if $$N!$$ exceeds 4 billion).

Modifying the HLS Kernel for Different N: This requires significant changes:

Update const int N = 5; in tsp.h (and tsp.cpp if it exists).
Resize the streamDistances consumption logic to expect $N \times N$ elements.
Critical: Modify the permutation generation logic. The current implementation likely uses fully unrolled loops for $$N=5$$ ; changing $$N$$ requires regenerating these pragmas or restructuring the code to use parameterized loops with ARRAY_PARTITION directives.
Re-synthesize and verify timing closure—the resource utilization will change non-linearly with $$N$$ .

Adding New Distance Metrics: To use Manhattan distance instead of Euclidean:

In main_gold.cpp, replace the sqrt(pow(...)) calculation with abs(cities[i].x-cities[j].x) + abs(cities[i].y-cities[j].y).
The normalization step remains the same.
For the HLS kernel, update the distance calculation logic similarly, ensuring the arithmetic maps to available DSP resources (absolute value is cheaper than square root).

Edge Cases and Gotchas

1. Factorial Overflow in Progress Indicator:

if (i%(factorial(N)/500) == 0) ...

Fix for True TSP: Add distance += distances[perm[N-1]*N+perm[0]]; after the inner loop.

3. HLS Interface Synchronization: The HLS kernel function signature uses a non-const reference for shortestDistance:

void tsp(hls::stream<uint16_t>& streamDistances, unsigned int& shortestDistance);

Single edge error: $\pm 0.5$ LSB (Least Significant Bit) after rounding.
Total path error for $$N$$ cities: approximately $\pm 0.5 \cdot N$ LSBs.

References and Related Modules

This module is part of the broader Hardware Acceleration Design Tutorials ecosystem. Key relationships:

Parent Context: Hardware_Acceleration_Design_Tutorials — The overarching tutorial collection covering convolution, FFT, and other acceleration patterns.
Sibling Tutorials: Hardware_Acceleration-Design_Tutorials-convolution_tutorial_filter2d_pipeline — A related tutorial showing 2D convolution acceleration, useful for comparing dataflow patterns versus this TSP's combinatorial approach.
External Dependencies: The build scripts reference AI_Engine_Development.AIE.Design_Tutorials.02-super_sampling_rate_fir.DualSSR16_hw.sw.Makefile.aie_control_xrt.cpp, indicating this tutorial may be integrated with AI Engine (AIE) flows in advanced configurations.

For deeper understanding of the HLS pragmas mentioned (PIPELINE, UNROLL, ARRAY_PARTITION), consult the Vitis HLS User Guide (UG1399). For XRT host application development patterns referenced in the dependency comments, see the XRT Native API Documentation.

Traveling Salesperson HLS and Reference Flow

Opening: What This Module Does

Architecture and Data Flow

System Topology

Component Roles

CPU Reference (main_gold.cpp)

HLS Kernel Interface (tsp.h)

Build Flows (hls.tcl vs hls_opt.tcl)

Design Decisions and Tradeoffs

1. Exactness vs. Scalability: The Brute-Force Choice

2. Fixed-Point vs. Floating-Point: The Quantization Strategy

3. Streaming vs. Memory-Mapped: Interface Architecture

4. Generic C++ vs. HLS-Optimized: The Implementation Split

Usage, Extension Points, and Pitfalls

How to Build and Run

Extending the Module

Edge Cases and Gotchas

Mental Model: The Algorithm as a Search Tree

Design Decisions and Tradeoffs

1. Exactness vs. Scalability: The Brute-Force Choice

2. Fixed-Point vs. Floating-Point: The Quantization Strategy

3. Streaming vs. Memory-Mapped: Interface Architecture

4. Generic C++ vs. HLS-Optimized: The Implementation Split

Usage, Extension Points, and Pitfalls

How to Build and Run

Extending the Module

Edge Cases and Gotchas

References and Related Modules

CPU Reference (`main_gold.cpp`)

HLS Kernel Interface (`tsp.h`)

Build Flows (`hls.tcl` vs `hls_opt.tcl`)