Building an Open Source Toolchain for an FPGA

25 Feb 2020 - Ahmed Sanaullah

Field Programmable Gate Arrays (FPGAs) are rapidly becoming first-class citizens in the datacenter, instead of niche components. This is because FPGAs i) can achieve high throughput and low latency by implementing a specialized architecture that eliminates a number of bottlenecks and overheads of general purpose computing, ii) consume little power and, by extension, have a high power-performance ratio, iii) have high-speed interconnects and can tightly couple computation with communication to mask the latency of data movement, and iv) can be configured so that each design is tuned for individual use cases.

Figure 1 illustrates the different configurations in which FPGAs are being deployed in a datacenter. Bump-in-the-Wire (BitW) FPGAs process all traffic between a server and a switch to perform application and system function acceleration. Coprocessor FPGAs provide a traditional accelerator configuration, like GPUs, with an optional back-end secondary network for direct connectivity between accelerators. Storage-attached FPGAs process data locally on storage servers to avoid memory copies to compute servers. Stand-alone FPGAs provide a pool of reconfigurable accelerators that can be programmed and interfaced with directly over the network. Smart network interface controllers (NICs) contain embedded FPGAs which perform custom packet processing alongside a NIC ASIC (application-specific integrated circuit). Finally, network switches can also contain embedded FPGAs that process data as the data moves through the datacenter network (e.g., collective operations such as broadcast and all-reduce).

Figure 1: Different configurations for deploying FPGAs in Data Centers

Research problem

FPGAs have traditionally lacked the clean, coherent, compatible, and consistent support for code generation and deployment generation that is typically available for traditional central processing units (CPUs). For the most part, previous efforts to address this have been ad hoc and limited in scope. Those who try to address this always do something special due to poor tooling, and the tooling that does exist is insufficient, especially for datacenter and high-performance computing (HPC) applications. This is because of the heavy reliance on proprietary, vendor-specific tools for core operations. These tools can change frequently and significantly, which means that even if we wanted to not be ad hoc, for the most part we couldn’t be without investing a lot of work which could be wiped out with a new generation of FPGAs. Moreover, these tools are not necessarily aimed at providing the most efficient solution since they limit the “flexibility” offered to users. For example, these tools i) do not allow modifications the algorithms for core operations (such as logic optimization and place & route), ii) hide details of (and access to) the underlying device architecture which prevents implementation of important functions (such as logic relocation without recompilation), and iii) are designed to be generic with limited opportunities for customization (such vendor IP blocks). This is not a problem unique to FPGAs, however. Similar issues already exist for software. Therefore, similar to free software in the software world, we must be able to code and deploy custom architectures using transparent, open, end-to-end frameworks that are i) not tied to any vendor, that is, do not use IP blocks or tools that are only compatible with the FPGA boards of a particular vendor, ii) provide opportunities for customization across all levels of the development stack, and iii) can be upstreamed, that is, can be easily and reliably integrated into downstream projects in order to build more complex and intricate systems.

Hardware as a reconfigurable, elastic, and specialized service

We refer to our framework for providing upstream support for datacenter FPGAs as Hardware as a RecoNfigurable, Elastic and Specialized Service (HaaRNESS). HaaRNESS is built as a high-level synthesis (HLS) tool, which creates and deploys high-quality hardware from algorithms expressed in high-level languages (HLL) such as OpenMP or OpenCL. Developers only specify the algorithm, with minimal use of pragmas and low-level constructs, and hence require virtually no prior expertise in hardware development; this prior expertise is both in terms of hardware-specific languages (e.g HDL), as well as code structures (in HDL and HLL) needed to effectively map design patterns to hardware. A preprocessor transforms this simple HLL code into an FPGA-centric HLL code which removes hardware optimization blockers and helps infer opportunities for parallelism. Then, an HLS compiler converts this FPGA-centric HLL code into HDL (hardware assembly language). The resulting HDL is run through a system generator which can perform one of two operations: i) cycle accurate simulation, or ii) deployment of application logic on the physical FPGA system. In case of the latter, a bitstream compiler maps the HDL code onto the FPGA fabric using Synthesis and Place & Route. Then, a software runtime is used to program the application onto the board and interface with it. Finally, similar to the OS on CPUs, a hardware operating system (OS) is provisioned on the FPGA in order to share the FPGA fabric amongst multiple independent entities.

HLS code preprocessor

Current HLS tools can require developers to explicitly identify opportunities (and constraints) for parallelism, as well as manually implement a number of important design features such as caches, loop coalescing, function inlining, floating point accumultors and data hazard elimination. This substantially increases the complexity of HLS code that developers need to provide. Our HLS code preprocessor reduces this complexity by automatically identifying optimization blockers in an HLS compiler through compiler instrumentation, and then addressing them using a series of system code transformations. Optimization blockers occur when a compiler writer is not being allowed to infer an optimization. An optimizing transform may be blocked if it: i) modifies code functionality, instead of structure only, ii) can result in a failure to compile, iii) is based on information available at run-time, iv) requires a global view of the computation, and/or v) is based on implicit code behavior that may be visible to the developer, but cannot be reliably extracted by the compiler.

Figure 2 illustrates our approach. To identify optimization blockers, we first built a logical model for FPGAs by identifying a set of core design patterns that an HLS compiler should be able to infer and implement efficiently in order to achieve high quality code generation. Examples of these design patterns include single instruction, multiple data (SIMD), pipelining, caching, logic inlining, and loop structures. Then, we instrumented the HLS compiler (OpenCL in our proof-of-concept) to determine what it has inferred given an input code. This requires analyzing the IR at compile time (static profiler) after all optimizer passes have been run (i.e. output of the front-end HLS compiler). We then built a set of probes which contain individual design patterns in relative isolation, so that we can determine compiler effectiveness for each. By running these probes through the compiler and looking at instrumentation report, we can tell what optimizations are blocked. The process is done once for a given version of an HLS compiler. Along with a set of probes, we also provide a set of HLL-HLL code transforms that can remove the optimization blocker for each probe. Examples of these transforms include loop unrolling for SIMD and generating register caches for read-after-write hazards for on-chip memories. These transforms are only done if an optimization is blocked. Finally, this set of code transforms and the probe report is fed into the pre-processor.

Figure 2: Framework for automatically removing optimization blockers using compiler instrumentation

Advancing HLS compilers

Current HLS compilers have two major drawbacks. First, since existing HLS tools map code fragments to vendor IP blocks in order to generate HDL from HLL code, a large library of such blocks is typically needed. Such libraries consume a large amount of CPU memory, have high-overhead non-trivial lookup operations, and provide limited opportunities for optimization since they are proprietary. Only a limited set of parameters can be modified and that, too, is within predefined bounds. Second, it is also likely that a significant fraction of code that is translated to these IP blocks is not needed for the HDL to execute. Application logic can have a spectrum of performance requirements for different components, not all of which require execution on custom hardware, e.g. control plane versus data plane.

Our goal is to advance HLS compilers by addressing the above two drawbacks. The first enhancement is to reduce the size of code sequences being translated to hardware, and perform this translation using only basic vendor-agnostic and transparent hardware building blocks like registers and gates. This enables faster compilation times and allows the design to be tuned for each individual application.

The second, perhaps more critical, improvement is to identify, at compile time, the best approach for implementing the algorithm. For the code generation itself, we have three different pieces: i) the part which must always executed on the host and cannot be on the FPGA e.g. due to I/O, ii) the part which is translated into softcores on the FPGA, and iii) the rest of the code which is translated into HDL. These parts can be either inferred automatically by the compiler (directly or through profiling executables), or marked up using OpenMP primitives. The split of (ii) and (iii), in particular, is important, since functions implemented using softcores consume negligible resources (logic/memory/DSP blocks) and can achieve asynchronous operation with respect to HDL. Moreover, for part (ii), we eliminate Place & Route (an hours/days long process), and achieve CPU-only software-like turnaround times, because the HDL does not change. If the HDL itself is as small as possible for computation kernels and/or is relocatable to other parts of the FPGA fabric, the need to run Place & Route is further reduced.

System generator: Cycle accurate simulation

Another major component of our research includes building a cycle accurate simulation framework. The framework can estimate performance directly from HDL code without compiling to actual hardware, because rapid and reliable design space exploration substantially reduces turnaround times for building high quality hardware. While this feature, called RTL simulation, is certainly not novel, we provide significantly more control over what can be evaluated and how.

Using our framework, developers have the flexibility of testing both the application logic and its interaction with the world around it. The latter involves testing application logic after connecting it to intra-FPGA wrappers, operating systems, external devices, etc. This is important since testing the application logic only, without modelling the environment around it, can result in developers converging on a design that gives worse performance than naive code when actually implemented on an FPGA. Worse, a design is likely to fail to execute altogether if deadlocks were not properly identified beforehand.

System generator: Deployment

Bitstream compiler

The mapping of HDL to the FPGA fabric is typically done exclusively by FPGA vendors. This is because it requires knowledge of low-level details of the underlying FPGA hardware, which vendors typically do not disclose publicly in order to protect intellectual property. Lack of these low-level hardware details means that we cannot determine how designs map to FPGAs and thus guarantees of security and performance cannot be reliably provided.

Our research focuses on inferring low-level hardware by reverse engineering FPGA bitstreams. The goal here is to obtain key insights into the compilation processes, which we can then use to build an open bitstream compiler. This allows us to both reduce the limitations of proprietary bitstream compilers, as well as implement important features that are currently not supported, e.g. FPGA fabric attestation.

Software runtime

With regards to software runtime, our research is primarily focused on building vendor agnostic tools, such as drivers and runtime libraries. Similar to how the Linux kernel is built, our goal is to separate the software stack for FPGA tools into architecture/configuration dependent and independent components. This will enable us to maintain a uniform and reusable structure for software runtime across all types of FPGA configurations boards in the datacenter. It will also reduce the complexity of adding and removing features, since well-defined APIs will ensure that changes are compatible with existing code. Having these APIs map well to a broad set of vendors is possible since FPGAs talk to host machines over standard buses, such as PCIe (peripheral component interconnect express). These standard buses, and associated protocols, constrain the behaviour of both the software (drivers) and hardware (PCIe logic blocks) built around the buses in a similar manner for all FPGAs. Any subtle differences, such as vendor or device IDs, can be supplied as compile/load/run-time values. As a result, it is possible to implement reliable and effective uniformity, at least for the lower levels of the hardware and software stacks, and expose a consistent interface to applications on the host and device.

Hardware OS

Hardware operating systems are effectively any logic on the FPGA that is not part of the application. They are responsible for partitioning the device fabric between multiple entities, data flow management and interfaces, and hardware modifications. They also manage the flow of data between different components in the FPGAs by defining a number of specifications such as APIs, protocols, bus widths, clock domains, FIFO depths, etc. Figure 3 shows our hardware operating system for Bump-in-the-Wire FPGAs, called Morpheus. Morpheus supports the sharing of the FPGA fabric between developers and the system administrator. Administrator functionality offloads are particularly useful for accelerating a large number of critical workloads such as encryption, SDN, Key Value Store, database operations, etc.

Figure 3: Design of our prototype Hardware OS for Bump-in-the-Wire FPGAs

Since APIs are well defined, Morpheus can be easily modified to support other deployment configurations of FPGAs in the datacenter. This is critical to ensuring compatibility across the stack, enabling portability across FPGAs, and reducing developer effort in integrating their designs into the hardware OS.