North America Research Interest Group Meeting [October 2022]
Date: October 4, 2022
Open Hardware Initiative Talks
- Reinforcement Learning based HLS Compiler Tuning
Speaker: Hafsah Shahzad, Boston University
Abstract: Despite the proliferation of Field Programmable Gate Arrays (FPGAs) in both the cloud and edge, the complexity of hardware development has limited its accessibility to developers. High Level Synthesis (HLS) offers a possible solution by automatically compiling CPU codes to custom circuits, but currently delivers far lower hardware quality than circuits written using Hardware Description Languages (HDLs). This is because the standard set of code optimizations used by CPU compilers, such as LLVM, are not suited for an FPGA backend. In order to bridge the gap between hand tuned and automatically generated hardware, it is thus important to determine the optimal pass ordering for HLS compilations, which could vary substantially across different workloads. Since there are dozens of possible passes and virtually infinite combinations of them, manually discovering the optimal pass ordering is not practical. Instead, we will use reinforcement learning to automatically learn how to best optimize a given workload (or a class of workloads) for FPGAs. Specifically, we investigate the use of reinforcement learning to discover the optimal set of optimization passes (including their ordering and frequency of application) for LLVM based HLS – a technique for compiler tuning that has been shown to be effective for CPU workloads. In this talk, we will present the results of our experiments aimed at exploring how HLS compiler tuning is impacted by different strategies in reinforcement. This includes, but is not limited to: i) selection of features, ii) methods for reward calculation, iii) selection of agent, iv) action space and v) training parameters. Our goal will be to identify strategies which converge to the best possible solution, take the least amount of time for doing so, and provide results which can be applied to a class of workloads instead of individual ones (to avoid retraining the model).
- Dynamic Infrastructure Services Layer for FPGAs
Speakers: Sahan Bandara and Zaid Tahir, Boston University
Abstract: FPGAs have long filled crucial niches in networking and edge by combining powerful computing/communication, hardware flexibility and energy efficiency. However, there are challenges in development and design portability in FPGAs: the entire hardware stack is commonly rebuilt for each deployment. Operating System-like abstractions, referred to as Shells or hardware Operating Systems (hOS), can help reduce the development complexity of FPGA workloads by connecting the IP blocks needed to support core functionality e.g. memory, network and I/O controllers. However, existing hOS have a number of limitations, such as, use of IP blocks which cannot be modified, fixed resource overhead, tightly coupled IP blocks, and unique interfaces which reduces design portability. As a result, existing hOS are typically only useful for specific workloads, interfaces, vendors and hardware deployed in a specific infrastructure configuration (e.g SmartNIC). In this work, we present the Dynamic Infrastructure Services Layer (DISL) for FPGAs as a solution to the above limitations. DISL is a framework that allows developers to generate hOS that can be either generic or customized based on user requirements such as the target workload, FPGA size, FPGA vendor, available peripherals etc. DISL does so through a number of features such as: i) use of open source, heavily parameterized, and vendor agnostic IP blocks, ii) a modular layout and configurable interconnect, iii) standard Application Programming Interfaces (APIs) at both the inter and intra device level, iv) automatic detection of an application’s hOS requirements for components and connectivity (both compile-time and run-time) during compilation, and v) a DISL software development kit (SDK) which is integrated into the Linux kernel and gives user access to tools for configuring, monitoring, debugging and various other utilities that reduce the complexity of developing, deploying and interfacing FPGA workloads.
- Optimizing open source tooling for FPGA bitstream generation
Speaker: Shachi Vaman Khadilkar, University of Massachusetts-Lowell
Abstract: The flexibility, high performance and power efficiency of Field Programmable Gate Arrays (FPGAs) has resulted in greater ubiquity in both cloud and edge environments. However, the existing state-of-the-art vendor tooling for FPGA bitstream generation lacks a number of features that are critical for high productivity, which in turn results in long turnaround times (hours to days) and substantially limits the manner in which FPGAs can be used. Since this tooling is also closed source, it cannot be modified to incorporate additional functionality. On the other hand, while there are a number of open source alternatives, these tools currently only deliver a fraction of the hardware quality as vendor tooling – thus making their use impractical for most workloads. Our work is aimed at bridging this gap between open-source and vendor tooling for FPGA bitstream generation, in order to make the former a viable solution to the low productivity in FPGA development. To do so, we first build a synthetic benchmark set which can be used to identify and analyze policy decisions made by tools that impact generated hardware quality. Next we apply these benchmarks to open source tools in order to determine bottlenecks or suboptimal policies. Finally, we optimize the identified policies – this can be done manually or through reinforcement learning (in order to automatically determine the best strategy for a given design). To demonstrate the effectiveness of our approach, we apply it to Packing – a critical step in the bitstream generation process which impacts device resource utilization. The open source tool we have used is Versatile Place and Route (VPR). In this talk, we will look at the details of packing policies, synthetic benchmarks that we built and the metrics we developed to determine packing quality.
- Relational Memory: Native In-Memory Stride Access
Speaker: Ju Hyoung Mun, Boston University
Abstract: Over the past few years, large-scale real-time data analytics has soared in popularity as the demand for analyzing fresh data is growing. Hence, modern systems must bridge the need for transactional and analytical, often referred as Hybrid Transactional/Analytical Processing (HTAP). Analytical systems typically use a columnar layout to access only the desired fields. On the contrary, storing data row-first works great for accessing, inserting, or updating entire rows. But, transforming rows to columns at runtime is expensive. So, many analytical systems ingest row-major data and eventually load them to a columnar system or in-memory accelerator for future analytical queries. However, these systems generally suffer from high complexity, high materialization cost, and heavy book-keeping overheads.How will this design change if the optimal layout was always available? We present a radically new approach, termed Relational Memory (RM), that converts rows into columns at runtime. We rely on a hardware accelerator that sits between the CPU and main memory and transparently converts base data to any group of columns with minimal overhead. To support different layouts over the same base data, we introduce ephemeral variables, a special type of variables that never instantiated in main memory. Instead, upon accessing them, the underlying machinery generates a projection of the requested columns according to the format that maximizes data locality. We implement and deploy RM in a commercially available platform that includes CPUs and FPGA. We demonstrate that RM provides a significant performance advantage; accessing the desired columns up to 1.63x faster than row-wise counterpart, while matching the performance of columnar access for low projectivity, and outperforming it by up to 1.87x as projectivity increases. Our next steps include supporting selection in hardware to reduce unnecessary data movements and integrating the proposed design within a DDR4 memory controller.