Andrew David Gunter
Doctor of Philosophy in Electrical and Computer Engineering (PhD)
Research Topic
Reducing wasted time in Field-Programmable Gate Array circuit design compilation
Debug for FPGAs FPGA Architectures CAD for FPGAs
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.
Dr. Steve Wilton is definitely more than a #GreatSupervisor. He is an outstanding mentor and I'm exited to keep learning from him every day. #UBC
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
Recent years have seen a dramatic increase in the use of hardware accelerators to perform machine learning computations. Designing these circuits is challenging, especially due to bugs that may only manifest after long run times, and interactions between hardware and software that are complex to understand. As a result, debugging the machine learning accelerator and ensuring that the system is delivering acceptable performance are very time-consuming processes that significantly limit productivity. This dissertation focuses on investigating how additional circuitry added to machine learning hardware designs may allow for the effective debugging of those systems and on gathering insights on how to better debug those systems. More specifically, we focus on techniques that are suitable for accelerators implemented using Field-Programmable Gate Arrays (FPGA), since many accelerators are prototyped using this type of reconfigurable fabric. This dissertation is comprised of four major contributions towards this goal. First, we present a debugging framework that allows the designer to observe domain-specific information (e.g. sparsity and other statistics) about the machine learning workload running on the accelerator, rather than raw information that is expensive to trace. This includes the creation of a custom circuitry that allows information to be recorded for at least 21.8x longer than in previous debugging architectures. Second, we create a technique to reduce the time between debug iterations by creating an overlay that enables designers to change which signals are being traced and select how those signals are being aggregated without the need of resynthesizing the design. Third, we investigate how to debug underperforming training jobs, resulting in a novel programmable debug architecture that allows designers to create custom ways of aggregating data at debug time, instead of being constrained by a few options selected at compile time. Finally, we investigate the impacts of hardware acceleration on the optimization landscape of training systems and how this information may be used to accelerate debug.We anticipate that the concepts presented in this thesis will be used to allow designers to aggregate domain-specific information in future commercial debugging tools and to motivate future work on improving the training performance of hardware accelerators.
View record
Field-Programmable Gate Array (FPGA) technology is rapidly gaining tractionin a wide range of applications. Nonetheless, FPGA debug productivity is a keychallenge. For FPGAs to become mainstream, a debug ecosystem which providesthe ability to rapidly debug and understand designs implemented on FPGAs isessential. Although simulation is valuable, many of the most elusive and troublesomebugs can only be found by running the design on an actual FPGA. However,debugging at the hardware level is challenging due to limited visibility. To gainobservability, on-chip instrumentation is required.In this thesis, we propose methods which can be used to support rapid and efficientimplementation of on-chip instruments such as triggers and coverage monitoring.We seek techniques that avoid large area overhead, and slow recompilationof the user circuit between debug iterations.First, we explore the feasibility of implementation of debug instrumentationinto FPGA circuits by applying incremental compilation techniques to reduce thetime required to insert trigger circuitry. We show that incremental implementationof triggers is constrained by the mapping of the user circuits.Second, we propose a rapid triggering solution through the use of a virtual overlayfabric and mapping algorithms that enables fast debug iterations. The overlayis built from leftover resources not used by the user circuit, reducing the area overhead.At debug time, the overlay fabric can quickly be configured to implementdesired trigger functionalities. Experimental results show that the proposed approachcan speed up debug iteration runtimes by an order of magnitude comparedto circuit recompilation.Third, to support rapid and efficient implementation of complex triggering capabilities, we design and evaluate an overlay fabric and mapping tools specializedfor trigger-type circuits. Experimental evaluation shows that the specialized overlaycan be reconfigured to implement complex triggering scenarios in less than 40 seconds, enabling rapid FPGA debug.The final contribution is a scalable coverage instrumentation framework basedon overlays that enables runtime coverage monitoring during post-silicon validation.Our experiments show that using this framework to gather branch coveragedata is up to 23X faster compared to compile-time instrumentation with a negligibleimpact on the user circuit.
View record
High-level synthesis (HLS) is a rapidly growing design methodology that allows designers to create digital circuits using a software-like specification language. HLS promises to increase the productivity of hardware designers in the face of steadily increasing circuit sizes, and broaden the availability of hardware acceleration, allowing software designers to reap the benefits of hardware implementation. One roadblock to HLS adoption is the lack of an in-system debugging infrastructure. Existing debug technologies are limited to software emulation and cannot be used to find bugs that only occur in the final operating environment. This dissertation investigates techniques for observing HLS circuits, allowing designers to debug the circuit in the context of the original source code, while it executes at-speed in the normal operating environment. This dissertation is comprised of four major contributions toward this goal. First, we develop a debugging framework that provides users with a basic software-like debug experience, including single-stepping, breakpoints and variable inspection. This is accomplished by automatically inserting specialized debug instrumentation into the user’s circuit, allowing an external debugger to observe the circuit. Debugging at-speed is made possible by recording circuit execution in on-chip memories and retrieving the data for offline analysis. The second contribution contains several techniques to optimize this data capture logic. Program analysis is performed to develop circuitry that is tailored to the user’s individual design, capturing a 127x longer execution trace than an embedded logic analyzer. The third contribution presents debugging techniques for multithreaded HLS systems. We develop a technique to observe only user-selected points in the program, allowing the designer to sacrifice complete observability in order to observe specific points over a longer period of execution. We present an algorithm to allow hardware threads to share signal-tracing resources, increasing the captured execution trace by 4x for an eight thread system. The final contribution is a metric to measure observability in HLS circuits. We use the metric to explore trade-offs introduced by recent in-system debugging techniques, and show how different approaches affect the likelihood that variable values will be available to the user, and the duration of execution that can be captured.
View record
A soft vector processor (SVP) is an overlay on top of FPGAs that allows data- parallel algorithms to be written in software rather than hardware, and yet still achieve hardware-like performance. This ease of use comes at an area and speed penalty, however. Also, since the internal design of SVPs are based largely on custom CMOS vector processors, there is additional opportunity for FPGA-specific optimizations and enhancements.This thesis investigates and measures the effects of FPGA-specific changes to SVPs that improve performance, reduce area, and improve ease-of-use; thereby expanding their useful range of applications. First, we address applications needing only moderate performance such as audio filtering where SVPs need only a small number (one to four) of parallel ALUs. We make implementation and ISA design decisions around the goals of producing a compact SVP that effectively utilizes existing BRAMs and DSP Blocks. The resulting VENICE SVP has 2x better performance per logic block than previous designs.Next, we address performance issues with algorithms where some vector elements ‘exit early’ while others need additional processing. Simple vector predication causes all elements to exhibit ‘worst case’ performance. Density time masking (DTM) improves performance of such algorithms by skipping the completed elements when possible, but traditional implementations of DTM are coarse-grained and do not map well to the FPGA fabric. We introduce a BRAM-based implementation that achieves 3.2x higher performance over the base SVP with less than 5% area overhead.Finally, we identify a way to exploit the raw performance of the underlying FPGA fabric by attaching wide, deeply pipelined computational units to SVPs through a custom instruction interface. We support multiple inputs and outputs, arbitrary-length pipelines, and heterogeneous lanes to allow streaming of data through large operator graphs. As an example, on an n-body simulation problem, we show that custom instructions achieve 120x better performance per area than the base SVP.
View record
As the CMOS technology scales down, the cost of fabricating an integrated circuit becomes more expensive. Field-programmable gate arrays (FPGAs) are becoming attractive because their fabrication costs are amortized among many customers. A major challenge that face adopting FPGAs in many low-power applications is their high power consumption due to their reconfigurablity. A significant portion of FPGAs' power is dissipated in the form of static power during applications' idle periods.To address the power challenge in FPGAs, this dissertation proposes architecture enhancements and algorithms support to enable powering down portions of a chip during their idle times using power gating. This leads to significant energy savings especially for applications with long idle periods, such as mobile devices. This dissertation presents three contributions that address the major challenges in adopting power gating in FPGAs.The first contribution proposes an architectural support for power gating in FPGAs. This architecture enables powering down unused FPGA resources at configuration time, and powering down inactive resources during their idle times by the help of a controller. The controller can be placed on the general fabric of an FPGA device. The proposed architecture provides flexibility in realizing varying numbers and structures of modules that can be powered down at run-time when inactive.The second contribution proposes an architecture to appropriately handle the wakeup phase of power-gated modules. During a wakeup phase, a large current is drawn from the power supply to recharge the internal capacitances of a power-gated module, leading to reduced noise margins and degraded performance for neighbouring logic. The proposed architecture staggers the wakeup phase to limit this current to a safe level. This architecture is configurable and flexible to enable realizing different user applications, while having negligible area and power overheads and fast power up times.The third contribution proposes a CAD flow that supports mapping users' circuits to the proposed architecture. Enhancements to the algorithms in this flow that reduce power consumption are studied and evaluated. Furthermore, the CAD flow is used to study the granularity of the architecture proposed in this dissertation when mapping application circuits.
View record
Electronic devices have come to permeate every aspect of our daily lives, and at the heart of each device is one or more integrated circuits. State-of-the art circuits now contain several billion transistors. However, designing and verifying that these circuits function correctly under all expected (and unexpected) operation conditions is extremely challenging, with many studies finding that verification can consume over half of the total design effort. Due to the slow speed of logic simulation software, designers increasingly turn to circuit prototypes implemented using field-programmable gate array (FPGA) technology. Whilst these prototypes can be operated many orders of magnitude faster than simulation, on-chip instruments are required to expose internal signal data so that designers can root-cause any erroneous behaviour. This thesis presents four contributions to enable rapid and effective circuit debug when using FPGAs, in particular, by harnessing the reconfigurable and prefabricated nature of this technology. The first contribution presents a post-silicon debug metric to quantify the effectiveness of trace-buffer based debugging instruments, and three algorithms to determine new signal selections for these instruments. Our most scalable algorithm can determine the most influential signals in a large 50,000 flip-flop circuit in less than 90 seconds. The second contribution of this thesis proposes that debug instruments be speculatively inserted into the spare capacity of FPGAs, without any user intervention, and shows this to be feasible. This proposal allows designers to extract more trace data from their circuit on every debug turn, ultimately leading to fewer debug iterations. The third contribution presents techniques to enable faster debug turnaround, by using incremental-compilation methods to accelerate the process of inserting debug instruments. Specifically, our incremental optimizations can speed up this procedure by almost 100X over recompiling the FPGA from scratch. Finally, the fourth contribution describes how a virtual overlay network can be embedded into the unused resources of the FPGA device, allowing debug instruments to be modified without any form of recompilation. Experimental results show that a new configuration for a debug instrument with 17,000 trace connections can be made in 50 seconds, thus enabling rapid circuit debug.
View record
Field-Programmable Gate Arrays (FPGAs) are widely used to implement logic without going through an expensive fabrication process. Current-generation FPGAs still suffer from area and power overheads, making them unsuitable for mainstream adoption for large volume systems. FPGA companies constantly design new architectures to provide higher density, lower power consumption, and faster implementation. An experimental approach is typically followed for new architecture design, which is a very slow and computationally expensive process. This dissertation presents an alternate faster way for FPGA architecture design. We use analytical model based design techniques, where the models consist of a set of equations that relate the effectiveness of FPGA architectures to the parameters describing these architectures. During early stage architecture investigation, FPGA architects and vendors can use our equations to quickly short-list a limited number of architectures from a range of architectures under investigation. Only the short-listed architectures need then be investigated using an expensive experimental approach. This dissertation presents three contributions towards the formulation of analytical models and the investigation of capabilities and limitations of these models.First, we develop models that relate key FPGA architectural parameters to the depth along the critical path and the post-placement wirelength. We detail how these models can be used to estimate the expected area of implementation and critical path delay for user-circuits mapped on FPGAs.Secondly, we develop a model that relates key parameters of the FPGA routing fabric to the fabric's routability, assuming that a modern one-step global/detailed router is used. We show that our model can capture the effects of the architectural parameters on routability. Thirdly, we investigate capabilities and limitations of analytical models in answering design questions that are posed by the FPGA architects. Comparing with two experimental approaches, we demonstrate that analytical models can better optimize FPGA architectures while requiring significantly less design effort. However, we also demonstrate that the analytical models, due to their continuous nature, should not be used to answer the architecture design questions related to applications having `discrete effects'.
View record
As process technology scales, the design effort and Non-Recurring Engineering(NRE) costs associated with the development of Integrated Circuits (ICs) is becoming extremely high. One of the main reasons is the high cost of preparing and processing IC fabrication masks. The design effort and cost can be reduced by employing Structured Application Specific ICs (Structured ASICs). Structured ASICs are partially fabricated ICs that require only a subset of design-specific custom masks for their completion.In this dissertation, we investigate the impact of design-specific masks on the area, delay, power, and die-cost of Structured ASICs. We divide Structured ASICs into two categories depending on the types of masks (metal and/or via masks) needed for customization: Metal-Programmable Structured ASICs (MPSAs) that require custom metal and via masks; and Via-Programmable Structured ASICs (VPSAs) that only require via masks for customization. We define the metal layers used for routing that can be configured by one or more via, or metal-and-via masks as configurable layers. We then investigate the area, delay, power, and cost trends for MPSAs and VPSAs as a function of configurable layers.The results show that the number of custom masks has a significant impact ondie-cost. In small MPSAs (area 100 sq.mm) is achieved with three or four configurable layers. The lowest cost in VPSAs is also obtained with four configurable layers. The best delay and power in MPSAs is achieved with three or four configurable layers. In VPSAs, four configurable layers lead to lowest power and delay, except when logic blocks have a large layout area. MPSAs have up to 5x, 10x, and 3.5x better area, delay, and power than VPSAs respectively. However, VPSAs can be up to 50% less expensive than MPSAs. The results also demonstrate that VPSAs are very sensitive to the architecture of fixed-metal segments; VPSAs with different fixed-metal architectures can have a gap of up to 60% in area, 89% in delay, 85% in power, and 36% in die-cost.
View record
Field-Programmable Gate Arrays (FPGAs) are pre-fabricated integrated circuits that can be configured as any digital system. Configuration is done by Computer-Aided Design (CAD) tools. The demand placed on these tools continues to grow as advances in process technology continue to allow for more complex FPGAs. Already, for large designs, compile-times of an entire work day are common and memory requirements that exceed what would be found in a typical workstation are the norm. This thesis presents three contributions toward improving FPGA CAD tool scalability.First we derive an analytical model that relates key FPGA architectural parameters to the CAD place-and-route (P&R) run-times. Validation shows that the model can accurately capture the trends in run-time when architecture parameters are changed.Next we focus on the CAD tool’s storage requirements. A significant portion of memory usage is in representing the FPGA. We propose a novel scheme for this representation that reduces memory usage by 5x-13x at the expense of a 2.26x increase in routing run-time. Storage is also required to track metrics used during routing. We propose three simple memory management schemes for this component that further reduces memory usage by 24%, 34%, and 43% while incurring a routing run-time increase of 4.5%, 6.5%, and 144% respectively. We also propose a design-adaptive scheme that reduces memory usage by 41% while increasing routing run-time by only 10.4%. Finally, we focus on the issue of long CAD run-times by investigating the design of FPGA architectures amenable to fast CAD. Specifically, we investigate the CAD run-time and area/delay trade-offs when using high-capacity logic blocks (LBs). Two LB architectures are studied: traditional and multi-level. Results show that for the considered architectures, CAD run-time can be reduced at the expense of area; speed improved. For example, CAD run-time could be reduced by 25% with an area increase of 5%. We also show that the run-time trade-offs through these architectural changes can be complementary with many previously published algorithmic speed-ups.
View record
Many-core architectures are the most recent shift in multi-processor design. This processor design paradigm replaces large monolithic processing units by thousands of simple processing elements on a single chip. With such a large number of processing units, it promises significant throughput advantage over traditional multi-core platforms. Furthermore, it enables better localized control of power consumption and thermal gradients. This is achieved by selective reduction of the core’s supply voltage, or by switching some of the cores off to reduce power consumption and heat dissipation. This dissertation proposes an energy optimization flow to implement applications on many-core architectures taking into account the impact of Process Voltage and Temperature (PVT) variations. The flow utilizes multi-supply voltage techniques, namely voltage island design, to reduce power consumption in the implementation. We propose a novel approach to voltage island formation, called Voltage Island Clouds, that reduces the impact of on-chip or intra-die PVT variations. The islands are created by balancing their shape constraints imposed by intra- and inter-island communication with the desire to limit the spatial extent of each island to minimize PVT impact. We propose an algorithm to build islands for Static Voltage Scaling (SVS) and Multiple Voltage Scaling (MVS) design approaches.The optimization initially allows for a large number of islands, each with its unique voltage level. Next, the number of the islands is reduced to a small practical number, e.g., four voltages. We then propose an efficient voltage selection approach, called the Removal Cost Method (RCM), that provides near optimal results with more than a 10X speedup compared to the best-known previous methods. Finally, we present an evaluation platform considering pre- and post-fabrication PVT scenarios where multiple applications with hundreds to thousands of tasks are mapped onto many-core platforms with hundreds to thousands of cores to evaluate the proposed techniques. Results show that the geometric average energy savings for 33 test cases using the proposed methods is 25% better than previous methods.
View record
With the continued trend in device scaling and the ever increasing popularity of hand heldmobile devices, power has become a major bottleneck for the development of futuregeneration System-on-Chip devices. As the number of transistors on the SoC and theassociated leakage current increases with every technology generation, methods for reducingboth active and static power have been aggressively pursued.Starting with the application for which the SoC is to be designed, the proposed design flowconsiders numerous design constraints at different steps of design process and produces a finalfloorplanned solution of the cores. Voltage Island Design is a popular method forimplementing multiple supply voltages in a SoC. Use of multiple threshold voltages withpower gating of cores is an attractive method for leakage power reduction. This thesisaddresses the design challenges of implementing multiple supply and threshold voltage on thesame chip holistically with the ultimate goal for maximum power reduction.Specifically, given the power-state machine (PSM) of an application, the high power and lowpower cores are identified first. Based on the activity of the cores, threshold voltage isassigned to each core. The next step is to identify the suitable range of supply voltage for eachcore followed by voltage island generation. A methodology of reducing the large number ofavailable choices to a useful set using the application PSM is developed. The cores arepartitioned into islands using a cost function that gradually shifts from a power-basedassignment to a connectivity-based one.Additional design constraints such as power supply noise and floorplan constraints can offsetthe possible power savings and thus are considered early in the design phase. Experimentalresults on benchmark circuits prove the effectiveness of the proposed methodology. Onaverage, the use of multiple VT and power gating can reduce almost 20% of power comparedto single VT. A proper choice of supply voltages leads to another 4% reduction in power.Compared to previous methods, the proposed floorplanning technique on average offers anadditional 10% power savings, 9% area improvement and 2.4X reduction in runtime.
View record
As the level of integrated circuit (IC) complexity continues to increase, the post-silicon validation stage is becoming a large component of the overall development cost. To address this, we propose a reconfigurable post-silicon debug infrastructure that enhances the post-silicon validation process by enabling the observation and control of signals that are internal to the manufactured device. The infrastructure is composed of dedicated programmable logic and programmable access networks. Our reconfigurable infrastructure enables not only the diagnoses of bugs; it also allows the detection and potential correction of errors in normal operation. In this thesis we describe the architecture, implementation and operation of our new infrastructure. Furthermore, we identify and address three key challenges arising from the implementation of this infrastructure. Our results demonstrate that it is possible to implement an effective reconfigurable post-silicon infrastructure that is able to observe and control circuits operating at full speed, with an area overhead of between 5% and 10% for many of our target ICs.
View record
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Acceleration of machine learning models is proving to be an important application for FPGAs. Unfortunately, debugging such models during training or inference is difficult. Software simulations of a machine learning system may be of insufficient detail to provide meaningful debug insight, or may require infeasibly long run-times. Thus, it is often desirable to debug the accelerated model while it is running on real hardware. Effective on-chip debug often requires instrumenting a design with additional circuitry to store run-time data, consuming valuable chip resources. Previous work has developed methods to perform lossy compression of signals by exploiting machine learning specific knowledge, thereby increasing the amount of debug context that can be stored in an on-chip trace buffer. However, all prior work compresses each successive element in a signal of interest independently. Since debug signals may have temporal similarity in many machine learning applications there is an opportunity to further increase trace buffer utilization. To that end, this thesis presents two major research contributions.The first contribution is an architecture to perform lossless temporal compression in addition to the existing lossy element-wise compression. Further, it is shown that, when applied to typical machine learning algorithms in realistic debug scenarios, approximately twice as much information can be stored in an on-chip buffer while increasing the total area of the debug instrument by approximately 25\%. The impact is that, for a given instrumentation budget, a significantly larger trace window is available during debug, possibly allowing a designer to narrow down the root cause of a bug faster.The second contribution is an evaluation of the margin for compression performance improvement. An attempt was made to determine the entropy at the input of the proposed encoder using information theory, but this was mathematically intractable. Instead, a comparison was made to a best in class software compression algorithm. It was demonstrated that while not superior to the software algorithm the proposed encoder performs well in the memory-scarce context of FPGA debug.
View record
Recent years have seen an explosion of machine learning applications implemented on Field-Programmable Gate Arrays (FPGAs). FPGA vendors and researchers have responded byupdating and optimizing their fabrics to more efficiently implement machine learning accelerators, including innovations such as enhanced Digital Signal Processing (DSP) blocks and hardened systolic arrays. Evaluating these architectural proposals is difficult however due to the lack of publicly available benchmark circuits.This thesis presents an open-source benchmark circuit generator that maps DNN layers onto a proposed FPGA architecture to generate circuits that are appropriate for use in FPGA architecture studies. Our circuits are constructed based on a set of nested loops that is characteristic of DNN and other machine learning applications, but differ in the size, shape and unrolling factors for various loops. Unlike previous generators, which create circuits that are agnostic of the underlying FPGA fabric, our circuits contain explicit instantiations of embedded computation blocks, allowing for meaningful comparison of recent architectural proposals without the need for a complete inference computer-aided design (CAD) flow. Our circuits are compatible with the VTR experimental CAD suite, allowing for architecture studies that investigate routing congestion, impact on place and route, and other low-level architectural implications. The framework alsocontains two levels of simulation support allowing for validation of the generated circuits.Our benchmark circuit generator is demonstrated through three case studies which show how realistic benchmark circuits can be generated to target actual different embedded blocks. We use these benchmark circuits to examine how FPGA architecture decisions affect DNN accelerator performance, and how different types of DNN have different performance bottlenecks.
View record
High-Level Synthesis (HLS) tools improve hardware designer productivity by enabling software design techniques to be used during hardware development. While HLS tools are effective at abstracting the complexity of hardware design away from the designer, producing high-performance HLS-generated circuits still generally requires awareness of hardware design principles. Designers must often understand and employ pragma statements at the software level or have the capability to make adjustments to the design in Register-Transfer Level (RTL) code. Even with designer hardware expertise, the HLS-generated circuits can be limited by the algorithms themselves. For example, during the HLS flow the delay of paths can only be estimated, meaning the resulting circuit may suffer from unbalanced computational distribution across clock cycles. Since the maximum operating frequency of synchronous circuits is determined statically using the worst-case timing path, this may lead to circuits with reduced performance compared to circuits designed at a lower level of abstraction. In this thesis, we address this limitation using Syncopation, a performance-boosting fine-grained timing analysis and adaptive clock management technique for HLS-generated circuits. Syncopation instrumentation is implemented entirely in soft logic without requiring alterations to the HLS-synthesis toolchain or changes to the FPGA, and has been validated on real hardware. The key idea is to use the HLS scheduling information along with the placement and routing results to determine the worst-case timing path for individual clock cycles. By adjusting the clock period on a cycle-by-cycle basis, we can increase performance of an HLS-generated circuit. Our experiments show that Syncopation improves performance by 3.2% (geomean) across all benchmarks (up to 47%). In addition, by employing targeted synthesis techniques called Enhanced Synthesis along with Syncopation we can achieve 10.3% performance improvement (geomean) across all benchmarks (up to 50%).
View record
High-Level Synthesis (HLS) promises improved designer productivity by allowing designers to create digital circuits targeting Field-Programmable Gate Arrays (FPGAs) using a software program. Widespread adoption of HLS tools is limited by the lack of an on-chip debug ecosystem that bridges the software to the generated hardware, and that addresses the challenge of long FPGA compile times. Recent work has presented an in-system debug framework that provides a software-like debug experience by allowing the designer to debug in the context of the original source code. However, like commercial on-chip debug tools, any modification to the on-chip debug instrumentation requires a system recompile that can take several hours or even days, severely limiting debug productivity. This work proposes a flexible debug overlay family that provides software-like debug turn-around times for HLS generated circuits (on the order of hundreds of milliseconds). This overlay is added to the design at compile time, and at debug time can be configured many times to implement specific debug scenarios without a recompilation. We propose two sets of debug capabilities, and their required architectural and CAD support. The first set form a passive overlay, the purpose of which is to provide observability into the underlying circuit and not change it. In this category, the cheapest overlay variant allows selective variable tracing with only a 1.7\% increase in area overhead from the baseline debug instrumentation, while the deluxe variant offers 2x-7x improvement in trace buffer memory utilization with conditional buffer freeze support. The second set of capabilities is control-based, where the overlay is leveraged to make rapid functional changes to the design. Supported functional changes include applying small deviations in the control flow of the circuit, or the ability to override signal assignments to perform efficient "what if'" tests. Our overlay is specifically optimized for designs created using an HLS flow; by taking advantage of information from the HLS tool, the overhead of the overlay can be kept low. Additionally, all the proposed capabilities require the designer to only interact with their original source code.
View record
The performance and capacity of Field-Programmable Gate Arrays (FPGAs) have dramatically improved in recent years. Today these devices are emerging as massively reconfigurable and paralleled hardware computation engines in data centers and cloud computing infrastructures. These emerging application domains require better and faster FPGAs. Designing such FPGAs requires realistic benchmark circuits to evaluate new architectural proposals. However, the number of available benchmark circuits is small, outdated, and few of these are representative of realistic circuits. A potential method to obtain more benchmark circuits is to design a generator that is capable of generating as many circuits as desired that are realistic and have specific characteristics. Previous work has focused on generating benchmark circuits at the netlist level. This limits the usefulness of these circuits in evaluating FPGA Computer Aided Design (CAD) algorithms since it does not allow for the evaluation of synthesis or related mapping algorithms. In addition, these netlist level circuit generators were calibrated using specific synthesis tools, which may no longer be state of the art. In this thesis, we introduce an Register Transfer Level (RTL) level circuit generator that can automatically create benchmark circuits that can be used for FPGA architecture studies and for evaluating CAD tools. Our generator can operate in two modes: as a random circuit generator or as a clone circuit generator. The clone circuit generator works by first analyzing an input RTL circuit then it generates a new circuit based on the analysis results. The outcome of this phase is evaluated by measuring the distance between certain post-synthesis characteristics of the generated clone circuit and those of the original circuit. In this study we generated a clone circuit for each of the VTR set of Verilog benchmark circuits. We generate clones with post-synthesis characteristics that are within 25% of the corresponding characteristic of the original circuits. In the other mode, the random circuit generator extracts the analysis results from a set of RTL circuits and uses that data to generate a random circuit with post-synthesis characteristics in an acceptable range.
View record
High-Level Synthesis (HLS) has emerged as a promising technology that allows designers to create a digital hardware circuit using a high-level language like C, allowing even software developers to obtain the benefits of hardware implementation. HLS will only be successful if it is accompanied by a suitable debug ecosystem. There are existing debugging methodologies based on software simulation, however, these are not suitable for finding bugs which occur only during the actual execution of the circuit. Recent efforts have presented in-system debug techniques which allow a designer to debug an implementation, running on a Field-Programmable Gate Array (FPGA) at its actual speed, in the context of the original source code. These techniques typically add instrumentation to store a history of all user variables in a design on-chip. To maximize the effectiveness of the limited on-chip memory and to simplify the debug instrumentation logic, it is desirable to store only selected user variables. Unfortunately, this may lead to multiple debug runs. In existing frameworks, changing the variables to be stored between runs changes the debug instrumentation circuitry. This requires a complete recompilation of the design before reprogramming it on an FPGA.In this thesis, we quantify the benefits of recording fewer variables and solve the problem of lengthy full compilations in each debug run using incremental compilation techniques present in the commercial FPGA CAD tools. We propose two promising debug flows that use this technology to reduce the debug turn-around time for an in-system debug framework. The first flow, in which the user circuit and instrumentation are co-optimized during compilation, gives the fastest debug clock speeds but suffers in user circuit performance once the debug instrumentation is removed. In the second flow, the optimization of the user circuit is sacrosanct. It is placed and routed first without having any constraints and the debug instrumentation is added later leading to the fastest user circuit clock speeds, but performance suffers slightly during debug. Using either flow, we achieve 40% reduction in debug turn-around times, on average.
View record
High-Level Synthesis (HLS) has emerged as a promising technology to reduce the time and complexity that is associated with the design of digital logic circuits. HLS tools are capable of allocating resources and scheduling operations from a software-like behavioral specification. In order to maintain the productivity promised by HLS, it is important that the designer can debug the system in the context of the high-level code. Currently, software simulations offer a quick and familiar method to target logic and syntax bugs, while software/hardware co-simulations are useful for synthesis verification. However, to analyze the behaviour of the circuit as it is running, the user is forced to understand waveforms from the synthesized design. Debugging a system as it is running requires inserting instrumentation circuitry that gathers data regarding the operation of the circuit, and a database that maps the record entries to the original high-level variables. Previous work has proposed adding this instrumentation at the Register Transfer Level (RTL) or in the high-level source code. Source-level instrumentation provides advantages in portability, transparency, and customization. However, previous work using source-level transformations has focused on the ability to expose signals for observation rather than the construction of the instrumentation itself, thereby limiting these advantages by requiring lower-level code manipulation. This work shows how trace buffers and related circuitry can be inserted by automatically modifying the source-level specification of the design. The transformed code can then be synthesized using the regular HLS flow to generate the instrumented hardware description. The portability of the instrumentation is shown with synthesis results for Vivado HLS and LegUp, and compiled for Xilinx and Altera devices correspondingly. Using these HLS tools, the impact on circuit size varies from 15.3% to 52.5% and the impact on circuit speed ranges from 5.8% to 30%. We also introduce a low overhead technique named Array Duplicate Minimization (ADM) to improve trace memory efficiency. ADM improves overall debug observability by removing up to 31.7% of data duplication created between the trace memory and the circuit{'}s memory structures.
View record
Field-Programmable Gate Arrays (FPGAs) consume roughly 14 times more dynamic power than Application Specific Integrated Circuits (ASICs) making it challenging to incorporate FPGAs in low-power applications. To bridge the gap, power consumption in FPGAs needs to be addressed at the application, Computer-Aided Design (CAD) tool, architecture, and circuit levels. The ability to properly evaluate proposals to reduce the power dissipation of FPGAs requires a realistic and accurate experimental framework. Mature FPGA power models are flexible, but can suffer from poor accuracy due to estimations on signal activity and simplifications. Additionally, run-time increases with the size of the design. Other techniques use unrealistic assumptions while physically measuring the power of a circuit running on an FPGA. Neither of these techniques can accurately portray the power consumption of FPGA circuits.We propose a framework to allow FPGA researchers to evaluate the impact of proposals for the reduction of power in FPGAs. The framework consists of a real-world System-on-Chip (SoC) and can be used to explore algorithmic and CADtechniques, by providing the ability to measure the power at run-time. High-level access to common low-level power-management techniques, such as clock gating, Dynamic Frequency Scaling (DFS), and Dynamic Partial Reconfiguration (DPR),is provided. We demonstrate our framework by evaluating the effects of pipelining and DPR on power. We also reason why our framework is necessary by showing that it provides different conclusions than that of previous work.
View record
A major task in post-silicon validation is timing validation: it can be incredibly difficult to ensure a new chip meets timing goals. Post-silicon validation is the first opportunity to check timing with real silicon under actual operating conditions and workloads. However, post-silicon tests suffer from low observability, making it difficult to properly quantify test quality for the long-running random and directed system-level tests that are typical in post-silicon. In this thesis, we propose a technique for measuring the quality of long-running system-level tests used for timing coverage through the use of on-chip path monitors to be used with FPGA emulation. We demonstrate our technique on a non-trivial SoC, measuring the coverage of 2048 paths (selected as most critical by static timing analysis) achieved by some pre-silicon system-level tests, a number of well-known benchmarks, booting Linux, and executing randomly generated programs. The results show that the technique is feasible, with area and timing overheads acceptable for pre-silicon FPGA emulation.
View record
This thesis presents a new power model, which is capable of modelling the power usage of many different field-programmable gate array (FPGA) architectures.FPGA power models have been developed in the past; however, they were designed for a single, simple architecture, with known circuitry. This work explores a method for estimating power usage for many different user-created architectures. This requires a fundamentally new technique. Although the user specifies the functionality of the FPGA architecture, the physical circuitry is not specified. Central to this work is an algorithm which translates these functional descriptions into physical circuits. After this translation to circuit components, standard methods can be used to estimate power dissipation. In addition to enlarged architecture support, this model also provides support for modern FPGA features such as fracturable look-up tables and hard blocks. Compared to past models, this work provides substantially more detailed static power estimations, which is increasingly relevant as CMOS is scaled to smaller technologies. The model is designed to operate with modern CMOS technologies, and is validated against SPICE using 22 nm, 45 nm and 130 nm technologies.Results show that for common architectures, roughly 73% of power consumption is due to the routing fabric, 21% from logic blocks and 3% from the clock network. Architectures supporting fracturable look-up tables require 3.5-14% more power, as each logic block has additional I/O pins, increasing both local and global routing resources.
View record
Designers constantly strive to improve Field-Programmable Gate Array (FPGA) performance through innovative architecture design. To evaluate performance, an understanding of the effects of modifying logic blocks structures and routing fabrics on performance is needed. Current architectures are evaluated via computer-aided design (CAD) simulations that are labourious and computationally-expensive experiments to perform. A more scientific method, based on understanding the relationships between architectural parameters and performance will enable the rapid evaluation of new architectures, even before the development of a CAD tool. This thesis presents an analytical model that describes such relationships and is based principally on Rent’s Rule. Specifically, it relates logic architectural parameters to the area efficiency of an FPGA. Comparison to experimental results show that our model is accurate. This accuracy combined with the simple form of the model’s equations make it a powerful tool for FPGA architects to better understand and guide the development of future FPGA architectures.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.