We aim to create hardware for state-of-the-art applications that combine image processing and machine learning. We are looking at ways to change the programming model to address the challenges involved in lowering to hardware. We use Halide, a data-parallel DSL, to create algorithms and schedule them for different hardware backends. We modify the Halide compiler to extract memories and compute from loop nests. Then, we use our polyhedral analysis tool, Clockwork, to analyze memory operations for scheduling and hardware mapping. Additionally, we are exploring auto-scheduling under resource constraints in Halide and a new IR, Aetherling.
New image and vision applications
Our applications focus on image processing and vision applications. We have created implementations and schedules of basic image applications, such as 3x3 convolution, Harris corner detector, and a simple camera pipeline. Our application suite continues to grow as we handle programs with upsampling, pyramid computations, and multi-level memory hierarchies. Some of the newer features of the compiler also means that machine learning applications, such as MobileNet and ResNet, are now feasible. We are also looking to create state-of-the-art applications that combine imaging and machine learning kernels, including HDR-Net.
The Halide compiler takes algorithms and generates scheduled implementations. The original Halide compiler uses scheduling to help explore how to optimally run CPU and GPU implementations. We modified the compiler to also support hardware generation. The accelerator that is produced uses streaming inputs and outputs, and is later mapped to our CGRA using our toolchain.
There are several specific passes for hardware generation. The Halide IR is modified for hardware-specific properties (such as custom buffers, streaming interfaces, simultaneous computation) and then the final output is separated into computation kernels and memories. The computation kernels are converted into CoreIR, our hardware definition representation. The memories are fed into Clockwork, which performs passes using polyhedral analysis. These passes determine when each memory store and load should occur, and determines the parameters for all buffers. Our unified buffer abstraction encapsulates line buffers for image processing and double buffers used in machine learning. In addition, the compute graph is modified to use a set of streaming interfaces between each of the kernel interfaces.
Aetherling is a system for compiling image processing pipelines expressed in a standard, data-parallel DSL to streaming hardware accelerators. Aetherling’s auto-scheduler performs design-space exploration to tradeoff throughput and compute and memory utilization within the constraints of the target hardware. The auto-scheduler can efficiently explore the set of valid schedules by lowering the programs into a performance transparent IR. Since all operators in the IR have an unambiguous implementation in hardware, approximating the resource utilization of any program in the IR is accurate and much faster than performing place-and-route. Additionally, the auto-scheduler can quickly cull invalid schedules by using the IR’s type system. Each operator in the IR has a dependent type signature that describes it's throughput. Chains of operators with mismatched throughputs do not compile to streaming data accelerators. Therefore, the search algorithm can ignore them.