Improving performance estimation of convolutional neural network accelerators
18 May 2020
Martin Ferianc, a machine learning researcher, was about to attend the 16th International Symposium on Applied Reconfigurable Computing 2020 to present his paper on FPGA-based Accelerators for Convolutional Neural Networks, when the pandemic started.
Martin's work introduces a novel method for fast and accurate estimation of latency based on a Gaussian process parametrised by an analytic approximation and coupled with runtime data. The conference was postponed until further notice, but we sent Martin a few questions to learn more about his work and wider impact.
Before we jump into talking about your method, how would you explain your research to someone outside of the electronic engineering industry?
Machine learning has recently populated news stories, from smart recognition of objects through our phones’ cameras, facial recognition, autonomous vehicles to robotic surgeons that can perform surgery on a single grape. It would not be possible to perform these tasks in real-time and on the spot without some kind of dedicated computer hardware that can quickly process the underlying machine learning algorithms. Nowadays, many are experimenting with field-programmable gate arrays (FPGAs), which is a hardware platform that is reconfigurable and well-suited especially for rapidly changing machine learning algorithms which aim for real-time processing. My research tried to predict, given the algorithm and the FPGA type, what is going to be their joint performance and thus help both the hardware and software engineers adapt their algorithms or the FPGA configuration to increase their joint efficacy. In particular, jointly with researchers from UCL, Imperial and ETH Zurich we focused on a class of machine learning algorithms known as convolutional neural networks and their reconfigurable FPGA-based accelerators.
How does this compare with previous approaches and how is your approach novel?
In comparison to the standard heuristic approach, which tries to predict the performance simply through information extracted from a datasheet, our method actually encapsulates it through a non-parametric Bayesian method called a Gaussian process. However, we also need to collect some performance measurements, which improve our estimate. The advantage of this method, in comparison to other machine learning inspired methods, is that it avoids completely relying on collected measurements by also including the heuristic in the estimation. Additionally, this method does not need to extract any additional features to the ones that are used in the standard heuristic estimation.
What is it about CNNs (over, say RNNs, GANs, etc) that are a prime fit on an FPGA?
Neural networks as a topic absolutely dominated the machine learning and deep learning community, in particular, in the field of computer vision. Applications areas such as autonomous vehicles or medical imaging present an interesting use-case for convolutional neural networks (CNNs) CNNs for object tracking, instance segmentation, and even depth approximation.
Despite this impressive progress in practicality, CNN itself is a computationally intensive model, which is also the biggest drawback. Extracting the features and processing them requires an enormous amount of hardware and electric power. State-of-the-art CNN research often demonstrates GPU usage and favourable processing times, however, it's the power consumption which is not suitable for real-time application. Taking this into account, FPGA is a better fit as it is a low power device, which provides a similar degree of acceleration to a GPU.
Do you have (personal/outside of this research on perf estimation) opinions about performance, scalability, accuracy, etc. of FPGAs for CNNs over GPUs or pure CPU?
For a CPU, the CNNs are running on hardware designed for general purpose applications, relatively speaking, offering a few very fast processing cores. A GPU offers many more processing units, with a slight speed decrease. Both of these are designed for a certain data format, which may not always be optimal for the CNN used.
FPGAs allow for the implementation of exactly the necessary design, meaning that the CNN can make use of optimal processing units. These processing units can be interfaced together more efficiently than with a GPU/CPU, since the channel is also purpose-built for the particular CNN. By allowing CNN to work with any desired data width, the accuracy can be tuned to the desired level.
What does this mean for the industry and ultimately for how we design programmes/hardware in the future?
Hardware and software are going to be more interwoven than ever before. We are going to constantly not only think about how to accelerate the algorithm through some software modifications, but also how to co-adapt its hardware to achieve even a bigger boost in performance. Nevertheless, it is much more difficult to co-adapt hardware, but because of the interests in FPGAs and their application as hardware accelerators for machine learning algorithms, this might no longer be true.
What are the next steps for your project?
Our project focused on estimating only one performance factor and that is a latency and we experimented with a single accelerator. Therefore, in the future, we would like to focus on other metrics, for example, power consumption or other types of accelerators to the one on which we performed our experiments. A later step could be taken in the direction which would involve not only telling the user about the performance of their hardware or the algorithm, but also suggest how to change it such that they can get closer to their desired goal.
Since you produced this piece of work, you mentioned that your research interests have changed.
As I mentioned above, it is not only important to say that the performance could potentially be improved, but also suggest how. Previously, I was mostly working with ways how to improve the hardware side and in my PhD, I want to focus on the software side and namely how to improve the design of neural networks. To re-design a neural network it is usually necessary to consult a specialist for each unique use case. Neural architecture search removes this dependency by automatically searching for the neural network design through an algorithm. In my PhD with Prof Miguel Rodrigues at UCL, I am focusing on improving and finding new ways how to improve these algorithms such that they are better suited especially for computer vision applications. With successful search algorithms, neural networks can become more available to professionals outside of the domain of computer science which can lead to interesting applications for the good of us all.
You can find Martin’s paper and detailed plans here. The paper, titled Improving Performance Estimation for FPGA-based Accelerators for Convolutional Neural Networks, was written in collaboration with Hongxiang Fan, Ringo S. W. Chu, Jakub Stano, Wayne Luk.