"It's Much More Interesting to Live not Knowing than to "have Answers" which might be Wrong" - Richard Feynman
Model compression is a ubiquitous tool that brings the power of modern deep learning to edge devices with power and latency constraints. The goal of model compression is to take a large reference neural network and output a smaller and less expensive compressed network that is functionally equivalent to the reference. Compression typically involves pruning and/or quantization, followed by retraining to maintain the reference accuracy. However, it has been observed that compression can lead to a considerable mismatch in the labels produced by the reference and the compressed models, resulting in bias and unreliability. To combat this, we present a framework that uses a teacher-student learning paradigm to better preserve labels. We investigate the role of additional terms to the loss function and show how to automatically tune the associated parameters. We demonstrate the effectiveness of our approach both quantitatively and qualitatively on multiple compression schemes and accuracy recovery algorithms using a set of 8 different real-world network architectures. We obtain a significant reduction of up to 4.1× in the number of mismatches between the compressed and reference models, and up to 5.7× in cases where the reference model makes the correct prediction.
Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy. Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration. In this paper, we introduce a programmable system for model compression called CONDENSA. Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, CONDENSA uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios. Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to 65× and runtime throughput improvements of up to 2.22x using at most 10 samples per search. We have released a reference implementation of CONDENSA at https://github.com/NVlabs/condensa.
Under review SysML2020
A new area is emerging at the intersection of artificial intelligence, machine learning, and systems design. This birth is driven by the explosive growth of diverse applications of ML in production, the continued growth in data volume, and the complexity of large-scale learning systems. The goal of this workshop is to bring together experts working at the crossroads of machine learning, system design and software engineering to explore the challenges faced when building practical large-scale ML systems. In particular, we aim to elicit new connections among these diverse fields, and identify tools, best practices and design principles. We also want to think about how to do research in this area and properly evaluate it. The workshop will cover ML and AI platforms and algorithm toolkits, as well as dive into machine learning-focused developments in distributed learning platforms, programming languages, data structures, GPU processing, and other topics.
NeurIPS 2019: Workshop on Systems for Machine Learning
Message Scheduling for Performant, Many-Core Belief Propagation
Belief Propagation (BP) is a message-passing algo- rithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, error-correcting codes, and protein-folding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, many-core Graphical Processing Units (GPUs) could significantly improve BP perfor- mance. Improving BP through many-core systems is non-trivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.
Index Terms—General Purpose GPU Computing, Randomized Algorithms, Message-Passing Algorithms.
HPDC - Finalist Best Student Paper award
GTC 2018 - Poster
Predicting Reproducibility of Floating-Point Code
Results of the programs that perform Floating point arithmetic can vary, depending upon the platform, compiler and its flags. In this paper, we provide a novel model based machine learning approach to predict floating point reproducibility. The binary instructions and other related features are modeled as a linear chain conditional markov random field (CRF). The reproducibility of the binary is broken down to an instuction level granularity and modeled as latent variables. As it is not tractable to label them, we use the Latent Structure Support Vector Machines(LS-SVM) as the learning algorithm. Inference is done by iterating through certain relavent subset of possible assignments to the latent variables and this is performed in an efficient manner. The results in this paper are meant to inform communities that seek higher performance by going towared IEEE unsafe optimizations, but want to have an estimate of result reproducibility, and also to the compiler developers who choose efficient transformations for higher performance.
CONCEPTS Floating Point Arithmetic, Compiler Transfor- mations Model based Machine Learning, Structured Predic- tion
KEYWORDS High-Performance Computing, Programming Environment, Reproducibility, Compier Flags.
MADONNA: A Framework for Energy Measurements and Assistance in designing Low Power Deep Neural Networks.
The recent success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. Since tuned designs easily give an order of magnitude improvement over general-purpose hardware, many architects look beyond an MVP implementation. This project presents Madonna v1.0, a direction towares automated co-design approach across the numerical precision to optimize DNN hardware accelerators. Compared to an 64-bit floating point accelerator baseline, we show that 32-bit floating points accelerators, reduces energy by 1.5; Training time improved by 1.22x and a observable improvement in Inference as well; Across three datasets, these power and energy measurements provide a collective average of 0.5W reduction and 2x energy reduction over an accelerator baseline without almost compromising DNN model accuracy. Madonna enables accurate, low power DNN accelerators , making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
Nvidia’s Volta architecture is a second generation GPU that offers FP16 support. However, when actual products were shipped, CUDA programmers realized that a Naïve replacement of float to half leads to disappointing results, even for error prone (approximation based) GPU algorithms. In this project we empirically study the impact of reducing floating point precision of the k-means framework, an unsupervised learning algorithm on a synthetic data set.
We successfully converted this benchmark from single precision arithmetic to its half precision equivalent and achieved an improvement in execution time for datasets of varying sizes.
We also discuss some new issues and opportunities that Volta GPUs provide for floating point users.