systems for ai

ai for systems



robust ai

ai for society

“He would often play his violin in his kitchen late at night, improvising melodies while he pondered complicated problems. Then, suddenly in the middle of playing, he would announce excitedly, I’ve got it!” - About Albert Einstein

SYSTEMS FOR AI

Condensa : A Programming System for

Deep Neural Networks Model Compression

Condensa is a framework for programmable model compression in Python. It comes with a set of built-in compression operators which may be used to compose complex compression schemes targeting specific combinations of DNN architecture, hardware platform, and optimization objective. To recover any accuracy lost during compression, Condensa uses a constrained optimization formulation of model compression and employs an Augmented Lagrangian-based algorithm as the optimizer.

Status: Condensa is under active development, and bug reports, pull requests, and other feedback are all highly appreciated. See the contributions section below for more details on how to contribute.

Supported Operators and Schemes

Condensa provides the following set of pre-built compression schemes:

The schemes above are built using one or more compression operators, which may be combined in various ways to define your own custom schemes.

Please refer to the documentation for a detailed description of available operators and schemes.

Motivation: Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy.

Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration.

In this project, we introduce a programmable system for model compression called CONDENSA.

Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, CONDENSA uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios.

Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to 65× and runtime throughput improvements of up to 2.22x using at most 10 samples per search.

We have released a reference implementation of CONDENSA at https://github.com/NVlabs/condensa.

Madonna : A Framework for Energy Measurements and Assistance in designing Low Power Deep Neural Networks. (Fall 2016)

Goal: How to automatically adapt a pre-trained deep neural network to a mobile platform given a resource budget ?

Motivation: Many existing algorithms simplify networks based on the number of MACs or weights, optimizing those indirect metrics may not necessarily reduce the direct metrics, such as latency and energy consumption.

Madonna incorporates direct metrics into its adaptation algorithm. These direct metrics are evaluated using empirical measurements, so that detailed knowledge of the platform and toolchain is not required. we propose to automatically and progressively simplifies a pre-trained network until the resource budget is met while maximizing the accuracy.

Learn-Fast : Resource Efficient Machine Learning to effectively utilize New GPU Architecture Features.

(Spring 2018)

Motivation : With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia’s recent Pascal architecture was the first GPU that offered FP16 support.

Problem: When actual products were shipped, programmers soon realized that a naive replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings.

Proposal: Automate the conversion to help users migrate their CUDA code to better exploit Pascal’s half precision capability.

There are two approaches of using half precision in NVIDIA Pascal P100: When the half datatype is used, the FPU takes the same amount of time to execute both FP16 and FP32 instructions. This approach is simple for code migration, but it fails to take full advantage of the new FPU.

The performance gain is slightly better than when half is used merely as a storage datatype because the data is not implicitly converted to and from float during each operation.

We notice that the majority of programs showed only marginal speedup, mostly slow down when using this approach. A developer needs to refer the NVIDA CUDA MATH API Reference Manual for Half Precision and carefully re-write as shown in the below.

Naive FP32 : Processing 100 points on 1 blocks x 1024 threads Takes: 1.80609s,

Naive FP16x2 : Processing 100 points on 1 blocks x 1024 threads Takes: 6.97532

Challenges

Correctness

  • What is the impact of FP precision reduction on the accuracy of the prediction?
  • Manual rewriting is error prone as APIs are not user friendly, Similar to MPFR (for C++).

Potential Performance Penalty

  • Casting from FP64/32 to FP16, CPU or GPU side
  • Other misc casting or conversions in Kernel code __float2half(0)
  • Correctness Considerations (How much degradation is Ok?)
    • Manual Conversion : Setup correctness measure, and efficient debugging for numerical ops.
    • Automated Conversion : Research tool cuda-half2 LLVM based LibTooling
  • Performance Bottlenecks (Is is really possible to get 2x benefits ?)
    • Is it always possible to vectorize half data into half2 for 2x performance ?
      • shared_data[x] = __hadd2(shared_data[x],shared_data[x + stride]);
      • shared_data[y] = __hadd2(shared_data[x],shared_data[y + stride]);
    • Conversions are penalizing
      • new_sums_x[cluster_index] = __half2float(__high2half(shared_data[x]));
      • new_sums_y[cluster_index] = __half2float(__high2half(shared_data[y]));
      • counts[cluster_index] = __half_as_short(__high2half(shared_data[count]));
      • shared_data[x] = __half2half2(__int2half_rn(0));
      • Floating point constants
    • CUDA Restriction for FP16x2
      • SFU is FP32 data only
      • atomic* operations on FP32 only
      • Address misalignment reported by cuda-memcheck

Perf: Actual Speed-Up on Pascal P100

Analyzing Performance Bottlenecks, Correctness results (right).

AI FOR SYSTEMS

System Resilience : Machine Learning Techniques For Robust HPC Kernels (Fall 2015 - Fall 2016)

Back Then, Helicoder Drums

Today, its Stencil Kernels

And we need RTM Algorithms to be robust to errors!

Background : Recorders like the ones on the top (left) operated continuously at the University of Utah from the mid 1970s to March 30, 2009. Each helicorder recorded data transmitted from one or two seismic stations, depending on the number of pens on the helicorder. Vertical ground motion at the station caused the pen to move from side to side. As the drum rotated slowly beneath the pen, a heated tip on the pen traced a record of ground motion at the station onto heat sensitive paper fastened to the drum. Currently operating digital equivalent of these helicorder records can be seen on the University of Utah Seismograph Stations video wall (right).

Motivation : Today, we do not use Helicoder Drums as shown above but use Stencil computations which is the basis for the Reverse Time Migration algorithm in seismic computing. Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8th order or larger, which require large-scale HPC clusters to meet the computational demands.. The underlying mathematical problem is to solve the wave equation using a finite difference method. In our project we compute a 3-D 25-point stencil. The computation contains 4 layer loops for each dimention and time duration.

Problem : In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

In a computer's memory system, a soft error changes an instruction in a program or a data value. Soft errors typically can be remedied by cold booting the computer. A soft error will not damage a system's hardware; the only damage is to the data that is being processed.

Proposal: The idea is to use machine learning techniques to train a cost-effective regression model that predicts the output of the target stencil’s kernel given its input. The model will be trained on values observed in real stencil executions and will declare an error when its predictions significantly disagree with the value computed by the stencil.

Comprender : Understanding Floating Point Errors (Fall 2019)

"floats" have a serious drawback: They are inaccurate.

Anybody can try this out on a pocket calculator: Punch in ⅓ and you get 0.333333.

You wonder, of course, how close an approximation this is. Now multiply by 3. Most likely you will see 0.999999 and not 1.0. If the result is 1.0, subtract 1.0 from it, which will probably give you something like −1E−10. This is a perfectly simple example—why can't computers get this right?

Motivation : Avoiding the Badlands, the main task in debugging floating-point math is figuring out when we’ve wandered into the badlands with an approximation and then rewriting parts of our code so we don’t do that. Formulas and plots of the badlands for common 64-bit floating-point functions (above).

"God created the integers, all else is the work of man." -- Leopold Kronecker

Research question: Can we train DNNs to learning such functions ? These trained DNNs can be used to track propagation errors in Important Scientific computation kernels, for examples in detecting soft errors in CPU or GPU systems.

Reference: https://www.cs.umd.edu/~ntoronto/papers/toronto-2014cise-floating-point.pdf

ROBUST AI

"If we focus on endowing machines with common sense and deep understanding, rather than simply focusing on statistical analysis and gathering ever larger collections of data, we will be able to create an AI we can trust—in our homes, our cars, and our doctors' offices." - Gary Marcus

A cat

A guacamole !

Ok, What is at stake here ?

Despite the hype surrounding AI, creating an intelligence that rivals or exceeds human levels is far more complicated than we have been led to believe. Researchers have spent their careers at the forefront of AI research and have witnessed some of the greatest milestones in the field, but we argue that a computer beating a human in Jeopardy! does not signal that we are on the doorstep of fully autonomous cars or superintelligent machines.

The achievements in the field thus far have occurred in closed systems with fixed sets of rules, and these approaches are too narrow to achieve genuine intelligence.

The real world, in contrast, is wildly complex and open-ended.

  • How can we bridge this gap?
  • What will the consequences be when we do?

Active Project (Spring 2020)

Taking inspiration from the human mind, We want to explain what we need to advance AI to the next level, and suggest that if we are wise along the way, we won't need to worry about a future of machine overlords.

AI FOR SOCIAL GOOD

"Every 2 minutes, a child dies of malaria. And each year, more than 200 million new cases of the disease are reported. Although countries have dramatically reduced the total number of malaria cases and deaths since 2000, progress in recent years has stalled. Worryingly, in some countries, malaria is on the rise" -- WHO (2020)

Nearly half the world’s population is at risk from malaria and there are over 200 million malaria cases and approximately 400,000 deaths due to malaria every year. This gives us all the more motivation to make malaria detection and diagnosis fast, easy and effective.

In these projects we focus on different aspects of how Artificial Intelligence (AI) coupled with popular open-source tools, technologies and frameworks we develop can be used for development and betterment of our society.

Project CINCHONA: Using Condensa Compressed DNNs for Malaria Detection on Resource Constrained Regions. Active Project (Spring 2020)

malaria

is deadly

it kills children

$ money lost

$$ family budget

SIMPLE NEt ! or a Neural Net ?!

Insecticide-treated MosquitoNet (above) Condensa Compressed MosquitoNet.pth, we will improve inference throughput, via Optimal Filter Pruning.

class MosquitoNet(nn.Module):
    
    def __init__(self):
        super(MosquitoNet, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        ...
        ...
There is a correlation between areas of poverty and areas affected by malaria.

Challenges

Infected cells that were misclassified (false negatives) with an interpretable model. Notice the middle cell (below) seems to contain a poorly-stained parasite at the bottom right corner, making it difficult to correctly classify

Artificial intelligence combined with open source tools improves diagnosis of the fatal disease malaria.

Data-prep, Train, Deploy, Repeat.