PitchSend

Information Technology

Mining and Metals

NVIDIA Investor Presentation Deck

Released by

NVIDIA

66 of 74

Creator

NVIDIA

Category

Technology

Published

December 2020

Slides

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Open in new tab

Transcriptions

#1NVIDIA #2Accelerating Intelligence Bill Dally, Chief Scientist and Senior Vice President of Research GTC China, December 14, 2020 #3SAFE HARBOR Forward-Looking Statements Except for the historical information contained herein, certain matters in this presentation including, but not limited to, statements as to: the performance, benefits, abilities and impact of our products and technology; the availability of our products and technology; our partnerships and customers; model complexity; the number of users of internet technologies and their markets; our markets; and our growth and growth drivers are forward-looking statements within the meaning of the Private Securities Litigation Reform Act of 1995. These forward-looking statements and any other forward-looking statements that go beyond historical facts that are made in this presentation are subject to risks and uncertainties that may cause actual results to differ materially. Important factors that could cause actual results to differ materially include: global economic conditions; our reliance on third parties to manufacture, assemble, package and test our products; the impact of technological development and competition; development of new products and technologies or enhancements to our existing product and technologies; market acceptance of our products or our partners' products; design, manufacturing or software defects; changes in consumer preferences and demands; changes in industry standards and interfaces; unexpected loss of performance of our products or technologies when integrated into systems and other factors. NVIDIA has based these forward-looking statements largely on its current expectations and projections about future events and trends that it believes may affect its financial condition, results of operations, business strategy, short-term and long-term business operations and objectives, and financial needs. These forward-looking statements are subject to a number of risks and uncertainties, and you should not rely upon the forward-looking statements as predictions of future events. The future events and trends discussed in this presentation may not occur and actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. Although NVIDIA believes that the expectations reflected in the forward-looking statements are reasonable, the company cannot guarantee that future results, levels of activity, performance, achievements or events and circumstances reflected in the forward-looking statements will occur. Except as required by law, NVIDIA disclaims any obligation to update these forward-looking statements to reflect future events or circumstances. For a complete discussion of factors that could materially affect our financial results and operations, please refer to the reports we file from time to time with the SEC, including our Annual Report on Form 10-K and quarterly reports on Form 10-Q. Copies of reports we file with the SEC are posted on our website and are available from NVIDIA without charge.#4i CUDA LIBRARIES CUDA CuDNN CuBLAS CuSPARS CuTensor CuSolver CuFFT CURAND AmgX NVSHMEM NCC TensorRT GRAPHICS IVA AI Jaba hepatH DRIVE CLARA 00004 ROBOTICS #5NVIDIA A100 NVIDIA Ampere Architecture World's Largest 7nm Chip 54B Transistors, HBM2 New Sparsity Acceleration Harness Sparsity in Al Models 2x Al Performance 17 T 3rd Gen Tensor Cores Faster, Flexible, Easier to Use 20x Al Performance with TF32 New Multi-Instance GPU Optimal Utilization with Right-sized GPU 7x Simultaneous Instances per GPU 3rd Gen NVLink and NVSwitch Efficient Scaling to Enable Super GPU 2x More Bandwidth #6TRANSISTOR COUNT DIE SIZE FP64 CUDA CORES FP32 CUDA CORES TENSOR CORES STREAMING MULTIPROCESSORS FP64 FP64 TENSOR CORE FP32 TF32 TENSOR CORE BFLOAT16 TENSOR CORE FP16 TENSOR CORE INT8 TENSOR CORE INT4 TENSOR CORE GPU MEMORY INTERCONNECT MULTI-INSTANCE GPUS FORM FACTOR MAX POWER NVIDIA A100 Detailed Specs 54 Billion 826 mm² 3,456 6,912 432 108 9.7 TeraFLOPS 19.5 TeraFLOPS 19.5 TeraFLOPS 156 TeraFLOPS | 312 TeraFLOPS* 312 TeraFLOPS | 624 TeraFLOPS* 312 TeraFLOPS | 624 TeraFLOPS* 624 TOPS 1,248 TOPS* 1,248 TOPS | 2,496 TOPS* 40 GB NVLink 600 GB/s | PCle Gen4 64 GB/s Various Instance Sizes with up to 7MIGs at 5GB 4/8/16 SXM GPUs in HGX A100 400W (SXM) * Includes Sparsity 1:06 ***#7Dense Matrix Structural Sparsity Brings Additional Speedups. Sparse Matrix 2X Faster Execution T A100 Tensor Core BERT LARGE INFERENCE BERT Large Inference | Precision = INT8 with and without sparsity | Batch sizes - no sparsity: bs256, with sparsity: bs49, A100 with 7 MIGS 1x A100 1.5x A100 Sparsity STRUCTURED SPARSITY Half the values are zero Skip half of the compute and mem fetches Compute up to 2x rate vs non-sparse #8DGX A100 The Universal Al System Data Analytics, Training, and Inference NVIDIA 1440 kam shpi jerm shped 9x Mellanox ConnectX-6 VPI 200 Gb/s Network Interface Dual 64-core AMD Rome CPU 1TB RAM 8x NVIDIA A100 GPUs 6x NVIDIA NVSwitches 4.8 TB/s Bi-directional Bandwidth 600 GB/s GPU-to-GPU Bandwidth 15TB Gen4 NVME SSD #9Rack-Scale Infrastructure Building an Al Center of Excellence with DGX POD Built on DGX A100 DGX POD More Attainable than Ever with DGX A100 Experience a Faster Start with Building Flexible Al Infrastructure Proven Architectures, with Leading Storage Partners Up to 40 PetaFLOPS Computing Power in Just 2 Racks 700 PetaFLOPS of Power to Train the Previously Impossible 4-node DGX POD 8-node DGX POD Complete Al Infrastructure Solutions: DGX, Storage, Networking, Services, Software #10IDIA IDIA GNVIDIA 0080 00-17 E TET FE enVIDIA The Selene Supercomputer NVIDIA's DGX SuperPOD Deployment #5 on TOP500 (63 PetaFLOPS HPL @ 24GF/W) #5 on Green500 (20.5 GigaFLOPS/watt) Fastest Industrial System in the U.S. - 1+ ExaFLOPS AI One of the Fastest and Most Efficient Supercomputers on the Planet - Built in Under a Month HG 6 Cr CNG H PRO 中区33 NVIDIA NVIDIA 400000 #11IDIA IDIA GNVIDIA 0080 00-1 ** FE TE F enVIDIA 4 or frie #5 TOP500 SELENE 8 OF TOP 10 NVIDIA+MLNX #1 USA SUMMIT TOP 500 The List. #5 nVIDIA. GREEN500 SELENE #1 ACADEMIA FRONTERA #1 CHINA TAIHULIGHT #1 GREEN500 DGX SUPERPOD #1 INDUSTRIAL SELENE #1 EUROPE JÜLICH #12Specialized Instructions Amortize Overhead Operation HFMA HDP4A ΗΜΜΑ IMMA Energy¹ (pJ) Overhead² 1.5 6.0 110 160 2000% 500% 22% 16% 1 - Energy numbers from 45nm process 2 - Overhead is instruction fetch, decode, and operand fetch - 30pJ #13Int 8 TOPS 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00 3.94 April 2012 August 2013 K20X Single-chip Inference Performance 317x in 8 Years December 2014 M40 6.84 P100 21.20 May 2016 125.00 V100 September 2017 261.00 Q8000 February 2019 A100 1248.00 June 2020 October 2021 #143x 2x 1x Ox 1.2x 0.7x 1x 1.5x Image Classification ResNet-50 v.1.5 MLPerf Training Benchmarks Relative Speedup Commercially Available Solutions | Speedup Over V100 0.9x 1x 1.6x NLP BERT XX 1x Huawei Ascend ■TPUv3 ■V100 A100 1.9x XX 1x 2x 1x 2x XX Object Detection Reinforcement Object Detection (Heavy Weight) Learning (Light Weight) Mask R-CNN MiniGo SSD XX 1x 2.4x Translation (Recurrent) GNMT XX 2.4x 1x Translation (Non-recurrent) Transformer XX 1x 2.5x Recommendation DLRM XX = No Result Submitted Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet-50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A 100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22, Mask R-CNN: 0.7-40, 0.7-19, MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-171 ML Perf name and logo are trademarks. See www.mlperf.org for more information.#1510x 1x 8x 6x 4x 2x Ox 0.3x X Medical Imaging 3D U-Net NVIDIA Tops MLPerf Data Center Inference Benchmarks A100 is up to 237x Faster than the CPU Per-accelerator 0.8x OFFLINE ■Xilinx U250 (Available) ■NVIDIA T4 (Available) 0.2x Normalized to T4 X X Image Speech Classification Recognition ResNet-50 RNN-T Intel Cooper Lake (Preview) NVIDIA A100 (Available) 0.2x X X 0.1x Object Recommendation Detection DLRM SSD-Large X X NLP BERT 1x 10x 8x 6x 4x 2x Ox XX No Result Submitted. 0.7x 0.2x SERVER Per-accelerator Normalized to T4 Xilinx U250 (Available) ■NVIDIA T4 (Available) XX Speech Image Classification Recognition ResNet-50 RNN-T 0.1x X Intel Cooper Lake (Preview) NVIDIA A100 (Available) 237x vs. CPU 0.04x Object Recommendation Detection DLRM SSD-Large X X X NLP BERT MLPerf v0.7 Inference Closed; Per-accelerator performance derived from the best MLPerf results for respective submissions using reported accelerator count in Data Center Offline and Server. 3D U-Net 99.9%,: 0.7-125, 0.7-113, 0.7-111, ResNet-50: 0.7-119, 0.7-124, 0.7-113, 0.7-111, SSD-Large: 0.7-123, 0.7-113, 0.7-111 DLRM 99.9%: 0.7-126, 0.7-113, 0.7-111, RNN-T, BERT 99.9%: 0.7-111, 0.7-113 MLPerf name and logo are trademarks. See www.mlperf.org for more information.#1620x 15x 10x 1x 5x Ox SINGLE-STREAM Per-accelerator Normalized to Jetson AGX Xavier 0.9x Centaur NVIDIA Tops MLPerf Edge Inference Benchmarks Leadership Performance for Edge Servers with T4, A100, and Edge Devices with Jetson 2.3x NVIDIA Jetson AGX Xavier NVIDIA T4 E NVIDIA A100-PCle 2.0x 2.0x X 2.9x Object Image Speech Detection Classification Recognition SSD-Small ResNet-50 RNN-T 2.8x Medical Imaging. 3D U-Net 3.4x Object Detection SSD-Large 4.8x NLP BERT 30x 25x 1x 20x 15x 10x 5x Ox X = No Result Submitted MULTI-STREAM Per-accelerator Normalized to Jetson AGX Xavier ■NVIDIA Jetson AGX Xavier 3.4x Object Detection SSD-Small NVIDIA T4 2.8x Image Classification ResNet-50 E NVIDIA A100-PCle MLPerf v0.7 Inference Closed; Edge Single-Stream derived from reported latencies and Multi-Stream. SSD-Small, ResNet-50 (Single-Stream): 0.7-131, 0.7-152, 0.7-153, 0.7-146 RNN-T, 3D U-Net 99.9%, SSD-Large, BERT 99% (Single-Stream): 0.7-152, 0.7-153, 0.7-146 SSD-Small, ResNet-50, SSD-Large (Multi-Stream): 0.7-152, 0.7-153, 0.7-146 MLPerf name and logo are trademarks. See www.mlperf.org for more information. 4.0x Object Detection SSD-Large #17HOO OIL COLOUR HOU C #18mn 3 р е ро JOHN IRVIN 3 RE әүүәләуәсәя EQUAL TES MILL GUARDS! GO T Datahall 1861 по 25 ISAAC ASIMOV THE COMPL NEIL GAIMAN TOLKIEN - THE RET AFG54 DUODOR 1000qpd0s THE WORLD ACCORDING TO GA Terry Pratchett N THE UGHT FANTASTIC Terry Pratchett Terry Pratchett & Neil Gaiman DOUGLAS ADAMS 638 ISAAC ASIMOV THE COMPLETE STORIES GOOD OMENS THE HITCHHIKER'S GUIDE TO THE GALAXY G RTXDI OFF DEAR aliter st Theorem Simon Sir 24 ECOLAPICES DE COLOR PUNTA GRUESA sunt 20 OŘÁN 24 ECOLÁPICES DE COLOR PUNTA GRUESA 1 0.00 #19mu JOn IRVIN P op a Pag EQUAL TES GUARDS! GARDELT Pentabell +861 по ра ACCORDING" ISAAC ASIMOV THE COMPLE NEIL GAIMAN DUODOR 000 000 0 THE WORLD AN Terry Pratchett TOLKIEN THE RET THE UGHT FANTASTIC Terry Pratchett Terry Pratchett & Neil Gaiman DOUGLAS ADAMS ISAAC ASIMOV THE COMPLETE STORIES 24 GOOD OMENS buosa THE HITCHHIKER'S GUIDE TO THE GALAXY SD RTXDI ON 66 Light Primitives, with IES Profiles st Theorem Simon Sir 24 ECOLÁPICES DE COLOR PUNTA GREESA ORÁN 24 ECOLÁPICES DE COLOR PUNTA GRUESA Suunt 220 MER #20T RTXGI ON RTXGI OFF #21RTXGI ON LIGHTING ONLY RTXGI OFF #22Ray Tracing 1440P EMA U NVIDIA DLSS 2.0 A New Era of Al-powered Computer Graphics Δw 4K VS Supercomputer Rendered 16K Ground Truth #23TIE TAEXPERED DETAY NATIVE 4K 108 TIE TAMPERED DIF TAR DLSS 4K 141 #24TIE TAKZOE TAMPERED 60P 455- DO NOT TAMPER IF TA NATIVE 4K 108 НЕ ТАМЕ TAMPERED 60P 455- DO NOT TAMPER IF TA DLSS 4K 141 #25F 0 Real-time Path Tracing Opportunity for New Fixed-function Hardware Roadmap Many-light Sampling Efficient Long Specular Paths through Glass/Mirrors etc. Adaptive Sampling BSDF Sampling or Path Guiding Random Numbers Terminate Paths into Lighting Approx. (RTXGI, Neural Light Cache, PSTF, ...) DL Denoising #26Direct Neural Image Synthesis Random Numbers in → Image out Director: David Luebke #27PetaFLOPS Days 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 AlexNet 2012 2013 Exploding Model Complexity 30,000x in 5 Years | Now Doubling Every 2 Months 2014 ResNet 2016 2017 Megatron-GPT2 GPT-2 BERT 2018 GPT-3 Megatron-BERT Turing NLG 2020 #28Int 8 TOPS 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00 3.94 April 2012 August 2013 K20X Single-chip Inference Performance 317x in 8 Years December 2014 M40 6.84 P100 21.20 May 2016 125.00 V100 September 2017 261.00 Q8000 February 2019 A100 1248.00 June 2020 October 2021 #29Generator Discriminator JA Latent Variable Real Images Fake Image Real or Fake?#30Latent ze z Normalize Fully-connected PixelNorm Conv 3x3 PixelNorm Upsample Conv 3x3 PixelNorm Conv 3x3 PixelNorm (a) Traditional 4x4 8x8 Latent ze z Normalize FC FC FC FC FC FC FC FC Style GAN Mapping Network f Wew A A A A Style Style Style Style Synthesis network g Const 4x4x512 AdaIN Conv 3x3 AdalN PixelNorm Conv 3x3 Ada IN Conv 3x3 AdalN 4x4 8x8 (a) Style-based Generator B B B B Noise #31Webcam Sender Keyframe Keypoint Extraction Receiver Keyframe Keypoints Neural Network Output NVIDIA AI Video Compression #32Webcam Sender Keyframe Keypoint Extraction Receiver Keyframe Keypoints Neural Network Output NVIDIA AI Video Compression #33 #34PRE-TRAINED MODEL 0-0-0 000 L NVIDIA GPU CLOUD NVIDIA Jarvis Multimodal Conversational Al Services Framework RE-TRAIN Transfer Learning NeMo Service Maker NVIDIA AI TOOLKIT Multi-Speaker Transcription NLU Vision Speech Language Model NVIDIA JARVIS Dialog Manager Chatbot Decoder Gesture Recognition Acoustic Model 22 NLU & Speech Recommenders Synthesis Look to Talk Feature Extraction TRITON INFERENCE SERVER T Voice Encoder FR JESSICA: What will you have ready for Wednesday? DOUGLAS: I expect to have early designs of the packaging. JESSICA: Great. Sign up for Jarvis Beta developer.nvidia.com/nvidia-jarvis #35Jarvis snow Current sky + 5 Go an open grassland on a cloudy day clouds hill mountain dirt Brush Size tree water grass road sea flower 3:1 4:3 river stone 16:9 2:1 rock bush 1:1 plant wood 1:2 91 #36Applications Places System Welcome to Megatron-CNTRL, a controllable story generator Please give a start sentence. Press Enter to exit generation [FEMALE] was on a road trip. File Edit View Search Terminal Tabs Help mpatwary@cpu0727-... xmpatwary@selene-lo... Xmpatwary@selene-lo... X mpatwary@circe-logi... X mpatwary@MLR-expl... xmpatwary@MLR-expl... X Terminal Megatron-CNTRL predicted keywords: sudden Press enter to continue, or please Megatron-CNTRL generated: she was write control keywords (comma seperated): driving driving down the road. Megatron-CNTRL predicted keywords: sudden Press enter to continue, or please write control keywords (comma seperated): Megatron-CNTRL generated: all of a sudden she smelled smoke. Megatron-CNTRL predicted keywords: pulled, check Press enter to continue, or please write control keywords (comma seperated): Megatron-CNTRL generated: she checked the air conditioner. Megatron-CNTRL predicted keywords: blown Press enter to continue, or please write control keywords (comma seperated): help Terminal Terminal Wed Sep 16, 5:50 PM Mostofa Patwary X Terminal X #37Item Item Embedding Items 0(10⁹) Recommenders: The Personalization Engine of the Internet Candidate Generation User User Embedding 0(10²) Ranking 0(10) Recommended Items S 1 4 TAPKAMEX 3 6 8 9 O FON DIGITAL CONTENT 2.7 Billion Monthly Active Users SOCIAL MEDIA 3.8 Billion Active Users E-COMMERCE 2 Billion Digital Shoppers BLACK FRIDAY Sale OP NOW DIGITAL ADVERTISING 4.7 Billion Internet Users #381TB CRITEO DATASET 3hr 2 min ETL 1hr 3 min Training NVIDIA Merlin Democratizing Large-scale Deep Learning Recommenders | I T 1 ETL NVTabular RAPIDS DATA LOADER NV Tab RAPIDS TRAINING HugeCTR developer.nvidia.com/nvidia-merlin TensorFlow PyTorch cuDNN DATA LAKE INFERENCE CANDIDATE GENERATION 0(1000) 0(Billions) EMBEDDINGS Triton RANKING 0(10) User Query #39Int 8 TOPS 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00 3.94 April 2012 August 2013 K20X Single-chip Inference Performance 317x in 8 Years December 2014 M40 6.84 P100 21.20 May 2016 125.00 V100 September 2017 261.00 Q8000 February 2019 A100 1248.00 June 2020 October 2021 #40App Developer Al Application Al Deep Learning Inference Workflow Two-part Process Implemented by Multiple Personas Query Data Scientist | ML Engineer Trained Models Result MLOps, DevOps INFERENCE SERVING Scaled Multi-framework Inference Serving for High Perf. & Utilization on GPU/CPU ML Engineer MODEL OPTIMIZATION Optimize for Multiple Constraints for High Perf. Inference Model Store #41Al Application Query Result Standard HTTP/GRPC Triton Inference Server Open-source Software for Scalable, Simplified Inference Serving Dynamic Batching (Real time, Batch, Stream) Multiple GPU & CPU Backends 55 TensorFlow PYTORCH TensorRT ONNX Custom Per Model Scheduler Queues Utilization, Throughput, Latency Metrics GPU CPU Flexible Model Loading (All, Selective) Metrics Model Store Kubernetes, Prometheus #42Al Application AUDIO IN Triton Model Ensemble for Conversational Al AUDIO & TEXT OUT Feature Extraction Waveform Generator Acoustic Model Speech Synthesis Language Model ↑ Decoder Orchestrates Models & Data Movement in the Pipeline | Transparent to Al Application NLP - Q&A #43GENOMICS Clara Parabricks RAPIDS Biological Target Chemical Compounds STRUCTURE NVIDIA Clara Discovery CryoSPARC, Relion AlphaFold Literature 0(1060) SEARCH RAPIDS 0(10⁹) DOCKING AutoDock RAPIDS NLP Fever and Apery dignosis o powded cof ware posible but lety. The way Issessment with a blood pressure Serum lactate was devoted 01. AV was destro (1) but the pe remained Incolorowe.Our colleges from ICU were comuted Aralin was inserted for hemodynamic monitoring Homadynamics were supported with he cryster Hood and neues were down 12 hours 200 stabied Blood cultures were positive for EC that was sensitive t oral cipraloxacinta completa alda Biomegatron, BioBERT 0(10²) Available Now on NVIDIA NGC Collections SIMULATION FUNERLY Schrodinger NAMD, VMD, OpenMM MELD Real World Data Drug Candidate. IMAGING Clara Imaging ΜΟΝΑΙ #44Minutes Days Months Al Turbocharging Structure, Screening, and Simulation. Turning Months into Minutes Folding@Home HPC 30x 1.5 ExaFLOPS for COVID CryoSparc HPC + AI New Deep Learning Protein Structure Prediction 12x 12 Days vs 5 Months COVID Spike Protein Alphafold Al STRUCTURAL BIOLOGY Autodock HPC TorchANI HPC + AI 100x Screen 10 Million Drugs 8 Min vs 240 Days 33x Screen 2B Compounds 1 Day vs 3 Months DeepChem Al New Deep Learning Virtual Drug Screening COMPUTATIONAL CHEMISTRY #45NVIDIA CLARA PARABRICKS PIPELINES GERMLINE Alignment Preprocess EGX DOM SOMATIC LIBRARIES Variant Calling Joint Genotyping CUDA-X DGX NVIDIA Clara Parabricks Improving Accuracy, Speed, and Scale of Computational Genomics RNA-SEQ Variant Processing Quality Checking CLOUD ACCELERATED BEST PRACTICES Industry Standard GATK Pipelines Germline, Somatic, RNA-Seq 30x Faster WGS, 12x Faster WES 40 30 20 10 0 GPU ACCELERATED VARIANT CALLERS EXECUTION TIME IN HRS Muse Somatic Sniper Varscan CPU GPU Mutect DETECTING DIVERSITY OF VARIANTS DeepVariant Multiple Somatic Callers, Days -> 1Hr Clara Parabricks NGC Collection ACE2 EXCAM ANALYTICS AT SCALE GWAS & Single Cell NVIDIA RAPIDS GPU Data Science #46 #47 #48Manipulating Unknown Objects Most Existing Work Assumes 3D Models of Objects for Manipulation FOR B POU Train Deep Networks in Sim to Generate Variety of Successful Grasps on Unknown Objects [Mousavian-Eppner-Fox: ICCV-19], [Danielczuk-Mousavian-Eppner-Fox: 2021] | [Murali-Mousavian-Eppner-Paxton-Fox: ICRA-20] Best Student Paper Finalist **#49RAZG Dar nd TRE Y Ca b 2 OTRPanaos Domino SUGAR Soft Scrub Dom VALUE SIZE INSTANT ACTION 7 BLEACH WAL #50GIS #51COLLECT DATA End-to-End Platform for AV + IX Development TRAIN MODELS SIMULATE DRIVE AV DRIVE IX B 00 B DRIVE RC #52OBSTACLES PATHS INTERSECTION DISTANCE SIGNS TRAFFIC LIGHTS World Class Neural Networks Perception Mapping Planning Interior Sensing 000 0 +1000 TIME TO COLLISION MAP GESTURES/POSE FREE SPACE HIGH BEAM GAZE LANES PARKING PREDICTION (RNN) LIDAR CAMERA BLINDNESS RADAR #53ADAS Windshield NCAP 10 TOPS, 5W NVIDIA DRIVE with New Orin and Ampere 5W to 2,000 TOPS - One Programmable Architecture L2+ Autopilot 200 TOPS, 45W SINGLE SCALABLE ARCHITECTURE - - SOFTWARE COMPATIBLE L5 Robotaxi 2,000 TOPS, 800W JAGUAR LAND #54Int 8 TOPS 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00 3.94 April 2012 August 2013 K20X Single-chip Inference Performance 317x in 8 Years December 2014 M40 6.84 P100 21.20 May 2016 125.00 V100 September 2017 261.00 Q8000 February 2019 A100 1248.00 June 2020 October 2021 #55******* STT 30000 www 110 Hii Efficient DL Inference Accelerators TOI_1_10% TOO_1_00 TOO 1 20 13:39 18.3 RC18: 9 TOPS/W, 0.32-128 TOPS DL Inference Accelerator 36-die MCM in 16nm TO1_1_20 Pool Conv Pool FC Output 1000 [VLSI 2019, HotChips 2019, JSSC 2020, MICRO 2019] [ICCAD 2019] DEEP LEARNING MODELS MAGNet PE DL HARDWARE ACCELERATOR MAGNet: 29 TOPS/W DL Inference Accelerator Generator (7nm)#56Global Controller PE R PE R PE R PE R PE R PE R Global Buffer DRAM PE R PE R PE AXI-In AXISlave Controller ControlOut Address Generator WEIGHT BUFFER Buffer Manager. NLanes MAGNet Address Generator Buffer Manager Weights ACCUMULATION BUFFER Input Activation Address Generator Buffer Manager INPUT BUFFER Vector MAC Arbitrated Xbar CrossPE-AccumOut CrossPE-Accumin PROCESSING ELEMENT (PE) Pooling ReLU Bias Addition Scaling Rounding PPU OutputActivation Venkatesan, et al. "MAGNet: A Modular Accelerator Generator for Neural Networks." ICCAD. 2019. Weight Collector VectorSize WA WA + (+) W₁ Ab Accumulation Collector #57Temporal weight reuse Weight Buffer Weight Collector Vector MAC Temporal partial sum reuse Vector MAC WEIGHT STATIONARY (WS) 55 Accum. Buffer 11 5 Energy Consumption 29 Dataflow Options Input Buffer Vector MAC Accum. Buffer ■ Input Buffer Temporal weight reuse Temporal partial sum reuse ■ Datapath Weight Buffer Weight Buffer Weight Collector Vector MAC 10 29 Vector MAC OUTPUT STATIONARY (OS) 7 Accum. Collector Accum. Buffer Energy Consumption 54 Input Buffer Vector MAC #58Temporal weight reuse Weight Stationary (WS) Reduce Accum. Buffer Accesses Weight Buffer Temporal partial sum reuse Weight Collector Vector MAC Vector MAC Accum. Collector Accum. Buffer Multi-level Dataflows Input Buffer Vector MAC OUTPUT STATIONARY - LOCAL WEIGHT STATIONARY (OS-LWS) Less-frequent Access Temporal weight reuse Output Stationary (OS) Reduce Weight Buffer Accesses Temporal partial sum reuse Weight Buffer Vector MAC More-frequent Access Weight Collector Vector MAC Accum. Collector Accum. Buffer Input Buffer Vector MAC WEIGHT STATIONARY - LOCAL OUTPUT STATIONARY (WS-LOS)#59100% 80% 60% 40% 20% 0% Input Buffer Multi-level Dataflows WS VectorSize=16, IAPrecision=8, Wprecision=8 Weight Buffer Accumulation Buffer IIII OS Datapath WS-LOS OS-LWS 70 fJ/MAC 35 fJ/OP 29 TOPS/W #606000 1000 00 ELECTRICAL DGX NVHS 8pJ/b 0.3m Long Reach Signaling OPTICAL DGX Si Photonics 4pJ/b 20m-100m #61LASER COMB SOURCE 8-10 Laser Lines 100ghz Separation on Single Fiber O Photonics Link Electrical Drivers TX Micro-rings Tuned to Laser Lines On-off Modulated at 25gbps Initially 200gbps per Fiber Electrical Receivers RX Micro-rings Tuned to Laser Lines Each Channel Coupled to Drop Port Modulated Light Detected by Photodetector #62Microbump Bump Ball DRA M HBM S ●●● GPU GPU OE TSMC Interposer HBM S Organic Package PCB EIC Co-packaged Photonics Optical Engine (OE) PIC ●●●● O PIC EIC OE OE OE Switch Organic Package PCB OE OE OE Switch 24 NVLINKS 4.8Tbps per Direction 24 Laser Fibers 24 TX Fibers 24 RX Fibers ~5mm x ~10mm #63GPU TRAY Co-packaged Photonics System Concept NVSWITCH TRAY #64Legate Programming System Couple Convenient Notation with Accelerated Libraries via an Advanced Runtime System try: import legate.numpy as np except: import numpy as np def cg_solve (A, b): x = np.zeros (A. shape [1]) r = b A.dot (x) p = r rsold = r.dot (r) for i in xrange (b.shape [0]): Ap = A.dot (p) alpha * x = x + alpha P r = r alpha * Ap = rsold (p.dot (Ap)) rsnew = r. dot (r) if np.sqrt (rsnew) < 1e-10: break return X beta = rsnew/ rsold pr+ beta р rsold = rsnew * FAMILIAR DOMAIN-SPECIFIC INTERFACES Legate DATA-DRIVEN TASK RUNTIME SYSTEM Legion ACCELERATED LIBRARIES cuBLAS, cuSPARSE, cuDNN, cuDF, cuGRAPH, culo, ... EARLY ACCESS RELEASE: http://developer.nvidia.com/legate #65VAV JETSON NANO Scalable Execution Achieve High-performance Execution at Any Scale with Sufficient Data DGX A100 DGX SUPERPOD NVIDIA.#66Throughput (iterations/s) 10³ 10² 10¹ 10⁰ 10-1 1.5 GB matrix 13860 19600 (1 Sockets) (2 Sockets) 27566 (4 Sockets) Example: Jacobi Iteration 39204 55696 78766 (8 Sockets) (16 Sockets) (32 Sockets) Matrix Dimension 111392 (64 Sockets) 400 GB matrix 157532 (128 Sockets) (256 Sockets) 222784 - Legate CPU A-- Legate GPU + Dask Array Tuned = A = np.random.rand (N, N) b np.random.rand (N) = NumPy for i in range x = (b - Intel (ML) NumPy X = np.zeros (b.shape) d np.diag (A) R = A - np.diag (d) - ■-■ CuPy (n): np. dot (R, x) ) / d #67Throughput (iterations/s) 10³ 10² 10¹ 10⁰ 10-1 13860 19600 (1 Sockets) (2 Sockets) 27566 (4 Sockets) Example: Jacobi Iteration 55696 78766 39204 111392 157532 (8 Sockets) (16 Sockets) (32 Sockets) (64 Sockets) Matrix Dimension 222784 (128 Sockets) (256 Sockets) - Legate CPU A-- Legate GPU + Dask Array Tuned NumPy - Intel (ML) NumPy ■-■ CuPy #68i CUDA LIBRARIES CUDA CuDNN CuBLAS CuSPARS CuTensor CuSolver CuFFT CURAND AmgX NVSHMEM NCC TensorRT GRAPHICS IVA AI Jaba hepatH DRIVE CLARA 00004 ROBOTICS #69NVIDIA

Download to PowerPoint

Download presentation as an editable powerpoint.

Related