Ampere (microarchitecture) - Misplaced Pages

(Redirected from Nvidia Ampere) GPU microarchitecture by Nvidia

Ampere
Product Series
Launched	May 14, 2020; 4 years ago (2020-05-14)
Designed by	Nvidia
Manufactured by	TSMC Samsung
Fabrication process	TSMC N7 (professional) Samsung 8N (consumer)
Codename(s)	GA10x
Desktop	GeForce RTX 30 series
Professional/workstation	RTX A series
Server/datacenter	A100
Specifications
L1 cache	192 KB per SM (professional) 128 KB per SM (consumer)
L2 cache	2 MB to 6 MB
Memory support	GDDR6 GDDR6X HBM2
PCIe support	PCIe 4.0
Supported Graphics APIs
DirectX	DirectX 12 Ultimate (Feature Level 12_2)
Direct3D	Direct3D 12.0
Shader Model	Shader Model 6.8
OpenCL	OpenCL 3.0
OpenGL	OpenGL 4.6
CUDA	Compute Capability 8.6
Vulkan	Vulkan 1.3
Media Engine
Encode codecs	H.264 H.265
Decode codecs	H.264 H.265 AV1
Color bit-depth	8-bit 10-bit
Encoder(s) supported	NVENC
Display outputs	DisplayPort 1.4a HDMI 2.1
History
Predecessor	Turing (consumer) Volta (professional)
Successor	Ada Lovelace (consumer) Hopper (datacenter)
Support status
Supported

Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures. It was officially announced on May 14, 2020 and is named after French mathematician and physicist André-Marie Ampère.

Nvidia announced the Ampere architecture GeForce 30 series consumer GPUs at a GeForce Special Event on September 1, 2020. Nvidia announced the A100 80 GB GPU at SC20 on November 16, 2020. Mobile RTX graphics cards and the RTX 3060 based on the Ampere architecture were revealed on January 12, 2021.

Nvidia announced Ampere's successor, Hopper, at GTC 2022, and "Ampere Next Next" (Blackwell) for a 2024 release at GPU Technology Conference 2021.

Details

Architectural improvements of the Ampere architecture include the following:

CUDA Compute Capability 8.0 for A100 and 8.6 for the GeForce 30 series
TSMC's 7 nm FinFET process for A100
Custom version of Samsung's 8 nm process (8N) for the GeForce 30 series
Third-generation Tensor Cores with FP16, bfloat16, TensorFloat-32 (TF32) and FP64 support and sparsity acceleration. The individual Tensor cores have with 256 FP16 FMA operations per clock 4x processing power (GA100 only, 2x on GA10x) compared to previous Tensor Core generations; the Tensor Core Count is reduced to one per SM.
Second-generation ray tracing cores; concurrent ray tracing, shading, and compute for the GeForce 30 series
High Bandwidth Memory 2 (HBM2) on A100 40 GB & A100 80 GB
GDDR6X memory for GeForce RTX 3090, RTX 3080 Ti, RTX 3080, RTX 3070 Ti
Double FP32 cores per SM on GA10x GPUs
NVLink 3.0 with a 50 Gbit/s per pair throughput
PCI Express 4.0 with SR-IOV support (SR-IOV is reserved only for A100)
Multi-instance GPU (MIG) virtualization and GPU partitioning feature in A100 supporting up to seven instances
PureVideo feature set K hardware video decoding with AV1 hardware decoding for the GeForce 30 series and feature set J for A100
5 NVDEC for A100
Adds new hardware-based 5-core JPEG decode (NVJPG) with YUV420, YUV422, YUV444, YUV400, RGBA. Should not be confused with Nvidia NVJPEG (GPU-accelerated library for JPEG encoding/decoding)

Chips

GA100
GA102
GA103
GA104
GA106
GA107
GA10B

Comparison of Compute Capability: GP100 vs GV100 vs GA100

GPU features	Nvidia Tesla P100	Nvidia Tesla V100	Nvidia A100
GPU codename	GP100	GV100	GA100
GPU architecture	Pascal	Volta	Ampere
Compute capability	6.0	7.0	8.0
Threads / warp	32	32	32
Max warps / SM	64	64	64
Max threads / SM	2048	2048	2048
Max thread blocks / SM	32	32	32
Max 32-bit registers / SM	65536	65536	65536
Max registers / block	65536	65536	65536
Max registers / thread	255	255	255
Max thread block size	1024	1024	1024
FP32 cores / SM	64	64	64
Ratio of SM registers to FP32 cores	1024	1024	1024
Shared Memory Size / SM	64 KB	Configurable up to 96 KB	Configurable up to 164 KB

Comparison of Precision Support Matrix

	FP16	FP32	FP64	INT1	INT4	INT8	TF32	BF16	FP16	FP32	FP64	INT1	INT4	INT8	TF32	BF16
	Supported CUDA Core Precisions								Supported Tensor Core Precisions
Nvidia Tesla P4	No	Yes	Yes	No	No	Yes	No	No	No	No	No	No	No	No	No	No
Nvidia P100	Yes	Yes	Yes	No	No	No	No	No	No	No	No	No	No	No	No	No
Nvidia Volta	Yes	Yes	Yes	No	No	Yes	No	No	Yes	No	No	No	No	No	No	No
Nvidia Turing	Yes	Yes	Yes	No	No	No	No	No	Yes	No	No	Yes	Yes	Yes	No	No
Nvidia A100	Yes	Yes	Yes	No	No	Yes	No	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes

Legend:

FPnn: floating point with nn bits
INTn: integer with n bits
INT1: binary
TF32: TensorFloat32
BF16: bfloat16

Comparison of Decode Performance

Concurrent streams	H.264 decode (1080p30)	H.265 (HEVC) decode (1080p30)	VP9 decode (1080p30)
V100	16	22	22
A100	75	157	108

Ampere dies

Die	GA100	GA102	GA103	GA104	GA106	GA107	GA10B	GA10F
Die size	826 mm	628 mm	496 mm	392 mm	276 mm	200 mm	448 mm	?
Transistors	54.2B	28.3B	22B	17.4B	12B	8.7B	21B	?
Transistor density	65.6 MTr/mm	45.1 MTr/mm	44.4 MTr/mm	44.4 MTr/mm	43.5 MTr/mm	43.5 MTr/mm	46.9 MTr/mm	?
Graphics processing clusters	8	7	6	6	3	2	2	1
Streaming multiprocessors	128	84	60	48	30	20	16	12
CUDA cores	12288	10752	7680	6144	3840	2560	2048	1536
Texture mapping units	512	336	240	192	120	80	64	48
Render output units	192	112	96	96	48	32	32	16
Tensor cores	512	336	240	192	120	80	64	48
RT cores	N/A	84	60	48	30	20	8	12
L1 cache	24 MB	10.5 MB	7.5 MB	6 MB	3 MB	2.5 MB	3 MB	1.5 MB
L1 cache	192 KB per SM	128 KB per SM					192 KB per SM	128 KB per SM
L2 cache	40 MB	6 MB	4 MB	4 MB	3 MB	2 MB	4 MB	?

A100 accelerator and DGX A100

The Ampere-based A100 accelerator was announced and released on May 14, 2020. The A100 features 19.5 teraflops of FP32 performance, 6912 FP32/INT32 CUDA cores, 3456 FP64 CUDA cores, 40 GB of graphics memory, and 1.6 TB/s of graphics memory bandwidth. The A100 accelerator was initially available only in the 3rd generation of DGX server, including 8 A100s. Also included in the DGX A100 is 15 TB of PCIe gen 4 NVMe storage, two 64-core AMD Rome 7742 CPUs, 1 TB of RAM, and Mellanox-powered HDR InfiniBand interconnect. The initial price for the DGX A100 was $199,000.

Comparison of accelerators used in DGX:

Model	Architecture	Socket	FP32 CUDA cores	FP64 cores (excl. tensor)	Mixed INT32/FP32 cores	INT32 cores	Boost clock	Memory clock	Memory bus width	Memory bandwidth	VRAM	Single precision (FP32)	Double precision (FP64)	INT8 (non-tensor)	INT8 dense tensor	INT32	FP4 dense tensor	FP16	FP16 dense tensor	bfloat16 dense tensor	TensorFloat-32 (TF32) dense tensor	FP64 dense tensor	Interconnect (NVLink)	GPU	L1 Cache	L2 Cache	TDP	Die size	Transistor count	Process	Launched
B200	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	4.5 POPS	N/A	9 PFLOPS	N/A	2.25 PFLOPS	2.25 PFLOPS	1.2 PFLOPS	40 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	1000 W	N/A	208 B	TSMC 4NP	Q4 2024 (expected)
B100	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	3.5 POPS	N/A	7 PFLOPS	N/A	1.98 PFLOPS	1.98 PFLOPS	989 TFLOPS	30 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	700 W	N/A	208 B	TSMC 4NP	Q4 2024 (expected)
H200	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	6.3 Gbit/s HBM3e	6144-bit	4.8 TB/sec	141 GB HBM3e	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	1000 W	814 mm	80 B	TSMC 4N	Q3 2023
H100	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	5.2 Gbit/s HBM3	5120-bit	3.35 TB/sec	80 GB HBM3	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	700 W	814 mm	80 B	TSMC 4N	Q3 2022
A100 80GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	3.2 Gbit/s HBM2e	5120-bit	1.52 TB/sec	80 GB HBM2e	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm	54.2 B	TSMC N7	Q1 2020
A100 40GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	2.4 Gbit/s HBM2	5120-bit	1.52 TB/sec	40 GB HBM2	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm	54.2 B	TSMC N7	Q1 2020
V100 32GB	Volta	SXM3	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	32 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	350 W	815 mm	21.1 B	TSMC 12FFN	Q3 2017
V100 16GB	Volta	SXM2	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	16 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	300 W	815 mm	21.1 B	TSMC 12FFN	Q3 2017
P100	Pascal	SXM/SXM2	N/A	1792	3584	N/A	1480 MHz	1.4 Gbit/s HBM2	4096-bit	720 GB/sec	16 GB HBM2	10.6 TFLOPS	5.3 TFLOPS	N/A	N/A	N/A	N/A	21.2 TFLOPS	N/A	N/A	N/A	N/A	160 GB/sec	GP100	1344 KB (24 KB × 56)	4096 KB	300 W	610 mm	15.3 B	TSMC 16FF+	Q2 2016

Products using Ampere

GeForce MX series
- GeForce MX570 (mobile) (GA107)
GeForce 20 series
- GeForce RTX 2050 (mobile) (GA107)
GeForce 30 series
- GeForce RTX 3050 Laptop GPU (GA107)
- GeForce RTX 3050 (GA106 or GA107)
- GeForce RTX 3050 Ti Laptop GPU (GA107)
- GeForce RTX 3060 Laptop GPU (GA106)
- GeForce RTX 3060 (GA106 or GA104)
- GeForce RTX 3060 Ti (GA104 or GA103)
- GeForce RTX 3070 Laptop GPU (GA104)
- GeForce RTX 3070 (GA104)
- GeForce RTX 3070 Ti Laptop GPU (GA104)
- GeForce RTX 3070 Ti (GA104 or GA102)
- GeForce RTX 3080 Laptop GPU (GA104)
- GeForce RTX 3080 (GA102)
- GeForce RTX 3080 12 GB (GA102)
- GeForce RTX 3080 Ti Laptop GPU (GA103)
- GeForce RTX 3080 Ti (GA102)
- GeForce RTX 3090 (GA102)
- GeForce RTX 3090 Ti (GA102)
Nvidia Workstation GPUs (formerly Quadro)
- RTX A1000 (mobile) (GA107)
- RTX A2000 (mobile) (GA106)
- RTX A2000 (GA106)
- RTX A3000 (mobile) (GA104)
- RTX A4000 (mobile) (GA104)
- RTX A4000 (GA104)
- RTX A5000 (mobile) (GA104)
- RTX A5500 (mobile) (GA103)
- RTX A4500 (GA102)
- RTX A5000 (GA102)
- RTX A5500 (GA102)
- RTX A6000 (GA102)
- A800 Active

Nvidia Data Center GPUs (formerly Tesla)
- Nvidia A2 (GA107)
- Nvidia A10 (GA102)
- Nvidia A16 (4 × GA107)
- Nvidia A30 (GA100)
- Nvidia A40 (GA102)
- Nvidia A100 (GA100)
- Nvidia A100 80 GB (GA100)
- Nvidia A100X
- NVIDIA A30X

Tegra SoCs
- AGX Orin (GA10B)
- Orin NX (GA10B)
- Orin Nano (GA10B)

Products using Ampere (per Chip)
Type	GA10B	GA107	GA106	GA104	GA103	GA102	GA100
GeForce MX series	—	GeForce MX570 (mobile)	—	—	—	—	—
GeForce 20 series	—	GeForce RTX 2050 (mobile)	—	—	—	—	—
GeForce 30 series	—	GeForce RTX 3050 Laptop GeForce RTX 3050 GeForce RTX 3050 Ti Laptop	GeForce RTX 3050 GeForce RTX 3060 Laptop GeForce RTX 3060	GeForce RTX 3060 GeForce RTX 3060 Ti GeForce RTX 3070 Laptop GeForce RTX 3070 GeForce RTX 3070 Ti Laptop GeForce RTX 3070 Ti GeForce RTX 3080 Laptop	GeForce RTX 3060 Ti GeForce RTX 3080 Ti Laptop	GeForce RTX 3070 Ti GeForce RTX 3080 GeForce RTX 3080 Ti GeForce RTX 3090 GeForce RTX 3090 Ti	—
Nvidia Workstation GPUs	—	RTX A1000 (mobile)	RTX A2000 (mobile) RTX A2000	RTX A3000 (mobile) RTX A4000 (mobile) RTX A4000 RTX A5000 (mobile)	RTX A5500 (mobile)	RTX A4500 RTX A5000 RTX A5500 RTX A6000	—
Nvidia Data Center GPUs	—	Nvidia A2 Nvidia A16	—	—	—	Nvidia A10 Nvidia A40	Nvidia A30 Nvidia A100
Tegra SoCs	AGX Orin Orin NX Orin Nano	—	—	—	—	—	—

References

Newsroom, NVIDIA. "NVIDIA's New Ampere Data Center GPU in Full Production". NVIDIA Newsroom Newsroom. {{cite web}}: |last= has generic name (help)
"NVIDIA Ampere Architecture In-Depth". NVIDIA Developer Blog. May 14, 2020.
"NVIDIA Delivers Greatest-Ever Generational Leap with GeForce RTX 30 Series GPUs". Nvidia Newsroom. September 1, 2020. Retrieved April 9, 2023.
"NVIDIA GeForce Ultimate Countdown". Nvidia.
"NVIDIA Doubles Down: Announces A100 80GB GPU, Supercharging World's Most Powerful GPU for AI Supercomputing". Nvidia Newsroom. November 16, 2020. Retrieved April 9, 2023.
"NVIDIA GeForce Beyond at CES 2023". NVIDIA.
"I.7. Compute Capability 8.x". Nvidia. Retrieved September 23, 2020.
Bosnjak, Dominik (September 1, 2020). "Samsung's old 8nm tech at the heart of NVIDIA's monstrous Ampere cards". SamMobile. Retrieved September 19, 2020.
^ Smith, Ryan (May 14, 2020). "NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech.
Delgado, Gerardo (September 1, 2020). "GeForce RTX 30 Series GPUs: Ushering In A New Era of Video Content With AV1 Decode". Nvidia. Retrieved April 9, 2023.
Morgan, Timothy Prickett (May 29, 2020). "Diving Deep Into The Nvidia Ampere GPU Architecture". The Next Platform. Retrieved March 24, 2022.
"NVIDIA A100 Tensor Core GPU Architecture: Unprecedented Accerlation at Every Scale" (PDF). Nvidia. Retrieved September 18, 2020.
"NVIDIA Tensor Cores: Versatility for HPC & AI". NVIDIA.
"Abstract". docs.nvidia.com.
"NVIDIA A100 Tensor Core GPU Architecture" (PDF). NVIDIA Corporation. Retrieved April 29, 2024.
"NVIDIA GA102 GPU Specs". TechPowerUp. Retrieved April 29, 2024.
"NVIDIA GA103 GPU Specs". TechPowerUp. Retrieved April 29, 2024.
"NVIDIA GA104 GPU Specs". TechPowerUp. Retrieved April 29, 2024.
"NVIDIA GA106 GPU Specs". TechPowerUp. Retrieved April 29, 2024.
"NVIDIA GA107 GPU Specs". TechPowerUp. Retrieved April 29, 2024.
"NVIDIA AGX Orin Series Technical Brief v1.2" (PDF). NVIDIA Corporation. Retrieved April 29, 2024.
^ Tom Warren; James Vincent (May 14, 2020). "Nvidia's first Ampere GPU is designed for data centers and AI, not your PC". The Verge.
Smith, Ryan (March 22, 2022). "NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder". AnandTech.
Smith, Ryan (May 14, 2020). "NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech.
"NVIDIA Tesla V100 tested: near unbelievable GPU power". TweakTown. September 17, 2017.
Igor, Wallossek (February 13, 2022). "The two faces of the GeForce RTX 3050 8GB". Igor's Lab. Retrieved February 23, 2022.
Shilov, Anton (September 25, 2021). "Gainward and Galax List GeForce RTX 3060 Cards With GA104 GPU". Tom's Hardware. Retrieved September 23, 2022.
Tyson, Mark (February 23, 2022). "Zotac Debuts First RTX 3060 Ti Desktop Cards With GA103 GPU". Tom's Hardware. Retrieved September 23, 2022.
WhyCry (October 26, 2022). "ZOTAC launches GeForce RTX 3070 Ti with GA102-150 GPU". VideoCardz. Retrieved May 21, 2023.

External links

Nvidia

GeForce (List of GPUs)

Fixed pixel pipeline

Pre-GeForce

Vertex and pixel shaders

GeForce 3

4 Ti

Unified shaders

Unified shaders & NUMA

Ray tracing & Tensor Cores

Software and technologies

Multimedia acceleration	NVENC (video encoding) NVDEC (video decoding) PureVideo (video decoding)
Software	Cg (shading language) CUDA Nvidia GameWorks OptiX (ray tracing API) PhysX (physics SDK) Nvidia Omniverse (3D graphics) Nvidia RTX (ray tracing platform) Nvidia System Tools VDPAU (video decode API)
Technologies	Nvidia 3D Vision (stereo 3D) Nvidia G-Sync (variable refresh rate) Nvidia Optimus (GPU switching) Nvidia Surround (multi-monitor) MXM (module/socket) SXM (module/socket) NVLink (protocol) Scalable Link Interface (multi-GPU) TurboCache (framebuffer in system memory) Video Super Resolution (live video upscaling)
GPU microarchitectures	Celsius Kelvin Rankine Curie Tesla Fermi Kepler Maxwell Pascal Volta Turing Ampere Hopper Ada Lovelace Blackwell Rubin

Other products

Graphics Workstation cards	Nvidia Quadro Quadro Plex
GPGPU	Nvidia Tesla DGX
Console components	NV2A (Xbox) RSX 'Reality Synthesizer' (PlayStation 3) Tegra NX-SoC (Nintendo Switch)
Nvidia Shield	Shield Portable Shield Tablet Shield Android TV GeForce Now
SoCs and embedded	GoForce Drive Jetson Tegra
CPUs	Project Denver
Computer chipsets	nForce

Company

Key people	Jen-Hsun Huang Chris Malachowsky Curtis Priem David Kirk Bill Dally Debora Shoquist Ranga Jayaraman Jonah M. Alben
Acquisitions	3dfx Interactive Ageia ULi Bright Computing Cumulus Networks DeepMap Icera Mellanox Technologies Mental Images PortalPlayer Exluna MediaQ Stexar

Categories: