英伟达热门 GPU 对比：H100、A6000、L40S、A100

: 2024年11月21日

在人工智能和深度学习领域，GPU 的性能直接影响模型的训练速度和推理效率。随着技术的迅速发展，市场上涌现出多款高性能的 GPU，尤其是英伟达的旗舰产品。

In the field of artificial intelligence and deep learning, the performance of GPUs directly affects the training speed and inference efficiency of models. With the rapid development of technology, several high-performance GPUs have emerged in the market, especially NVIDIA's flagship products.

本文将对比四款基于 2020 年后架构的显卡：NVIDIA H100、A100、A6000 和 L40S。通过深入分析这些 GPU 的性能指标，我们将探讨它们在模型训练和推理任务中的适用场景，以帮助用户在选择适合的 GPU 时做出明智的决策。同时，我们还会给出一些实际有哪些知名的公司或项目在使用这几款 GPU。

In this article, we will compare four graphics cards based on post-2020 architectures: the NVIDIA H100, A100, A6000, and L40S. By analyzing in-depth the performance metrics of these GPUs, we will explore scenarios where they can be used for model training and inference tasks, to help users make an informed decision about choosing the right GPU. We'll also give some insights into what well-known companies or projects are actually using these GPUs.

主流几款 GPU 中哪些适合推理？哪些适合训练？

Which of the mainstream GPUs are good for inference? Which are good for training?

那么进行一下指标对比，在 NVIDIA H100、A100、A6000、L40s，这几个GPU 中，分析哪些 GPU更适合做模型训练任务，哪些 GPU 更适合做推理任务。以下是 NVIDIA H100、A100、A6000、L40s的主要性能指标参数表：

So let's do a metrics comparison and analyze which GPUs are better suited for model training tasks and which GPUs are better suited for inference tasks among the NVIDIA H100, A100, A6000, and L40s GPUs. Below is a table of key performance metrics for the NVIDIA H100, A100, A6000, and L40s:这个表格总结了每个GPU的架构、FP16/FP32计算性能、Tensor Core性能、显存大小、显存类型以及内存带宽，便于比较各个GPU在不同任务场景中的适用性。按照架构来讲，越新的架构肯定性能相对更好，这些架构从旧到新依次是：

This table summarizes each GPU's architecture, FP16/FP32 compute performance, Tensor Core performance, VRAM size, VRAM type, and RAM bandwidth, making it easy to compare each GPU's suitability for different task scenarios. In terms of architecture, the newer the architecture, the better the performance must be, and these architectures are, in order from oldest to newest:

Ampere（2020年发布/Released 2020）
Ada Lovelace（2022年发布/Released 2022）
Hopper（2022年发布/Released 2022）

在选择用于大语言模型（LLM）训练和推理的GPU时，不同GPU有着各自的特性和适用场景。以下将对这些GPU进行分析，探讨它们在模型训练和推理任务中的优劣势，帮助明确不同GPU的应用场景。

When chosen for Large Language Model (LLM) training and inference, different GPUs have their own characteristics and application scenarios. In the following, we will analyze these GPUs and discuss their advantages and disadvantages in model training and reasoning tasks to help clarify the application scenarios of different GPUs.

1、NVIDIA H100

适用场景：

Applicable Scenario:

模型训练：H100是目前NVIDIA最先进的GPU，设计专门用于大规模AI训练。它拥有超强的计算能力、超大的显存和极高的带宽，能够处理海量数据，特别适合训练GPT、BERT等大规模语言模型。其Tensor Core性能尤为出色，能够极大加速训练过程。

Model Training: H100 is currently NVIDIA's most advanced GPU designed specifically for large-scale AI training. It has superb computing power, large video memory and extremely high bandwidth, capable of handling massive amounts of data, and is especially suitable for training large-scale language models such as GPT and BERT. Its Tensor Core performance is particularly good, and can greatly accelerate the training process.

推理：H100的性能也能轻松应对推理任务，尤其在处理超大模型时表现优异。但由于其高能耗和成本，一般只在需要极高并发量或实时性要求下用于推理任务。

Reasoning: The H100's performance can also handle reasoning tasks with ease, especially when dealing with very large models. However, due to its high power consumption and cost, it is generally only used for inference tasks that require very high concurrency or real-time requirements.

实际用例

Practical use cases

Inflection AI：在微软和 Nvidia 的支持下，Inflection AI 计划使用22,000 个 Nvidia H100 计算 GPU（可能与 Frontier 超级计算机的性能相媲美）构建一个超级计算机集群。该集群标志着 Inflection AI 对产品（尤其是其 AI 聊天机器人 Pi）扩展速度和能力的战略投资。Meta：为了支持其开源通用人工智能 (AGI) 计划，Meta 计划在 2024 年底前购买 350,000 个 Nvidia H100 GPU。Meta 的大量投资源于其增强先进 AI 功能和可穿戴 AR 技术基础设施的雄心。

Inflection AI: With support from Microsoft and Nvidia, Inflection AI plans to build a supercomputer cluster using 22,000 Nvidia H100 compute GPUs, potentially rivaling the performance of the Frontier supercomputer. The cluster marks Inflection AI's strategic investment in scaling the speed and capabilities of its products, particularly its AI chatbot, Pi.Meta: In support of its Open Source General Artificial Intelligence (AGI) initiative, Meta plans to purchase 350,000 Nvidia H100 GPUs by the end of 2024.Meta's significant investment comes as a result of its efforts to enhance its advanced AI capabilities and wearable AR. Meta's significant investment stems from its ambition to enhance its advanced AI capabilities and wearable AR technology infrastructure.

2、NVIDIA A100

适用场景：

Applicable Scenario:

模型训练：A100是数据中心AI训练的主力GPU，特别是在混合精度训练中具有极强的表现。其较高的显存和带宽使得它在处理大型模型和大批量训练任务时表现卓越。

Model Training: The A100 is the workhorse GPU for AI training in data centers, and is especially strong in mixed-precision training. Its higher video memory and bandwidth allow it to excel in handling large models and high-volume training tasks.

推理：A100的高计算能力和显存也使其非常适合推理任务，特别是在需要处理复杂神经网络和大规模并发请求时表现优异。

Reasoning: The A100's high compute power and graphics memory also make it well suited for reasoning tasks, especially when it needs to handle complex neural networks and large-scale concurrent requests.

实际用例

Practical use cases

Microsoft Azure：Microsoft Azure 将 A100 GPU 集成到其服务中，以促进公共云中的高性能计算和 AI 可扩展性。这种集成支持各种应用程序，从自然语言处理到复杂的数据分析。NVIDIA 的 Selene 超级计算机：Selene 是一款NVIDIA DGX SuperPOD 系统，采用 A100 GPU，在 AI 研究和高性能计算 (HPC) 中发挥了重要作用。值得注意的是，它在科学模拟和 AI 模型的训练时间方面创下了纪录——Selene 在最快工业超级计算机 Top500 榜单中排名第 5。

Microsoft Azure: Microsoft Azure integrates A100 GPUs into its services to facilitate high-performance computing and AI scalability in the public cloud. This integration supports a wide range of applications, from natural language processing to complex data analytics. NVIDIA' s Selene Supercomputer: Selene is an NVIDIA DGX SuperPOD system with A100 GPUs that plays an important role in AI research and high-performance computing (HPC). Notably, it has set records for training times for scientific simulations and AI models - Selene is ranked No. 5 on the Top500 list of fastest industrial supercomputers.

3、NVIDIA A6000

适用场景：

Applicable Scenario:

模型训练：A6000在工作站环境中是非常合适的选择，特别是在需要大显存的情况下。虽然它的计算能力不如A100或H100，但对于中小型模型的训练已经足够。其显存也能支持较大模型的训练任务。

Model Training: The A6000 is a very suitable choice in a workstation environment, especially if large video memory is required. Although it does not have the same computing power as the A100 or H100, it is sufficient for training small to medium sized models. Its graphics memory can also support training tasks for larger models.

推理：A6000的显存和性能使其成为推理的理想选择，尤其是在需要处理较大的输入或高并发推理的场景中，能提供平衡的性能和显存支持。

Reasoning: The A6000's graphics memory and performance make it ideal for reasoning, especially in scenarios where larger inputs need to be processed or where highly concurrent reasoning is required, providing balanced performance and graphics memory support.

实际用例

Practical use cases

拉斯维加斯球顶巨幕：拉斯维加斯的球顶巨幕使用了 150 个 NVIDIA A6000 GPU，供其处理和渲染球顶巨幕需要显示的动画内容。

Las Vegas Dome: The Las Vegas dome uses 150 NVIDIA A6000 GPUs to process and render the animations that the dome needs to display.

4、 NVIDIA L40s

适用场景：

Applicable Scenario:

模型训练：L40s为工作站设计，并且在计算能力和显存上有较大提升，适合中型到大型模型的训练，尤其是当需要较强的图形处理和AI训练能力结合时。

Model Training: The L40s is designed for workstations and has a large increase in computing power and graphics memory, making it suitable for medium to large model training, especially when a combination of strong graphics processing and AI training capabilities are required.

推理：L40s的强大性能和大显存使其非常适合高性能推理任务，尤其是在工作站环境下的复杂推理任务。如下图所示，虽然 L40s 的价格比 A100 要低，但是在文生图模型的测试中，它的性能表现比 A100 要高 1.2 倍，这完全是由于其Ada Lovelace Tensor Cores 和 FP8 精度所致。

Reasoning: The power and large graphics memory of the L40s make it well suited for high-performance reasoning tasks, especially complex reasoning tasks in a workstation environment. As shown in the chart below, although the L40s is less expensive than the A100, it outperforms the A100 by a factor of 1.2 in tests of Vincennes graphical models, due solely to its Ada Lovelace Tensor Cores and FP8 accuracy.

实际用例

Practical use cases

动画工作室：NVIDIA L40S 被广泛应用于动画工作室的3D 渲染和复杂视觉效果。其处理高分辨率图形和大量数据的先进功能使其成为媒体和游戏公司制作详细动画和视觉内容的理想选择。医疗保健和生命科学：医疗保健机构正在利用 L40S 进行基因组分析和医学成像。GPU 在处理大量数据方面的效率正在加速遗传学研究，并通过增强的成像技术提高诊断准确性。

Animation Studios: The NVIDIA L40S is widely used in animation studios for 3D rendering and complex visual effects. Its advanced capabilities for handling high-resolution graphics and large amounts of data make it ideal for detailed animation and visual content for media and gaming companies. Healthcare and Life Sciences: Healthcare organizations are utilizing the L40S for genomic analysis and medical imaging. the GPU's efficiency in handling large amounts of data is accelerating genetics research and improving diagnostic accuracy through enhanced imaging techniques.

结论

Conclusions

更推荐用于模型训练的GPU：

GPUs are more recommended for model training:

H100 和 A100 是目前训练大规模模型（如GPT-3、GPT-4等）的最佳选择，拥有顶级的计算能力、显存和带宽。H100在性能上超越了A100，但A100仍然是当前大规模AI训练中的主力。

The H100 and A100 are currently the best choices for training large-scale models (e.g., GPT-3, GPT-4, etc.) with top-notch compute power, video memory, and bandwidth. The H100 outperforms the A100 in terms of performance, but the A100 is still the workhorse of large-scale AI training today.

A6000 可以在工作站环境中进行中小型模型的训练。

The A6000 allows for small to medium sized models to be trained in a workstation environment.

L40S ：提供均衡的性能，具有出色的 FP32 和 Tensor Core 功能，但在模型训练方面，仍然还是 H100、A100 更强。

The L40S offers balanced performance with excellent FP32 and Tensor Core features, but it's still the H100 and A100 that are stronger when it comes to model training.

更推荐用于推理的GPU：

GPUs are more recommended for inference:

A6000 和 L40s 是推理任务的理想选择，提供了强大的性能和显存，能够高效处理大模型的推理。

The A6000 and L40s are ideal for inference tasks, offering powerful performance and graphics memory to efficiently handle inference for large models.

A100 和 H100 在超大规模并发或实时推理任务中表现优异，但由于其成本相对更高一些，如果只用于推理场景，有些浪费性能，不能物尽其用。

The A100 and H100 perform well in hyper-scale concurrent or real-time inference tasks, but due to their relatively higher cost, they are a bit of a waste of performance if used only in inference scenarios and do not make the best use of their resources.

另外，要做大模型的训练必定会需要多张GPU，那么这时候就需要用到 NVIDIA 推出的 NLink 技术。NVLink 通常存在于高端和数据中心级 GPU，但是像 L40s 这样的专业卡不支持 NVLink 的。所以不太适合去做相对复杂的大型模型的训练任务，只建议用单卡训练一些小模型。所以这里更推荐把L40s用于推理任务。在这里H100是相对最前沿的 GPU 卡型，虽然后来 NVIDIA 发布了 B200，但是这款 GPU 暂时还未大规模在市场上得到应用。像 H100 这种 GPU 实际上既适合做模型训练，也适合做推理，但是 H100 的成本会比较高，性能也比较好，如果只用在推理任务上有些大材小用。我们以上给出的结论都是基于指标层面，并结合了一些实际用例，大家在选型的过程中还需要结合成本来看。相对于购买 GPU 自己搭建服务器，我们更推荐使用GPU 云服务，一方面它的成本比购买 GPU 更便宜，只需要几分钟就可以开启 GPU 实例，另一方面，个别 GPU 云服务平台还会提供适合团队协作开发的云环境，包括 Jupyter notebook、模型部署等功能。大家可以参考 DigitalOcean GPU 云服务器定价来看，DigitalOcean 部分型号既提供单卡也提供 8卡的配置，比如 H100 。

In addition, training large models will require multiple GPUs, which is where NVIDIA's NLink technology comes in. NVLink is usually found in high-end and datacenter-class GPUs, but professional cards like the L40s don't support NVLink. NVLink is usually found in high-end and datacenter-class GPUs, but professional cards like the L40s do not support NVLink, so it's not suitable for relatively complex large model training tasks, and it's only recommended to use a single card to train small models. It is recommended to use the L40s for inference tasks. The H100 is the most cutting-edge GPU here, and although NVIDIA has since released the B200, this GPU has yet to be used on a large scale in the market. GPUs like the H100 are actually suitable for both model training and inference, but the H100's higher cost and better performance make it a bit of an overkill for inference tasks. The conclusions we have given above are based on the index level, and combined with some practical use cases, you also need to look at the cost in the selection process. On one hand, it is cheaper than buying GPUs, and it only takes a few minutes to open a GPU instance. On the other hand, individual GPU cloud service platforms also provide cloud environments suitable for collaborative team development, including Jupyter notebook, model deployment, etc. You can refer to the DigitalOcean GPU cloud service platform for more details. You can refer to DigitalOcean GPU cloud server pricing to see that some DigitalOcean models offer both single card and 8 card configurations, such as H100.

以下我们可以先参考单卡GPU 实例的价格：DigitalOcean GPU 云服务是专注 AI 模型训练的云 GPU 服务器租用平台，提供了包括 A5000、A6000、H100 等强大的 GPU 和 IPU 实例，以及透明的定价，可以比其他公共云节省高达70%的计算成本。

Below we can first refer to the price of a single card GPU instance: DigitalOcean GPU Cloud Service is a cloud GPU server rental platform focusing on AI model training, which provides powerful GPU and IPU instances including A5000, A6000, H100, etc., as well as transparent pricing, which can save up to 70% of compute cost compared to other public clouds.

公司动态

英伟达热门 GPU 对比：H100、A6000、L40S、A100

accordin 菜单

off-canvase 菜单