讨论个人GPU的训练时间 #60

jingyaogong · 2024-10-05T16:12:52Z

在某平台上租了2台机器，控制内存CPU等变量的一致性，测试不同GPU的训练时间

个人认为 [3060~2080Ti~3090~4090] 这个区间包含了大部分AI从业者手头的显卡规格，具有很强的代表性

其它桌面GPU，例如3060的算力略弱于2080Ti，可以参考上图换算

2080Ti单卡(11G显存)

pretrain batchsize=48，预计7小时1个epoch

root@autodl-container-908d479a1c-1697cfd8:~/autodl-tmp/minimind# python 1-pretrain.py 
LLM总参数量：26.878 百万
Epoch:[0/20](0/111769) loss:8.879 lr:0.0002000 epoch_Time:2618.0min:
Epoch:[0/20](100/111769) loss:7.438 lr:0.0002000 epoch_Time:442.0min:
Epoch:[0/20](200/111769) loss:6.899 lr:0.0002000 epoch_Time:431.0min:
Epoch:[0/20](300/111769) loss:6.576 lr:0.0002000 epoch_Time:426.0min:

full_sft batchsize=48，预计5.4小时1个epoch

root@autodl-container-908d479a1c-1697cfd8:~/autodl-tmp/minimind# python 3-full_sft.py 
LLM总参数量：26.878 百万
Epoch:[0/19](0/82267) loss:8.876 lr:0.0001000 epoch_Time:2011.0min:
Epoch:[0/19](100/82267) loss:6.302 lr:0.0001000 epoch_Time:335.0min:
Epoch:[0/19](200/82267) loss:5.667 lr:0.0001000 epoch_Time:327.0min:
Epoch:[0/19](300/82267) loss:5.193 lr:0.0001000 epoch_Time:324.0min:

4090单卡(24G显存)

pretrain batchsize=96，预计3.3小时1个epoch

root@autodl-container-36164ea3cd-3ac722f7:~/autodl-tmp/minimind# python 1-pretrain.py 
LLM总参数量：26.878 百万
Epoch:[0/20](0/55885) loss:8.876 lr:0.0002000 epoch_Time:2049.0min:
Epoch:[0/20](100/55885) loss:7.401 lr:0.0002000 epoch_Time:212.0min:
Epoch:[0/20](200/55885) loss:6.958 lr:0.0002000 epoch_Time:201.0min:
Epoch:[0/20](300/55885) loss:6.460 lr:0.0002000 epoch_Time:197.0min:

full_sft batchsize=96，预计2.5小时1个epoch

root@autodl-container-36164ea3cd-3ac722f7:~/autodl-tmp/minimind# python 3-full_sft.py 
LLM总参数量：26.878 百万
Epoch:[0/19](0/41134) loss:5.676 lr:0.0001000 epoch_Time:1086.0min:
Epoch:[0/19](100/41134) loss:4.872 lr:0.0001000 epoch_Time:156.0min:
Epoch:[0/19](200/41134) loss:4.446 lr:0.0001000 epoch_Time:152.0min:

如果您手头的机器 < 3060算力/纯CPU/Intel显卡/MAC等等，当然同样可以训练 minimind，这没什么问题。
但是 时间开销「也许」需要几十个小时乃至数天时间，这背离了您上手 minimind 的初衷，恐难以接受。

因此，更推荐这部分朋友在云平台上租用GPU服务器完成训练任务。

以4090为例，每1小时价格为1.88人民币，训练1 epoch pretrain +1 epoch full_sft 共约6小时，即总花费为11.28元
以2080Ti为例，每1小时价格为0.88人民币，训练1 epoch pretrain +1 epoch full_sft 共约12.4小时，即总花费为10.91元

如果需要训练更多轮次获得更佳效果，一杯奶茶钱同样可以cover，同时容器化实例省去了配置环境的繁琐步骤

算力平台推荐（无任何推广和商业目的，仅个人分享）：

如果您有其他型号的显卡参与MiniMind训练，欢迎分享训练时长，以便为他人提供参考，这将非常有意义，谢谢。

此外，如果您有其他更优算力平台的推荐，也欢迎在评论区补充，谢谢。

The text was updated successfully, but these errors were encountered:

jingyaogong · 2024-10-05T16:18:41Z

补充：

本人设备为单机双卡3090，≈3小时训练1 epoch pretrain +1 epoch full_sft

以上所有讨论均基于26M大小的模型版本展开

Caizkk · 2024-10-07T13:57:44Z

单卡4060ti 16g为啥比你的2080Ti差那么多，我这个时间是正常的吗

jingyaogong · 2024-10-07T14:38:11Z

单卡4060ti 16g为啥比你的2080Ti差那么多，我这个时间是正常的吗

半精度确实比2080Ti慢，如果数据准确的话，大约4060Ti是2080Ti 77%的性能

因此4060Ti相同batchsize理论epoch时间换算下来可能是550~580 min

但实际运行时间比理论值再慢100秒+，差异应该体现在CPU和内存上

chuanzhubin · 2024-10-09T09:02:46Z

4卡 2080ti 22G

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 1-pretrain.py
python 3-full_sft.py 
LLM总参数量：26.878 百万
Epoch:[0/19](0/49361) loss:8.855 lr:0.00020000 epoch_Time:1566.0min:
Epoch:[0/19](100/49361) loss:5.565 lr:0.00020000 epoch_Time:326.0min:
Epoch:[0/19](200/49361) loss:4.950 lr:0.00020000 epoch_Time:320.0min:
Epoch:[0/19](300/49361) loss:4.622 lr:0.00020000 epoch_Time:319.0min:
Epoch:[0/19](400/49361) loss:4.381 lr:0.00020000 epoch_Time:317.0min:
Epoch:[0/19](500/49361) loss:4.183 lr:0.00020000 epoch_Time:315.0min:
Epoch:[0/19](600/49361) loss:4.053 lr:0.00020000 epoch_Time:315.0min:
Epoch:[0/19](700/49361) loss:3.734 lr:0.00020000 epoch_Time:314.0min:

dblab0 · 2024-11-22T03:02:18Z

单卡4060ti 16g为啥比你的2080Ti差那么多，我这个时间是正常的吗

我的4060Ti 16G 64batch，一个epoch678min，之前试32batch也是这样。应该就是算力差了比较多

jiaohuix · 2025-01-03T11:15:30Z

可以使用这个脚本测试算力

import torch
import torch.cuda.amp as amp

def benchmark_with_cuda_events(size, dtype=torch.float16, iterations=100):
    torch.cuda.init()
    torch.backends.cudnn.benchmark = True
    
    # 创建 CUDA events
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    
    a = torch.randn(size, size, dtype=dtype, device='cuda').contiguous()
    b = torch.randn(size, size, dtype=dtype, device='cuda').contiguous()
    
    # 预热
    with amp.autocast():
        for _ in range(10):
            c = torch.matmul(a, b)
    
    # 测试
    start_event.record()
    with amp.autocast():
        for _ in range(iterations):
            c = torch.matmul(a, b)
    end_event.record()
    
    # 等待完成
    torch.cuda.synchronize()
    
    # 获取时间（毫秒）
    elapsed = start_event.elapsed_time(end_event) / 1000.0  # 转换为秒
    
    flops = 2 * size * size * size * iterations
    tflops = flops / (elapsed * 1e12)
    
    return tflops

def main():
    print(f"Testing on: {torch.cuda.get_device_name()}\n")
    print("Running optimized benchmark with CUDA events...")
    
    sizes = [1024, 2048, 4096, 8192,16384]
    for size in sizes:
        try:
            tflops = benchmark_with_cuda_events(size)
            print(f"Matrix size: {size}x{size}, TFLOPS: {tflops:.2f}")
        except RuntimeError as e:
            print(f"Size {size} failed: {e}")

if __name__ == "__main__":
    main()

某云平台的B1.gpu.large GPU算力和3090差不多：

最近刷到篇关于估算训练时间的文章模型计算量估计,训练时间预测，于是对B1.gpu.large4090的FP16算力进行了估算，并尝试根据训练时间的公式 (6ND/S) 来计算训练时间。以下是计算过程和结果：

minimind预训练数据token数量约10B：

B1.gpu.large GPU
- FP16算力：约75 TFlops
- 参数量 (N)：26M
- Tokens数量 (D)：10B
- 计算公式：

$$ \text{训练时间} = \frac{6 \times N \times D}{S} = \frac{6 \times 26 \times 10^6 \times 10 \times 10^9}{75 \times 10^{12}} \approx 20800 \text{ secs} \approx 346 \text{ min} $$

实际预训练1个epoch时间：约300分钟。

4090 GPU
- FP16算力：约165 TFlops
- 计算公式：

$$ \text{训练时间} = \frac{6 \times N \times D}{S} = \frac{6 \times 26 \times 10^6 \times 10 \times 10^9}{165 \times 10^{12}} \approx 9455 \text{ secs} \approx 157 \text{ min} $$

实际训练时间：约200分钟。

实际上的结果和估计结果有出入，可能是GPU利用率的不同导致的。
欢迎大家分享自己的看法和经验，谢谢。

jingyaogong added the good first issue Good for newcomers label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

讨论个人GPU的训练时间 #60

讨论个人GPU的训练时间 #60

jingyaogong commented Oct 5, 2024 •

edited

Loading

jingyaogong commented Oct 5, 2024 •

edited

Loading

Caizkk commented Oct 7, 2024

jingyaogong commented Oct 7, 2024

chuanzhubin commented Oct 9, 2024

dblab0 commented Nov 22, 2024

jiaohuix commented Jan 3, 2025 •

edited

Loading

讨论个人GPU的训练时间 #60

讨论个人GPU的训练时间 #60

Comments

jingyaogong commented Oct 5, 2024 • edited Loading

jingyaogong commented Oct 5, 2024 • edited Loading

Caizkk commented Oct 7, 2024

jingyaogong commented Oct 7, 2024

chuanzhubin commented Oct 9, 2024

dblab0 commented Nov 22, 2024

jiaohuix commented Jan 3, 2025 • edited Loading

jingyaogong commented Oct 5, 2024 •

edited

Loading

jingyaogong commented Oct 5, 2024 •

edited

Loading

jiaohuix commented Jan 3, 2025 •

edited

Loading