Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

讨论个人GPU的训练时间 #60

Open
jingyaogong opened this issue Oct 5, 2024 · 6 comments
Open

讨论个人GPU的训练时间 #60

jingyaogong opened this issue Oct 5, 2024 · 6 comments
Labels
good first issue Good for newcomers

Comments

@jingyaogong
Copy link
Owner

jingyaogong commented Oct 5, 2024

在某平台上租了2台机器,控制内存CPU等变量的一致性,测试不同GPU的训练时间
image
image

个人认为 [3060~2080Ti~3090~4090] 这个区间包含了大部分AI从业者手头的显卡规格,具有很强的代表性

其它桌面GPU,例如3060的算力略弱于2080Ti,可以参考上图换算


  • 2080Ti单卡(11G显存)

    pretrain batchsize=48,预计7小时1个epoch

    root@autodl-container-908d479a1c-1697cfd8:~/autodl-tmp/minimind# python 1-pretrain.py 
    LLM总参数量:26.878 百万
    Epoch:[0/20](0/111769) loss:8.879 lr:0.0002000 epoch_Time:2618.0min:
    Epoch:[0/20](100/111769) loss:7.438 lr:0.0002000 epoch_Time:442.0min:
    Epoch:[0/20](200/111769) loss:6.899 lr:0.0002000 epoch_Time:431.0min:
    Epoch:[0/20](300/111769) loss:6.576 lr:0.0002000 epoch_Time:426.0min:
    

    full_sft batchsize=48,预计5.4小时1个epoch

    root@autodl-container-908d479a1c-1697cfd8:~/autodl-tmp/minimind# python 3-full_sft.py 
    LLM总参数量:26.878 百万
    Epoch:[0/19](0/82267) loss:8.876 lr:0.0001000 epoch_Time:2011.0min:
    Epoch:[0/19](100/82267) loss:6.302 lr:0.0001000 epoch_Time:335.0min:
    Epoch:[0/19](200/82267) loss:5.667 lr:0.0001000 epoch_Time:327.0min:
    Epoch:[0/19](300/82267) loss:5.193 lr:0.0001000 epoch_Time:324.0min:
    
  • 4090单卡(24G显存)

    pretrain batchsize=96,预计3.3小时1个epoch

    root@autodl-container-36164ea3cd-3ac722f7:~/autodl-tmp/minimind# python 1-pretrain.py 
    LLM总参数量:26.878 百万
    Epoch:[0/20](0/55885) loss:8.876 lr:0.0002000 epoch_Time:2049.0min:
    Epoch:[0/20](100/55885) loss:7.401 lr:0.0002000 epoch_Time:212.0min:
    Epoch:[0/20](200/55885) loss:6.958 lr:0.0002000 epoch_Time:201.0min:
    Epoch:[0/20](300/55885) loss:6.460 lr:0.0002000 epoch_Time:197.0min:
    

    full_sft batchsize=96,预计2.5小时1个epoch

    root@autodl-container-36164ea3cd-3ac722f7:~/autodl-tmp/minimind# python 3-full_sft.py 
    LLM总参数量:26.878 百万
    Epoch:[0/19](0/41134) loss:5.676 lr:0.0001000 epoch_Time:1086.0min:
    Epoch:[0/19](100/41134) loss:4.872 lr:0.0001000 epoch_Time:156.0min:
    Epoch:[0/19](200/41134) loss:4.446 lr:0.0001000 epoch_Time:152.0min:
    

如果您手头的机器 < 3060算力/纯CPU/Intel显卡/MAC等等,当然同样可以训练 minimind,这没什么问题。
但是 时间开销「也许」需要几十个小时乃至数天时间,这背离了您上手 minimind 的初衷,恐难以接受。

因此,更推荐这部分朋友在云平台上租用GPU服务器完成训练任务。

  • 以4090为例,每1小时价格为1.88人民币,训练1 epoch pretrain +1 epoch full_sft 共约6小时,即总花费为11.28元
  • 以2080Ti为例,每1小时价格为0.88人民币,训练1 epoch pretrain +1 epoch full_sft 共约12.4小时,即总花费为10.91元

如果需要训练更多轮次获得更佳效果,一杯奶茶钱同样可以cover,同时容器化实例省去了配置环境的繁琐步骤

算力平台推荐(无任何推广和商业目的,仅个人分享):


如果您有其他型号的显卡参与MiniMind训练,欢迎分享训练时长,以便为他人提供参考,这将非常有意义,谢谢。

此外,如果您有其他更优算力平台的推荐,也欢迎在评论区补充,谢谢。

@jingyaogong jingyaogong added the good first issue Good for newcomers label Oct 5, 2024
@jingyaogong
Copy link
Owner Author

jingyaogong commented Oct 5, 2024

补充:

本人设备为单机双卡3090,≈3小时训练1 epoch pretrain +1 epoch full_sft

以上所有讨论均基于26M大小的模型版本展开

@Caizkk
Copy link

Caizkk commented Oct 7, 2024

屏幕截图 2024-10-07 215451
单卡4060ti 16g为啥比你的2080Ti差那么多,我这个时间是正常的吗

@jingyaogong
Copy link
Owner Author

屏幕截图 2024-10-07 215451 单卡4060ti 16g为啥比你的2080Ti差那么多,我这个时间是正常的吗

image
半精度确实比2080Ti慢,如果数据准确的话,大约4060Ti是2080Ti 77%的性能

因此4060Ti相同batchsize理论epoch时间换算下来可能是550~580 min

但实际运行时间比理论值再慢100秒+,差异应该体现在CPU和内存上

image

@chuanzhubin
Copy link
Contributor

4卡 2080ti 22G

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 1-pretrain.py
python 3-full_sft.py 
LLM总参数量:26.878 百万
Epoch:[0/19](0/49361) loss:8.855 lr:0.00020000 epoch_Time:1566.0min:
Epoch:[0/19](100/49361) loss:5.565 lr:0.00020000 epoch_Time:326.0min:
Epoch:[0/19](200/49361) loss:4.950 lr:0.00020000 epoch_Time:320.0min:
Epoch:[0/19](300/49361) loss:4.622 lr:0.00020000 epoch_Time:319.0min:
Epoch:[0/19](400/49361) loss:4.381 lr:0.00020000 epoch_Time:317.0min:
Epoch:[0/19](500/49361) loss:4.183 lr:0.00020000 epoch_Time:315.0min:
Epoch:[0/19](600/49361) loss:4.053 lr:0.00020000 epoch_Time:315.0min:
Epoch:[0/19](700/49361) loss:3.734 lr:0.00020000 epoch_Time:314.0min:

@dblab0
Copy link

dblab0 commented Nov 22, 2024

屏幕截图 2024-10-07 215451 单卡4060ti 16g为啥比你的2080Ti差那么多,我这个时间是正常的吗

我的4060Ti 16G 64batch,一个epoch678min,之前试32batch也是这样。应该就是算力差了比较多

@jiaohuix
Copy link

jiaohuix commented Jan 3, 2025

可以使用这个脚本测试算力

import torch
import torch.cuda.amp as amp

def benchmark_with_cuda_events(size, dtype=torch.float16, iterations=100):
    torch.cuda.init()
    torch.backends.cudnn.benchmark = True
    
    # 创建 CUDA events
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    
    a = torch.randn(size, size, dtype=dtype, device='cuda').contiguous()
    b = torch.randn(size, size, dtype=dtype, device='cuda').contiguous()
    
    # 预热
    with amp.autocast():
        for _ in range(10):
            c = torch.matmul(a, b)
    
    # 测试
    start_event.record()
    with amp.autocast():
        for _ in range(iterations):
            c = torch.matmul(a, b)
    end_event.record()
    
    # 等待完成
    torch.cuda.synchronize()
    
    # 获取时间(毫秒)
    elapsed = start_event.elapsed_time(end_event) / 1000.0  # 转换为秒
    
    flops = 2 * size * size * size * iterations
    tflops = flops / (elapsed * 1e12)
    
    return tflops

def main():
    print(f"Testing on: {torch.cuda.get_device_name()}\n")
    print("Running optimized benchmark with CUDA events...")
    
    sizes = [1024, 2048, 4096, 8192,16384]
    for size in sizes:
        try:
            tflops = benchmark_with_cuda_events(size)
            print(f"Matrix size: {size}x{size}, TFLOPS: {tflops:.2f}")
        except RuntimeError as e:
            print(f"Size {size} failed: {e}")

if __name__ == "__main__":
    main()

某云平台的B1.gpu.large GPU算力和3090差不多:
B1.gpu.large
3090计算力

最近刷到篇关于估算训练时间的文章模型计算量估计,训练时间预测 ,于是对B1.gpu.large4090的FP16算力进行了估算,并尝试根据训练时间的公式 (6ND/S) 来计算训练时间。以下是计算过程和结果:

minimind预训练数据token数量约10B:
image

  1. B1.gpu.large GPU
    • FP16算力:约75 TFlops
    • 参数量 (N):26M
    • Tokens数量 (D):10B
    • 计算公式:

$$ \text{训练时间} = \frac{6 \times N \times D}{S} = \frac{6 \times 26 \times 10^6 \times 10 \times 10^9}{75 \times 10^{12}} \approx 20800 \text{ secs} \approx 346 \text{ min} $$

  • 实际预训练1个epoch时间:约300分钟。

B1.gpu.large训练时间

  1. 4090 GPU
    • FP16算力:约165 TFlops
    • 计算公式:

$$ \text{训练时间} = \frac{6 \times N \times D}{S} = \frac{6 \times 26 \times 10^6 \times 10 \times 10^9}{165 \times 10^{12}} \approx 9455 \text{ secs} \approx 157 \text{ min} $$

  • 实际训练时间:约200分钟。

image

实际上的结果和估计结果有出入,可能是GPU利用率的不同导致的。
欢迎大家分享自己的看法和经验,谢谢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants