Straggler-Aware Weighted Synchronization for Distributed Deep Learning

Oral Session B-3 : Biomedical Applications

Straggler-Aware Weighted Synchronization for Distributed Deep Learning

간행물

한국차세대컴퓨팅학회 학술대회 바로가기
권호(발행년)

ICNGC 2025 The 11th International Conference on Next Generation Computing 2025 (2025.12) 바로가기
페이지

pp.241-243
저자

HyungJun Kim, Joon-Min Gil, Heonchang Yu
언어

영어(ENG)
URL

https://www.earticle.net/Article/A478503

원문정보

초록

영어: Synchronous ring all-reduce is widely adopted for multi-GPU training due to its simplicity and scalability. However, its convergence-time advantage collapses in heterogeneous or unstable environments where a single slow worker (straggler) throttles overall progress. We present SAWS (Straggler-Aware Weighted Synchronization), a lightweight technique that (i) detects transient stragglers via adaptive, profile-driven timeouts, (ii) isolates them from the synchronous fast path without job aborts, and (iii) merges their partial progress through weighted model averaging proportional to processed-data ratio. In experiments on ResNet-18/CIFAR-10 with injected 5× slowdowns, SAWS improves wall-clock training time by up to 3.2× over vanilla Horovod while matching final accuracy within <1% of fully synchronous baselines. Compared to a straggler-drop variant, SAWS achieves competitive time-to-accuracy and consistently higher final validation accuracy.

Abstract
I. INTRODUCTION
II. BACKGROUND & RELATED WORK
A. Synchronous approaches
B. Asynchronous approaches
C. Hybrid and mitigation approaches
D. Frameworks and toolchains
III. SAWS: DESIGN OVERVIEW
A. Adaptive straggler detection
B. Worker group management
C. Weighted merging of models
D. Advantages
IV. EXPERIMENTAL SETUP
A. Environment
B. Workload
C. Baselines
D. Reported results
V. EVALUATION
A. Time-to-Train (one 5× straggler)
B. Final Accuracy (train/validation)
VI. DISCUSSION
A. Why not just stop?
B. Overheads
C. Scope
VII. CONCLUSION
ACKNOWLEDGMENT
REFERENCES

저자

HyungJun Kim [ Department of Computer Science and Engineering Korea University Seoul, Republic of Korea ]
Joon-Min Gil [ Department of Computer Engineering Jeju National University Seoul, Republic of Korea ]
Heonchang Yu [ Department of Computer Science and Engineering Korea University Seoul, Republic of Korea ]

참고문헌

자료제공 : 네이버학술정보

간행물 정보

간행물

한국차세대컴퓨팅학회 학술대회
간기
반년간
수록기간
2021~2025
십진분류
KDC 566 DDC 004

Earticle