Torch MPS Basic Speedup Test
PyTorch MPS Performance Tips
Conducted quick experiments to find what actually speeds up PyTorch code on Apple Silicon (MPS).
Based on this video.
tl;dr
Keep tensors on GPU from the start, vectorize everything, avoid dynamic shapes. Code can be found here.
What’s tested
- Creating tensors: CPU to GPU vs creating directly on GPU
- Runtime constants: Dynamic tensor creation vs
register_buffer - Masking: Different masking approaches (indexing vs vectorized ops)
- DataLoader:
pin_memoryandnon_blockingflags
Quick findings
- Create tensors directly on device - Up to 80x speedup for larger tensors for creation and basic ops.
# Slow tensor = torch.randn(size, size).to(device) # Fast tensor = torch.randn(size, size, device=device) - Use
register_bufferfor constants - Up to 8x speedup.self.register_buffer('constant', torch.tensor(..., device=device)) - Vectorize masking operations - Massive speedup (50-90x)
# Slow - creates irregular tensor result = attention_map[attention_map > .5].mean() # Fast - stays vectorized mask = (attention_map > .5).float() result = (attention_map * mask).sum() / mask.sum().clamp(min=1e-9) - Use
non_blocking=True- Small improvement (~5%)data = data.to(device, non_blocking=True)Note:
pin_memorydoesn’t help on MPS (not supported yet)
Enjoy Reading This Article?
Here are some more articles you might like to read next: