Torch MPS Basic Speedup Test

Created on October 22, 2025

2025 · machine-learning

PyTorch MPS Performance Tips

Conducted quick experiments to find what actually speeds up PyTorch code on Apple Silicon (MPS).

Based on this video.

tl;dr

Keep tensors on GPU from the start, vectorize everything, avoid dynamic shapes. Code can be found here.

What’s tested

Creating tensors: CPU to GPU vs creating directly on GPU
Runtime constants: Dynamic tensor creation vs register_buffer
Masking: Different masking approaches (indexing vs vectorized ops)
DataLoader: pin_memory and non_blocking flags

Quick findings

Create tensors directly on device - Up to 80x speedup for larger tensors for creation and basic ops.

# Slow
tensor = torch.randn(size, size).to(device)

# Fast
tensor = torch.randn(size, size, device=device)

Use register_buffer for constants - Up to 8x speedup.

self.register_buffer('constant', torch.tensor(..., device=device))

Vectorize masking operations - Massive speedup (50-90x)

# Slow - creates irregular tensor
result = attention_map[attention_map > .5].mean()

# Fast - stays vectorized
mask = (attention_map > .5).float()
result = (attention_map * mask).sum() / mask.sum().clamp(min=1e-9)

Use non_blocking=True - Small improvement (~5%)
```
data = data.to(device, non_blocking=True)
```
Note: pin_memory doesn’t help on MPS (not supported yet)

Enjoy Reading This Article?

Here are some more articles you might like to read next:

Old NLP Paper Reviews

Showcase of EMNLP Sycophancy Papers

Random Cosine Similarity Distribution

Further Look into Cancer-Myth. Does the LLM Ignore False Presupposition Due to Lack of Knowledge or is it Sycophantic?

Simple Attention Visualizer