Time required: 10-15 minutes • Difficulty: Beginner
System Requirements
Minimum Requirements
- 2× NVIDIA GPUs with CUDA support (for RL training)
- 1× GPU minimum for inference only
- 32GB+ system RAM
- 100GB+ disk space
- Python 3.10 or newer
Recommended Setup
- 4×H100 or 8×H100 GPUs (40GB+ VRAM each)
- 128GB+ system RAM
- 200GB+ NVMe storage
- Ubuntu 22.04 LTS
Prerequisites
1
CUDA Setup
Ensure NVIDIA drivers and CUDA are installed and compatible with PyTorch 2.6.0:
2
Python Environment
Verify Python version (3.10 or newer required):
3
HuggingFace Authentication
Authenticate for model and dataset access:
Installation Methods
- Runtime SDK (Minimal)
- Automated Training Setup (Recommended)
- Manual Training Installation
- Conda Environment
Keep credentials such as
OPENAI_API_KEY in a .env file and load them before orchestrating runs..atlas/discover.json, optional factory scaffolds, and metadata snapshots while automatically loading
.env and extending PYTHONPATH. Re-run atlas env init --scaffold-config-full whenever you want a fresh runtime
configuration derived from discovery output.Environment Configuration
API Keys and Tracking
Configure authentication for various services:The training script automatically sets
HF_HUB_ENABLE_HF_TRANSFER=1 to speed up model downloads.Keep provider keys,
DATABASE_URL, and other secrets in .env. The Atlas CLI family (atlas env, atlas run, atlas train, arc-atlas export) loads .env automatically and adds your project root plus src/ to PYTHONPATH, so custom adapters resolve without manual sys.path tweaks.Disable Tracking
To disable Weights & Biases tracking:Verification
After installation, verify your setup:3-Minute Smoke Test
Run this once to confirm CUDA, vLLM, and model downloads are working before you invest in longer training jobs.
GPU Memory Management
For different GPU configurations:Single GPU Setup
Single GPU Setup
Single GPU is supported for inference only. For RL training, use model offloading:
Multi-GPU Setup
Multi-GPU Setup
For distributed training across multiple GPUs:
Memory Optimization
Memory Optimization
Reduce memory usage with these settings:
Security Best Practices
Follow these security guidelines to protect sensitive information:
- Never commit secrets: Keep tokens,
.envfiles, and API keys out of version control - Use environment variables: Store
HF_TOKEN,WANDB_API_KEY, etc. as environment variables - Gitignore protection: Ensure
results/,logs/,wandb/remain in.gitignore - Least privilege: Restrict dataset access permissions
- Logout on shared machines: Run
huggingface-cli logoutafter use
Platform-Specific Notes
- Linux
- macOS
- Windows WSL2
Tested on Ubuntu 20.04/22.04 LTS:
- Ensure CUDA toolkit matches PyTorch requirements
- May need
sudofor system package installations
Troubleshooting
CUDA Version Mismatch
CUDA Version Mismatch
If you see CUDA errors:
Out of Memory Errors
Out of Memory Errors
Reduce memory usage:
HuggingFace Access Denied
HuggingFace Access Denied
Ensure proper authentication:
vLLM Installation Fails
vLLM Installation Fails
Common vLLM issues: