How to run DeepSeek-R1 on an AMD MI300X

A step-by-step guide to run DeepSeek-R1-Distill on AMD MI300X using Shark-AI toolkit.

The rise of open-source large language models (LLMs) like DeepSeek-R1 has democratized access to cutting-edge AI capabilities, but deploying these models efficiently on modern hardware remains a nuanced challenge. For organizations invested in AMD’s Instinct™ MI300X accelerators, the Shark-AI toolkit offers a promising pathway—blending ROCm’s raw compute power with LLM-specific optimizations. Below, we dissect the advantages and limitations of this setup, drawing from hands-on implementation experiences.

The Pros: Why This Setup Shines

While the guide focuses on single-node deployment, Shark-AI’s underlying IREE compiler supports distributed execution—a future-proof feature for scaling to multi-MI300X clusters.

Cost-Effective AMD Hardware Utilization

The MI300X’s 192GB HBM3 memory and high memory bandwidth (5.3TB/s) make it ideal for serving large models like DeepSeek-R1-Distill-32B without aggressive quantization. Shark-AI’s ROCm integration ensures these specs translate into tangible throughput gains.

Compared to NVIDIA’s H200, MI300X offers competitive performance at potentially lower TCO for AMD-centric data centers.

Open-Source Flexibility

Shark-AI’s pipeline—from GGUF conversion via llama.cpp to serving via shortfin—avoids vendor lock-in. Users retain full control over quantization, batch sizes, and compilation flags.

Compatibility with non-native architectures (e.g., running DeepSeek-R1 via Qwen templates) demonstrates adaptability in the absence of official Hugging Face support.

Quantization-Friendly Workflow

The GGUF-centric approach allows easy experimentation with FP16, INT8, or even 4-bit quantization, balancing accuracy and speed. For example, a 4-bit quantized 32B model requires just ~20GB VRAM, freeing resources for concurrent workloads.

Performance Tuning Levers

Shark-AI exposes granular controls: batch sizes (EXPORT_BATCH_SIZES), HIP compilation targets (gfx942), and concurrency isolation strategies. These levers let users optimize for latency-sensitive or throughput-first scenarios.

Running DeepSeek-R1 on AMD MI300X: A Cost-Effective Alternative to OpenAI’s O-Series Models

Large Language Models (LLMs) like OpenAI’s O1 series have set benchmarks in reasoning and coding tasks, but their API costs can quickly escalate for high-volume use cases. Enter DeepSeek-R1, an open-source reasoning-optimized LLM that matches OpenAI’s performance at a fraction of the cost. In this guide, we’ll walk through setting up DeepSeek-R1 on a single AMD MI300X GPU node, complete with performance benchmarks and cost comparisons.


Why DeepSeek-R1 on AMD MI300X?

The AMD MI300X accelerator, with 192GB of HBM3 memory and ROCm support, is ideal for running large models like DeepSeek-R1 (up to 70B parameters). Compared to NVIDIA’s H100, the MI300X requires only a single node, offering competitive performance at lower costs, making it a cost-efficient choice for enterprise reasoning inference.


DeepSeek-R1 Deployment Guide

Step 1: Set Up the Environment

Bash
# Create and activate virtual environment
python -m venv --prompt shark-ai .venv
source .venv/bin/activate

# Install Shark-AI with ROCm support
pip install shark-ai[apps] --extra-index-url https://download.pytorch.org/whl/rocm6.0

# Install additional dependencies
pip install torch==2.3.1+rocm6.0 --extra-index-url https://download.pytorch.org/whl/rocm6.0
pip install transformers huggingface-hub

# Configure working directory
export EXPORT_DIR=$PWD/deepseek-export
mkdir -p $EXPORT_DIR

Step 2: Obtain Model Weights

Bash
# Download DeepSeek-R1-Distill-Qwen-32B (HF format)
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --local-dir $EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B \
  --local-dir-use-symlinks False

Step 3. Convert to GGUF Format

Bash
# Clone conversion tools
git clone --depth 1 https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert HF weights to GGUF
python3 convert-hf-to-gguf.py \
  $EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B \
  --outtype f16 \
  --outfile $EXPORT_DIR/deepseek-r1-distill-qwen-32b.f16.gguf 

cd ..

Step 4. Configure Environment Variables

Bash
export MODEL_PARAMS_PATH=$EXPORT_DIR/deepseek-r1-distill-qwen-32b.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B/tokenizer.json
export MLIR_PATH=$EXPORT_DIR/model.mlir
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
export VMFB_PATH=$EXPORT_DIR/model.vmfb
export EXPORT_BATCH_SIZES=1,4
export ROCR_VISIBLE_DEVICES=1  # For MI300X

Step 5. Export Model to MLIR

Bash
python -m sharktank.examples.export_paged_llm_v1 \
  --gguf-file=$MODEL_PARAMS_PATH \
  --output-mlir=$MLIR_PATH \
  --output-config=$OUTPUT_CONFIG_PATH \
  --bs=$EXPORT_BATCH_SIZES \
  --model-family=qwen  # Use Qwen architecture template

Step 6. Compile for MI300X

Bash
iree-compile $MLIR_PATH \
  --iree-hal-target-backends=rocm \
  --iree-hip-target=gfx942 \
  --iree-rocm-bc-dir=/opt/rocm/amdgcn/bitcode \
  -o $VMFB_PATH

Step 7. Launch Shortfin Server

Bash
python -m shortfin_apps.llm.server \
  --tokenizer_json=$TOKENIZER_PATH \
  --model_config=$OUTPUT_CONFIG_PATH \
  --vmfb=$VMFB_PATH \
  --parameters=$MODEL_PARAMS_PATH \
  --device=hip \
  --host=0.0.0.0 \
  --port=8000 > server.log 2>&1 &

Step 8. Test Inference

Bash
# Health check
curl -i http://localhost:8000/health

# Sample request
curl http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Explain quantum computing in simple terms",
    "sampling_params": {"max_completion_tokens": 150}
  }'

Start developing to open source Reasoning AI, powered by the Awesome Engine.

Awesome Cloud provides enterprise custom inference, distillation, & fine-tuning, powered by the next generation AI accelerators.

DeepSeek-R1 on AMD MI300X delivers fully private state of the art AI at a fraction of the cost with performance on par with the API models. By leveraging open-source models and AMD’s cost-efficient hardware, enterprises can reduce reliance on expensive APIs while maintaining high reasoning accuracy.


Contact Awesome if you’d like help running Reasoning Inference Infrastructure, optimizing multi-GPU setups or large scale inference! 🚀