How to run DeepSeek-R1 on an AMD MI300X
A step-by-step guide to run DeepSeek-R1-Distill on AMD MI300X using Shark-AI toolkit.

The rise of open-source large language models (LLMs) like DeepSeek-R1 has democratized access to cutting-edge AI capabilities, but deploying these models efficiently on modern hardware remains a nuanced challenge. For organizations invested in AMD’s Instinct™ MI300X accelerators, the Shark-AI toolkit offers a promising pathway—blending ROCm’s raw compute power with LLM-specific optimizations. Below, we dissect the advantages and limitations of this setup, drawing from hands-on implementation experiences.
The Pros: Why This Setup Shines
While the guide focuses on single-node deployment, Shark-AI’s underlying IREE compiler supports distributed execution—a future-proof feature for scaling to multi-MI300X clusters.
Cost-Effective AMD Hardware Utilization
The MI300X’s 192GB HBM3 memory and high memory bandwidth (5.3TB/s) make it ideal for serving large models like DeepSeek-R1-Distill-32B without aggressive quantization. Shark-AI’s ROCm integration ensures these specs translate into tangible throughput gains.
Compared to NVIDIA’s H200, MI300X offers competitive performance at potentially lower TCO for AMD-centric data centers.
Open-Source Flexibility
Shark-AI’s pipeline—from GGUF conversion via llama.cpp
to serving via shortfin
—avoids vendor lock-in. Users retain full control over quantization, batch sizes, and compilation flags.
Compatibility with non-native architectures (e.g., running DeepSeek-R1 via Qwen templates) demonstrates adaptability in the absence of official Hugging Face support.
Quantization-Friendly Workflow
The GGUF-centric approach allows easy experimentation with FP16, INT8, or even 4-bit quantization, balancing accuracy and speed. For example, a 4-bit quantized 32B model requires just ~20GB VRAM, freeing resources for concurrent workloads.
Performance Tuning Levers
Shark-AI exposes granular controls: batch sizes (EXPORT_BATCH_SIZES
), HIP compilation targets (gfx942
), and concurrency isolation strategies. These levers let users optimize for latency-sensitive or throughput-first scenarios.
Running DeepSeek-R1 on AMD MI300X: A Cost-Effective Alternative to OpenAI’s O-Series Models
Large Language Models (LLMs) like OpenAI’s O1 series have set benchmarks in reasoning and coding tasks, but their API costs can quickly escalate for high-volume use cases. Enter DeepSeek-R1, an open-source reasoning-optimized LLM that matches OpenAI’s performance at a fraction of the cost. In this guide, we’ll walk through setting up DeepSeek-R1 on a single AMD MI300X GPU node, complete with performance benchmarks and cost comparisons.
Why DeepSeek-R1 on AMD MI300X?
The AMD MI300X accelerator, with 192GB of HBM3 memory and ROCm support, is ideal for running large models like DeepSeek-R1 (up to 70B parameters). Compared to NVIDIA’s H100, the MI300X requires only a single node, offering competitive performance at lower costs, making it a cost-efficient choice for enterprise reasoning inference.
DeepSeek-R1 Deployment Guide
Step 1: Set Up the Environment
# Create and activate virtual environment
python -m venv --prompt shark-ai .venv
source .venv/bin/activate
# Install Shark-AI with ROCm support
pip install shark-ai[apps] --extra-index-url https://download.pytorch.org/whl/rocm6.0
# Install additional dependencies
pip install torch==2.3.1+rocm6.0 --extra-index-url https://download.pytorch.org/whl/rocm6.0
pip install transformers huggingface-hub
# Configure working directory
export EXPORT_DIR=$PWD/deepseek-export
mkdir -p $EXPORT_DIR
Step 2: Obtain Model Weights
# Download DeepSeek-R1-Distill-Qwen-32B (HF format)
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--local-dir $EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B \
--local-dir-use-symlinks False
Step 3. Convert to GGUF Format
# Clone conversion tools
git clone --depth 1 https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert HF weights to GGUF
python3 convert-hf-to-gguf.py \
$EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B \
--outtype f16 \
--outfile $EXPORT_DIR/deepseek-r1-distill-qwen-32b.f16.gguf
cd ..
Step 4. Configure Environment Variables
export MODEL_PARAMS_PATH=$EXPORT_DIR/deepseek-r1-distill-qwen-32b.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/DeepSeek-R1-Distill-Qwen-32B/tokenizer.json
export MLIR_PATH=$EXPORT_DIR/model.mlir
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
export VMFB_PATH=$EXPORT_DIR/model.vmfb
export EXPORT_BATCH_SIZES=1,4
export ROCR_VISIBLE_DEVICES=1 # For MI300X
Step 5. Export Model to MLIR
python -m sharktank.examples.export_paged_llm_v1 \
--gguf-file=$MODEL_PARAMS_PATH \
--output-mlir=$MLIR_PATH \
--output-config=$OUTPUT_CONFIG_PATH \
--bs=$EXPORT_BATCH_SIZES \
--model-family=qwen # Use Qwen architecture template
Step 6. Compile for MI300X
iree-compile $MLIR_PATH \
--iree-hal-target-backends=rocm \
--iree-hip-target=gfx942 \
--iree-rocm-bc-dir=/opt/rocm/amdgcn/bitcode \
-o $VMFB_PATH
Step 7. Launch Shortfin Server
python -m shortfin_apps.llm.server \
--tokenizer_json=$TOKENIZER_PATH \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip \
--host=0.0.0.0 \
--port=8000 > server.log 2>&1 &
Step 8. Test Inference
# Health check
curl -i http://localhost:8000/health
# Sample request
curl http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Explain quantum computing in simple terms",
"sampling_params": {"max_completion_tokens": 150}
}'
Start developing to open source Reasoning AI, powered by the Awesome Engine.
Awesome Cloud provides enterprise custom inference, distillation, & fine-tuning, powered by the next generation AI accelerators.
DeepSeek-R1 on AMD MI300X delivers fully private state of the art AI at a fraction of the cost with performance on par with the API models. By leveraging open-source models and AMD’s cost-efficient hardware, enterprises can reduce reliance on expensive APIs while maintaining high reasoning accuracy.
Contact Awesome if you’d like help running Reasoning Inference Infrastructure, optimizing multi-GPU setups or large scale inference! 🚀