← Index

⚖️ LAGE-tool

Local AI Inference GPU Economics

Full Comparison Table · NVIDIA RTX · AMD ROCm · Intel ARC | Consumer & Data Center | LLM Inference Performance

Consumer: Marktplaats.nl • Cloud: Modal.com / Lambda Labs • DGX Systems • March 2026

Charts →

Click the checkboxes to filter by series. Click the + on any GPU row to add it to comparison.

GPU Model VRAM Bandwidth FP32
TFLOPS
FP8
TFLOPS
INT4
TFLOPS
FP4
TFLOPS
Purchase/Rental Hours for €500 Llama 8B
tok/s
FP4
tok/s
INT8 INT4
🟢 NVIDIA RTX 30 Series (Ampere) — INT8 via Tensor Cores, no FP8/INT4
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TFFP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RTX 3050 8GB 224 GB/s 9.1 €200–270 ~25–35 21 🔄
RTX 3060 Best Mid-Budget 12GB 360 GB/s 12.7 €200–250 ~35–45 34 🔄
RTX 3060 Ti 8GB 448 GB/s 16.2 €300–380 ~45–55 42 🔄
RTX 3070 8GB 448 GB/s 20.3 €300–350 ~48–58 42 🔄
RTX 3070 Ti 8GB 608 GB/s 21.8 €380–460 ~50–62 57 🔄
RTX 3080 10GB 760 GB/s 29.8 €350–400 ~60–75 71 🔄
RTX 3080 12GB 12GB 912 GB/s 30.6 €550–650 ~65–80 85 🔄
RTX 3080 Ti 12GB 912 GB/s 34.1 €550–600 ~68–83 85 🔄
RTX 3090 Best Value 24GB 936 GB/s 35.6 €650–700 ~70–85 87 🔄
RTX 3090 Ti 24GB 1008 GB/s 40.0 €1000–1050 ~75–90 94 🔄
🟢 NVIDIA RTX 40 Series (Ada Lovelace) — FP8 + INT4 via 4th-gen Tensor Cores
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TFFP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RTX 4060 8GB 8GB 272 GB/s 15.1 121 242 €300–380 ~40–50 25 🔄
RTX 4060 16GB Best INT4 Budget 16GB 272 GB/s 15.1 121 242 €400–480 ~40–50 25 🔄
RTX 4060 Ti 8GB 8GB 288 GB/s 22.1 177 354 €380–460 ~45–55 27 🔄
RTX 4060 Ti 16GB 16GB 288 GB/s 22.1 177 354 €480–560 ~45–55 27 🔄
RTX 4070 12GB 504 GB/s 29.1 233 466 €470–520 ~65–80 47 🔄
RTX 4070 Super 12GB 504 GB/s 35.5 284 568 €600–650 ~70–85 47 🔄
RTX 4070 Ti 12GB 504 GB/s 40.1 321 642 €700–800 ~75–90 47 🔄
RTX 4070 Ti Super 16GB 672 GB/s 44.1 353 706 €800–900 ~85–100 63 🔄
RTX 4080 16GB 716 GB/s 48.7 390 780 €850–900 ~95–115 67 🔄
RTX 4080 Super 16GB 736 GB/s 52.2 418 836 €830–880 ~100–120 69 🔄
RTX 4090 Best Performance 24GB 1008 GB/s 82.6 661 1322 €2200–2250 ~120–145 94 🔄
🟢 NVIDIA RTX 50 Series (Blackwell, Jan 2025) — FP8 + INT4 + FP4 via 5th-gen Tensor Cores NEW!
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TFFP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RTX 5060 8GB 448 GB/s 19.2 154 307 307 €320–400* ~55–70 67 ⚡
RTX 5060 Ti 8GB 8GB 448 GB/s 28.0 224 448 448 €400–480* ~60–75 67 ⚡
RTX 5060 Ti 16GB 16GB 448 GB/s 28.0 224 448 448 €450–550* ~60–75 82 ⚡
RTX 5070 12GB 672 GB/s 45.0 360 720 720 €600–750* ~85–105 100 ⚡
RTX 5070 Ti 16GB 896 GB/s 58.5 468 936 936 €850–1000* ~100–125 134 ⚡
RTX 5080 16GB 960 GB/s 65.0 520 1040 1040 €1250–1300* ~115–140 144 ⚡
RTX 5090 Cutting Edge 32GB 1792 GB/s 125.0 1000 2000 2000 €2500–2600* ~160–200 268 ⚡
☁️ NVIDIA Data Center GPUs (Modal Cloud Pricing) Cloud Rental
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TFFP4 TFRental / PurchaseHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
Nvidia T4 16GB 320 GB/s 8.1 $0.59/hr • €450–550 used 898 hrs ~40–55
Nvidia L4 24GB 300 GB/s 30.3 $0.80/hr • €1,500–2,000 used 663 hrs ~60–80
Nvidia A10 24GB 600 GB/s 31.2 $1.10/hr • €1,700–2,600 used 481 hrs ~70–90
Nvidia L40S 48GB 864 GB/s 91.6 733 1466 $1.95/hr • €7,500–10,200 used 272 hrs ~140–180
Nvidia A100 40GB 40GB 1555 GB/s 19.5 312 624 $2.10/hr • €7,000–9,000 used 253 hrs ~180–220
Nvidia A100 80GB 80GB 2039 GB/s 19.5 312 624 $2.50/hr • €14,000–20,500 used 212 hrs ~200–250
Nvidia H100 80GB 3350 GB/s 67 1979 3958 $3.95/hr • €27,500+ used 134 hrs ~350–450
Nvidia H200 141GB 4800 GB/s 67 1979 3958 $4.54/hr • €37,500+ used 117 hrs ~400–500
Nvidia B200 Latest! 192GB 8000 GB/s 4500 9000 18000 18000 $6.25/hr • rare used 85 hrs ~600–800
🖥️ NVIDIA DGX Spark — Compact AI system (GB10 Grace Blackwell, 128GB unified) ~$4.7k / €2.8k–4.2k
SystemVRAMMem BWFP32 TFFP8 TFINT4 TFFP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
DGX Spark 128GB • FP4 128GB 273 GB/s ~50 ~500 ~1000 ~1000 ~$4,700 / €2,800–4,200 ~200–300 41 ⚡
🔴 AMD ROCm — RDNA 2 (RX 6000 Series) — Community ROCm support, shader INT8 only ROCm 5.x
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TF†FP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RX 6800 XT 16GB 512 GB/s 20.7 €280–350 ~55–70 48 🔄
RX 6900 XT 16GB 512 GB/s 23.1 €350–430 ~60–75 48 🔄
🔴 AMD ROCm — RDNA 3 (RX 7000 Series) — Official ROCm support, WMMA INT8+INT4 via AI Accelerators ROCm 5.7+
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TF (WMMA)FP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RX 7700 XT 12GB 432 GB/s 35.7 ~143 €280–340 ~65–80 40 🔄
RX 7800 XT 16GB 576 GB/s 37.0 ~148 €350–430 ~75–95 54 🔄
RX 7900 GRE 16GB 576 GB/s 45.9 ~184 €400–500 ~75–95 54 🔄
RX 7900 XT 20GB 800 GB/s 53.4 ~214 €550–650 ~90–110 75 🔄
RX 7900 XTX Best AMD Value 24GB 960 GB/s 61.4 ~246 €650–800 ~100–130 90 🔄
🔴 AMD ROCm — RDNA 4 (RX 9000 Series, Mar 2025) — Full FP8 + INT4 AI Accelerators NEW!
GPU ModelVRAMMem BWFP32 TFFP8 TF (AI Acc)INT4 TF (AI Acc)FP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
RX 9070 16GB 512 GB/s ~40.0 ~320 ~320 €450–550 ~80–100 48 🔄
RX 9070 XT Best AMD New 16GB 640 GB/s ~54.0 ~432 ~432 €550–650 ~100–130 60 🔄
🔴 AMD Instinct — Data Center GPUs (ROCm Cloud, Lambda Labs pricing) Cloud
GPU ModelVRAMMem BWFP32 TF (CU)FP8 TF (Matrix)INT4 TF (Matrix)FP4 TFRental Price/hrHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
AMD MI300X Flagship 192GB 5300 GB/s 163 2614 5228 ~$4.00/hr ~133 hrs ~600–800
🔵 Intel ARC Alchemist (A-Series) — XMX INT8 + INT4, SYCL/llama.cpp backend oneAPI
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TF (XMX)FP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
Arc A380 6GB 188 GB/s 7.0 ~112 €80–130 ~15–25 18 🔄
Arc A580 8GB 512 GB/s 12.4 ~198 €130–170 ~25–35 48 🔄
Arc A750 8GB 512 GB/s 17.6 ~282 €180–240 ~30–40 48 🔄
Arc A770 8GB 8GB 560 GB/s 19.7 ~315 €230–290 ~30–45 52 🔄
Arc A770 16GB Best Intel Value 16GB 560 GB/s 19.7 ~315 €290–360 ~30–45 52 🔄
🔵 Intel ARC Battlemage (B-Series) — Xe2 XMX with improved matrix throughput 2024–2025
GPU ModelVRAMMem BWFP32 TFFP8 TFINT4 TF (XMX Xe2)FP4 TFPurchase PriceHrs/€500Llama 8B tok/sFP4 tok/sINT8 HWINT4 HW
Arc B50 Entry 8GB 224 GB/s ~7.0 ~224 €100–140* ~18–28 21 🔄
Arc B60 Mid 8GB 320 GB/s ~11.2 ~358 €150–190* ~25–38 30 🔄
Arc B580 Best Intel Budget 12GB 456 GB/s 14.0 ~448 €230–280 ~35–50 43 🔄
Arc B770 New 2025 16GB 616 GB/s ~24.0 ~768 €350–450* ~50–70 58 🔄

📖 Reference Guides

💡 Quick Picks

  • 💛 Best Under €500 (NVIDIA):
  • → RTX 3060 12GB (€200–250) — 12GB VRAM + INT8, 12.7 FP32 TFLOPS
  • → RTX 4060 16GB (€400–480) — 16GB + INT4, 15.1 FP32 / 121 FP8 / 242 INT4 TFLOPS
  • 💛 Best Under €500 (Intel ARC):
  • → Arc B580 (€230–280) — 12GB + Xe2 XMX INT4, ~448 INT4 TOPS via llama.cpp SYCL
  • → Arc A770 16GB (€290–360) — 16GB + XMX INT4, great for llama.cpp SYCL
  • 💛 Best Under €500 (AMD ROCm):
  • → RX 7800 XT (€350–430) — 16GB + fast 576 GB/s, excellent ROCm support
  • → RX 7900 GRE (€400–500) — 16GB, 576 GB/s, full WMMA INT4
  • 💚 Best Value Overall:
  • → RTX 3090 24GB (€650–700) — Huge VRAM, 35.6 FP32 TFLOPS
  • → RTX 4090 (€2200–2250) — 83 FP32 / 661 FP8 / 1322 INT4 TFLOPS
  • → RX 7900 XTX (€650–800) — 24GB, 960 GB/s, fully ROCm-supported
  • → RTX 5090 (€2500–2600) — 32GB VRAM, 125 FP32 / 2000 FP4 TFLOPS 🚀

⚡ Understanding FLOPS & Precision

  • TFLOPS: Tera Floating Point Operations Per Second (trillions of calculations/sec)
  • FP32 (all GPUs): 32-bit precision, standard shader benchmark — useful for comparison baseline
  • FP8 (RTX 40/50, RDNA 4, MI300X): 8-bit float via matrix cores, ~8× faster than FP32 for AI
  • INT4 (RTX 40/50, RDNA 3+, Intel XMX): 4-bit integer via accelerators, ~16× faster than FP32 🚀
  • FP4 (RTX 50 only!): 4-bit float via 5th-gen Tensor Cores — same speed as INT4 but more accurate 🚀
  • ⚠️ BANDWIDTH NEVER CHANGES: GB/s is a fixed hardware spec regardless of precision!
  • Example RTX 4090: 83 FP32 / 661 FP8 / 1,322 INT4 TFLOPS — same 1,008 GB/s bandwidth always
  • Example RX 7900 XTX: 61.4 FP32 / ~246 INT4 TFLOPS (WMMA) — bandwidth always 960 GB/s
  • Example Arc A770 16GB: 19.7 FP32 / ~315 INT4 TFLOPS (XMX) — bandwidth always 560 GB/s

🔴 CRITICAL: Bandwidth vs FLOPS for LLM Inference

  • ⚠️ BANDWIDTH (GB/s) = FIXED HARDWARE SPEC — determines LLM token speed!
  • → RTX 5090: 1,792 GB/s | RTX 4090: 1,008 GB/s | RX 7900 XTX: 960 GB/s | Arc A770: 560 GB/s
  • → Bandwidth = how fast model weights are streamed from VRAM to compute cores
  • ✅ FLOPS = VARIES BY PRECISION — determines batch throughput and training speed
  • → LLM inference (batch=1) is bandwidth-bound: higher FLOPS won't linearly increase tok/s
  • → High FLOPS matters for: training, large batches, or when compute is the bottleneck
  • Why AMD RX 7900 XTX can match some RTX 40 cards:
  • → Similar memory bandwidth (960 vs 1008 GB/s) means similar token/s on bandwidth-bound LLMs
  • Why Intel ARC is slower despite high XMX INT4 TFLOPS:
  • → SYCL backend less mature than CUDA; kernel optimisation gap vs llama.cpp CUDA/HIP

📊 VRAM Requirements & Performance Notes

  • VRAM for Llama 8B: ~16GB (FP16), ~8GB (FP8/INT8), ~4GB (FP4/INT4)
  • VRAM for Llama 70B: ~140GB (FP16), ~70GB (FP8), ~35GB (INT4) — needs multiple GPUs!
  • RTX 5090 with FP4: Can exceed 400+ tokens/sec on Llama 8B
  • RX 7900 XTX ROCm INT4: ~100–130 tok/s via llama.cpp HIP, on par with RTX 4080
  • Intel ARC SYCL: Good INT4 TFLOPS via XMX, but llama.cpp SYCL backend is less optimised — 30–50% slower than bandwidth would suggest
  • AMD ROCm maturity: RDNA 3/4 excellent with ROCm 5.7+. RDNA 2 community-supported. Works well with llama.cpp HIP backend
  • INT4 vs FP4: FP4 (RTX 50) is slightly more accurate but same speed. RDNA 3 WMMA INT4 ≈ 4× FP32 throughput

🧩 Software Ecosystem Comparison

  • NVIDIA CUDA (best): RTX 30/40/50 — llama.cpp, Ollama, vLLM, ExLlamaV2, all frameworks ✅✅✅
  • AMD ROCm / HIP (good): RX 7000/9000, MI300X — llama.cpp, Ollama, vLLM (limited), PyTorch ROCm ✅✅
  • → RX 6000 series: community ROCm support only, not all versions work ⚠️
  • Intel SYCL / oneAPI (developing): Arc A/B-series — llama.cpp SYCL, Intel Extension for PyTorch ✅
  • → Setup requires Intel oneAPI toolkit; fewer pre-built binaries than CUDA/ROCm ⚠️
  • INT4 TFLOPS disclaimer for AMD/Intel: Values marked ~ are estimates based on architecture ratios; actual throughput depends on software optimisation

☁️ Cloud vs Own Hardware

  • Consumer GPUs: One-time purchase from Marktplaats.nl (used market prices)
  • NVIDIA Data Center: Hourly rental from Modal.com
  • NVIDIA DGX Spark: Compact AI system (~$4.7k / €2.8k–4.2k) — 128GB unified, 1 petaFLOP FP4, runs models up to 200B params
  • AMD MI300X: Hourly rental from Lambda Labs (~$4/hr)
  • Break-even Example: RTX 4090 @ €2,200 vs H100 @ $3.95/hr = 557 hours of H100 use
  • Choose Cloud If: Testing, burst workloads, need H100/B200/MI300X performance, no hardware maintenance
  • Choose Purchase If: Daily heavy use (>2–3 hrs/day), long-term projects, privacy needs, air-gapped
  • €500 Budget hours: Green = great value (600+ hrs), Yellow = moderate (200–600 hrs), Red = expensive (<200 hrs)
📊 0 selected