LAGE-tool — Full Comparison Table

GPU Model	VRAM	Bandwidth	FP32 TFLOPS	FP8 TFLOPS	INT4 TFLOPS	FP4 TFLOPS	Purchase/Rental	Hours for €500	Llama 8B tok/s	FP4 tok/s	INT8	INT4
🟢 NVIDIA RTX 30 Series (Ampere) — INT8 via Tensor Cores, no FP8/INT4
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RTX 3050	8GB	224 GB/s	9.1	—	—	—	€200–270	—	~25–35	21 🔄	✅	❌
RTX 3060 Best Mid-Budget	12GB	360 GB/s	12.7	—	—	—	€200–250	—	~35–45	34 🔄	✅	❌
RTX 3060 Ti	8GB	448 GB/s	16.2	—	—	—	€300–380	—	~45–55	42 🔄	✅	❌
RTX 3070	8GB	448 GB/s	20.3	—	—	—	€300–350	—	~48–58	42 🔄	✅	❌
RTX 3070 Ti	8GB	608 GB/s	21.8	—	—	—	€380–460	—	~50–62	57 🔄	✅	❌
RTX 3080	10GB	760 GB/s	29.8	—	—	—	€350–400	—	~60–75	71 🔄	✅	❌
RTX 3080 12GB	12GB	912 GB/s	30.6	—	—	—	€550–650	—	~65–80	85 🔄	✅	❌
RTX 3080 Ti	12GB	912 GB/s	34.1	—	—	—	€550–600	—	~68–83	85 🔄	✅	❌
RTX 3090 Best Value	24GB	936 GB/s	35.6	—	—	—	€650–700	—	~70–85	87 🔄	✅	❌
RTX 3090 Ti	24GB	1008 GB/s	40.0	—	—	—	€1000–1050	—	~75–90	94 🔄	✅	❌
🟢 NVIDIA RTX 40 Series (Ada Lovelace) — FP8 + INT4 via 4th-gen Tensor Cores
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RTX 4060 8GB	8GB	272 GB/s	15.1	121	242	—	€300–380	—	~40–50	25 🔄	✅	✅
RTX 4060 16GB Best INT4 Budget	16GB	272 GB/s	15.1	121	242	—	€400–480	—	~40–50	25 🔄	✅	✅
RTX 4060 Ti 8GB	8GB	288 GB/s	22.1	177	354	—	€380–460	—	~45–55	27 🔄	✅	✅
RTX 4060 Ti 16GB	16GB	288 GB/s	22.1	177	354	—	€480–560	—	~45–55	27 🔄	✅	✅
RTX 4070	12GB	504 GB/s	29.1	233	466	—	€470–520	—	~65–80	47 🔄	✅	✅
RTX 4070 Super	12GB	504 GB/s	35.5	284	568	—	€600–650	—	~70–85	47 🔄	✅	✅
RTX 4070 Ti	12GB	504 GB/s	40.1	321	642	—	€700–800	—	~75–90	47 🔄	✅	✅
RTX 4070 Ti Super	16GB	672 GB/s	44.1	353	706	—	€800–900	—	~85–100	63 🔄	✅	✅
RTX 4080	16GB	716 GB/s	48.7	390	780	—	€850–900	—	~95–115	67 🔄	✅	✅
RTX 4080 Super	16GB	736 GB/s	52.2	418	836	—	€830–880	—	~100–120	69 🔄	✅	✅
RTX 4090 Best Performance	24GB	1008 GB/s	82.6	661	1322	—	€2200–2250	—	~120–145	94 🔄	✅	✅
🟢 NVIDIA RTX 50 Series (Blackwell, Jan 2025) — FP8 + INT4 + FP4 via 5th-gen Tensor Cores NEW!
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RTX 5060	8GB	448 GB/s	19.2	154	307	307	€320–400*	—	~55–70	67 ⚡	✅	✅
RTX 5060 Ti 8GB	8GB	448 GB/s	28.0	224	448	448	€400–480*	—	~60–75	67 ⚡	✅	✅
RTX 5060 Ti 16GB	16GB	448 GB/s	28.0	224	448	448	€450–550*	—	~60–75	82 ⚡	✅	✅
RTX 5070	12GB	672 GB/s	45.0	360	720	720	€600–750*	—	~85–105	100 ⚡	✅	✅
RTX 5070 Ti	16GB	896 GB/s	58.5	468	936	936	€850–1000*	—	~100–125	134 ⚡	✅	✅
RTX 5080	16GB	960 GB/s	65.0	520	1040	1040	€1250–1300*	—	~115–140	144 ⚡	✅	✅
RTX 5090 Cutting Edge	32GB	1792 GB/s	125.0	1000	2000	2000	€2500–2600*	—	~160–200	268 ⚡	✅	✅
☁️ NVIDIA Data Center GPUs (Modal Cloud Pricing) Cloud Rental
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF	FP4 TF	Rental / Purchase	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
Nvidia T4	16GB	320 GB/s	8.1	—	—	—	$0.59/hr • €450–550 used	898 hrs	~40–55	—	✅	❌
Nvidia L4	24GB	300 GB/s	30.3	—	—	—	$0.80/hr • €1,500–2,000 used	663 hrs	~60–80	—	✅	✅
Nvidia A10	24GB	600 GB/s	31.2	—	—	—	$1.10/hr • €1,700–2,600 used	481 hrs	~70–90	—	✅	✅
Nvidia L40S	48GB	864 GB/s	91.6	733	1466	—	$1.95/hr • €7,500–10,200 used	272 hrs	~140–180	—	✅	✅
Nvidia A100 40GB	40GB	1555 GB/s	19.5	312	624	—	$2.10/hr • €7,000–9,000 used	253 hrs	~180–220	—	✅	✅
Nvidia A100 80GB	80GB	2039 GB/s	19.5	312	624	—	$2.50/hr • €14,000–20,500 used	212 hrs	~200–250	—	✅	✅
Nvidia H100	80GB	3350 GB/s	67	1979	3958	—	$3.95/hr • €27,500+ used	134 hrs	~350–450	—	✅	✅
Nvidia H200	141GB	4800 GB/s	67	1979	3958	—	$4.54/hr • €37,500+ used	117 hrs	~400–500	—	✅	✅
Nvidia B200 Latest!	192GB	8000 GB/s	4500	9000	18000	18000	$6.25/hr • rare used	85 hrs	~600–800	—	✅	✅
🖥️ NVIDIA DGX Spark — Compact AI system (GB10 Grace Blackwell, 128GB unified) ~$4.7k / €2.8k–4.2k
System	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
DGX Spark 128GB • FP4	128GB	273 GB/s	~50	~500	~1000	~1000	~$4,700 / €2,800–4,200	—	~200–300	41 ⚡	✅	✅
🔴 AMD ROCm — RDNA 2 (RX 6000 Series) — Community ROCm support, shader INT8 only ROCm 5.x
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF†	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RX 6800 XT	16GB	512 GB/s	20.7	—	—	—	€280–350	—	~55–70	48 🔄	✅	❌
RX 6900 XT	16GB	512 GB/s	23.1	—	—	—	€350–430	—	~60–75	48 🔄	✅	❌
🔴 AMD ROCm — RDNA 3 (RX 7000 Series) — Official ROCm support, WMMA INT8+INT4 via AI Accelerators ROCm 5.7+
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF (WMMA)	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RX 7700 XT	12GB	432 GB/s	35.7	—	~143	—	€280–340	—	~65–80	40 🔄	✅	✅
RX 7800 XT	16GB	576 GB/s	37.0	—	~148	—	€350–430	—	~75–95	54 🔄	✅	✅
RX 7900 GRE	16GB	576 GB/s	45.9	—	~184	—	€400–500	—	~75–95	54 🔄	✅	✅
RX 7900 XT	20GB	800 GB/s	53.4	—	~214	—	€550–650	—	~90–110	75 🔄	✅	✅
RX 7900 XTX Best AMD Value	24GB	960 GB/s	61.4	—	~246	—	€650–800	—	~100–130	90 🔄	✅	✅
🔴 AMD ROCm — RDNA 4 (RX 9000 Series, Mar 2025) — Full FP8 + INT4 AI Accelerators NEW!
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF (AI Acc)	INT4 TF (AI Acc)	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
RX 9070	16GB	512 GB/s	~40.0	~320	~320	—	€450–550	—	~80–100	48 🔄	✅	✅
RX 9070 XT Best AMD New	16GB	640 GB/s	~54.0	~432	~432	—	€550–650	—	~100–130	60 🔄	✅	✅
🔴 AMD Instinct — Data Center GPUs (ROCm Cloud, Lambda Labs pricing) Cloud
GPU Model	VRAM	Mem BW	FP32 TF (CU)	FP8 TF (Matrix)	INT4 TF (Matrix)	FP4 TF	Rental Price/hr	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
AMD MI300X Flagship	192GB	5300 GB/s	163	2614	5228	—	~$4.00/hr	~133 hrs	~600–800	—	✅	✅
🔵 Intel ARC Alchemist (A-Series) — XMX INT8 + INT4, SYCL/llama.cpp backend oneAPI
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF (XMX)	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
Arc A380	6GB	188 GB/s	7.0	—	~112	—	€80–130	—	~15–25	18 🔄	✅	✅
Arc A580	8GB	512 GB/s	12.4	—	~198	—	€130–170	—	~25–35	48 🔄	✅	✅
Arc A750	8GB	512 GB/s	17.6	—	~282	—	€180–240	—	~30–40	48 🔄	✅	✅
Arc A770 8GB	8GB	560 GB/s	19.7	—	~315	—	€230–290	—	~30–45	52 🔄	✅	✅
Arc A770 16GB Best Intel Value	16GB	560 GB/s	19.7	—	~315	—	€290–360	—	~30–45	52 🔄	✅	✅
🔵 Intel ARC Battlemage (B-Series) — Xe2 XMX with improved matrix throughput 2024–2025
GPU Model	VRAM	Mem BW	FP32 TF	FP8 TF	INT4 TF (XMX Xe2)	FP4 TF	Purchase Price	Hrs/€500	Llama 8B tok/s	FP4 tok/s	INT8 HW	INT4 HW
Arc B50 Entry	8GB	224 GB/s	~7.0	—	~224	—	€100–140*	—	~18–28	21 🔄	✅	✅
Arc B60 Mid	8GB	320 GB/s	~11.2	—	~358	—	€150–190*	—	~25–38	30 🔄	✅	✅
Arc B580 Best Intel Budget	12GB	456 GB/s	14.0	—	~448	—	€230–280	—	~35–50	43 🔄	✅	✅
Arc B770 New 2025	16GB	616 GB/s	~24.0	—	~768	—	€350–450*	—	~50–70	58 🔄	✅	✅

📖 Reference Guides

💡 Quick Picks

💛 Best Under €500 (NVIDIA):
→ RTX 3060 12GB (€200–250) — 12GB VRAM + INT8, 12.7 FP32 TFLOPS
→ RTX 4060 16GB (€400–480) — 16GB + INT4, 15.1 FP32 / 121 FP8 / 242 INT4 TFLOPS
💛 Best Under €500 (Intel ARC):
→ Arc B580 (€230–280) — 12GB + Xe2 XMX INT4, ~448 INT4 TOPS via llama.cpp SYCL
→ Arc A770 16GB (€290–360) — 16GB + XMX INT4, great for llama.cpp SYCL
💛 Best Under €500 (AMD ROCm):
→ RX 7800 XT (€350–430) — 16GB + fast 576 GB/s, excellent ROCm support
→ RX 7900 GRE (€400–500) — 16GB, 576 GB/s, full WMMA INT4
💚 Best Value Overall:
→ RTX 3090 24GB (€650–700) — Huge VRAM, 35.6 FP32 TFLOPS
→ RTX 4090 (€2200–2250) — 83 FP32 / 661 FP8 / 1322 INT4 TFLOPS
→ RX 7900 XTX (€650–800) — 24GB, 960 GB/s, fully ROCm-supported
→ RTX 5090 (€2500–2600) — 32GB VRAM, 125 FP32 / 2000 FP4 TFLOPS 🚀

⚡ Understanding FLOPS & Precision

TFLOPS: Tera Floating Point Operations Per Second (trillions of calculations/sec)
FP32 (all GPUs): 32-bit precision, standard shader benchmark — useful for comparison baseline
FP8 (RTX 40/50, RDNA 4, MI300X): 8-bit float via matrix cores, ~8× faster than FP32 for AI
INT4 (RTX 40/50, RDNA 3+, Intel XMX): 4-bit integer via accelerators, ~16× faster than FP32 🚀
FP4 (RTX 50 only!): 4-bit float via 5th-gen Tensor Cores — same speed as INT4 but more accurate 🚀
⚠️ BANDWIDTH NEVER CHANGES: GB/s is a fixed hardware spec regardless of precision!
Example RTX 4090: 83 FP32 / 661 FP8 / 1,322 INT4 TFLOPS — same 1,008 GB/s bandwidth always
Example RX 7900 XTX: 61.4 FP32 / ~246 INT4 TFLOPS (WMMA) — bandwidth always 960 GB/s
Example Arc A770 16GB: 19.7 FP32 / ~315 INT4 TFLOPS (XMX) — bandwidth always 560 GB/s

🔴 CRITICAL: Bandwidth vs FLOPS for LLM Inference

⚠️ BANDWIDTH (GB/s) = FIXED HARDWARE SPEC — determines LLM token speed!
→ RTX 5090: 1,792 GB/s | RTX 4090: 1,008 GB/s | RX 7900 XTX: 960 GB/s | Arc A770: 560 GB/s
→ Bandwidth = how fast model weights are streamed from VRAM to compute cores
✅ FLOPS = VARIES BY PRECISION — determines batch throughput and training speed
→ LLM inference (batch=1) is bandwidth-bound: higher FLOPS won't linearly increase tok/s
→ High FLOPS matters for: training, large batches, or when compute is the bottleneck
Why AMD RX 7900 XTX can match some RTX 40 cards:
→ Similar memory bandwidth (960 vs 1008 GB/s) means similar token/s on bandwidth-bound LLMs
Why Intel ARC is slower despite high XMX INT4 TFLOPS:
→ SYCL backend less mature than CUDA; kernel optimisation gap vs llama.cpp CUDA/HIP

📊 VRAM Requirements & Performance Notes

VRAM for Llama 8B: ~16GB (FP16), ~8GB (FP8/INT8), ~4GB (FP4/INT4)
VRAM for Llama 70B: ~140GB (FP16), ~70GB (FP8), ~35GB (INT4) — needs multiple GPUs!
RTX 5090 with FP4: Can exceed 400+ tokens/sec on Llama 8B
RX 7900 XTX ROCm INT4: ~100–130 tok/s via llama.cpp HIP, on par with RTX 4080
Intel ARC SYCL: Good INT4 TFLOPS via XMX, but llama.cpp SYCL backend is less optimised — 30–50% slower than bandwidth would suggest
AMD ROCm maturity: RDNA 3/4 excellent with ROCm 5.7+. RDNA 2 community-supported. Works well with llama.cpp HIP backend
INT4 vs FP4: FP4 (RTX 50) is slightly more accurate but same speed. RDNA 3 WMMA INT4 ≈ 4× FP32 throughput

🧩 Software Ecosystem Comparison

NVIDIA CUDA (best): RTX 30/40/50 — llama.cpp, Ollama, vLLM, ExLlamaV2, all frameworks ✅✅✅
AMD ROCm / HIP (good): RX 7000/9000, MI300X — llama.cpp, Ollama, vLLM (limited), PyTorch ROCm ✅✅
→ RX 6000 series: community ROCm support only, not all versions work ⚠️
Intel SYCL / oneAPI (developing): Arc A/B-series — llama.cpp SYCL, Intel Extension for PyTorch ✅
→ Setup requires Intel oneAPI toolkit; fewer pre-built binaries than CUDA/ROCm ⚠️
INT4 TFLOPS disclaimer for AMD/Intel: Values marked ~ are estimates based on architecture ratios; actual throughput depends on software optimisation

☁️ Cloud vs Own Hardware

Consumer GPUs: One-time purchase from Marktplaats.nl (used market prices)
NVIDIA Data Center: Hourly rental from Modal.com
NVIDIA DGX Spark: Compact AI system (~$4.7k / €2.8k–4.2k) — 128GB unified, 1 petaFLOP FP4, runs models up to 200B params
AMD MI300X: Hourly rental from Lambda Labs (~$4/hr)
Break-even Example: RTX 4090 @ €2,200 vs H100 @ $3.95/hr = 557 hours of H100 use
Choose Cloud If: Testing, burst workloads, need H100/B200/MI300X performance, no hardware maintenance
Choose Purchase If: Daily heavy use (>2–3 hrs/day), long-term projects, privacy needs, air-gapped
€500 Budget hours: Green = great value (600+ hrs), Yellow = moderate (200–600 hrs), Red = expensive (<200 hrs)

📖 Reference Guides

💡 Quick Picks

⚡ Understanding FLOPS & Precision

🔴 CRITICAL: Bandwidth vs FLOPS for LLM Inference

📊 VRAM Requirements & Performance Notes

🧩 Software Ecosystem Comparison

☁️ Cloud vs Own Hardware

GPU Comparison