diff --git a/README.md b/README.md
index 1e714b77..07871443 100644
--- a/README.md
+++ b/README.md
@@ -51,42 +51,6 @@ GPULlama3ChatModel model = GPULlama3ChatModel.builder()
#### **[Interactive-mode]** Running on a RTX 5090 with nvtop on bottom to track GPU utilization and memory usage.

------------
-#### **[Instruct-mode]** Running on a RTX 5090
-
-
-----------
-
-### TornadoVM-Accelerated Inference Performance and Optimization Status
-
-We are at the early stages of Java entering the AI world with features added to the JVM that enable faster execution such as GPU acceleration, Vector acceleration, high-performance access to off-heap memory and others.
-
This repository provides the first Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM.
-The baseline numbers presented below provide a solid starting point for achieving more competitive performance compared to llama.cpp or native CUDA implementations.
-[Our roadmap](https://github.com/beehive-lab/GPULlama3.java/blob/main/docs/GPULlama3_ROADMAP.md) provides the upcoming set of features that will dramatically improve the numbers below with the clear target being to achieve performance parity with the fastest implementations.
-
-If you achieve additional performance data points (e.g. new hardware or platforms) please let us know to add them below.
-
-In addition, if you are interested to learn more about the challenges of managed programming languages and GPU acceleration, you can read [our book](https://link.springer.com/book/10.1007/978-3-031-49559-5) or consult the [TornadoVM educational pages](https://www.tornadovm.org/resources).
-
-
-| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
-|:----------------------------:|:------------:|:---------------------:|:---------------------:|:-------------:|
-| | | **FP16** | **FP16** | **Support** |
-| **NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 22.96 tokens/s | ✅ |
-| | RTX 4090 | 66.07 tokens/s | 35.51 tokens/s | ✅ |
-| | RTX 5090 | 96.65 tokens/s | 47.68 tokens/s | ✅ |
-| | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s | ✅ |
-| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
-| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
-| | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
-| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
-
-##### ⚠️ Note on Apple Silicon Performance
-
-TornadoVM currently runs on Apple Silicon via [OpenCL](https://developer.apple.com/opencl/), which has been officially deprecated since macOS 10.14.
-
-Despite being deprecated, OpenCL can still run on Apple Silicon; albeit, with older drivers which do not support all optimizations of TornadoVM. Therefore, the performance is not optimal since TornadoVM does not have a Metal backend yet (it currently has OpenCL, PTX, and SPIR-V backends). We recommend using Apple silicon for development and for performance testing to use OpenCL/PTX compatible Nvidia GPUs for the time being (until we add a Metal backend to TornadoVM and start optimizing it).
-
-----------
@@ -159,6 +123,39 @@ make
./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"
```
+
+----------
+
+### TornadoVM-Accelerated Inference Performance and Optimization Status
+
+We are at the early stages of Java entering the AI world with features added to the JVM that enable faster execution such as GPU acceleration, Vector acceleration, high-performance access to off-heap memory and others.
+
This repository provides the first Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM.
+The baseline numbers presented below provide a solid starting point for achieving more competitive performance compared to llama.cpp or native CUDA implementations.
+[Our roadmap](https://github.com/beehive-lab/GPULlama3.java/blob/main/docs/GPULlama3_ROADMAP.md) provides the upcoming set of features that will dramatically improve the numbers below with the clear target being to achieve performance parity with the fastest implementations.
+
+If you achieve additional performance data points (e.g. new hardware or platforms) please let us know to add them below.
+
+In addition, if you are interested to learn more about the challenges of managed programming languages and GPU acceleration, you can read [our book](https://link.springer.com/book/10.1007/978-3-031-49559-5) or consult the [TornadoVM educational pages](https://www.tornadovm.org/resources).
+
+
+| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
+|:----------------------------:|:------------:|:---------------------:|:---------------------:|:-------------:|
+| | | **FP16** | **FP16** | **Support** |
+| **NVIDIA / OpenCL-PTX** | RTX 3070 | 66 tokens/s | 55.46 tokens/s | ✅ |
+| | RTX 4090 | 86.11 tokens/s | 75.32 tokens/s | ✅ |
+| | RTX 5090 | 117.65 tokens/s | 112.68 tokens/s | ✅ |
+| | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s | ✅ |
+| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
+| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
+| | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
+| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
+
+##### ⚠️ Note on Apple Silicon Performance
+
+TornadoVM currently runs on Apple Silicon via [OpenCL](https://developer.apple.com/opencl/), which has been officially deprecated since macOS 10.14.
+
+Despite being deprecated, OpenCL can still run on Apple Silicon; albeit, with older drivers which do not support all optimizations of TornadoVM. Therefore, the performance is not optimal since TornadoVM does not have a Metal backend yet (it currently has OpenCL, PTX, and SPIR-V backends). We recommend using Apple silicon for development and for performance testing to use OpenCL/PTX compatible Nvidia GPUs for the time being (until we add a Metal backend to TornadoVM and start optimizing it).
+
-----------
## 📦 Maven Dependency