Glossary¶
A comprehensive guide to terms, acronyms, and concepts used in Backend.AI GO, the Lablup ecosystem, and the broader world of AI infrastructure.
A¶
Agent¶
An AI system that goes beyond simple text generation. It creates plans, uses tools (like web search or code execution), and acts autonomously to achieve a user's goal. See ReAct.
Alignment¶
The process of tweaking an AI model so its behaviors and outputs match human values and intentions (e.g., being helpful, harmless, and honest).
API (Application Programming Interface)¶
A set of rules that allows different software applications to talk to each other. Backend.AI GO provides an OpenAI-compatible API, allowing other apps to use your local models.
Attention Mechanism¶
The core innovation of the Transformer architecture. It allows the model to "pay attention" to different parts of the input text with varying degrees of focus when generating each new word, capturing long-range dependencies.
Auto-Regressive¶
A property of LLMs where they generate text one token at a time, using the previously generated tokens as context for the next one.
B¶
Backend.AI¶
Lablup's flagship AI Infrastructure Operating System. It orchestrates massive GPU clusters, managing resource allocation, multi-tenancy, and scheduling for enterprise-scale AI training and serving.
Backend.AI FastTrack¶
An MLOps platform built on top of Backend.AI. It automates the entire lifecycle of AI development, from data processing and training to deployment and monitoring, using intuitive pipelines.
Backend.AI GO¶
The desktop application you are using. It serves as a personal AI runtime and a client for Backend.AI clusters, bringing LLM capabilities to consumer hardware.
Batch Size¶
The number of distinct prompts (or data samples) processed by the GPU at the exact same time. Higher batch sizes improve throughput but require more VRAM.
Beam Search¶
A text generation strategy that explores multiple possible paths (sentences) simultaneously and keeps the most promising ones, rather than just picking the single best next word at each step.
C¶
Chain of Thought (CoT)¶
A prompting technique where the model is encouraged to "show its work" by generating intermediate reasoning steps before giving the final answer. This significantly improves performance on logic and math problems.
ChatML¶
A popular prompt format (template) used by many open-source models (like Qwen, Yi) to structure conversations between "User", "Assistant", and "System".
Checkpoint¶
A snapshot of a model's weights at a specific point during training. It allows training to be resumed or the model to be used for inference.
Cloud Integration¶
A feature in Backend.AI GO that allows you to connect to external API providers (OpenAI, Anthropic, remote vLLM) and use them alongside your local models.
Context Window¶
The "short-term memory" of an LLM. It limits how much text (measured in Tokens) the model can process at once. If the conversation exceeds this limit, the model forgets the beginning.
Continuous Batching¶
An advanced serving technique (used in vLLM and PALI) that inserts new requests into the GPU processing queue the moment a previous request finishes, rather than waiting for the entire batch to complete. This drastically reduces latency.
CUDA (Compute Unified Device Architecture)¶
NVIDIA's parallel computing platform and programming model. It is the industry standard for running AI workloads on NVIDIA GPUs.
D¶
Decoding Strategy¶
The method used to select the next token during text generation. Common strategies include Greedy, Temperature Sampling, Top-K, and Top-P.
DeepSeek¶
A research organization and model family known for high-performance open-source models, often rivaling proprietary ones in coding and reasoning tasks.
Docker¶
A platform for developing, shipping, and running applications in containers. Backend.AI uses Docker containers to isolate users and environments in a cluster.
E¶
Embedding¶
A way of representing text (words, sentences, or documents) as a list of numbers (a vector). Text with similar meanings will have similar embedding vectors. This is crucial for RAG systems.
Epoch¶
One complete pass through the entire training dataset during the machine learning training process.
F¶
Fine-tuning¶
The process of taking a pre-trained model (base model) and training it further on a specific dataset to improve its performance on a particular task or domain.
Flash Attention¶
An algorithm that optimizes the attention mechanism to be much faster and memory-efficient by reducing the number of read/write operations to the GPU's slow main memory (HBM).
Floating Point (FP16, FP32, BF16)¶
Data types used to represent model weights.
-
FP32 (Full Precision): Standard 32-bit floating point.
-
FP16 (Half Precision): 16-bit, uses half the memory.
-
BF16 (Bfloat16): A format optimized for machine learning that preserves the dynamic range of FP32 but with less precision.
Foundation Model¶
A large-scale model trained on a vast amount of data that can be adapted (e.g., via fine-tuning) to a wide range of downstream tasks.
G¶
GGUF (GPT-Generated Unified Format)¶
A binary file format designed by Georgi Gerganov for storing model weights. It supports memory mapping (mmap) for fast loading and CPU/GPU hybrid inference. It is the standard for local AI.
GPU (Graphics Processing Unit)¶
A specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images. Their parallel processing structure makes them ideal for AI computations.
Gradient Descent¶
The optimization algorithm used to train neural networks. It iteratively adjusts the model's weights to minimize the error (loss) in its predictions.
H¶
Hallucination¶
A phenomenon where an LLM generates text that is grammatically correct and confident but factually incorrect or nonsensical.
HBM (High Bandwidth Memory)¶
A type of high-speed memory interface used in modern GPUs (like NVIDIA H100). The speed of HBM is often the bottleneck for LLM inference (bandwidth-bound).
Hugging Face¶
The "GitHub of AI". A platform where the community shares models, datasets, and demos. Backend.AI GO integrates with Hugging Face to let you download models easily.
I¶
Inference¶
The phase where a trained model is used to make predictions (generate text, recognize images, etc.). This is what Backend.AI GO does when you chat with a model.
Instruction Tuning¶
A type of fine-tuning where the model is trained on dataset of (instruction, output) pairs to learn how to follow user commands and act as an assistant.
K¶
KV Cache (Key-Value Cache)¶
A memory optimization technique used during inference. The model caches the calculated attention keys and values for past tokens so it doesn't have to re-compute them for every new token generated. Managing this cache efficiently is key to performance.
L¶
Latency¶
The time it takes for the model to start generating the first token after you send a prompt (Time to First Token, TTFT).
Layer¶
A building block of a neural network. LLMs consist of many stacked layers (e.g., 32, 80 layers).
Llama (Large Language Model Meta AI)¶
A series of open-source foundation models released by Meta. Llama 2 and Llama 3 set the standard for open-weight models.
llama.cpp¶
An open-source project by Georgi Gerganov that allows running LLMs on consumer hardware (Macs, PCs) with high efficiency, using techniques like quantization and CPU inference.
LLM (Large Language Model)¶
A deep learning model with billions of parameters trained on massive text datasets to understand and generate human language. Examples include Gemma 3, Qwen3, gpt-oss, GPT-5.2, and Claude 4.5.
LoRA (Low-Rank Adaptation)¶
A parameter-efficient fine-tuning technique. Instead of updating all model weights (which is expensive), it injects small, trainable rank decomposition matrices into the model, making fine-tuning much faster and lighter.
M¶
MLOps (Machine Learning Operations)¶
A set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. Backend.AI FastTrack is an MLOps platform.
MLX¶
Apple's array framework for machine learning on Apple Silicon. It allows for highly optimized model execution on M-series Macs using unified memory.
Model Parallelism¶
A technique used to run models that are too large for a single GPU. The model is split across multiple GPUs (Tensor Parallelism or Pipeline Parallelism).
Multi-Modal¶
The ability of an AI model to process and generate different types of media, such as text, images, audio, and video, simultaneously.
N¶
Neuron¶
The basic unit of a neural network, inspired by biological neurons. It receives inputs, applies a weight, adds a bias, and passes the result through an activation function.
NPU (Neural Processing Unit)¶
A specialized processor designed specifically for accelerating machine learning operations (e.g., the Neural Engine in Apple Silicon).
O¶
Overfitting¶
A modeling error where the model learns the training data too closely (including noise), negatively impacting its ability to generalize to new, unseen data.
P¶
PALI (Performant AI Launcher for Inference)¶
Lablup's total suite for building inference services based on Backend.AI. It is not just an engine but a comprehensive platform that combines: * Engine: Backend.AI Core. * Serving: Diverse inference engines managed by Backend.AI Deployment. * Model Management: Reservoir AI for serving and updating AI models in on-premise environments. * Routing: Continuum Router for efficient traffic management. * Interfaces: Web-based Generative AI service UI (AI:DOL) and the desktop local inference app (AI:GO).
PALANG¶
An extended package built upon PALI. It adds Backend.AI FastTrack for MLOps pipelines, curated datasets for fine-tuning, and specialized fine-tuning services. It provides everything needed to go from raw data to a deployed, domain-specific AI service.
PagedAttention¶
A memory management algorithm introduced by vLLM (and used in PALI). It breaks the KV cache into non-contiguous blocks (pages), virtually eliminating memory fragmentation and allowing much higher batch sizes.
Parameter¶
The internal variables (weights and biases) of a model that are learned during training. The number of parameters (e.g., 7B, 70B, 120B, 235B) is a rough proxy for the model's capacity and intelligence.
Perplexity¶
A metric used to evaluate how well a language model predicts a sample of text. A lower perplexity score indicates the model is less "surprised" by the text (i.e., better prediction).
Prompt Engineering¶
The art of crafting inputs (prompts) to guide the LLM to generate the desired output.
Q¶
Quantization¶
The process of reducing the precision of model weights (e.g., from 16-bit to 4-bit integers). This significantly reduces memory usage and increases inference speed with minimal loss in quality. Q4KM is a popular format.
R¶
RAG (Retrieval-Augmented Generation)¶
A technique that enhances LLMs by retrieving relevant information from an external knowledge base (your documents) before generating a response. This reduces hallucinations and allows the model to use private data.
ReAct (Reasoning + Acting)¶
A paradigm for building AI agents. The model alternates between "Reasoning" (thinking about what to do) and "Acting" (using a tool), allowing it to solve multi-step problems.
RLHF (Reinforcement Learning from Human Feedback)¶
A training technique used to align LLMs with human preferences. Humans rate model outputs, and a reward model is trained to guide the LLM towards generating better responses.
ROCm (Radeon Open Compute)¶
AMD's open software platform for GPU computing, the counterpart to NVIDIA's CUDA.
S¶
Sampling¶
The process of randomly selecting the next token from the probability distribution generated by the model.
Seed¶
A number used to initialize the random number generator. Using the same seed with the same settings will result in the same output from the model (deterministic behavior).
Speculative Decoding¶
An optimization technique where a small, fast "draft" model generates a few tokens ahead, and the large "target" model verifies them in parallel. This can speed up inference by 2-3x without losing quality.
System Prompt¶
The initial instruction given to the model that defines its persona, behavior, and constraints (e.g., "You are a helpful coding assistant").
T¶
Temperature¶
A parameter that controls the randomness of the model's output. High temperature (e.g., 1.0) makes the output more creative and diverse; low temperature (e.g., 0.1) makes it more focused and deterministic.
Tensor¶
A multi-dimensional array of numbers. It is the fundamental data structure used in deep learning.
Throughput¶
The rate at which the system can process tokens, usually measured in Tokens Per Second (TPS). High throughput is essential for serving many users.
Token¶
The atomic unit of text processing for an LLM. A token can be a word, part of a word, or a character. Roughly, 1,000 tokens ≈ 750 English words.
Tool Calling¶
A capability where an LLM can generate structured output (like JSON) to invoke external functions (tools), enabling it to perform actions like web searching or file manipulation.
Transformer¶
The deep learning architecture introduced by Google in 2017 ("Attention Is All You Need") that revolutionized NLP. It relies on the self-attention mechanism to process sequences of data.
U¶
Unified Memory¶
An architecture (notably in Apple Silicon) where the CPU and GPU share the same pool of high-speed memory. This avoids data copying and allows loading massive models that wouldn't fit in dedicated VRAM on a PC.
V¶
vLLM¶
A high-throughput, memory-efficient open-source LLM serving engine. It pioneered PagedAttention and is widely used for production serving.
VRAM (Video RAM)¶
The dedicated memory on a graphics card. The amount of VRAM is the primary bottleneck for running large models locally.
W¶
Weights¶
The learnable parameters of a neural network. When you download a model file (like .gguf), you are essentially downloading these weights.
Z¶
Zero-shot Learning¶
The ability of a model to perform a task it hasn't been explicitly trained to do, simply by understanding the instruction in the prompt.