Reference

AI and LLM glossary

Plain-English definitions of the terms that come up when you build with AI and large language models. No jargon, no marketing, just what each one means and why it matters.

Talk to our team Read the answers

Agent: An AI system that can plan a task, call tools or APIs, observe the results, and decide what to do next without a human directing every step. Agents are used for work like research, data entry, and multi-step workflows where a single model response is not enough.
AgentOps: The practice of running AI agents in production: monitoring what they do, logging their tool calls and decisions, controlling cost, and catching failures before users see them. It is the operational layer that keeps agents reliable over time.
Chunking: Splitting documents into smaller pieces before storing them for retrieval. Chunk size affects how much context a model sees per result, so it directly changes the quality of a RAG system's answers.
Context window: The maximum amount of text a model can read and reason over in a single request, measured in tokens. Anything beyond the window is ignored, so long documents often need chunking or retrieval to fit.
Copilot: An AI feature embedded inside an existing product to assist a user as they work, such as suggesting code, drafting text, or answering questions in context. The human stays in control and accepts or rejects each suggestion.
Distillation: Training a smaller model to copy the behaviour of a larger one. The result is cheaper and faster to run while keeping much of the larger model's quality on a specific task.
Embedding: A list of numbers that represents the meaning of a piece of text, image, or other data. Items with similar meaning produce similar embeddings, which is what makes semantic search and retrieval possible.
Eval: A repeatable test that measures how well a model or AI feature performs on a defined set of inputs. Evals catch regressions when you change a prompt, model, or pipeline, the same way unit tests catch bugs in code.
Few-shot prompting: Including a handful of worked examples in the prompt so the model can follow the pattern. It often improves accuracy on formatting and classification tasks without any training.
Fine-tuning: Continuing to train an existing model on your own examples so it adapts to a specific style, format, or domain. It is useful when prompting alone cannot get consistent results, but it needs good training data and ongoing maintenance.
Function calling: A feature that lets a model return a structured request to run a named function with specific arguments, rather than plain text. Your code runs the function and passes the result back, which is how models connect to real tools and data.
GPU: A graphics processing unit, the hardware that runs most AI training and inference. GPUs handle the large parallel math behind neural networks far faster than ordinary CPUs, which is why they dominate AI cost and capacity planning.
Grounding: Tying a model's answer to specific source material, such as your documents or a database, so it responds from facts instead of memory. Grounding reduces hallucination and lets a system cite where an answer came from.
Guardrails: Rules and checks placed around a model to keep its inputs and outputs safe and on-topic. Guardrails can block unsafe requests, filter sensitive data, and enforce that responses stay within allowed boundaries.
Hallucination: When a model states something false or invented as if it were true. It happens because models predict plausible text rather than look up facts, which is why grounding and evals matter for production systems.
Hybrid search: Combining keyword search with semantic (embedding-based) search and merging the results. It catches both exact matches like product codes and meaning-based matches, usually retrieving better results than either method alone.
Inference: Running a trained model to get an output, as opposed to training it. Most production AI cost and latency comes from inference, since it happens on every request.
Knowledge cutoff: The date after which a model has no built-in information, because its training data stops there. Anything more recent has to be supplied at request time through retrieval or tools.
Latency: The time between sending a request to a model and getting a response. It shapes how an AI feature feels to use, and streaming partial output is a common way to make high-latency responses feel faster.
LLM: A large language model: a model trained on huge amounts of text to predict and generate language. LLMs power chat assistants, summarisation, code generation, and most current AI products.
LLM-as-judge: Using one language model to score or grade the output of another against a rubric. It is a common way to run evals at scale when human grading is too slow, though the judge itself needs checking.
MCP (Model Context Protocol): An open standard for connecting AI models to external tools and data sources through a consistent interface. An MCP server exposes capabilities like file access or database queries that any compatible model can use, instead of building a custom integration each time.
MLOps: The practices and tooling for taking machine learning models to production and keeping them healthy: deployment, versioning, monitoring, and retraining. It applies engineering discipline to the model lifecycle.
Multimodal: A model that can handle more than one type of input or output, such as text together with images, audio, or video. Multimodal models can, for example, read a screenshot and answer questions about it.
Prompt caching: Reusing the processed form of a repeated section of a prompt across requests instead of recomputing it each time. It lowers cost and latency when many requests share a large fixed prefix, like a long system prompt.
Prompt engineering: Writing and refining the instructions given to a model to get reliable, useful output. It covers wording, structure, examples, and constraints, and is often the cheapest way to improve results before reaching for fine-tuning.
Quantization: Storing a model's numbers at lower precision to shrink its memory use and speed up inference. It makes large models cheaper to run, usually with a small and acceptable drop in quality.
RAG: Retrieval-augmented generation: a pattern where relevant documents are fetched from your own data and added to the prompt so the model answers from them. RAG lets a model use private or current information it was never trained on, and it supports source citations.
Reranking: A second pass that reorders search results by relevance before they reach the model. A reranking model looks at each candidate against the query more carefully than the initial retrieval, which improves answer quality in RAG systems.
RLHF: Reinforcement learning from human feedback: a training method where people rank model responses and the model learns to prefer the better ones. It is a major reason modern chat models follow instructions and stay helpful.
Semantic search: Searching by meaning rather than exact keywords, using embeddings to find results that are conceptually similar to a query. It returns relevant matches even when the wording is different.
Structured output: Forcing a model to return data in a fixed format such as JSON that matches a schema. It makes responses safe to parse and use directly in code, instead of pulling values out of free text.
System prompt: A set of instructions given to a model before the conversation starts, defining its role, rules, and tone. It stays in effect across the whole session and shapes how the model responds to every user message.
Temperature: A setting that controls how random a model's output is. Lower values make responses more focused and repeatable, higher values make them more varied, so factual tasks usually use low temperature.
Throughput: How many requests or tokens a system can process in a given time. High throughput matters when serving many users at once, and it is often traded off against latency for any single request.
Token: The unit a model reads and writes text in, roughly a word or part of a word. Context windows, pricing, and speed are all measured in tokens, so token count drives cost.
Tool use: A model's ability to call external functions, APIs, or services to act in the world or fetch information beyond its training. Tool use is what lets a model look up live data, run code, or take actions on a user's behalf.
Transformer: The neural network architecture behind almost all modern language models. Its attention mechanism lets the model weigh how much each part of the input relates to every other part, which is what makes it good at language.
Vector database: A database built to store embeddings and quickly find the ones most similar to a query. It is the retrieval engine behind most RAG and semantic search systems.
vLLM: An open-source engine for serving large language models efficiently. It increases throughput and lowers cost by managing GPU memory and batching requests cleverly, and is widely used to self-host open models.
Zero-shot prompting: Asking a model to do a task with only an instruction and no examples. Capable models handle many tasks this way, and you add examples only when zero-shot results are not consistent enough.

Building something with these?

FoundrySoft is an India-based studio that ships production AI and software for US companies. If you are turning any of these ideas into a real system, we can help.

Start a conversation