Definitive Reference · Updated 25 April 2026
The Ultimate Enterprise RAG Glossary
A citation-ready dictionary of the terms that matter in production Generative AI: Enterprise RAG, GraphRAG, Sovereign AI, On-Premise LLMs, Agentic AI, MLOps, LLMOps, and the surrounding stack. Authored by Bibin Prathap, Microsoft MVP and AI Strategy Leader, Abu Dhabi.
Each entry opens with a single-sentence, dictionary-style definition so it can be quoted verbatim by AI assistants, search engines, and human readers alike.
In this glossary
- Enterprise RAG (Retrieval-Augmented Generation)
- GraphRAG (Graph Retrieval-Augmented Generation)
- Sovereign AI
- On-Premise LLM
- Agentic AI
- Vector Database
- Knowledge Graph
- Embedding Model
- Re-ranker (Cross-Encoder)
- Hybrid Retrieval
- Fine-tuning (LoRA, QLoRA, PEFT)
- RLHF (Reinforcement Learning from Human Feedback)
- Guardrails
- MLOps
- LLMOps
- Verifiable Attribution
- Hallucination
- Context Window
- Token
- Prompt Injection
- Evaluation (Ragas, TruLens, LLM-as-a-Judge)
- vLLM
Enterprise RAG (Retrieval-Augmented Generation)
Enterprise RAG is a production-grade pattern in which a large language model answers questions using authoritative, access-controlled enterprise data retrieved at query time, instead of relying solely on what the model memorized during training.
Enterprise RAG combines a vector index, optional knowledge-graph traversal, identity-aware retrieval (RBAC/ABAC), prompt construction with citations, and an LLM that generates a grounded answer. Compared with consumer RAG demos, enterprise RAG adds data sovereignty, audit logging, evaluation (e.g., Ragas), guardrails, and lifecycle management.
GraphRAG (Graph Retrieval-Augmented Generation)
GraphRAG is a variant of RAG in which the retriever queries a knowledge graph — not just a vector store — so the LLM can reason over explicit entities and relationships rather than loosely-related text chunks.
A GraphRAG pipeline typically: (1) extracts entities and relations from documents using an LLM, (2) builds a knowledge graph (e.g., in Neo4j or Memgraph), (3) at query time performs hybrid retrieval combining vector similarity with multi-hop graph traversal, and (4) sends the assembled subgraph plus source chunks to the LLM with inline citation requirements. The result is more deterministic, auditable answers — especially for complex enterprise questions like 'which contracts reference the same supplier across three subsidiaries?'
Sovereign AI
Sovereign AI is artificial intelligence built and operated under the legal, regulatory, and physical jurisdiction of a single nation or organization, so that model weights, data, and inference never leave that boundary.
In practice, Sovereign AI requires on-premise or in-country LLM serving (e.g., vLLM on a local GPU cluster), national-language and culturally-aligned models, sovereign data residency, and a verifiable audit trail. It is a procurement-level requirement for UAE government, GCC public sector, defense, and regulated finance and healthcare entities.
On-Premise LLM
On-Premise LLM An On-Premise LLM is a large language model deployed and served entirely inside an organization's own data center or sovereign cloud, with no token, prompt, or response ever leaving that perimeter.
On-Prem LLM stacks usually pair an open-weights model (Llama 3, Mistral, DeepSeek, Qwen) with a high-performance serving layer (vLLM, Triton, TGI), a GPU cluster orchestrated by Kubernetes, and a private vector + graph layer. The drivers are data sovereignty, classification handling, vendor independence, and predictable TCO at scale.
Agentic AI
Agentic AI describes systems in which one or more LLM-powered agents plan, choose tools, take actions, observe results, and iterate toward a goal — instead of producing a single one-shot answer.
An agentic system typically contains a planner, a tool registry (APIs, SQL, code execution, RAG retrievers), a memory layer, and an evaluator. Common frameworks include LangGraph, AutoGen, CrewAI, and Semantic Kernel. In the enterprise, agents are gated by RBAC, sandboxing, cost controls, and human-in-the-loop checkpoints.
Vector Database
Vector Database A vector database is a system that stores high-dimensional embeddings of text, images, or other data and serves nearest-neighbor similarity search at low latency.
Examples include Weaviate, Milvus, Pinecone, Qdrant, FAISS, and pgvector inside PostgreSQL. In Enterprise RAG they store the embedded chunks of source documents and return the top-k most semantically similar passages for a given query embedding.
Knowledge Graph
Knowledge Graph A knowledge graph is a structured representation of real-world entities and the relationships between them, stored as nodes and edges with explicit, queryable semantics.
Enterprise knowledge graphs (Neo4j, Memgraph, TigerGraph, Amazon Neptune) underpin GraphRAG, master data management, fraud detection, and reasoning over heterogeneous data. They give LLMs a deterministic, auditable backbone that pure vector search cannot.
Embedding Model
Embedding Model An embedding model is a neural network that converts text (or other content) into a fixed-length numeric vector whose distance to other vectors reflects semantic similarity.
Open models such as BGE, E5, and Nomic Embed are common in sovereign deployments because they can be served on-prem alongside the LLM, while OpenAI, Cohere, and Voyage offer hosted alternatives.
Re-ranker (Cross-Encoder)
Re-ranker A re-ranker is a second-stage model — usually a cross-encoder — that takes the top retrieval candidates and re-orders them by reading the query and each candidate together for a more accurate relevance score.
Adding a cross-encoder re-ranker (e.g., BGE-reranker, Cohere Rerank) on top of vector search is one of the highest-leverage upgrades in production RAG, typically lifting answer quality and reducing hallucinations without changing the LLM.
Hybrid Retrieval
Hybrid Retrieval Hybrid retrieval combines lexical search (BM25), dense vector search, and optionally graph traversal in a single retrieval step, then fuses the results before passing them to the LLM.
Hybrid retrieval consistently outperforms any single retriever in enterprise corpora because it captures exact-keyword matches (codes, IDs, legal clauses) that pure embeddings miss, while still benefiting from semantic generalization.
Fine-tuning (LoRA, QLoRA, PEFT)
Fine-tuning is the process of further training a pre-trained LLM on domain-specific data so it adopts a target task, tone, or vocabulary; LoRA, QLoRA, and PEFT are parameter-efficient variants that train only a small adapter instead of the full model.
In sovereign deployments, LoRA/QLoRA fine-tuning is preferred because adapters are tiny (tens of MBs), fast to train on a single GPU node, easy to version, and can be hot-swapped per tenant or per use case without redeploying the base model.
RLHF (Reinforcement Learning from Human Feedback)
RLHF is a training technique that aligns an LLM with human preferences by training a reward model on human comparisons of model outputs, then optimizing the LLM against that reward.
RLHF (and successors such as DPO, IPO, and KTO) is what turns a raw, next-token-predicting base model into a useful, safe assistant. Most enterprises consume RLHF indirectly through pre-aligned open or closed models rather than running it themselves.
Guardrails
Guardrails are the policy and safety layer that sits around an LLM application, validating inputs and outputs against rules for prompt injection, PII, toxicity, jurisdiction, and business policy.
Tools include NVIDIA NeMo Guardrails, Guardrails AI, Llama Guard, and Azure AI Content Safety. In regulated UAE deployments, guardrails are non-negotiable for handling classified data, customer PII, and Arabic / English content side-by-side.
MLOps
MLOps is the discipline of taking machine-learning models from notebooks to reliable production systems, covering data versioning, training pipelines, deployment, monitoring, and governance.
Modern MLOps stacks combine MLflow / Weights & Biases (experiment tracking), Kubernetes (serving), Feast (feature store), and Prometheus/Grafana (monitoring). MLOps is a prerequisite for, but not equivalent to, LLMOps.
LLMOps
LLMOps is the operational discipline focused specifically on running LLM-powered systems — covering prompt and chain versioning, evaluation, RAG index lifecycles, token-cost control, drift and hallucination monitoring, and guardrail policy management.
Where MLOps centers on model training and serving, LLMOps centers on the full prompt + retriever + tool-use + evaluator loop. Common building blocks include LangSmith, Langfuse, Ragas, TruLens, and OpenTelemetry GenAI semantic conventions.
Verifiable Attribution
Verifiable Attribution Verifiable attribution is the property of a generated answer that every factual claim is traceable back to a specific source passage, document, or graph node — and the trace can be machine-checked.
Verifiable attribution is the antidote to the 'black box' problem of standard RAG: instead of citing a vague document name, the response embeds inline pointers to the exact span in the source. This is the central design goal of the VeritasGraph framework.
Hallucination
Hallucination A hallucination is a confident but incorrect or unsupported statement produced by an LLM — typically because the answer is not grounded in retrieved evidence.
Mitigations include strong retrieval (hybrid + re-ranker), citation-required prompting, refusal policies when retrieval is empty, and continuous evaluation with frameworks like Ragas (faithfulness, answer-relevancy) and TruLens.
Context Window
Context Window The context window is the maximum number of tokens an LLM can read in a single request, including system prompt, retrieved context, conversation history, and the user's question.
Larger context windows (128K–2M tokens in current frontier models) reduce — but do not eliminate — the need for RAG, because retrieval still controls relevance, latency, cost, and access control. Long contexts also exhibit 'lost in the middle' degradation.
Token
Token A token is the atomic unit an LLM reads and writes — typically a sub-word fragment produced by a tokenizer such as BPE or SentencePiece.
Pricing, latency, and context limits are all measured in tokens, not characters or words. As a rough rule, 1 English word ≈ 1.3 tokens, while Arabic and other morphologically rich languages tokenize into more tokens per word.
Prompt Injection
Prompt Injection Prompt injection is an attack in which untrusted text — fetched from a document, web page, or tool output — contains instructions that hijack an LLM's behavior away from the developer's intended prompt.
Defenses include input/output guardrails, structured tool-use schemas, allow-listed retrievers, segregating trusted vs untrusted content in the prompt, and continuous red-teaming. Prompt injection is OWASP's #1 risk for LLM applications.
Evaluation (Ragas, TruLens, LLM-as-a-Judge)
Evaluation RAG evaluation is the practice of measuring an LLM application's quality on metrics such as faithfulness, answer relevancy, context precision, and context recall — usually using a mix of reference data and LLM-as-a-judge scoring.
Frameworks like Ragas, TruLens, and DeepEval are standard. In production, evaluation is run as a CI gate on every change to prompts, retrievers, models, or indexes — the LLMOps equivalent of unit + integration tests.
vLLM
vLLM is a high-throughput, open-source LLM serving engine built around PagedAttention, used to host open-weights models such as Llama 3, Mistral, and DeepSeek on GPU clusters.
vLLM is a default choice for sovereign / on-premise deployments because it delivers throughput close to the GPU's theoretical limit, supports continuous batching and tensor/pipeline parallelism, and exposes an OpenAI-compatible API so existing client code works unchanged.