n8n RAG Workflow: Build a Retrieval-Augmented Agent (2026)

TL;DR: A RAG workflow in n8n is two separate flows, not one. The first ingests documents: a Default Data Loader feeds a Recursive Character Text Splitter, an embeddings model (OpenAI, Cohere, Gemini) turns chunks into vectors, and a vector store node in Insert mode writes them. The second answers questions: the same vector store node, this time exposed as a retriever to a Question and Answer Chain or as a tool to an AI Agent. The non-negotiable rule: the embedding model must be identical on both sides. Default to PGVector if you self-host (you almost certainly already run Postgres) and Pinecone or Qdrant if you want a managed store.

Most broken n8n RAG builds come from treating retrieval as a single pipeline. It is two. You ingest once (or on a schedule when documents change), and you query continuously. The only thing they share is the vector store and the embedding model.

The ingestion flow runs left to right: a trigger (manual, webhook, or a schedule that watches a folder or Google Drive), the document source, then the vector store node configured in Insert Documents mode. Hanging off that vector store node as sub-node inputs are the embeddings model and the document data. The query flow is the inference side: a chat trigger or webhook, then either a Question and Answer Chain or an AI Agent, with the vector store attached as a retriever or a tool.

If you have not set up the chat side of n8n's AI features before, the connection model (root nodes with sub-node inputs hanging below them) is the same one covered in our AI Agent skill files guide. The cluster-node wiring there carries straight over.

The ingestion flow: loader, splitter, embeddings, store

Build ingestion in this order, because each node is a sub-node input to the next.

Vector store node, Insert Documents mode. This is the root node of the ingestion cluster. Pick PGVector, Pinecone, Qdrant, Supabase, or the Simple Vector Store (in-memory) for testing. It has two sub-node inputs: a document and an embeddings model.
Default Data Loader connects to the document input. It takes binary or JSON data from earlier in the flow and normalizes it into LangChain document objects. Set whether you are loading binary (PDFs, files) or JSON, and map any fields you want kept as metadata.
Recursive Character Text Splitter connects to the Default Data Loader. It splits recursively by paragraph, then sentence, then character, which keeps semantically related text together far better than the plain Character Text Splitter. Use it as the default. The Token Splitter is the alternative when you need to respect an exact token budget.
Embeddings model (OpenAI Embeddings, Cohere, Google Gemini, etc.) connects to the embeddings input on the vector store node.

Chunk size and overlap live on the splitter and they are the single biggest lever on retrieval quality. There is no universal best, but the tradeoff is concrete: large chunks preserve context but dilute the embedding (the vector averages over too much text, so similarity search gets fuzzy), while tiny chunks retrieve precisely but lose the surrounding context the model needs to answer. A practical starting point for prose documentation is roughly 800-1200 characters per chunk with about 10-15% overlap, then adjust based on what your retrieval actually returns. Overlap exists so a fact that straddles a chunk boundary still appears whole in at least one chunk.

For structured or messy source data, normalize before the loader rather than after. A Code node is the right place to strip boilerplate, attach metadata, or merge fields. Our Code node JavaScript and Python examples cover the $input patterns for this kind of pre-processing.

The query flow: retriever chain or agent tool

Two ways to read from the store, and they are not interchangeable.

Question and Answer Chain

Use this when the job is strictly "answer from these documents." The Question and Answer Chain is a root node. You attach a chat model to it and a Vector Store Retriever sub-node, and the retriever in turn points at your vector store node (now in Retrieve / get-documents mode). The flow is linear: chat input goes in, the retriever pulls the top matching chunks, the chain stuffs them into the prompt, the model answers. It is predictable and cheap because there is no agent loop deciding whether to search.

AI Agent with a Vector Store tool

Use this when the assistant does more than answer document questions: it also calls other tools, holds a conversation, or decides when retrieval is even needed. Attach the vector store to the AI Agent through the tools input. The cleaner option is the Vector Store Question Answer Tool, which wraps the store with its own summarizing model so the agent gets a synthesized answer back instead of raw chunks. The agent only invokes it when the tool's name and description match the user's question, so write that description deliberately, for example "Search internal product documentation for features, limits, and configuration steps."

Default to the Q&A Chain when retrieval is the whole job, and to the Agent when retrieval is one capability among several. Reaching for an agent "to be safe" just adds latency, token cost, and a layer that can decide not to search at all. If you are also exposing other systems to the agent, an MCP server is usually a better way to add those than bolting on more individual tool nodes.

The gotcha that breaks most builds: embedding model mismatch

The embedding model must be identical between ingestion and retrieval. Different models produce vectors with different dimensions and different geometry, so a query vectorized with text-embedding-3-large cannot be meaningfully compared against documents stored with text-embedding-3-small. The symptom is brutal because it is silent: no error, the workflow runs green, and retrieval just returns irrelevant chunks or nothing. If your agent suddenly answers "I don't have information on that" for content you know is indexed, check the two embedding nodes match before you touch anything else.

This applies to the dimension too. If you re-index with a different model, the existing vectors in the store are now garbage and need to be wiped and rebuilt, not appended to.

Picking a vector store: default to PGVector self-hosted, Pinecone or Qdrant managed

The choice is driven by what infrastructure you already run, not by benchmark leaderboards.

PGVector is the default for self-hosted n8n. You are already running Postgres for n8n itself, so PGVector means no new service, your vectors sit next to your relational data, and you can filter on metadata with plain SQL. For most agency and internal-docs use cases this is enough, and it is the lowest-operations option.
Qdrant is the move when PGVector's recall or latency stops keeping up at scale, and you still want to self-host. It is a purpose-built vector database with strong filtering, and it runs as a single container next to your n8n stack.
Pinecone or Supabase are the managed picks. Pinecone if you want a fully hosted vector database and zero ops; Supabase if your app data already lives there (it is PGVector under the hood with a managed layer on top).
Simple Vector Store (in-memory) is for prototyping only. It lives in memory and clears when the workflow or instance restarts. Build against it to validate your chunking and prompts, then swap in a persistent store.

Metadata filtering is the feature that separates a toy from something usable. Attach fields like source, tenant_id, doc_type, or updated_at at the Default Data Loader stage, then filter on them at query time so the agent only searches the relevant subset. For a multi-client agency, filtering by client at retrieval is the difference between a scoped assistant and one that leaks one client's documents into another's answers. PGVector, Qdrant, and Pinecone all support this; the in-memory store effectively does not.

A realistic chunking config

If you ingest mixed-length Markdown docs, normalize and tag them in a Code node before the loader so retrieval has metadata to filter on:

// Code node: clean text and attach metadata before the Default Data Loader
return $input.all().map((item) => {
  const text = item.json.body
    .replace(/\r\n/g, "\n")
    .replace(/\n{3,}/g, "\n\n")  // collapse runaway blank lines
    .trim();

  return {
    json: {
      text,
      metadata: {
        source: item.json.url,
        doc_type: "help-article",
        updated_at: item.json.modified_at,
      },
    },
  };
});

Then on the Recursive Character Text Splitter, a sane starting config for this kind of content:

{
  "chunkSize": 1000,
  "chunkOverlap": 150,
  "splitter": "recursiveCharacterTextSplitter"
}

Re-run ingestion whenever source documents change, and wipe-and-rebuild (not append) if you ever change the embedding model or chunk strategy, so you never mix vector generations in one store.

What to verify before you call it done

Three checks catch nearly every silent RAG failure. First, confirm the embedding node is identical on the ingestion and query sides. Second, open the vector store after ingestion and confirm chunk count looks right (a single PDF producing two chunks means the splitter or loader is misconfigured). Third, run a question whose answer you know is in exactly one document and inspect the retrieved chunks, not just the final answer, to confirm the right text is coming back. If retrieval is good but the answer is wrong, that is a prompt problem; if retrieval is wrong, it is an embedding, chunking, or filtering problem. Diagnosing in that order saves hours.

If you want a production RAG agent built and tuned on your own document set, with the chunking, metadata schema, and store choice fitted to your data rather than a tutorial default, n8n Logic does exactly this kind of build. The patterns above are the starting point, not the finish line.

FAQ

What is RAG in n8n? RAG (retrieval-augmented generation) in n8n means giving an AI model access to your own documents at answer time. You ingest documents into a vector store as embeddings, then at query time the relevant chunks are retrieved and fed to the model so it answers from your data instead of only its training.

Which vector store should I use in n8n? Default to PGVector if you self-host, because you already run Postgres for n8n and it needs no extra service. Use Qdrant when you outgrow PGVector and still want to self-host, and Pinecone or Supabase when you want a managed option. Use the Simple (in-memory) store only for prototyping.

Why is my n8n RAG agent returning irrelevant or empty results? The most common cause is an embedding model mismatch between ingestion and retrieval, which fails silently with no error. Confirm both embedding nodes use the exact same model and dimension. After that, check chunk size and overlap, and verify documents actually got inserted into the store.

What chunk size and overlap should I use? There is no universal value, but roughly 800-1200 characters with 10-15% overlap is a reasonable starting point for prose. Larger chunks keep context but blur the embedding; smaller chunks retrieve precisely but lose surrounding context. Tune by inspecting what retrieval actually returns.

Should I use the Question and Answer Chain or the AI Agent? Use the Question and Answer Chain when answering from documents is the entire job; it is simpler, cheaper, and predictable. Use the AI Agent with a Vector Store tool when retrieval is one capability among several, such as combining document search with other tools or multi-turn conversation.

Do I need to re-run ingestion when documents change? Yes. Ingestion is a separate flow from querying, so re-run it (manually or on a schedule) whenever source documents change. If you change the embedding model or chunking strategy, wipe and rebuild the store rather than appending, so you never mix incompatible vector generations.

RAG in n8n is two flows that share one vector store