If you have spent any time on tech Twitter, you have seen the diagram: user question → embed → vector DB → top-k → LLM → answer. It looks clean. In real projects, the diagram is a lie, but a useful one, like a map that skips the construction zones.
What RAG is really “retrieving”
RAG (retrieval-augmented generation) sounds fancy. At its heart you are shoving your own text into a box so the model can quote something that is not in its training cutoff. The magic is not the vector store logo you picked. The magic is whether your chunks make any sense to begin with.
I have seen teams spend a week tuning temperature while their documents were split mid-sentence. The model was not “hallucinating” so much as doing its best with homework that was torn in half.
Chunking is a product decision, not a config knob
A fixed 512-token window is a starting point, not a religion. If your docs are API references, you might want one endpoint per chunk. If you are ingesting long legal PDFs, you are in a different world. I usually ask: if a human had only this paragraph, could they answer a reasonable question about it? If not, the embedding will not save you.
Overlap helps, but it is not free. You pay in storage, in index size, and in duplicate noise when two chunks say almost the same thing. I aim for just enough overlap that a boundary does not eat a definition.
The retrieval step is allowed to be dumb (at first)
Top-k nearest neighbors from a single query embedding works surprisingly often—until it does not. When users ask vague questions, or when your data uses different words than they do, you will get misses that feel random.
That is when people reach for re-rankers, hybrid search, query expansion, the whole kitchen sink. You probably will need some of that eventually. But I like to get a baseline with plain vector search and a few honest eval questions first. If you cannot tell whether things improved, you are just moving deck chairs.
A word on “grounding”
When the app finally returns an answer with citations, non-technical stakeholders relax. That is good. Just remember: a citation is only as honest as the chunk behind it. The model can still paraphrase in a way that technically points at a source and still misleads. I have started treating RAG as human-in-the-loop for anything high stakes, or at least logging enough that we can replay what was retrieved when someone complains.
If there is one takeaway: respect the plumbing. The embedding model and the database are fun to argue about, but most of the wins I have seen came from better text going in—cleaner HTML, better titles, and chunk boundaries that a tired engineer at 6pm can still follow.