Retrieval Augmented Generation (RAG)

RAG for accessing external information
Retrieval Augmented Item Generation (RAIG)
Step 1: Indexing
Step 2: Retrieval
Step 3: Generation
RAG demonstration
Specify knowledge-base
Build FAISS index
Test retrieval pipeline
Generated item quality check
References

RAG for accessing external information

Large LLMs have vast amounts of information compressed into their parameters. However, it isn’t viable to store all the information an LLM will ever need in its parameters. For one thing, it would make models too large. But information also becomes outdated. LLMs need a way to access new information that wasn’t in their training. Otherwise, the expensive process of pre-training would need to be repeated frequently.

Retrieval Augmented Generation (RAG) was proposed by researchers at Facebook (Lewis et al, 2020) to address these challenges. RAG is also useful for reducing hallucinations and when users require references. RAG was specifically designed to improve performance on “knowledge-intensive” tasks by combining information in model weights with information from documents that weren’t in the model training data.

The information in LLM memory is parametric because the information is stored in parameters. The external information is considered non-parametric because it’s stored in documents rather than in the LLM parameters. The aim in RAG is to inject relevant content into the user prompt by appending retrieved content from a search of an external knowledge base. This grounds the generator response in relevant contextual information.

RAG involves 2 models. The first is a small encoder, part of a retrieval module that encodes the knowledge base as well as the user prompts at inference time and conducts similarity searches. The second is a decoder for generation using the user query and retrieved content. RAG first used an encoder-decoder for generation and both models were jointly trained to be optimized for each other. Today decoders are used and joint-training is less common.

Retrieval Augmented Item Generation (RAIG)

In this section we explain the steps in RAG and a common extension to RAG called RAG re-ranking. We’ll share Python code that shows an end-to-end demonstration of both RAG and RAG with reranking for item generation. We use the full the IPIP item database for injecting context into user prompts that we call RAIG (Retrieval Augmented Item Generation).

In the example, items generated with RAIG are shown to improve on items generated by prompting alone. RAIG items are also more closely aligned with their construct definitions. An extension of RAG, RAG with re-ranking, is shown to improve model item alignment further. The work is preliminary and Monte Carlo studies are required. The key steps we now cover include indexing, retrieving, generating, and ranking.

Step 1: Indexing

In step one, a database of documents is created that will serve as a knowledge base that the RAG model will search through for query relevant information to ground its response. Each of these documents (items) is embedded as a numerical vector in a database. A key decision at this stage relates to the chunking strategy. Documents can be split into sentences, a fixed number of tokens, or split by semantic boundaries when they’re embedded.

If the chunk size is too small, the model may miss important context. If the chunk size is too large, it might retrieve irrelevant context. Sometimes developers will specify deliberate overlap in chunking to avoid context loss when partial chunks are retrieved. Decisions must also be made about whether to store metadata and whether to clean up the data.

A database to store the knowledge embeddings must be chosen. Two popular options are FAISS from Facebook and ChromaDB from Chroma. FAISS is the industry standard and is fast at scale. ChromaDB is common for fast prototyping by start-ups. The embedding model must also be chosen. The chunks are then encoded and stored as an index. Index is a term for any searchable structure such as the knowledge embeddings we refer to here.

Step 2: Retrieval

The retrieval pipeline works as follows. The user first asks a question. An embedding model encodes the query using the same method that was used to encode the document chunks in the database. The retriever then compares the query embedding to the stored chunk embeddings to find the top-k most similar chunks using dense passage retrieval methods (Karpukhin, 2020). These retrieved chunks are appended to the user’s prompt before the decoder generates a final answer.

Choosing K can be seen as a kind of hyper-parameter tuning. More context can lead better answers but costs more tokens and increases the risk of irrelevant context. How you request what you want has a big impact on effectiveness, so best practice is to show what was used as context and your prompt engineering strategy. Newer directions involved query rewriting by the LLM before the FAISS search.

Step 3: Generation

Generation combines retrieved documents with the model’s reasoning to produce grounded responses to the user query. Irrelevant or conflicting passages can be minimised by re-ranking retrieved text by its similarity to the query using neural scoring methods such as BERT-based re-rankers (Nogueira & Cho, 2019) and by filtering unreliable sources. Long documents can be automatically summarised after retrieval so only concise facts enter the prompt.

Precision can be enhanced further improved by lowering the generation temperatures of the LLM. To reduce hallucinations, separate verification LLMs models can check whether claims are supported by retrieved evidence (Manakul et al., 2023). Because language models focus more on nearby text, placing retrieved content directly before or after the question to improve grounding. Including retrieval-time relevance scores or estimated confidence can guide the generator to rely more heavily on high-quality evidence (Nogueira et al. (2020).

RAG demonstration

To improve item quality in psychometrics it's useful to consider what you can do prior to item generation, during item generation, and post generation. Before item generation, you can fine-tune a decoder or do improve your prompt engineering strategy. During item generation itself you can use Retrieval Augmented Generation (RAG). Because this generates items using RAG, we call the approach RAIG. Download the full Python notebook and the IPIP item file required to run the demonstration.

Specify knowledge-base

RAG requires a knowledge base, so our first step is to load all items from the IPIP database into a dataframe in Juptyer using Python. We will use OpenAI text-embedding-3-small model for our embeddings. The code requires setting an environment variable. See the previous section on generating items via an API to see how to do this. The code then sends the items to OpenAI to obtain embeddings in batches for speed.

Build FAISS index

We next need to build a Builds a FAISS index for similarity search. The notebook converts the text embeddings into a NumPy array with float32 dtype. It initializes a FAISS IndexFlatL2 index using Euclidean distance to enable efficient similarity searches across the 3805 item vectors. The index is populated with these vectors, allowing fast retrieval of items most similar to a query embedding.

Test retrieval pipeline

We tested the retrieval pipeline by prompting for 20 Achievement-striving items. The pipeline embedded the query using text-embedding-3-small model and searched the FAISS index for the most similar items. In the IPIP database, many items align with many scales. The low precision, recall and F1 results in the figure may be because of this, reflecting fuzzy semantic boundaries between labels like Achievement-striving and related constructs (e.g., Ambition/Drive).

These retrieval results may be reasonable for IPIP item retrieval, where items often have overlapping meanings, given the face relevance of retrieved items supports adequate embedding quality (see the notebook). The text-embedding-3-small model, trained on general web-data rather than psychometric-specific data, may limit performance. Fine-tuning could improve results. Additionally, uneven item distribution may impact retrieval metrics.

Generated item quality check

Usability of the results will ultimately depend on item performance with empirical data. However, we can check semantic alignment with construct definitions and even conduct pseudo- factor analysis before item trialling. To do so, we generated 100 items for the Care/Harm dimension of the executive Moral Foundations model using prompt engineering alone, prompt engineering with context injection from RAG, and prompt engineering with RAG and re-ranking.

Re-ranking rescores an initial FAISS retrieval subset for better semantic match. A cross-encoder model processes both the query and item text together to produce a single similarity score, indicating how semantically relevant the item is to the query. This allows reordering of items by relevance. The notebook uses the cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers as the re-ranker. Only the top k (10) items are selected after re-ranking, based on the highest cross-encoder scores, to improve relevance.

The retrieval pipeline testing is not relevant in this case, because the data base does not contain item labels Care/Harm. What is relevant is the distribution of cosine similarities between the item text and construct definitions across prompting alone, prompting with RAG, and prompting with RAG and re-ranking. The results show that RAG marginally outperforms prompting alone, and RAG is outperformed by RAG with re-ranking.

Overall, these results are encouraging. The compute cost is negligible, less than US$0.05. In addition, the method is easily implemented using the codebooks that are shared. They may be adapted to use RAG for any other context data you have on hand for item writing, and potentially other psychometric tasks. I emphasise that the ultimate quality marker of your psychometric items is their performance in live use, so empirical data is always required. Given the low cost, easy implementation, and that empirical data will be collected, using RAG in psychometrics can make good sense if appropriate knowledge data is available.

References

Karpukhin, V., Oguz, B., Min, S., Lewis, P. S., Wu, L., Edunov, S., et al. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769–6781).

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Manakul, P., Lertvittayakumjorn, P., & Riedel, S. (2023). SELF-CHECKGPT: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085.

Nogueira, R., Jiang, Z., & Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713.

Next section

Semantic item alignment results

Last section

Generating items via an API

Return home

Psychometrics.ai