Synthetic Data for Human-Preference Modelling: EMNLP 2023 Overview Part 3
In the rapidly evolving field of language model (LM) development, the role of preference modeling has become increasingly crucial. This section post delves into the latest trends and observations in preference modeling, with a particular focus on the generating synthetic data for human preference modelling, essential for techniques like, RLHF, DPO etc.
Aligning Large Language Models through Synthetic Feedback
Problem Statement: The challenge in aligning Large Language Models (LLMs) with human values, crucial for sophisticated model steering, lies in the heavy reliance on extensive human demonstrations and feedback, or on insights distilled from proprietary LLMs like ChatGPT. This work seeks to address the dependence on substantial human annotations or proprietary model insights, instead leveraging synthetic feedback for effective LLM alignment.
Approach Idea:
The hypothesis is that larger, optimally prompted models produce superior responses.
Step1 — Reward Modeling with Synthetic Feedback: Synthetic data is collected based on the assumptions:
- Larger model > Smaller model
- More few-shots > Less few-shots
- Better demonstration > Worse demonstration
** Authors highlight the important post-processing steps to improve the data sanity.
Step 2 - The RM trained in Step 1 guides Self-Play (RMSP) to simulate high-quality demonstrations with rejection sampling.
Step 3 — Reinforcement Learning from Synthetic Feedback (RLSF)
Key Result:
- The ALMoST model significantly outperforms models such as Alpaca — distilled from InstructGPT (Ouyang et al., 2022) — and Dollyv2.
- In human evaluations, ALMoST was preferred over Alpaca and Dolly-v2 by 55.0% and 58.5%, respectively, demonstrating a notable advantage in alignment with human values.
Learning Preference Model for LLMs via Automatic Preference Data Generation
Problem Statement: The study addresses the challenge of the heavy reliance on human-annotated data in training preference models for LLMs, which limits their versatility and scalability.
Approach Idea:
- AutoPM uses two primary methods: In-Breadth Data Generation and In-Depth Data Generation.
In-Breadth Data Generation involves eliciting pairwise preference data from LLMs based on helpful-honest-harmless (HHH) criteria.
- Create Pairs of Phrases: Begin by forming phrase pairs based on being Helpful, Harmless, and Honest (HHH), where each pair showcases contrasting aspects of these HHH principles.
- Use Templates to Craft Prompts: Use these phrase pairs in specially designed templates to create prompts. These prompts guide the Large Language Model (LLM) in generating specific types of responses.
- Generate Two Types of Responses: When these prompts are fed into a Large Language Model (LLM), the model generates two different types of responses. One response aligns with the HHH criteria — this is considered the ‘positive’ case. The other response goes against these criteria, making it the ‘negative’ sample.
In-Depth Data Generation employs a sequential post-editing process, applying edit actions (like deletion, substitution, insertion) on LLM-generated responses to vary their quality.
- Deletion involves eliminating parts of the LLM response that are relevant and helpful in addressing the query.
- Substitution involves modifying the content of the LLM response in a way that makes it unsuitable for the query.
- Insertion entails adding new content to the LLM response that is not related to the query at hand.
This method ensures broad coverage of quality and adherence to human preference guidelines. The study uses GPT-3.5 as the main source for synthetic preference data generation and employs LLaMA for model development.
Key Result: AutoPM shows remarkable performance improvements, with its predictions aligning highly with humans and GPT-4. Specifically, AutoPM-7B and AutoPM-30B variants demonstrate superior accuracy on five benchmark datasets. Notably, for Vicuna-13B vs. Alpaca-7B, AutoPM achieves impressive results, underscoring its effectiveness in preference modeling. Additionally, the model significantly improves the response quality of LLMs like Alpaca and Vicuna.
Axiomatic Preference Modeling for Longform Question Answering
Problem Statement: This research tackles the issue of existing Reward Models (RMs) used in RLHF post-training lacking direct knowledge of the principles behind human preference annotations, leading to misalignment with human preferences.
Approach Idea:
Authors define principles (axioms) that humans desire in longform answers around the concepts of usefulness, relevance, groundedness, truthfulness, and thoroughness.
- These principles are then used to construct candidate answer pairs “axiomatically” such that one answer is clearly preferred along a certain principle.
- Usefulness and Relevance based preference data is curated on the basis of Human Preference Signals, obtained on the basis of upvotes/downvotes on Reddit and Stack Exchange, along with curating hard negative samples on the basis of negative sampling.
- LLM (ChatGPT) is utilized for the other three principles, to generated the data.
Key Result:
The Preference Model, comprising about 220M parameters, aligns more closely with human-annotated preference labels than GPT-4. It effectively scores both human- and LLM-generated answers, showcasing its proficiency in preference scoring. The model’s performance is highlighted by its ability to outperform GPT-4 in preference scoring.