Reinforcement Learning from Human FeedbackΒΆ
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.
The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):
See the following basic examples to get started if you don't want to use an existing library:
- Training and inference processes are located on separate GPUs (inspired by OpenRLHF)
- Training and inference processes are colocated on the same GPUs using Ray
- Utilities for performing RLHF with vLLM
See the following notebooks showing how to use vLLM for GRPO: