Offline Regularised Reinforcement Learning for Large Language...
The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data....