Thought Preference Optimization (TPO): Prompt the model to generate a thought process followed by the response. TPO demonstrates significant performance gains on non-reasoning categories, including translation, marketing, and health; reasoning categories like math and analysis also show improvements.
Read MoreRLAIF