The OpenAI team built a new benchmark dataset called SimpleQA that evaluates large language models' (LLMs) ability to answer factual questions. A particularly intriguing aspect of this paper is, in this era of LLMs, how the team of researchers leverages LLMs in their own workflow to design, iterate, and analyze a new dataset.
Read MoreThought Preference Optimization (TPO)
Thought Preference Optimization (TPO): Prompt the model to generate a thought process followed by the response. TPO demonstrates significant performance gains on non-reasoning categories, including translation, marketing, and health; reasoning categories like math and analysis also show improvements.
Read More