LLM-generated labels for topic classification
Read it first on my Substack
In a recent post, I trained a model to classify news articles by topic using just their URLs as input features. This approach, I argued, demonstrated the power of a lightweight dataset, provided the signal in the data was strong enough.
But even with this kind of model, there’s a limit to how lightweight the data can be. It still requires relevant labels. You either have to train on a pre-existing dataset and hope that it transfers to your use case, or you have to generate your own labels. The latter approach adds a costly prerequisite.
But maybe there’s a way to reduce that cost? A recent HuggingFace blog post discusses synthetic data annotation via large language models:
In 2023, one development fundamentally changed the machine-learning landscape: LLMs started reaching parity with human data annotators. There is now ample evidence showing that the best LLMs outperform crowd workers and are reaching parity with experts in creating quality (synthetic) data (e.g. Zheng et al. 2023, Gilardi et al. 2023, He et al. 2023). It is hard to overstate the importance of this development. The key bottleneck for creating tailored models was the money, time, and expertise required to recruit and coordinate human workers to create tailored training data. With LLMs starting to…