Snake eating its tail: how can synthetic data possibly work for training AI?
LLM training has exhausted all existing human data. So, we're now training LLMs on synthetic datasets, i.e., data produced by the LLM itself.
The strange thing is that training on synthetic data works for training LLMs.
To repeat this strangeness: training an LLM on data that it produced produces a better LLM. But how can this work?
Isn't this the snake eating its tail? Isn't this like a dung beetle growing by eating its own feces? It almost seems to go against conservation of energy.
After all, LLMs are just interpolation, aren't they? This process seems like it would lead to a closed loop of cannibalism, a form of self-cannibalism where quality inevitably degrades.
My hypothesis is that training a model with synthetic data is less like a snake eating its own tail and more like learning from imaginative variations of things you already know. I'll describe this with some analogies.
Hypothesis: Thinking in an Empty Room
Let's start with a thought experiment. We, as humans, explore ideas just by thinking more.
If you were put in an empty room with only a blank notepad and a pen, could you come out of that room with significantly more knowledge than when you went in, given time and no other external stimulation?
This is a situation where there's very little new data about the world, yet my answer is yes. I'm still able to create new knowledge.
If you were left in this room forever, would you run out of things to think?
I think the answer is no. You would never run out of things to think about, even though there's no new external stimulus. (I'd go mad due to the lack of social stimulation, but that's besides the point. I'd keep thinking in my madness).
Assuming I came out of the room before I lost my sanity, I imagine I'd have gained knowledge while in the room.
The analogy here is that the LLM is the thing that can keep thinking forever, using just its existing thoughts and building on them.
Like us in the white room, we create new knowledge from combining and varying existing data.
Dreams as varied training data
This is similar to how I think of dreams. We know that when you go to sleep, you wake up smarter, having assimilated the day’s events, particularly if you’re sleeping enough (e.g., over 7.5 hours).
Dreams are often strange, yet relatable. Your brain is making up situations that are subtly different but still somehow connected to the data it already knows as reality.
In this sense, dreams can be like data augmentation
Analogy: Data Augmentation
Here's my most direct technical analogy. In machine learning, you can improve the performance of image classifiers (e.g., detecting cats and dogs) simply by doing data augmentation.
Quick explanation of data augmentation for people who don't know what it is:
Data augmentation is a standard ML process during training that involves transforming data in to vary it, without changing its essence, with the effect of improving your final model's ability to generalize.
Data augmentation can be very simple. Taking an example from the Fast.ai course, you can take a set of cat photos and apply some basic transforms. Just by rotating all the cats 90 degrees, skewing them, making them bigger, and so on - making them different while remaining distinctively "catty" - you can improve the performance of your model significantly to generalize, i.e., understanding what "catness" is.
This is data augmentation. Variation to produce better understanding.
Conclusion: Contemplation not Self-Cannibalism
My hypothesis: training a model with synthetic data is less like a snake eating its own tail and more like learning from imaginative variations of things you already know.
Modelling potential new situations in your mind is useful. You're not just repeating the same information. You're creating new ideas by varying the things you already know to gain new knowledge - without the need for external stimuli.
Therefore, the learning process for LLMs using synthetic data isn't a closed loop of degradation. It seems more like a form of internal contemplation, an exploration of the latent space.