Discussion about this post

User's avatar
Alfredo Zorrilla, MBA's avatar

Synthetic Data sounds like an interesting and useful concept, but it's important to remember that those "biases" and "inconsistencies" that it tries to remove, can be exactly what makes such data "real".

Why synthetic diamonds are cheap and natural ones are expensive?

The beauty of the natural diamonds come from the impurities present in the carbon crystal, that won't happen in the synthetic one (too "perfect" because it is created by deposition).

If you are creating a new type of nuclear weapons, the simulations (synthetic) can only take you so far... but at some point you need to produce a real detonation to get the real data that your simulations couldn't predict would happen.

My take is: using synthetic data needs to be intertwined with the use of real data, or we risk creating models that will be dangerously blind to real situations our normalized data cannot predict.

Expand full comment
Dominika Michalska's avatar

Thanks for sharing your thoughts, Alfredo! You make a compelling point—and the diamond analogy is a powerful one. I do think there's truth in what you're saying: some of the messiness in real-world data holds, meaning that synthetic data might smooth over too much. Those imperfections can carry important context, especially in edge cases.

That said, I’d push back a little. Not all bias or inconsistency in real data is valuable—some of it reflects systemic issues, poor measurement, or unbalanced sampling that skews models in ways we may not want. In those cases, replicating that “reality” isn’t always the right goal.

I agree that synthetic data isn’t a silver bullet, and models built on it alone risk being disconnected from the world they’re meant to operate in. But I’m not sure the answer is just to preserve all the noise in real data, either. It might be that the real challenge is learning to tell the difference between a meaningful imperfection and a harmful one...

Expand full comment

No posts