The Synthetic Data Solution: Bypassing the Headache of Data Privacy

By Mosty AI & Siqi Lin

From predictive analytics to machine learning, data remains a foundational element in the success of the AI-driven technologies fueling the growth of today’s enterprise-level businesses. Though we know processing and sharing big data assets drives analytical progress, fosters innovation, and enables smarter products and services, it also poses one of the larger challenges facing today’s digitized world: data privacy. As enterprises struggle with how to reconcile data-driven innovation with data protection, the use of synthetic data in lieu of actual, privacy-sensitive data has emerged as a powerful potential solution. 


What is Synthetic Data? 

Synthetic data is information that has been artificially manufactured based on real-world data using an AI algorithm. Synthetic data retains many of the same attributes and correlations as its source, regulated data. 


Why Synthetic Data?

While other methods have been developed to bypass data privacy constraints, including data anonymization and manual generation of synthetic data, these solutions are overly simplistic, prone to human bias and error, and destroy much of the value in the data. 

By contrast, the field of artificial intelligence has made tremendous progress in recent years. Generative models are presently at the forefront of academic research in this domain, as models can train on available (unlabeled) data, and generate new synthetic data points accordingly. The synthetic data generated by an AI engine can simulate original data with identical correlations and patterns in a way that no other current solution can achieve, especially within a similar time frame. 

On top of accuracy, efficiency is another benefit that synthetic data solution brings. Getting access to privacy-sensitive data to test out an innovative idea usually takes weeks or even months. With the AI generated synthetic data, the testing time can be shorten to days. 



Who is Pioneering Synthetic Data?

Nvidia - Synthetic Image Generation

The Synthetic Data Solution 1

(Source: Nvidia)

Synthetic image generation has shown impressive results with large-scale image datasets available for research. The following images released by Nvidia look like real persons but are in fact synthetic images created by a deep generative model trained on a dataset of 30,000 celebrity photos.

Apart from using synthetic images just for fun, the same methods can open new perspectives to enhance AI for autonomous driving. With headquarter in Silicon valley, Nvidia can easily train its self-driving cars under sunny weather in California, but not all states are like California. What happens when it runs into rain or snow? Nvidia generates synthetic images about more extreme weather to train its self-driving cars so they can adapt to different environments. 

synthetic data solution 2

(Source: Nvidia)

Synthetic images provide important information for self-driving cars, and the technology could be applied in many other situations. The results on images are inspiring, and the methods can be transferred to other domains, to both structured and unstructured data. 


Mostly AI: Synthetic Data Engine (PNP Enterprise 2.0 Batch Startup)

A group of top AI scientists from Austria deeply experienced the pain of accessing data in privacy-sensitive domains for large enterprises. They decided to create a synthetic data engine and founded Mostly AI in 2017. With their Synthetic Data Engine , synthetic versions of privacy-sensitive data could be generated that retain all the properties, structure and correlations of the real data within a short time frame. These synthetic datasets can then be used as drop-in replacement for real data in all data workflows with no loss in accuracy. 

synthetic data solution 3

(Source: Mostly AI)

With the derived synthetic data having no direct relationship to actual persons within the original dataset, there remains no possibility for re-identification: the generated output is truly anonymous and thus exempt from privacy regulations.


How to use Synthetic Data? 

With synthetic data, organizations are free to use, share and monetize the data without being restricted by privacy constraints or regulations. This opens up the way for a broad range of use cases and opportunities to be safely and quickly explored.


Testing & Development

Provide realistic and representative data, instead of privacy sensitive production data, to your internal developers as well as external partner network. Instantly sharing a rich and accurate representation of the actual data will boost the development process, help design smart UX, improve testing, and make complex end-to-end integrations less error-prone.


AI Training & Analytics

    Running analytical queries as well as building machine learning models require access to large amount of record-level data. Differentially private synthetic data allows the unrestricted provision of such data, without running the risk of exposing individual’s privacy. Computation can take place on less restrictive environments by a larger group of people.


    Open Big Data & Data Monetization

      Openly share and/or monetize on your big data assets without putting privacy at risk, either directly or via emerging data marketplaces. Data providers can offer data at its fullest detail, with a high share of original information retained, and thus can provide much higher value to their data consumers compared to existing anonymization techniques.


      Marketing Optimization

        Leverage your historic data to optimize your or your partner’s sales & marketing activities by realistically simulating customer behavior, while fully respecting individuals’ privacy. The patterns and sequences of ad displays, price promotions and actual conversions provide valuable information to serve relevant offers to new and existing customers.


        The realization of synthetic data has brought enterprises to an exciting turning point on better leveraging big data assets. It can be the solution to solve headaches of large corporates that data assets are locked up due to data privacy restrictions and thus inaccessible to a broader group of smart and innovative people . Synthetic data enables large enterprises to collaborate more closely with startups, innovation partners, research institutions and machine learning experts alike, becoming data-driven in the real sense.