Microsoft AI Research Exposes Terabytes of Sensitive Data on GitHub

In a turn of events, Microsoft AI researchers unintentionally exposed tens of terabytes of sensitive data while publishing an open-source training data storage bucket on GitHub. This accidental exposure has raised significant concerns regarding data security within one of the world’s tech giants.


Discovery by Cloud Security Startup Wiz


Cloud security start-up Wiz uncovered this security lapse during their ongoing investigation into cloud-hosted data exposures. According to reports, Wiz stumbled upon a GitHub storage file belonging to Microsoft’s AI research division. What they found was deeply concerning.

The GitHub file was intended to provide open-source code and AI models for image recognition, and users were directed to download these models from an Azure Storage URL. However, an oversight led to the misconfiguration of this URL, granting permissions not only to the intended data but to the entire storage account, revealing a host of sensitive information.


The Extent of the Data Leak


Among the exposed data, a staggering 38 terabytes of sensitive information came to light. This included personal backups from the computers of two Microsoft employees, laying bare personal data and potentially compromising their privacy.

Furthermore, the data breach exposed passwords to various Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from hundreds of employees.


Configuration Errors Amplify the Risk


The gravity of the situation was exacerbated by configuration errors. The exposed URL was not set to “read-only” permissions but rather to “full control,” opening the door for malicious activities.

Those aware of the issue could potentially delete, replace, or inject malicious content into the exposed data, further heightening the risks associated with this incident.



Shared Access Signature Token (SAS) Oversight


Notably, the storage account itself was not directly exposed. Instead, the oversight came from an overly permissive shared access signature (SAS) token embedded within the URL. SAS tokens are a standard mechanism used by Azure to create shareable links that grant access to an Azure Storage account’s data.

This lax oversight of SAS tokens raises concerns about the security protocols in place within Microsoft’s AI development teams.


Microsoft Responds and Expands GitHub’s Security Measures


Upon discovering the extent of the data exposure, Wiz promptly notified Microsoft on June 22, and Microsoft took immediate action by revoking the SAS token on June 24. Subsequently, the tech giant conducted an investigation into the potential organisational impact, which concluded on August 16.

Microsoft responded to the incident in a blog post, saying that “no customer data was exposed, and no other internal services were put at risk.” As a direct response to the incident, Microsoft announced the expansion of GitHub’s secret spanning service. This enhancement will monitor all public open-source code changes, focusing on the inadvertent exposure of credentials and other secrets, including SAS tokens that may have overly permissive expirations or privileges.


The Broader Implications


This incident serves as a reminder of the challenges tech companies face as they harness the power of AI and handle vast amounts of data. As data scientists and engineers rush to develop cutting-edge AI solutions, extensive security checks and safeguards are more important than ever. The incident at Microsoft shows the growing difficulty in monitoring and preventing data breaches as data manipulation, sharing, and collaboration become integral aspects of AI development.

In conclusion, the accidental data exposure by Microsoft’s AI research division raises critical questions about data security in the AI era and serves as a cautionary tale for organisations worldwide. It underscores the importance of robust security measures and vigilant oversight in an age where data is not just an asset but also a potential liability.