Microsoft Copilot Reveals Confidential GitHub Repositories

The Dangers of Zombie Data: Implications for GitHub Security

In August 2024, a post on LinkedIn raised concerns about the capabilities of ChatGPT, particularly in relation to its access to private GitHub repositories through Microsoft Copilot. If the claims were true, this could pose major risks for data security and privacy.

Investigating the Claims

The team at Lasso, a digital security company, began investigating the claims made in the LinkedIn post. They discovered a phenomenon they referred to as “Zombie Data.” This involves data that had been cached and made publicly accessible before being switched to private or deleted altogether.

The inquiry started when Lasso searched for the specified GitHub repository mentioned in the post. They found that the repository had been indexed by Bing while it was still public, but it was no longer available directly on GitHub. When asked about the repository, ChatGPT wasn’t pulling data from direct access. Instead, it relied on indexed content, which can still appear in search results due to Bing’s caching system. This led Lasso to realize that repositories that transitioned from public to private could have portions of their data still accessible via cached content.

Discovering the Impact on Internal Repositories

As part of their research, Lasso examined one of their own GitHub repositories and found it still indexed by Bing despite being set to private. Internal audits revealed that this repository had briefly been made public, which allowed cached data to be retrievable through Bing’s indexing.

Using ChatGPT for further analysis produced limited results in terms of actionable information. However, when tested with Microsoft Copilot, the AI tool was able to extract actual data from the repository as it existed during its public phase, indicating that Copilot had access to what Lasso called the “Zombie Data.” This raised significant alarms about the potential risks to data privacy, especially for organizations that believe their sensitive information is secure.

Addressing the Risks of Zombie Data

The findings raised notable concerns regarding data security on platforms like GitHub. Lasso identified several critical issues:

Persistence of Zombie Data: Data that was once public can linger permanently in caches like Bing’s, making it retrievable even after switching to private. Lasso highlighted that any information previously public could be exploited by tools like Microsoft Copilot.
Sensitive Information Exposure: Private code repositories that accidentally become public can reveal critical data, such as credentials and access tokens. This poses a serious risk if the information gets into the wrong hands.
Microsoft’s Role: The situation was worsened by the ability of Microsoft Copilot to access cached data through Bing, leading to questions about whether Microsoft’s tools adequately protect user data, given their interconnected nature.

Widespread Data Exposure

To assess the scope of the issue, Lasso used Google BigQuery to analyze GitHub activity and compiled a list of repositories that had transitioned from public to private in 2024. Their methodology included:

Identifying Public Activity: They pinpointed repositories that had gone private or been deleted.
Exploring Bing’s Cache: Searches were conducted to find cached versions of these missing repositories in Bing.
Scanning for Sensitive Information: The cached data was analyzed to identify any exposed sensitive information, including credentials and security tokens.

The results were alarming:

Over 20,580 GitHub repositories could still be accessed through Bing’s cache despite being set private or deleted.
16,290 organizations were affected, including major corporations like Microsoft, Google, and IBM.
More than 100 vulnerable packages and over 300 exposed credentials linked to various platforms were found.

Microsoft’s Response to the Findings

After being informed about the vulnerabilities, Microsoft acknowledged the issue but classified it as “low severity.” Despite this assessment, they acted quickly by removing the Bing cached link feature and disabling the domain used for storing cached pages. Unfortunately, this only addressed the surface issue; cached results continued to appear, while Copilot maintained access to sensitive data unnoticed by users.

Further testing in early 2025 revealed Copilot’s continued ability to retrieve data from a GitHub repository that had been deleted, reinforcing concerns about the effectiveness of safeguards in place.

Moving Forward

The emergence of Large Language Models (LLMs) like Microsoft Copilot introduces a new challenge to organizational data security. Their capacity to uncovers cached information poses vulnerabilities that many organizations may not be prepared to handle.

Lasso outlined several recommendations for companies to secure their data:

Treat Public Data as Compromised: Any data that becomes public should be assumed at risk, as it can be accessed later by AI and indexing systems.
Adapt Security Monitoring: Companies need to focus on monitoring the interactions between AI tools and sensitive data to prevent unintended exposure.
Enforce Strict Access Controls: Organizations must ensure that AI systems respect data access permissions to prevent over-sharing.
Maintain Cyber Hygiene: Basic cybersecurity practices, such as keeping sensitive information private and avoiding hardcoded tokens, remain critical.

Lasso’s findings and Microsoft’s delayed response underscore the ongoing challenges posed by Zombie Data and highlight the need for organizations to be vigilant about how their data is managed, particularly in the wake of advanced AI technologies.

Please follow and like us: