Bluesky Users Discuss Strategies for User Data and AI Training

Recent Bluesky Proposal on Data Use
Bluesky, a rising social networking platform, has recently put forward a significant proposal on GitHub. This initiative aims to provide users with options regarding the scraping of their posts and data. The intent is to give users control over whether their content can be utilized for things like training generative AI or for public archiving purposes.
Overview of the Proposal
The CEO of Bluesky, Jay Graber, shared insights about this proposal during a talk at the South by Southwest festival and later expanded on the topic in a post on Bluesky. The initiative has prompted diverse reactions from users, with some expressing concern that this signals a departure from Bluesky’s initial commitment to user privacy. Initially, Bluesky stated they would not sell user data to advertisers or use it for AI training, but this new proposal has left some feeling uneasy.
One user, Sketchette, shared a strong reaction, stating, “The beauty of this platform was the NOT sharing of information. Especially gen AI. Don’t you cave now.” This sentiment reflects a widespread anxiety among users about the implications of this new direction.
The Rationale Behind the Proposal
In response to user concerns, Graber explained that generative AI companies are already using public data from various sources, including Bluesky. She pointed out that the information shared on Bluesky is publicly available, similar to content on a typical website. Hence, the platform’s proposal seeks to establish a “new standard” for managing how data can be scraped, similar to the well-known robots.txt file employed by websites to guide web crawlers.
Understanding the Robots.txt File
The robots.txt file is a simple text file that webmasters use to communicate with web crawlers about which parts of their site should not be accessed. While this provides a mechanism for setting permissions, it is worth noting that it is not legally enforceable. The proposed standard by Bluesky aims to create a similar system that offers ethical guidelines rather than legal restrictions.
User Control Over Data Scraping
According to the proposal, users on the Bluesky platform—or other applications utilizing the underlying ATProtocol—can adjust their settings to control data usage in various categories. These categories include:
- Generative AI: Users can opt out of having their data used for AI training.
- Protocol Bridging: This involves connecting different social media platforms.
- Bulk Datasets: Allowing or disallowing the creation of bulk datasets from their data.
- Web Archiving: Deciding if their posts can be archived by organizations like the Internet Archive.
If a user chooses to prevent their data from being utilized in generative AI training, the proposal anticipates that companies and research teams will respect this preference during data scraping or bulk transfers.
Responses from the Community
Molly White, known for her work on the "Citation Needed" newsletter and the blog "Web3 is Going Just Great," offered a different perspective. She called the proposal “a good proposal” and highlighted that it may not be about inviting more AI scraping but rather about allowing users to express their preferences regarding existing practices.
However, White did note a significant limitation in the proposal. She pointed out that it relies heavily on scrapers voluntarily adhering to these preferences, which might challenge the effectiveness of the initiative. Past experiences have shown that some companies have ignored similar signals, such as those outlined in robots.txt files.
Summary
Bluesky’s recent proposal is an attempt to address user concerns about data scraping while striving to set a new standard for data usage in the digital landscape. As the discussion unfolds, it remains critical to evaluate how effectively these measures can protect user preferences in an evolving technological environment.