Novel Approach Enables DeepSeek and Other Models to Address ‘Sensitive’ Queries

Understanding Language Models and Censorship Issues
Large language models (LLMs) are becoming a crucial part of artificial intelligence (AI) technology. However, they often struggle with biases and the potential for censorship. One notable example is DeepSeek, a model developed in China, which has raised concerns among U.S. politicians and business leaders regarding its implications for national security.
Congressional Concerns Over DeepSeek
Recently, the U.S. Congress released a report highlighting DeepSeek as a “profound threat” to national security. This report made recommendations for policy changes to mitigate risks associated with such AI models. The growing fear is tied not only to how DeepSeek processes information but also to how it might manipulate or censor content.
Innovations in Bias Mitigation
While various strategies exist to address biases in language models—such as Reinforcement Learning from Human Feedback (RLHF) and model fine-tuning—CTGT, an enterprise risk management startup, claims to have developed a different approach. Their method promises to eliminate bias and censorship entirely.
In a recent paper, researchers Cyril Gorlla and Trevor Tuttle explain how their framework specifically targets and modifies the internal features linked to censorship in language models. They emphasize that this method is computationally efficient and maintains the model’s integrity and factual accuracy while delivering uncensored responses.
Framework Overview
CTGT’s method was initially designed for the DeepSeek-R1-Distill-Llama-70B model but has shown effectiveness with other models as well, such as Llama. According to Gorlla, their technology operates on the foundational level of neural networks, making it applicable to a range of deep learning models.
Steps in the Process
Feature Identification: Identify features in the model that correlate with unwanted behavior, such as censoring certain topics.
Feature Isolation: Determine which manipulative elements within the model control these unwanted behaviors.
- Dynamic Modification: Integrate a mechanism into the model that adjusts these features to minimize censorship.
Gorlla and Tuttle illustrate the method by presenting sensitive prompts like those concerning Tiananmen Square. They analyze responses and identify patterns leading to censorship. With this information, they can isolate and manipulate features related to those responses.
Enhanced Model Responsiveness
According to their experiments, the original DeepSeek model only addressed 32% of controversial prompts. In contrast, after modification, the model was able to answer 96% of these prompts. The remaining 4% pertained to highly explicit content not typically amenable to unrestricted responses.
CTGT emphasizes that while their modifications reduce unnecessary censorship, they do not lead to reckless or irresponsible output from the model. Importantly, the integrity and accuracy of the model are preserved in this process.
Efficiency and Adaptation Benefits
The approach CTGT employs differs from traditional fine-tuning, as it doesn’t involve changing model weights or providing new example responses. This results in immediate benefits, allowing changes to take effect instantly during the next token generation. Additionally, the model can switch between different behaviors without permanent alterations.
Safety and Future Implementation
In light of the U.S. government’s scrutiny of DeepSeek, AI developers are actively exploring solutions to ensure models remain safe to use. Definitions of “safety” can vary, so providing companies the ability to manage bias and censorship efficiently may be beneficial.
Gorlla highlights that businesses need to trust their AI models to align with their goals and regulatory requirements. This adaptability is particularly vital in high-stakes sectors such as finance, healthcare, and security, where mistakes can result in serious consequences.
By offering companies the capability to adapt their AI to meet specific needs without costly overhauls, CTGT’s technology could prove invaluable in the evolving landscape of artificial intelligence.