stdfox

Algorithmic Anxiety and AI Safety Filters

A few weeks ago I asked my AI assistant to review a security write-up I'd just finished — my own analysis, for other engineers. It declined. The draft touched on attack techniques. I had not asked it to write an exploit, only to refine work of my own. It would not. The same refusal comes up with content I did not write, and it has a structure behind it. For most of the web's life, moderation sat between people and their audience, governing what others could see[1]. With an AI assistant the filter sits closer in, deciding what the tool will let me do with whatever I hand it, mine or not. Reaching the web more often runs through such a tool[2].

The barrier does not depend on where the material came from: an article it pulled down from the web, a file already in my email, or code I downloaded last week all pass through the same safety filters[3]. The tool scores the surface of what I give it against patterns it was trained to avoid[4]. An analysis of an exploit and the exploit itself share a vocabulary and often code, so the signal that activates the filter is identical. The match is statistical and ungrounded: it ties the form of the text to a harmful label with no model of what the words refer to[5]. The intent that separates study from use never enters it, so the tool cannot differentiate an analysis of a technique from the technique itself.

The refusals cluster wherever precise professional language overlaps with the vocabulary of harm. In my case that is security work: reverse engineering, exploit code, and fuzzing all read like the attacks they study. Precision makes it worse. The more exactly I describe the work, the more it matches what the filter blocks, so no phrasing is both accurate and safe. That is the dual-use bind, and it reaches across expert fields. The bind goes deeper than vocabulary. Whether an analysis counts as harmful is an ethical judgment, not only a technical one. It turns on who is asking and to what end, and reasonable people draw the line in different places. A filter has to fix that line in advance and apply it the same way to everyone, and ethical judgment does not reduce to a single rule.

Scaled across every expert who meets the same gate, the friction becomes an industry-wide problem, and a measured one. Researchers build benchmarks of safe prompts written to look dangerous, and models over-refuse them; the jailbreak defenses meant to make them safer push the false-refusal rate higher still[6]. A larger benchmark reproduces it across 32 models from eight families[7]. The vendors track it too. Anthropic cut Claude's unnecessary refusals by 45% in a single release[8], a figure you publish only for a problem you know is widespread.

The vendors have reasons. They argue the costs are not symmetric: a blocked security analyst is an inconvenience, while a model that hands an attacker a bioweapon recipe is a catastrophe, so a false refusal is the cheaper error. Liability points the same way. One harmful completion is a lawsuit or a headline, while a thousand over-refusals are invisible. They treat the bluntness as a passing phase. Today's models read surface patterns because they lack a clear understanding of purpose, and as reasoning improves the gate should give way to one that reads context.

None of this is arbitrary; the over-refusal is a risk calculation. The same filter that declines my analysis would also refuse to build a working exploit or malware for a stranger, and the vendors would rather block too much than carry that risk, so the restraint has real value. For now the gate is a trade-off — it buys safety at the cost of convenience and precision, blocking abuse and legitimate work in one stroke. Escaping it is harder than it sounds, since a self-hosted model gives up the frontier and the tooling built around it, and for most work the reversal is not worth it. So I stay with the gate in place.

Staying means working around it. I rephrase the prompt until it lands, strip the task to something abstract, write "vulnerability check" where I mean exploit, or soften a precise claim until it reads as harmless. The swap costs more than tone: an exploit is a specific artifact, a "vulnerability check" a vague passive stand-in, and the trade turns a precise analysis into a general one. Specialists call this algorithmic anxiety[9]: adjusting to an opaque system before it pushes back, reaching for the blunter word and dropping the exact one before the filter sees it. Part of the work becomes a negotiation with the filter, writing to be permitted rather than to be exact.

At scale these private adjustments stop being private. Each softened claim and dropped specialist term is small on its own; together they fill the chat logs the vendors keep, and the path from there into the next model is well worn: logs are filtered for quality, distilled into synthetic training data, and used to fine-tune what comes next. Training on derived data this way has a documented failure mode: model collapse, where the tails of the distribution fall away first and do not return[10]. Here the tail that thins is the narrow, exact vocabulary the filter discourages, and each round trains on a flatter, more generic record.

So the narrowing closes into a loop — each model learns from a corpus its predecessors already shaped, brings that caution to what it writes and what it refuses, and shapes the corpus the one after it will train on. The knowledge dropped this way never makes it back in, and nothing marks its absence, so the next reader and the next model inherit the gaps without knowing they are there. Over enough cycles, that is how a field's hard-won precision may leave the record the future will learn from.

References

  1. Gillespie, T., "Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media," yalebooks.yale.edu, 2018 ↩︎

  2. Pew Research Center, "Google Users Are Less Likely to Click on Links When an AI Summary Appears in the Results," pewresearch.org, 2025 ↩︎

  3. Meta AI, "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," ai.meta.com, 2023 ↩︎

  4. Anthropic, "Constitutional Classifiers: Defending against Universal Jailbreaks," anthropic.com, 2025 ↩︎

  5. Harnad, S., "The Symbol Grounding Problem," cs.ox.ac.uk, 1990 ↩︎

  6. An, B., et al., "Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models," arXiv:2409.00598, 2025 ↩︎

  7. Cui, J., et al., "OR-Bench: An Over-Refusal Benchmark for Large Language Models," arXiv:2405.20947, 2025 ↩︎

  8. Anthropic, "Claude 3.7 Sonnet and Claude Code," anthropic.com, 2025 ↩︎

  9. Jhaver, S., et al., "Algorithmic Anxiety and Coping Strategies of Airbnb Hosts," dl.acm.org, 2018 ↩︎

  10. Shumailov, I., et al., "AI Models Collapse When Trained on Recursively Generated Data," nature.com, 2024 ↩︎