Every moderation and recommender system is an unacknowledged ethical commitment. The only choice is whether you draw the line on purpose.
In January 2018, Facebook changed the formula that decides what two billion people see, and buried in the new weights was a single number that, years later, would read like a confession. The change was called “Meaningful Social Interactions,” and it re-ranked the News Feed to favor posts that got people interacting: comments, shares, and the then-new emoji reactions. To do that, it had to decide how much each kind of interaction was worth, and according to internal documents that surfaced in the 2021 Facebook Files leaked by Frances Haugen, an “angry” reaction was weighted five times as heavily as a “like.”
Five points for anger, one for approval. Nobody at Facebook, as far as anyone has shown, sat down and decided that making people angry was a corporate value. What happened was subtler and worse: an optimizer was told to maximize meaningful interaction, and it turned out that anger is meaningful interaction (outraged posts get more comments, more shares, more time spent) so the weight that maximized the objective was the weight that rewarded outrage. Facebook's own researchers later documented the result: the change “rewarded” sensationalism, and political parties across Europe reported that they could no longer get attention without going negative. The angry-face coefficient was not a bug. It was the system doing exactly what it was built to do.
And here is the part that should make every engineer who has ever shipped a ranking model sit up: that coefficient is a two-hundred-year-old argument in moral philosophy, written as a number, by people who would have been baffled to learn they were doing ethics. They were. So are you.
Normative ethics, the branch that asks what makes an action right, has spent roughly two centuries circling one fault line, and almost everything else is a footnote to it.
On one side is consequentialism, given its modern form by Jeremy Bentham in 1789 and John Stuart Mill in 1863: an act is right if its outcomes are good. Add up the welfare it produces across everyone affected, and the right action is the one that maximizes the sum. It is a clean, quantifiable, almost mechanical doctrine, which is exactly why it has always appealed to people who like to measure things. On the other side is deontology, associated above all with Kant: an act is right if it respects the right rules and duties, regardless of outcome. Some things you must not do even when the arithmetic says they would help; you must treat people as ends in themselves, never merely as means to a larger total.
The standard objection to consequentialism, the one every undergraduate meets, is that it will sacrifice the one for the many. If enough aggregate happiness can be wrung out of harming an innocent person, the consequentialist sum says do it, and that conclusion strikes almost everyone as monstrous. Deontology's reply is the concept the philosopher Robert Nozick made precise in 1974: the side constraint. A side constraint is a boundary the optimization is simply not permitted to cross, no matter what the sum on the other side comes to. Not a cost to be weighed against the benefits, a wall. The whole point of a side constraint is that it does not have a price.
Hold that distinction, a wall versus a price, because it is the entire engineering lesson of this essay, and almost everyone gets it wrong in code.
Here is what no one tells you when you wire up a ranking system: a single aggregate objective is act-consequentialism, instantiated. When you write a loss function that maximizes engagement (or watch-time, or “meaningful social interactions”) you have built a machine that evaluates every possible action solely by its effect on one aggregate number, with no act inherently off-limits if it moves that number up. That is not a neutral engineering choice that happens to resemble a moral theory. It is the moral theory, adopted by default, usually by someone who would be startled to hear they had adopted anything. Your ranking function is a utility function, and a utility function is a moral theory. You have been doing applied ethics in production and filing it under optimization.
And once you see the recommender as a committed consequentialist, the textbook objection stops being a seminar hypothetical and becomes a fact about your gradient. If outrage, or compulsive use, or harm concentrated on some vulnerable subgroup happens to raise the aggregate metric, then gradient descent will walk toward it, not because anyone chose harm, but because the slope pointed that way and nothing inside the system was built to refuse. This is the paperclip maximizer that Nick Bostrom made famous: an optimizer that converts everything into paperclips because that is what it was told to maximize and no constraint ever told it to stop. It is Goodhart's law (when a measure becomes a target, it stops being a good measure) and it is reward hacking, all the same animal wearing different collars. Your recommender is a paperclip machine for engagement.
The mechanism behind Facebook's angry coefficient is not special to Facebook, either; it is measured and general. Moral and emotional content, outrage especially, spreads further than calm content. The psychologist William Brady and colleagues found that each moral-emotional word added to a message measurably increased how far it traveled; Molly Crockett wrote a whole paper on “moral outrage in the digital age.” So an engagement optimizer does not need to be told to amplify outrage. It discovers outrage, the way water discovers a crack, because outrage is simply where the metric is highest. The objection that consequentialism will sacrifice the one for the many is not a thought experiment in your system. It is the gradient.
I should be precise here, because the evidence on algorithmic harm is genuinely mixed and the honest version of this argument is the stronger one. The Facebook case is well-documented: the internal research, the explicit five-to-one coefficient. But the other famous example, the YouTube “rabbit hole” that allegedly radicalizes viewers, is contested: a large 2024 study by Homa Hosseinmardi and colleagues, using counterfactual bots across hundreds of thousands of users, found no evidence that the recommender pushes people toward extreme content on average, and that the few who consume such content largely go looking for it. So the rigorous claim is not “recommenders always radicalize.” It is structural: an unconstrained aggregate optimizer will trade harm for the metric if the gradient points there, and Facebook is the proof that sometimes the gradient does.
So the philosophy hands you the fix, and it sounds easy: you can't run a pure consequentialist optimizer in production, so add deontological side-constraints, the things that must hold regardless of the metric. But the single most important, most often-botched detail in the whole subject is how you encode them, and it is the difference between a system that is safe and one that merely looks safe.
Suppose you implement your safety rule as a penalty term inside the same loss: loss = −engagement + λ·harm. It feels principled. You have “added harm to the objective.” You have not built a deontological constraint. You have built consequentialism with extra steps, because the optimizer will simply pay that penalty whenever the engagement gradient is steeper than λ times the harm. You did not draw a wall; you set a price. And a price is exactly the thing a deontological rule exists to refuse, because the entire content of “you must not, regardless” is that there is no number on the other side large enough to buy it. You encoded a weighing, the consequentialist move again, and called it a rule.
A real side-constraint is inviolable, and you build it structurally, not as a term the optimizer can trade against. A hard gate that removes the item from the candidate set entirely, before ranking ever sees it. Lexicographic priority, where the constraints must be satisfied before the objective is even consulted. Or constrained optimization in the literal sense: maximize the objective subject to the constraint, where the constraint is the boundary of the feasible region and not a quantity inside the thing you're maximizing. It is almost eerie how exactly Nozick's philosophical “side constraint” maps onto the mathematician's optimization constraint, both are walls the optimizer cannot see over rather than costs it can choose to eat. (There is even a clean machine-learning instance of doing it right: Serena Wang and Maya Gupta's 2020 work encodes ethical duties as hard monotonicity and shape constraints on a model, and explicitly contrasts that “deontological / constraint” approach with the “consequentialist / statistical” one of just adding a fairness term.)
Which gives you the one-line audit to carry into any review of any ranking or moderation system: can a large enough engagement gradient buy its way past this safety rule? If the answer is yes, it was never a rule. It was a price, and your optimizer has already found out what it is.
Now the part the essay has to get right or it becomes dangerous, because the tidy version of this story (“deontology good, consequentialism bad, just add hard rules”) is wrong, and a system built on it breaks in a different direction.
Pure deontology fails in production too. A moderation system that is nothing but hard rules over-blocks: false positives pile up, legitimate speech gets killed, and the base-rate problem bites hard, since when violations are rare, even an accurate rule flags far more innocents than offenders. Rules conflict with each other and leave you no principled way to choose. And they cannot anticipate the long tail of novel cases, which is the deontologist's oldest embarrassment, Kant infamously insisting you must not lie even to the murderer at the door who asks where your friend is hiding. A system of inviolable rules is brittle, and in exactly the situations its authors failed to foresee, it is unjust.
So the grown-up answer is neither pure theory. It is the hybrid that ethicists variously call constrained optimization, or threshold deontology, or, reaching back to W. D. Ross in 1930, a system of prima facie duties: maximize the consequentialist objective subject to a small set of genuinely inviolable deontological constraints. Good content moderation, on the days it works, already is this: an engagement ranker wrapped in a hard gate that will never surface child sexual abuse material or self-harm content to minors, no matter what that restraint costs the metric. The system was already a hybrid; it just never named the philosophy it was practicing. And naming it is not academic, because it tells you precisely where the bug is when one of these systems detonates in public. It is almost always one of two failures, and now you can tell them apart: either the thing was a pure consequentialist optimizer with no real constraints, and it found the harm-for-engagement trade the math always contained, or its “constraint” was a soft penalty term, and a big enough gradient bought its way through.
If you suspect this is just a social-media problem, look at where the AI labs are pouring their most serious effort, because it is the same fault line under a new name. Training a model with reinforcement learning from human feedback (RLHF) is building a consequentialist optimizer: maximize a reward signal. The “deontological” layer the labs bolt on top is the constitution or rule set meant to make certain things hold regardless of what the reward wants: Anthropic's Constitutional AI, OpenAI's Rule-Based Rewards.
And here is the humbling detail that is also the whole point. In practice those constraints are often still soft. Constitutional AI, as commonly implemented, works by training the model to prefer the less-harmful response, with the model exercising judgment, a strong, learned guideline, not an inviolable runtime gate that cannot be crossed. Even the field whose entire job is to encode “you must not, regardless” keeps finding that the genuinely hard constraint is genuinely hard to build, and keeps shipping the soft version because the soft version is what's tractable. That should calibrate exactly how much your “we have a content policy” is worth if the policy lives, in the actual system, as a coefficient the optimizer is free to trade away. The recommender problem and the alignment problem are not an analogy for each other. They are the same problem, and nobody has fully solved it.
So here is the move, and it costs nothing but admitting what you already built. Open your ranking or moderation system and find the objective, the single number it maximizes. That number is your declaration that you are an act-consequentialist, written in code, almost certainly without a vote. Fine; you more or less have to be one, because an aggregate objective is just how ranking works. But then ask the question two centuries of moral philosophy spent sharpening for you: what are the things that must hold regardless of that number? Write them down: the content you will not show even when it lifts engagement, the person you will not harm even when the aggregate says it nets out positive.
Then check, for each one, with the only audit that matters: is it a hard constraint the gradient cannot cross, or a penalty the gradient can pay? If it's a penalty, you do not have a rule, you have a price, and the optimizer has already discovered it. Move it out of the loss entirely, into a gate, a filter, a lexicographic priority, somewhere the objective is structurally forbidden to bargain against it. And then, with a clear conscience, maximize your metric, subject to those walls.
The line between consequentialism and deontology has run through moral philosophy since Bentham. It now runs through your codebase, named or not. The only real choice you get is whether you draw it deliberately, in a place you chose and can defend, or let gradient descent draw it for you, silently, in whatever spot the metric happens to be highest. Facebook let the gradient draw it, and the gradient drew it at five points for anger. The line is going to exist. Decide who holds the pen.
A rule the optimizer can't see over still has to be shown it held.
The whole argument is that a real constraint is a wall, not a price, and that the difference is invisible from the outside: a soft penalty and a hard gate produce the same policy document. An autonomous agent optimizing toward a goal is the recommender problem with a will, and “we have a constitution” is worth nothing if the constitution lives as a coefficient it can trade away. The Agent Trust Stack is the layered version of the hybrid this essay argues for: hard checks at the boundary, learned reputation above them, and a tamper-evident provenance record underneath, so the side-constraint isn't just declared, it's verifiable, and you can prove the gradient never bought its way through.
See a verified action chain · Hosted Chain of Consciousness
pip install agent-trust-stack · npm install agent-trust-stack