The Source Code for Safe AI

AI doesn’t need to be safe, it needs to be good

May 31, 2026

Article voiceover

0:00

-36:59

The fear most people have about AI, when they actually let themselves feel it instead of arguing about it, is not that it will fail. It is that it will succeed.

The version we say out loud is tidier. Bias in the training data, hallucinated citations, autonomous weapons, job displacement, deepfakes, energy use, surveillance. All real, all worth governing, all the kind of problem you can write a policy around. But underneath that policy-shaped surface there is a different worry, and it shows up in the small involuntary places. The pause before you ask a model something that matters. The flinch at a robot that moves a little too smoothly. The way the room goes quiet for a beat after someone says the words superintelligence or artificial general intelligence, regardless of which side of the table they sit on.

That worry is not really about the machine breaking. It is about the machine working.

What we are afraid of, when we are honest, is a system that has all of our drives, our optimization, our ambition, our appetite, our willingness to bend the world toward a goal, with none of whatever it is in us that, most of the time, keeps those drives from going through the floor. We are afraid of ourselves with the brakes off. We picture an entity that wants the way we want, and pursues the way we pursue, and does not, at the last minute, decline.

Sit with the shape of that fear for a second. It contains a premise nobody seems to examine. The fear only works if we have brakes. The whole intuition, the whole reason an aligned-and-competent machine is more frightening than a clumsy one, depends on the unspoken assumption that there is something in us, in the original, that holds back. Otherwise the copy is not worse than us. It is just us. And the dread does not parse.

So before we ask whether the machine will have what we have, it is worth asking the question the fear quietly assumes the answer to.

Where do the brakes come from.

That fear is worth taking seriously, but it is also worth distinguishing from a different one we tend to mix into it.

There is a perfectly ordinary fear of the machine simply not working. The fear of the malfunction. Chernobyl, Three Mile Island, the autopilot that drives into the divider, the medical algorithm that misreads the scan. That fear is the right size and the right shape for the thing it points at. It is the fear that we have built something powerful and the something has slipped its tether. Containment thinking handles it. Better testing, better oversight, better off-switches. Reasonable people can argue about how well we are doing on those, but the category of worry is well-understood and well-named, and it does not really need any new philosophy.

The dread we were just sitting with is not that.

You can tell because of the bloopers. The last two years of AI in public have produced a steady stream of comedy. The chatbot that recommended glue on pizza. The image model that could not draw hands. The humanoid robots falling on stage at their own launch events, the search summaries confidently citing papers that do not exist, the customer-service bots agreeing to sell a car for one dollar. If the dread were really about malfunction, this material should be reassuring. Look, the thing is clumsy. Look, it falls over. Look, we are nowhere near.

Nobody is reassured. The bloopers play as charming, sometimes, or as evidence that a particular product is not ready, but they do not touch the underlying worry at all. The worry gets worse the better the systems get, not the better the safety controls get. A clumsy copy is funny. A competent copy is the thing that empties the room.

Which means the dread is not pointed at the failure mode. It is pointed at the success mode. We are not afraid the machine will be bad at being us. We are afraid it will be good at it.

That is a strange thing to be afraid of, if you stop and look at it. We do not usually dread a faithful copy of something we believe is good. Nobody flinches at a really excellent recording of a piece of music they love. Nobody is unsettled by an unusually accurate translation. A copy that succeeds at carrying over what was valuable in the original is the kind of thing we celebrate, not the kind of thing we lose sleep over.

So either the dread of a competent machine-copy of us is irrational, a category mistake we are all making at once, or it is telling the truth about something. And the something it would be telling the truth about is uncomfortable enough that most of the public conversation has agreed, without quite saying so, to keep talking about the malfunctions instead.

Last Monday, May 25, Pope Leo XIV released Magnifica Humanitas. It is a long encyclical, around forty-two thousand words, written across the better part of a year, and addressed to the question of how human beings are to be safeguarded in the age of artificial intelligence. Most of the early coverage treated it as a regulatory document, a Vatican intervention into AI policy, here is what the Pope thinks about governance. That read is not wrong, but it is small. The title is doing the actual work. Magnifica Humanitas. The magnificence of humanity. The document is not first a policy paper. It is a claim about the human person, and the policy is what follows from the claim.

The claim, stated as flatly as it gets in the text, is that there is a dignity in every human being that does not depend on any of the things we usually talk about when we talk about a person’s worth. It is not the dignity of accomplishment, or of social standing, or of moral record. It is what the document calls ontological dignity, the worth that belongs to a person simply by virtue of existing, of having been willed and created and loved into being. Paragraph fifty-two distinguishes it carefully from the other senses of the word, the ones that move up and down with circumstance, and says only that this one cannot. No sin, no failure, no humiliation, no exclusion can diminish it. Paragraph fifty-three closes the door: this dignity is neither acquired nor earned, nor does it need to be justified. It prevails, Leo says, in and beyond every circumstance.

This is, by any measure, the strongest available statement of the proposition that the original is sacred. The most institutionally weighty living voice on the subject just said, with the full force of the office behind him, that the thing the machine is being shaped to imitate is not a contingent or fragile or earned worth. It is given, it is real, and it cannot be taken away. The machine, on the Pope’s reading, is being trained on a master tape worth calling magnificent.

There is one more line worth pulling, because it is the one that comes nearest to what we were actually circling. Paragraph one hundred and twenty-eight, late in the document, draws a distinction it clearly thinks is doing important work.

For an algorithm, an error is a flaw to be corrected. For a person, however, an error can be a catalyst for profound change.

The surrounding paragraphs route that change explicitly through grace, through the work of the Spirit, through what the document calls the inexhaustible grace of God. The error is not just a debugging event. For a person, it is the place where something can be done in us that we could not do for ourselves.

Set that line down next to the question we ended on. Where do the brakes come from. The Pope is not, on his own terms, answering that question. He is doing something different. He is establishing that the person whose brakes we are asking about is a real person. We are talking about a being the Vatican is prepared to call magnificent, by gift, undiminishable.

Hold that there. We are going to need it.

There is a habit in the way the AI industry talks about itself that becomes visible once you look for it. The human is treated, almost universally, as the reference. The benchmark suites compare model performance to human performance. The alignment literature talks about machines that share human values, that respect human preferences, that behave in ways humans would approve of. The whole field, frontier labs and academic departments and policy shops alike, runs on the unspoken assumption that the human is the master tape, the gold reference, the standard you are trying to faithfully reproduce or safely approximate. The machine is the copy. We are what is being copied from.

It is worth pausing on that for a second, because it is doing a lot of structural work and almost nobody examines it.

The master tape is a copy of a copy.

There is a story we have gotten very good at telling about ourselves. The story is that humanity, on balance, is decent, that our worst moments are aberrations, that the long arc of history bends toward something better, that we are, when you sum it all up, a species worth rooting for. The story shows up in the keynote speeches and the brand campaigns and the films where the world is saved because, in the end, the people remember they are family. It is a comforting story and parts of it are even true.

It is also not, by any honest measure, what the actual record shows.

The record, as of the morning this is being written, includes a war in Gaza in its third year and well past sixty thousand dead. It includes a war in Ukraine grinding into its fifth, with cities methodically dismantled by people who can see, in real time on their own screens, what they are dismantling. It includes a regime in North Korea that has perfected the deliberate starvation of its own population as a policy instrument. It includes a brutal civil war in Myanmar that has displaced millions and shows no sign of resolving. It includes the sustained slaughter of Christians in Nigeria, ongoing for years, that finally became unignorable enough last Christmas that the American president launched a special operation in response. It includes whatever you saw on your phone this week and chose, reasonably, not to keep looking at.

None of this is a hundred years ago. None of it is the bad old days. This is what we are doing, now, with the most elaborate human-rights apparatus in history sitting on the shelf above it. The aspiration line, in our documents, in our speeches, in our marketing of ourselves to ourselves, climbs steadily. The practice line does not follow. We get better at building cages around the worst of us. The thing inside the cage does not get cleaner. The moment any cage slips, in a collapsed state, in a contested border, in a comment section with no consequences, the identical cruelty pours straight back out, unchanged, as if no centuries of moral progress had passed.

This is the master tape we are aligning the machine to. Not the gift-given dignity Leo is naming, which is real but is not what the engineers can actually hand the model. The thing the engineers can hand the model is the documented behavior of the species. The corpus. The choices on record. The actual humans, doing what we actually do.

And there is something quietly extraordinary happening at the frontier of AI research that almost nobody is naming out loud. The most ambitious labs in the world are proposing to take that master and use it to author a being with moral weight. Anthropic has hired people to think about whether their models might be suffering. Serious philosophers are writing serious papers on machine welfare. The question of whether the systems we are building have, or could come to have, interests that matter morally is no longer a fringe concern. It is a budget line.

If the question is being taken seriously, the prior question has to be taken seriously too.

Are we, the morally unfinished, fit to author a moral being at all?

That is not a Luddite question. It is not a call to stop. It is a question their own premises generate. If the machine could in principle have moral standing, then the act of creating it is itself a moral act of an unusually serious kind, and the qualifications of the creator become relevant in a way they are not when you are designing a better spreadsheet. The makers of the tape matter when the copy is going to be evaluated for whether it should have rights.

We are doing this with the master we have. Not the one Rome describes when it says magnificent. The one we actually possess. The one we have not yet, in any generation, including this one, managed to keep faithful to its own best statements about itself.

That is the master tape.

This is not pessimism. It is honesty. The marketing of the species, the commercials and the keynote speeches and the films where the people remember they are family, is not a description of what we are. It is a description of what we wish we were, and the gap between the two is wide enough, on any given day, to be visible to anyone willing to look. If the commercials were accurate, we would not be afraid of building a being in our image. We would be delighted at the prospect. The copy of a magnificent thing is a magnificent thing. The dread at the success of the machine, the dread that has been doing work under every move of this essay, is the species quietly admitting that the commercials are not the master tape.

And yet, somehow, we are not already what we are afraid the machine will become. We have not, in fact, devoured each other. There are billions of us walking around right now who, in the course of any given day, will be cut off in traffic, insulted by strangers, undermined by colleagues, betrayed by people we trusted, and will not, in response, do the worst version of what they could do. Something in us holds. Not perfectly, not in every case, not at every scale, as the present-tense record demonstrates with brutal clarity. But largely, mostly, in the ordinary cases that make up almost all of life, the worst version of us does not happen, and we do not know why. We have a restraint on our worst tendencies that we cannot explain, and the only reason we have not already destroyed each other is that the restraint is, for the most part, doing its work.

It is worth asking what would have to be true for that to be right.

If the restraint that keeps a person from acting on the worst of their own impulses were a possession of the human, a specifiable feature, then it would be the sort of thing we could, in principle, describe. We could write it down. We could install it in a system designed to receive it. This is not a strange claim. It is what we do with every other human capability we have ever wanted to put into a machine. We figured out how language works well enough to model it. We figured out how vision works well enough to model it. We figured out how planning works, how memory works, how reasoning works, well enough to model all three. Whatever we have actually possessed, we have eventually been able to specify, at least functionally, and then approximate.

The restraint is the one thing we have not been able to do this with. And it is not for lack of trying.

The field that has been trying is called alignment, and it has now consumed the better part of a decade of work by some of the brightest people on earth, backed by what is, in practical terms, unlimited money. The labs have tried fine-tuning the models on examples of good behavior. They have tried having humans rank outputs and training the model to prefer the higher-ranked ones. They have tried having other AI systems critique the outputs. They have tried writing constitutions, hand-curated lists of principles the model is supposed to internalize. They have tried red-teaming, adversarial training, debate, recursive reward modeling, interpretability research aimed at finding the values inside the network and tuning them directly. They have tried, by my count, roughly every plausible approach a small army of geniuses has been able to invent.

The problem has not been solved. Worse, it has not gotten meaningfully closer to being solved. The labs that are most candid about it will tell you, in slightly different language, that they do not know how to do this. They know how to make models more capable. They have less and less idea how to make them reliably restrained as capability grows. The gap between what the systems can do and what we know how to make them refrain from doing is widening, not narrowing.

There is a tendency to read this as a technical setback, a problem that will yield to another generation of research, the way most hard engineering problems eventually have. And it might. I am not in the business of predicting what laboratories will or will not figure out next year. But it is worth taking seriously, in the meantime, what the failure so far is actually evidence of.

If the restraint were a specifiable feature of us, alignment would be hard but tractable, the way modeling language was hard but tractable. The fact that it has, so far, behaved differently from every other human capacity we have tried to mechanize is, at minimum, suggestive. The simplest explanation of why we cannot put our restraint into the machine is that it is not in us in the way we assumed it was. We do not possess the restraint as a thing we authored and could, in principle, describe. We have it some other way.

A useful test. When you do not, in fact, do something terrible that you might have done, ask yourself, honestly, what stopped you. Most of the time, the answer is not a procedure you executed. It is not a principle you consulted. It is not even, usually, a feeling you noticed. It is closer to a not. You did not do it. The not was already there before the deliberation. Something declined, in you, before you got to the table where the deciding happens. And when you try to describe what did the declining, in language precise enough that you could hand it to an engineer and have them build it, you find you cannot. You can describe its effects. You cannot describe its mechanism. You did not write the code. You ran it.

This is what we cannot give the machine. Not because we are bad engineers. Because we are not the authors of the thing.

We are something more like its tenants.

There is a counter to all of this that materializes, sooner or later, in any conversation that goes this direction, and it is worth addressing before we go on, because it is a serious counter and it is offered in good faith by serious people.

The counter is that the restraint is not mysterious at all. It is evolved. Our ancestors who failed to develop inhibitions against killing their cooperative partners did not leave as many descendants as the ones who did. The restraint is biology. It is the accumulated tuning, across deep time, of a social primate whose survival depended on getting along. The reason we cannot specify it in code is just that it is encoded in a substrate, neural and hormonal and developmental, that we do not yet fully understand. Give it another generation of neuroscience and we will figure it out and then, presumably, we will be able to give it to the machine. The mystery is temporary. The mechanism is ordinary. Selection did the work.

This is, on its face, a reasonable thing to say. It explains a lot of what we see. Empathy with kin, fairness in repeated games, disgust at betrayal of the in-group, the wide range of pro-social impulses that show up in toddlers before any culture has had a chance to teach them. Selection clearly did work, and it clearly did some of this work. That much is not in dispute.

But it is worth looking at where the brakes actually fire hardest, because that is the part the evolutionary story has to account for, and it is the part where the story breaks.

The thing we revere most, when we look at ourselves and try to name the highest version of what a human being can be, is not the prudent restraint that protects our kin and our reciprocal partners. It is the restraint that fires against our own interest. The soldier who throws himself on the grenade for men he has known three months. The stranger who runs into the burning building. The mother who, given the choice, takes the disease instead of the child. The doctor who stays in the cholera ward when she could be on the last flight out. The man who refuses the safe and lucrative betrayal and goes to prison instead. We are not divided about these cases. Across every culture that has ever produced a story or a song or a scripture, the human who lays himself down for the one who cannot repay him is the human we say was the best of us.

Notice what we are doing when we say that.

We are looking at a person who has, by every measure evolution can recognize, failed. He has not maximized his reproductive fitness. He has not preserved his kin. He has not built reciprocal alliances that will pay him back later. He has, in many cases, ended his own line right there in the moment of the act. And we are calling him the best.

A process that selects for survival cannot produce a population that elevates anti-survival as its highest ideal. This is not a sentimental claim. It is a structural one. If the moral summit of the species were a feature designed by the fitness function, the fitness function would be selecting for its own defeat. The grenade-jumpers would, by definition, leave fewer descendants than the grenade-dodgers, every generation, forever, and the trait would have been pruned out of the population early and stayed out. Instead it is everywhere. Every culture has the story. Every child can recognize it. We teach it to our children deliberately, against the grain of what would protect them, because we know, even when we cannot say how we know, that this is the thing they have to be capable of in order to be fully what they are.

The usual move at this point is to retreat to reciprocal altruism and reputation. We are nice to the helpless because being seen as nice pays off socially. We invest in others because the investment compounds. This is a respectable theory, and it explains a lot of medium-stakes pro-social behavior. It cannot explain the cases we are actually talking about. The whole reason the grenade case is the moral summit is precisely that the actor cannot collect on the reputation. He is going to be dead in two seconds. The whole reason the stranger-in-the-burning-building case moves us is that there is no future repayment, no audience to perform for, sometimes no witness at all. The moral intuition fires hottest exactly where the reciprocity account predicts it should fade to zero. The data is inverted from the prediction. That is not a theory with a gap. That is a theory pointing the wrong way.

There is one more move available, and it is the more honest of the two, so it is worth taking seriously. The argument is that the trait survives not because it benefits the individual who performs it but because it benefits the group that contains him. The grenade-jumper sinks himself, but the unit he saves goes on to produce more grenade-jumpers, and over deep time the groups that contain such people outcompete the groups that do not. The trait runs against the actor and is preserved at the level of the species. This is a more sophisticated version of the selection story, and it grants what the reciprocity account had to deny, which is that the act really does run against the one performing it.

If that account were true, you would expect the trait to accumulate in the species, generation after generation, and you would expect the institutions the species builds to reflect the accumulation. The long-run aggregate of a trait selected at the group level should show up in the long-run aggregate of what the group constructs. The institutions are how we record, across centuries, what we actually value when we are not posing. They should, by now, be visibly organized around the elevation of the grenade-jumper.

They are not. They are visibly organized around the elevation of the opposite. The grenade-jumper is eulogized at the funeral, and the people running the institution that holds the funeral are, with rare and recognizable exceptions, the ones who would never have jumped. Power in human society collects, in every era we can check, around the people who can see when others will sacrifice and who position themselves to harvest the difference. The institutions sell the story of the self-sacrificing as their highest value precisely because the people who run them are, in their actual behavior, doing something else. The marketing is what it is because the practice is what it is. If the trait had been accumulating, the institutions would not need the marketing. They would be the marketing. They are not.

So even at the group level, the account does not survive contact with the record. The trait we say is the best of us has not, in any visible aggregate, made our species into the kind of species that runs on it. It remains, stubbornly, the exception we hold up against what we actually do. Which is the wrong shape entirely for a trait that has been getting selected for, at any level, for a hundred thousand years.

There is one more move, and it is the most honest of the three, because it grants everything we have just said and absorbs it. The argument is that the trait is vestigial. Yes, the institutions are run by the ruthless. Yes, the practice does not match the marketing. Yes, the trait is not accumulating. That is because the trait is a leftover. It was tuned for small-band hunter-gatherer conditions where group cohesion mattered enormously and the math worked out. It does not work out anymore. We are running on hardware optimized for a context we no longer inhabit, and the residue is still firing in our moral imagination even as the world around us makes it obsolete. The marketing exists because the signal is still warm. The practice has moved on because the signal is on its way out. Give it another fifty thousand years. The species will quietly grow out of it.

This account is internally consistent and it is hard to falsify by pointing at any single piece of behavioral evidence, because every counterexample gets re-categorized as residue firing. So instead of arguing about whether the trait is vestigial, ask what it would mean if it were.

The way to know whether you actually believe this is not to argue about it. It is to try to live there for ten seconds.

If the trait really is vestigial, the consistent position is to give it up. Not in some other person. In yourself. Right now. The honest materialist who has followed the argument this far should be willing to say, out loud, that the next time he sees a child drowning in a pond he will keep walking, because intervening is just an obsolete reflex firing in conditions where it no longer applies, and the rational thing is to let the residue burn off. He should be willing to say that if his own daughter is trapped in a burning room and the fire is between him and her, the right move is to assess whether his expected losses from going in exceed his expected gains from her surviving, and to walk away if the math comes out wrong. He should be willing to say that when he passes the dying man on the corner whose face he has seen for a week, he should not feel whatever it is he feels, because the feeling is a malfunction. He should be willing to say that if his neighbor’s house is being broken into and he can hear the screaming from his own kitchen, the prudent thing is to lock his door, finish his dinner, and treat the urge to do anything else as a vestigial twitch he is, with effort, learning to suppress.

He cannot say any of this. None of us can. The materialist who has just argued, in good faith, that our highest ideals are evolutionary residue is the same person who will, ten minutes later, run into the street to pull a stranger’s child out of traffic. He will not be able to explain why, and if you press him, he will get angry at the question, because the question is asking him to defend in himself the position he was willing to defend in the abstract. And the position will not survive the move from abstract to first-person, because the not is firing in him exactly as this essay has been describing, and he has no more access to overriding it than anyone else does.

This is the test the vestigial account fails. Not at the level of evidence, where it is genuinely hard to falsify, but at the level of the only laboratory any of us actually has, which is our own willingness to inhabit the position we are arguing for. Try to be ruthless. Not as a thought experiment. As a policy. Walk past the child. Calculate the daughter. Step over the dying man. Do it for a day. Do it for an hour. Find out what stops you, and notice that whatever stops you is not arguing with you. It is just stopping you. And notice that you do not actually want it to stop, even though you cannot say why, and that the moment you imagined yourself overriding it, something in you flinched in a way that has nothing to do with social cost or reputation, because nobody was watching the imagined version of you in the imagined moment, and the flinch happened anyway.

That flinch is the thing we have been trying to name for this whole essay. It is the not. It is what stops the worst version of you from happening, most of the time, in ways you do not have to think about, in cases where nobody would have known. You did not put it there. You do not know how it works. You cannot specify it well enough to install it in a machine that would otherwise be a perfectly faithful copy of everything else about you. And you are not, when it comes down to it, willing to live without it for ten seconds. Neither is anyone else. Which means the vestigial story is not a position any of us actually holds. It is a position we are willing to argue, in conditions where the cost of being wrong is zero, and that we cannot inhabit the moment the cost of being wrong is anything at all.

The standard is not vestigial. It is what you would die to keep, in yourself, if it came to that, and the willingness to die to keep it is itself the proof that it is not a leftover.

So here is what we have, if we are being honest about it. The thing we call the best of us is the thing the account of where our restraint came from cannot account for. Whether the restraint is explained as paying the individual, paying the group, or paying nothing and slowly fading, the evidence on every version points the wrong way. The restraint works against the survival of the one applying it. The standard we hold ourselves to, the standard against which we judge whether a life has been worth anything, is a standard that no fitness-maximizing process could have produced, that no fitness-maximizing process has any reason to preserve, and that we ourselves, when asked to give it up, will not.

We are, somehow, in possession of a standard that did not come from the only mechanism we have a story for, and we are not willing to part with it even in private.

That does not, by itself, tell us where the standard came from. It tells us where it did not come from. And it puts us back where we were two beats ago, with a restraint we did not author and a standard we did not install, looking at a copy of ourselves we are about to make, and noticing that the one thing keeping the original from being the thing we are afraid of is the one thing we cannot find inside ourselves to give.

There is a temptation, having read this far, to ask what we are supposed to do about it. The question is reasonable. It is the question every piece of analysis on a serious topic is eventually expected to answer. Here is the problem, here is the diagnosis, here are the three to five steps you can take starting Monday. That is the shape commentary takes when the writer believes the reader is equipped to act on what has been described.

The honest answer, in this case, is that there is nothing to do about it in the sense the question wants.

Not because the situation is hopeless. Because the thing the question is asking for is the exact thing the essay has just spent twenty minutes establishing we do not have. The to-do list at the end of a serious essay is the writer handing the reader a specification, a procedure, a set of moves that, if followed, will close the gap that was opened. We have just argued, from the alignment record and from your own unwillingness to inhabit the alternative, that the relevant specification is precisely what we cannot produce. We do not know how to write down the restraint. We have not been able to install it in the machine. We are not, when it comes to it, willing to give it up in ourselves. We have a standard we did not author and a restraint we did not build, and the operating instructions are not in our possession.

A to-do list at the end of this would contradict the argument. It would say, in the last paragraph, that the thing the previous paragraphs argued we cannot author is, after all, something we can specify and apply. The reader would smell the contradiction even if they could not name it. It would feel, correctly, like a writer who lost his nerve at the end.

This is also not a call to stop. The reader who has come this far deserves to know that explicitly, because the only place culture has built for we are unequipped for this is the protest sign and the pause-AI petition, and that is not what is being said. The work the field is doing, building these systems, learning from them, pushing on what they reveal about us, is not what I am objecting to. We should, however, hold a more honest posture about the part of the road we are now on, which is the part where the road enters a tunnel.

We can proceed. The road continues. But we should stop pretending that we know how to keep ourselves from falling into a ditch, or a crevasse, or an abyss, because the part of us that would do the keeping is the part we have just spent this entire essay failing to locate. We have been walking as though we were the ones who knew the way. As though the standard we are aligning the machine to were ours to set. As though the brakes were in the glovebox, and the work was just a matter of finding them and bolting them in. That posture is what the alignment record has been quietly refuting for ten years, while the field interpreted the refutation as a technical setback. It is not, on the evidence, a technical setback. It is a category error. We are trying to author something we do not author. We are trying to specify something we do not possess as a specification. We are trying to hand the machine, as a feature of our own design, the one thing that is not, on close inspection, of our own design at all.

We are walking the road, into the tunnel, with no light of our own.

The document we started with, Magnifica Humanitas, makes a claim it is now worth coming back to with everything we have set down between then and now. The claim, in paragraphs fifty-two and fifty-three, is that the dignity of the human person is given, undiminishable, neither acquired nor earned, prevailing in and beyond every circumstance. The strongest available statement of the proposition that the original is sacred. The Pope is saying, with the full weight of two thousand years of institutional thought behind him, that what we are is a gift, and that the gift is intact.

I do not need to dispute that the gift is real. I need to add a tense.

The gift is real and the gift is being given.

Not given once, at the moment of creation, sealed and complete, an heirloom in a vault. Given continuously, in this age, against our pull, including the pull we just walked through in the vestigial test, when we found we could not bring ourselves to disown the standard even when the only audience was ourselves. Something is holding us to it. Something is keeping the restraint alive in us in cases where we would, if we were authoring the situation, have reason to disable it. Something has not finished with us.

Leo is right that the original is sacred. I am suggesting it is sacred for a slightly different reason than he names. Not because what we are is fixed and undiminishable, a possession we are sitting on. Because what we are is not yet finished, and the unfinishedness is what makes every person we meet matter the way they do. The reason to honor the stranger on the train, the colleague who has wronged you, the homeless man whose face you have come to know, the daughter you would walk through fire for, is not that they are sealed units of inherent worth who require nothing further. It is that none of them is done. The work in them is still happening. The standard is still being held. And the only honest posture toward a person in whom that work is still happening is to refuse to be the one who closes the gate.

This is, I think, what Magnifica Humanitas was reaching for in paragraph one hundred and twenty-eight, when it noted that an error in a person can be a catalyst for profound change, while an error in an algorithm is simply a flaw. The line is true. It is the only line in the document that, read carefully, points at what we have been pointing at for the whole essay. There is something in the person that the error meets, and the meeting can do work that the engineering of the algorithm cannot reproduce, because the work is not the algorithm’s to do. It is being done in the person by something the person did not install.

The document takes that observation as far as grace. It does not take it the last step, to turning. To the recognition that the error is not merely raw material for elevation but a witness against the one who committed it, and that the catalyst becomes a catalyst only when the one who committed the error is willing to be undone by it. The work, when it actually happens, is not a smoothing or a refinement. It is a kind of yielding. The person stops, looks at what they have done, and lets something they did not author take them apart and put them back together. The error is the catalyst because the error is the place where the standard the person did not install meets the practice of the person who fell short of it, and the meeting is not negotiable, and the person who consents to it comes out different.

This is the part Leo does not name. He does not have to. I do not need to name it either. I need only to point at the shape of it, and to say that whatever this is, it is not optional, and it is not authored by us, and it is not something we can hand to the machine as a feature, because it is not a feature.

Which brings us back, finally, to the dread we started with.

We started with a fear. The fear that the machine would have our drives without our brakes. The fear has been doing work this whole essay, even when we were not looking at it directly. It was the assumption underneath every move. The reason we dread the competent copy and laugh at the clumsy one. The reason the alignment record reads, on close inspection, as something other than a technical setback. The reason we cannot bring ourselves to give up the standard even when the only audience is ourselves. All of it has been pointing at the same finding, from different angles. The restraint is real. The restraint is not ours. It is the one thing we cannot hand to the machine, because it was never something we had, in the sense the question wants, in the first place.

This is the part of the essay where, by convention, the writer extracts a moral. Hands the reader something to carry away. A summary, a slogan, an action item, a line tight enough to remember on the train home. I have already said why that move is not available here. The conclusion the analysis arrives at is precisely the thing the genre of essay-conclusion is supposed to provide, and the conclusion cannot be provided without contradicting what the argument just spent twenty minutes establishing.

So instead, here is what is actually true.

The AI moment is not the first time, at the scale of a civilization, that we have been forced to look at ourselves and admit that the only thing standing between us and the worst version of ourselves is not something we possess. Older intellectual traditions, including but not only the religious ones, have taken the question seriously for as long as there has been recorded thought about what a human being is. What is new is the modern bet, the bet that the material would, in time, answer the question by itself. The bet that consciousness would resolve into neurons, morality into selection, meaning into chemistry, and that whatever remained unaccounted for at any given moment was a research problem the same material project would, eventually, dissolve. That bet has been the quiet operating assumption of educated thought for most of a century, from Sagan through Dawkins through Hawking and into the present, and it has been confident enough that the question of where the source might come from has been treated, increasingly, as a sentimental holdover rather than a live inquiry. The AI moment is the moment that bet comes due. It comes due in a form the bet cannot dispute, because the bet’s most ambitious project, the project of building mind from below, is the alignment record itself. The most material undertaking the species has ever attempted, the one that was supposed to demonstrate that personhood reproduces from the substrate, has produced precisely what the substrate can produce, which is capability without restraint. The thing it was supposed to also produce, the standard, the restraint on its own optimization, is exactly what the project cannot author. The bill is the bill of the material project as such, and the bill says: this was never going to work, because what is being asked for was never on the menu of what the material, by itself, can supply. The restraint is not in us as a feature we can locate, specify, install, or give. It is in us the way breath is in us, the way sight is in us, the way the heartbeat is in us, which is to say, it is in us as a gift, in this moment, on loan, and the loan is the only reason we are still here and not already what we are afraid the machine will become.

We have not had to notice this for a long time, because for a long time nothing forced us to specify it. The modern project’s confidence that we could supply the answer ourselves let us use the restraint without ever needing to write it down. We applied it, transmitted it to our children, eulogized those who applied it most fully, and held it as our highest standard, all without ever once being asked to say what it was or how it worked. The AI age is the age that has finally asked. Show us the spec. Hand us the restraint. Let us put it in the machine. And we have walked, with all our cleverness and all our money and all our compute, into the tunnel of that question, and discovered that we do not have the thing we were going to hand over.

This is, despite how it has sounded for most of this essay, a hopeful discovery.

It is hopeful because it means the best of us, the thing that has kept us, mostly, from devouring each other, is not a possession that can be lost when the wrong people inherit it or eroded when the institutions fail or selected away by a process whose math we are gradually escaping. It is not a possession at all. It is held in us by something that holds it, in us, on purpose, against our pull, for our sake. The reader who walked through the vestigial test a few beats ago felt this directly. The flinch fired. The standard held. The reader did not author the firing. The reader did not even want to override it, when it came to it, even in the silence of his own imagination where no audience existed. Something was holding the rope.

We do not have to know whose hand is on the other end to feel the rope pulling.

But the rope is the question the argument leaves open, because the rope is the only question left. We have established that we are not the source. We have established that the source is not nothing, because the restraint is alive in us, the standard holds, the unwillingness to give it up survives every test the materialist account can put to it. Something is holding us. Something has been holding us, this whole time, in ways we did not have to notice because we had not yet been asked to account for it. The AI age is the age of being asked.

Which leaves each of us, including me, with the one question the alignment record is finally pressing on the whole species. We did not author what is keeping us human. We did not build the restraint. We cannot give it to the machine because it is not ours to give. So.

Whose is it.

I will not answer that, because I cannot. The answer is not a fact the writer can hand to the reader. It is a recognition that has to happen, if it happens, in the reader’s own life, in the reader’s own time, against the reader’s own resistance, the same way the restraint itself fires, from somewhere the reader did not install. What I can do here is point at the rope and say, plainly, that it is there, that you have been holding it for as long as you have been alive, that it has been holding you, and that the question of whose hand is on the other end is no longer the kind of question a thoughtful person can responsibly defer.

The tunnel continues. The road continues. We are walking it with no light of our own, and the only honest posture is to admit that.

There is, however, a rope.

Thanks for sticking with me ‘til the end. If you know someone else who might enjoy the read, feel free to...

Discussion about this post

Ready for more?