Uncategorized

How to Choose the Best Decision Agent

Imagine you have several decision-making agents, and you want to find out which one is the best. A simple idea is to test them all on the same task and keep the one that performs best. For example, you could ask each agent to predict a series of coin flips: heads or tails.

You give all agents the same number of coin tosses under the same conditions. Then you measure how many they get right, or you look at who gets the longest streak of correct predictions in a row. The agent that makes the fewest mistakes, or has the longest streak, might look like the best decision-maker.

It is tempting to conclude that this agent is better at making decisions, because it “proves” itself on the test by getting more predictions right. From this point of view, the agent with the most correct answers, or the longest run of correct predictions, is simply the best.

But there is a crucial question: is this agent really better at making decisions—or did it just have the most luck?

If all agents are effectively guessing on a fair coin, then someone will, just by chance, get a long streak of correct answers. If you test many agents, one of them will almost always stand out with an impressive result, even if none of them has any real skill. In that case, you have not found the best decision-maker; you have found the luckiest one.

This matters in practice. If you run a single test, pick the apparent winner, and trust it as “the best”, you may be basing your decision on randomness. You might over-trust an agent that got lucky in one experiment and ignore others that would do better over time.

To choose a genuinely good decision-making agent, you need more than one short test. You should look for performance that is consistent across repeated trials and different tasks, not just one nice streak. You should also compare against simple baselines, like random guessing or basic rules, to see whether the agent is actually doing better than chance.

The simple coin-flip example shows the core idea: testing agents on the same task and picking the one with the best streak does not automatically mean you found the best decision-maker. It might just be the one that had the most luck.

Is Everything Just a .md File with a Prompt

When you start working with language model–based systems, a simple pattern often appears: almost everything seems to be a .md (Markdown) file with some text instructions. The agent is a .md file, skills are .md files, instructions are .md files, and policies are .md files. That quickly raises the question: is everything in this universe just a .md file with a prompt?

There are practical reasons for using Markdown. It is human-readable, easy to version-control, and easy to review in tools like Git. Putting the “prompt” in a file separates behavior from code, which means you can adjust how an agent behaves without redeploying anything. It also lets non-developers participate by editing the .md files directly.

An agent, in this setup, is usually defined as a .md file that describes who the agent is and what it is supposed to do. You can think of it as a role description for the language model. The file typically contains the agent’s identity (“You are a customer support assistant”), its responsibilities (“Help users solve product issues”), its tone (“Be clear and calm”), and its boundaries (“Do not give legal or medical advice”).

A skill is also a .md file with a prompt, but it represents a specific capability instead of a role. A skill describes what the system can do in a reusable way. For example, a “summarize ticket” skill file might explain that when given a long support ticket, the model should return a short bullet-point summary. The file usually defines the purpose of the skill, when it should be used, what the input looks like, and what format the output must have.

An instruction is again a .md file with text, but with a narrower focus. It describes how responses should look in a particular context. Instructions might set formatting rules (Markdown, JSON, plain text), length limits, language and tone preferences, or other local constraints. For example, an instruction file might say “keep answers under 200 words, write in English, and respond in Markdown.”

A policy is a .md file that defines rules and constraints the system must always follow. It covers safety, compliance, and domain-specific restrictions. A policy file might say “do not output personal data,” “do not provide medical or financial advice,” or “refuse to help with illegal activities.” Policies typically override anything else: if an agent or skill would violate a policy, the policy should win.

So from one angle, yes: agents, skills, instructions, and policies can all be “just” .md files with prompts in them. But they are not the same thing. In practice, it helps to treat them as different concepts: agents as roles, skills as capabilities, instructions as local guidelines, and policies as global rules.

A simple project structure might mirror this way of thinking, for example:

  • agents/support-agent.md
  • skills/summarize-ticket.md
  • instructions/chat-formatting.md
  • policies/safety.md

In a typical interaction, the system might load the support agent, apply the summarize skill for long tickets, format the answer according to the chat instructions, and enforce the safety policy. All of that behavior is driven by separate .md files, each with a clear purpose.

There are limits to this .md-file view. Tools, APIs, code execution, state, and workflows usually live outside Markdown. Still, as a mental model, “a universe of .md files with prompts” is useful, as long as you remember that not everything is literally just a file. The important part is not Markdown itself, but the structure: clear definitions of agents, skills, instructions, and policies that you can read, review, and evolve over time.

Agents and Priorities

When we build systems of agents based on language models, we often start with a simple idea: split a big problem into smaller parts, give each agent a clear task, and let them work together. But each agent will still optimize for something, and what it optimizes for matters a lot. If goals are set wrong or unbalanced, the overall system will miss its main purpose, even if each agent “does its job.”

Agents have different objectives and make choices and evaluations based on those. Some agents have very local goals that are limited to their own specific task. They focus on short-term, concrete outcomes like “summarize this document,” “classify this ticket,” or “extract these fields.” Other agents have broader goals and consider a larger whole, not just a single subtask. They might care about things like “improve user satisfaction” or “help the user solve their problem effectively.”

For a system of agents to function and reach an overarching goal, you need a combination of these local, constrained goals and more composite, system-level goals. The local goals give clarity and focus, while the global goals ensure that the system is moving in the right direction as a whole. Together, they should form the basis for the evaluations and decisions the agents make.

If you only have agents with narrow, local, short-term goals, the system easily becomes unbalanced. Those agents will tend to reach their local goals: tasks will be completed, short-term metrics will look good, and each agent can claim success. But the overall outcome is often worse. You can end up with fast but unhelpful answers, lots of extracted data that is not actually useful, or content that is technically correct but misses what the user really needs. The main goal of the system is not reached, even though each agent hits its own target.

The opposite imbalance also creates problems. If you only prioritize global, overarching goals like “maximize user value” or “ensure project success,” agents may make poor evaluations and decisions in practice. Global goals are often abstract and not clearly connected to local, short-term realities. When an agent has only a broad mission, it may not know how to act in a specific situation: Should it be brief or detailed? Strict or flexible? Conservative or creative? Different agents might interpret the same global goal in different ways, and decisions become inconsistent and hard to control.

The key is to connect local and global goals explicitly. Each agent should have a clear local objective that defines its own task: what it is responsible for, when its task is “done,” and under what constraints. At the same time, that local goal should be designed so it supports the overarching purpose of the system, and is limited by constraints that come from the global goal.

For example, in a support system, a triage agent might have the local goal “classify and route tickets accurately,” while a response agent has “provide a clear, actionable answer.” Both of these local goals should be grounded in a higher-level goal such as “resolve user issues effectively without unnecessary delay.” That global goal can add constraints: routing should prioritize correctness over speed when in doubt, and responses should prioritize resolving the issue over being as short as possible.

If the system becomes unbalanced, you get predictable patterns. With too much focus on local, short-term goals, every part looks fine, but the overall result is poor: the system reaches local targets but fails the main purpose. With too much focus on global, abstract goals, decisions become vague and ungrounded: agents struggle to translate the overarching aim into good local decisions, and the connection between what they do now and what the system should achieve later is unclear.

Designing a good agent system means thinking about both levels at the same time. You define the overarching goal of the system, and then design local goals for each agent that clearly contribute to this goal and are consistent with it. This balance between local and global goals is what allows many agents with different responsibilities to work together and move the whole system toward its intended outcome.

Text Versus Knowledge

Most of us have seen text that sounds impressive but doesn’t really say much. Buzzword-heavy emails. Corporate reports. Tool-generated paragraphs that look polished but don’t help you make a better decision. This is the core difference: handling text is one thing, understanding knowledge is something else entirely. Language models and other tools have made it incredibly cheap and fast to produce and manipulate text. But turning that text into real understanding and action is still the hard part.

When we say “text is easy”, we mean handling textual information: recognizing patterns, rearranging sentences, and guessing connections between words. It includes things like rephrasing, summarizing, translating, or completing a sentence. Even children manage this in word games and simple puzzles. They spot patterns, guess missing words, and play with rhymes. This is surface-level pattern matching. You don’t need to deeply understand the content; you just need to see what “fits” with what came before. That is why it is relatively easy to generate text that sounds plausible or to shuffle existing information into a new format.

“Knowledge is hard” points to something deeper: understanding the knowledge in the information. It’s about seeing the connections that are not directly written in the text, and recognizing the patterns that live in the concepts behind the words and in the context where the information is used. You need to understand what the text really means, what follows from it, and how it connects to other things you know. You need to infer what is implied but not written, see assumptions and consequences, and place ideas into a larger picture. This is much harder than manipulating text, because it demands background knowledge, experience, and reasoning.

The difference becomes clear in simple examples. Summarizing an article into a few bullet points is a text task: compressing and reorganizing sentences. Asking “What should I do differently in my team meeting on Monday based on this article?” is a knowledge task: you must understand the ideas, judge what is relevant for your situation, and turn that into action. Giving a definition of a concept is a text task. Using that concept to diagnose a real problem in your team or project is a knowledge task. One stays at the level of words; the other must connect to reality.

This distinction matters because we often confuse good-looking text with real understanding. At work, we produce reports, slide decks, and documentation. But the value lies in the decisions they inform, the problems they help solve, and the changes they lead to. In organizations, there might be a lot of documents, but that doesn’t guarantee shared understanding. Shared text is not the same as shared knowledge.

It also matters for how we use language models and agents. These systems are very good at handling text: drafting emails, summarizing documents, rephrasing content, generating examples. They are not a replacement for human judgment, domain expertise, or responsibility. If you treat a model as a text tool, you are aligned with what it does well. If you treat it as a source of guaranteed truth or deep understanding, you blur the line between text and knowledge.

A practical way to work with this distinction is to let models handle the text, while you own the knowledge. Use them to draft, rewrite, explore ideas, and structure notes. Treat the output as a draft or suggestion, not as an answer. Keep the harder part—understanding, evaluating, and deciding—in your own hands.

Since knowledge is hard, it helps to approach it actively. Don’t just read; summarize in your own words, ask what the core claim is, and consider when it might be wrong. Connect new information to what you already know, and test ideas in real situations. Discussing, applying, and reflecting turns information into knowledge.

Text is easy. Knowledge is hard. Tools and language models can make the text part faster and more convenient. Our job is to do the knowledge part: see the deeper connections, understand the concepts and context, and make better decisions based on them.

Click and Wait

When language models, agents, and scripts automate most of the actual work, something strange happens: your job quietly turns into a sequence of “click and wait”.

You start your day, open your tools, and on paper you should be highly productive. But in practice, your day looks like this: click “Run”, wait for a script to finish. Click “Open”, wait for an app to start. Click “Refresh”, wait for information to load. Click “Retry”, wait because something hung and needs to be restarted or run again. Doing a task becomes a long chain of click, wait, click, wait.

Once you start noticing it, you see different kinds of waiting everywhere. You wait for information to load. You wait for applications and services to start. You wait for scripts to execute, for calls to finish, for background jobs to complete. And when something goes wrong, you wait even more: you wait to see if it’s really stuck, you wait while things restart, you wait while a process is run again from scratch.

Individually, each wait is small. A few seconds for a page to load. Half a minute for a script. A minute for a restart. None of it feels dramatic in isolation, but across a day it adds up. You’re no longer spending time doing the core work; you’re spending time sitting in between automated steps, supervising.

There’s also a cognitive cost. These small waits constantly break your focus. You start a task, wait a little, get distracted, then try to pick up where you left off. You begin something else while a script runs, then you have to switch back when it’s done. Your workday gets chopped into tiny fragments by all these little pauses.

And then there’s the emotional side. It’s frustrating to feel blocked by your tools. It’s tiring to sit and watch a spinner, unsure if something is just slow or actually stuck. It doesn’t feel like real work, but it takes real time and real attention. You end up with the sense that you’re working all day, yet not moving as much as you should.

You can start by simply observing how often this happens. For one day, pay attention every time you: wait for information to load, wait for an app to start, wait for a script, call, or job to finish, or wait because something has hung and needs to be restarted or run again. Write down roughly how long you waited and what you did in that time. You will probably see how much of your day is actually “click and wait”.

Some of this waiting is unavoidable, but not all of it needs to be so painful. You can reduce the number of times you have to click by batching tasks and letting scripts or agents chain steps automatically instead of requiring you to manually start each one. You can reduce some of the waiting time by optimizing common scripts and calls, or by fixing obvious bottlenecks that slow you down again and again. And for the waiting that remains, you can plan small “micro-tasks” that fit into those 30–90 second gaps, so you’re not just staring at a loading screen.

Better feedback also helps. If your tools give you clear progress indicators and send notifications when something is done or has failed, you don’t have to sit and watch. You can safely turn your attention to something else and only come back when your input is actually needed.

At a higher level, the goal is to change your role from “the person who keeps clicking and checking” to “the person who sets intent and lets systems run”. Ideally, you describe what you want to happen, trigger it once, and the rest is handled end to end: run the scripts, handle retries, recover from common errors, and notify you only when a real decision is needed.

“Click and wait” is a symptom of half-finished automation: the work is automated, but the flow is not. If your days are full of waiting for things to load, start, run, hang, restart, and run again, it’s a sign that your next productivity gain is not a new tool, but redesigning how your existing tools work together—so that you spend less time waiting and more time actually working.

Using Language Models to Appear

Many people now let a language model write for them: the polished LinkedIn post, the perfect client email, the impressive application or essay. In a few minutes, you can get something that sounds confident, competent, and professional.

When you use a language model mainly to appear as something—to seem more knowledgeable, more authentic, or more expert than you really are—you are leaning on a property of the technology that is, in many ways, a weakness. The ability to “sound right” is not the real strength of this technology. It is a side-effect of what it actually does.

What a language model really does is predict plausible text. Given some input, it produces the words that statistically fit best. Because it has been trained on very large amounts of text, it is good at sounding fluent, correct, and even wise. But that does not mean it understands, cares, or believes anything. What it produces can look authentic without being authentic, and look correct without being correct.

The core strength of this technology is not its ability to appear correct, appear authentic, or imitate human expression. Those are just consequences of pattern-matching. Treating “appearing like something” as the main feature means we focus on the facade, not on what we actually think or know.

Still, this side-effect is exactly what many people want to use. It is attractive to let a model make you look more professional than you feel, more engaged than you are, or more experienced than you truly are. It is tempting for organizations to let a model generate values, strategies, and mission statements that read well, even if nobody really stands behind them.

The problem is that this use goes against what the technology is really good at. Language models are strong at drafting, summarizing, rephrasing, and exploring options. They can help you work faster when you already have ideas, knowledge, and viewpoints. They are much weaker when you ask them to be your identity, your authenticity, or your expertise.

When you rely on a model to appear as something, you risk confusing its output with your own thinking. You risk building communication on something that only looks real. Over time, this can weaken your own skills in writing and reflection, and it can create a gap between how you present yourself and who you actually are.

The same is true for organizations. If they mainly use this technology to mass-produce polished language, their voice becomes generic and hollow. Content may look good at first glance, but it is not rooted in real conviction or understanding. The surface improves, while the substance is left untouched.

A better way to use this technology is to treat it as a tool that supports your own work, not as a mask you wear. Start with your own thoughts, even if they are unclear or incomplete. Let the model help you structure, clarify, and refine. Use it to suggest alternatives and questions that can deepen your understanding. Keep the responsibility for what is being said.

The main point is simple: the ability to appear correct, authentic, or human is a side-effect of how these systems generate language. It is not where their true strength lies. If we use that side-effect to construct a facade, we become more dependent on it and less grounded in our own thinking. If we instead use the technology to sharpen and express what is already ours, we keep authenticity and judgment on the human side, where they belong.

How Systems Develop Over Time

Systems develop gradually, step by step, bit by bit. They change through small improvements, small adaptations, and also small deteriorations. A process is adjusted slightly. A tool gets a small fix. A routine is modified to handle an exception “just this once”. None of these feel dramatic on their own, but over time they add up and can make a system very different from how it started.

This gradual evolution doesn’t happen in just one place. It happens in many different systems at the same time. Inside a company, several processes, tools, and ways of working are all changing in parallel. Across a market or industry, multiple products and solutions are also evolving at the same time. Even among similar systems trying to solve the same problem, each one develops slightly differently as people make different choices and apply different small adjustments.

These systems also compete. They compete for users, attention, resources, and trust. Some of the similar systems that evolve in parallel turn out to be much, much better than others. Small differences in decisions and adaptations accumulate, and over time a few variants become clearly superior, while others stagnate or quietly get worse through layers of workarounds and compromises.

From the outside, it does not always look gradual. For a long time, differences between systems may seem small or invisible. Then, suddenly, development no longer looks step by step. It looks like a jump, a revolution, a quantum leap. This apparent leap happens when one system replaces another. The “new” system is usually not new in an absolute sense; it has been developing in parallel with others, but at some point it crosses a threshold where it is so much better that it rapidly displaces the alternatives.

The similar systems that were evolving at the same time did not all evolve in the same way. Some became much better and suddenly replaced the others. From a distance, this looks like sudden change. From up close, it is the result of many small improvements, adaptations, and deteriorations interacting over time, until one system wins the competition and the shift becomes visible.

High level standard components

Adding login to an app or an API is something developers keep doing over and over. You pick an identity provider, set up configs, implement a login flow, validate tokens, handle errors, add logging, and integrate it all into an existing system. None of this is especially unique per project, but it’s still a lot of work each time.

To get a login flow working, a developer typically needs to get access to an app or API at the chosen identity/auth provider, choose which provider to use, set up separate dev and prod configuration, and make sure the app and server can read that configuration correctly. This means wiring environment variables, config files, secrets, and making sure everything is consistent between environments.

Then you have to program the actual login flow. That usually includes a login page or button, redirecting the user to the provider, handling the callback endpoint, dealing with state, and setting up sessions or tokens. On the server side, you need components or endpoints to receive and process the login response and connect it to how your app represents users.

On top of that come all the cross-cutting concerns. You have to code token validation (for example verifying JWT signatures, issuer, audience, expiry and scopes). You have to handle error situations such as invalid tokens, expired sessions, misconfiguration, or provider errors. You need logging and observability so you can see when login fails and why. Finally, you must integrate everything into the existing application or system, protecting endpoints, checking roles or permissions, and mapping identities to whatever domain model you already have.

Most of these tasks are very similar from project to project. Developers keep deciding and implementing the same things repeatedly. The real choice in many cases is only which id/auth provider to use and which flows to support. The rest is largely the same glue code and setup. There is little reason to solve all of this from scratch every time.

Instead of re-implementing everything, you could use a high level standard component for login. In such a model, the only thing a developer really needs to decide is which identity/auth provider to use and provide the minimal configuration for it. The component would handle dev and prod configuration, reading config in the app and on the server, the login flow (pages and server parts), token validation, error handling, logging, and simple hooks for integrating with existing applications.

The goal is to avoid manually setting up and coding all of these pieces in each new project. By treating login as a reusable, high level standard component, you reduce the work to what is actually necessary: choosing a provider and specifying the few details that are truly specific to the application. Everything else becomes implementation detail that the component takes care of, instead of a new “doing job” for every single app.

Will a language model tool say no

When we ask a language model tool to do something, it usually just does it.
“Improve this text.”
“Rewrite this email.”
“Refactor this code.”

But will it ever say: “No, that’s not necessary” or “What you have is already good enough”?

If you ask a model to improve some code or polish a paragraph, it will almost always produce a new version. It does not stop and say: “This is fine as it is.” It does not tell you that the code you want improved is already clean, or that the text you want generated is unnecessary. The tool does not, on its own, take a position on whether the task needs to be done.

Today, these tools are designed to follow instructions. They do the task that is requested: generate, rewrite, expand, refine. They treat the request as a given. The logic is simple: you asked for something, so they try to deliver it. They do not usually question if the code is already good enough, or if the text you are asking for adds any real value.

There are some situations where a model will say no, but those are mostly about safety or policy. It may refuse to answer because something is not allowed. That is different from saying “You don’t need this” or “This is unnecessary.” The refusal is about what it is allowed to do, not about whether your request makes sense or is worth doing.

If you want the tool to act differently, you have to ask for it. You can say: “First, check if this text actually needs improvement. If it is already good enough, just tell me that and don’t rewrite it.” Or: “Review this code. Only suggest changes if there is a clear benefit. If not, say no changes are needed.” In other words, you have to explicitly invite the model to evaluate whether the task is necessary, not just to perform it.

By default, a language model tool does not decide whether your request is needed. It runs the task you give it. If you want a tool that sometimes says “No, this is already good enough,” that behavior has to be part of your prompt or the way the tool is set up—not something it will do on its own.

The importance of access control on information

Access control on information is becoming much more important now that we are using all these new language-model–based tools. This is especially true when the results are not just for your own use, but are going to other people. Imagine using a copilot to summarise what the board of a housing association has done in 2025, to inform all the co‑owners. The copilot has access to everything stored in the association’s documentation system, plus email, Vibbo messages, Vibbo posts, and other sources. You ask it for a summary, it gives you a nice text, you skim it, think “good enough,” and send it out. It’s just the housing association, how bad can it be?

Then you discover that the summary includes a paragraph saying that the association has been plagued by a lot of noise from a named co‑owner, and that measures for forced sale of the apartment have been initiated. In reality, nothing like that has been formally decided or made official. Maybe there are complaints, maybe someone has mentioned forced sale as a possibility in an internal email, but the tool has mixed together drafts, discussions, and documents and presented it as if it were a fact ready to be communicated to everyone.

This kind of obvious mistake might be caught if someone reads carefully. But it becomes much harder when it is not so clear what information can be given to whom, or when it is unclear what is considered official and what is not. A language model does not understand the difference between internal discussion and official communication. It just sees text and tries to answer the question you asked.

Some organisations try to handle this by forcing all information to be classified. For example, every document must be marked as open, internal, or restricted. In theory this should help, but in practice it is very easy to mix things up. People mislabel or forget to label. Different teams use the labels in different ways. Over time the classification becomes something you click past, not something that is actually trusted. For tool developers and system administrators, it can be even harder, because they often have technical access to everything and must somehow make sure the tools respect labels that are not consistently used.

What really matters is control over which context information belongs to, how it is used, and when it is allowed to be used. Information that belongs in a board context should not automatically be available in a public context. Internal complaints and draft legal assessments should not suddenly show up in a summary meant for all residents. The same piece of information can be appropriate in one context and completely wrong in another.

To handle this, we need more than simple labels. We need to think in terms of contexts and audiences. Is this for the board only, for internal staff, for all residents, or for the general public? Is this a draft or an approved decision? Tools should not have free access to everything just because the data is technically stored in the same system. There should be technical “watertight compartments” between different types of information and different uses. When a user asks for a text to be sent to all residents, the tool should only be allowed to use information that is safe for that audience, and exclude sources that belong to internal discussions.

This is not just a technical problem. It is also about culture and routines. People need to understand that these tools are powerful and can mix information from many places, and that they will not automatically know what is sensitive or unofficial. Generated texts should be treated as drafts that must be read with the same care as anything else you send out. If we combine clear thinking about context and audience with technical separation of data, we can use new tools without accidentally leaking, inventing, or exposing information that should never have left its original context.