top of page
Search

How fast is agentic AI moving — and what does that mean for your adoption strategy? The pace question matters more than people think

  • 1 hour ago
  • 6 min read
By Paul Shotton, Advocacy Strategy
By Paul Shotton, Advocacy Strategy

Most discussions about AI adoption in public affairs focus on capability: can it do this task, is it good enough, should we try it? That is the right starting point. But there is a second question that follows immediately, and it is less often asked: how quickly is the underlying capability changing, and what does that mean for the tools, services, and internal workflows we are building around it?


The answer has real strategic implications — for how you evaluate software providers, how you commission internal tools, and how you structure your team's experimentation.


The pace of change is not gradual

The numbers from Stanford HAI's 2025 AI Index set the strategic context for everything that follows.


The first trend is model size. AI models are measured partly by the number of parameters they contain — the variables the system uses to process and generate language. Stanford's data shows that the smallest model capable of scoring above 60% on a standard academic reasoning benchmark fell from 540 billion parameters in 2022 to 3.8 billion in 2024 — a 142-fold reduction for roughly comparable performance. By 2025-26, models under 10 billion parameters routinely match GPT-3.5 performance on standard tasks and run on a laptop. What once required the infrastructure of a large technology company now runs on consumer hardware.


The second trend is cost. Organisations accessing AI through software or building their own tools typically pay per query — a small charge each time the model is asked to do something, measured in cost per million tokens, where a token is roughly three-quarters of a word. Comparing like with like across capability tiers, the trajectory is steep and consistent.

At GPT-3.5 level — the capability that first made AI genuinely useful to mainstream business users — the cost fell from $20 per million tokens in late 2022, to $0.07 by October 2024, to around $0.03 by early 2026. That is a 667-fold reduction in three years. At GPT-4 level — roughly three to five times stronger on complex reasoning tasks — the cost fell from around $400 per million tokens in early 2023 to $0.40 by early 2026. A 1,000-fold reduction in three years.


To make that concrete: a monitoring workflow generating a structured policy briefing now runs at cents per report using GPT-4 class capability. The same task would have cost orders of magnitude more in 2023. Frontier AI now costs less than 0.1% of what it did three years ago. That changes what is viable to build, what can be run continuously, and what pricing models for AI-enabled services can realistically look like.


For public affairs managers, the practical implication is direct: a tool that was genuinely impressive twelve months ago may already be behind the current capability frontier. The question is not whether the stack will improve. It is whether your tools, vendor relationships, and internal workflows are built to absorb that improvement rather than be disrupted by it.


What this means when evaluating software providers

When a software or monitoring provider presents you with an AI-powered solution — legislative tracking, stakeholder intelligence, policy summarisation, horizon scanning — the question is not only "does it work?" It is "how is it built, and how will it keep pace?"

There are four things worth probing.


Model dependency. A tool built as a thin wrapper around a single model's API — passing your request to an AI service and returning the result, with minimal workflow logic of its own — is inherently fragile. When the underlying model changes, the tool changes with it, often unpredictably. A more defensible architecture is one where the provider has invested in workflow design, retrieval infrastructure, and structured data that can absorb model improvements rather than being disrupted by them. OpenAI's agent design guidance makes exactly this point: the durable assets in an agentic system are the workflow structure, the tool interfaces, and the evaluation framework — not the model itself.


Data quality. In public affairs monitoring, leading platforms like Quorum and FiscalNote have spent years building structured, classified, and connected policy datasets. That data asset is not easily replicated by a generic AI tool. AI models perform materially better when working with well-organised, consistently structured information than when applied to raw, unstructured sources. Providers who can demonstrate deep, structured domain data alongside capable models are offering something more durable than those leading with model capability alone.


Evaluation rigour. How does the provider measure whether the AI output is actually good? In simple single-step tasks, informal judgement can get you some of the way. But in agentic workflows involving retrieval, comparison, drafting, and escalation across multiple steps, errors can compound in ways that are not always visible in the final output. A model might retrieve the wrong document, draw a subtly incorrect inference, and produce a confident-sounding but flawed summary — and a reviewer looking only at the end result may not catch it. Anthropic's engineering guidance is explicit: multi-turn agentic systems require rigorous step-level evaluation, not just final-output review. A provider who cannot explain their evaluation methodology in concrete terms should prompt scepticism.


Upgrade architecture. A provider's ability to absorb model improvements matters as much as where their product sits today. Providers with modular architectures — where the data layer, the model layer, and the orchestration layer can evolve independently — are better placed to deliver compounding value over time. In practical terms, ask them: if a significantly better or cheaper model became available tomorrow, how long would it take to integrate it, and what would break?


What this means for in-house experimentation and tool-building

Many public affairs teams are now experimenting with their own workflows — custom AI assistants, internally built agents, connected data pipelines pulling from monitoring services, parliamentary databases, or internal document libraries. That is a healthy development. But the pace of change creates specific risks worth being aware of.

The most common mistake is building something brittle: a tool tightly coupled to a specific prompt, model version, or configuration that made sense at build time but becomes a maintenance burden as the model evolves. Both OpenAI's and Anthropic's agent guidance converge on the same principle: effective agentic systems rely on simple, composable building blocks, and are designed to be testable and upgradeable from the outset.

Five practical principles follow from this.


Design around the workflow, not the prompt. A well-designed monitoring or briefing workflow — with clear task decomposition, defined inputs and outputs at each stage, and explicit checkpoints for human review — will outlast any particular model configuration. Build the workflow first, and treat the prompt as something that will evolve.


Keep the architecture modular. The retrieval layer, the model layer, and the output layer should be separable enough that they can be improved or replaced independently. If commissioning internal development, specify this explicitly: changes to one component should not require rebuilding the others.


Start simple and build from evidence. McKinsey's 2025 AI survey found that organisations redesigning workflows around AI — rather than layering AI onto existing processes — are the ones seeing the most meaningful impact. A sophisticated tool in a poorly designed process will underperform a simpler tool in a well-structured one.


Invest in evaluation early. Evaluation does not mean formal academic testing. It means having a consistent way of answering: is this output actually good, and is it getting better or worse over time? Anthropic's guidance on agent evaluation argues that knowing whether your tool is performing well at each stage — not just at the final output — is what allows you to improve systematically rather than by intuition.


Assume the stack will improve. If some part of your current setup will become materially better or cheaper in the near term — and it will — that changes how you build. Avoid locking in decisions that are hard to reverse, and build in enough modularity to take advantage of the next wave of improvements rather than being disrupted by them.


The broader strategic picture

McKinsey's 2025 survey found that 88% of organisations report using AI in at least one business function, but most are still in pilot or experimentation mode. Only 23% are scaling an agentic AI system anywhere in the enterprise. McKinsey's 2026 AI Trust Maturity Survey adds an important nuance: governance and agentic controls — how AI decisions are reviewed, escalated, and corrected — are lagging behind technical capability in most organisations, with average responsible-AI maturity still below 2.5 on a five-point scale.


For public affairs managers, that picture is both reassuring and clarifying. Reassuring because most organisations are still in the same learning phase — there is no dramatic first-mover advantage being lost by moving carefully. Clarifying because the priority right now is not to deploy the most ambitious possible system. It is to develop the organisational capability — workflow design, evaluation practice, governance literacy — that will allow you to absorb better tools effectively as they arrive.


The teams most likely to benefit from the pace of AI development are not those who move fastest. They are those who build most deliberately — with architectures designed to improve, workflows designed to be tested, and expectations calibrated to what the technology can actually deliver today versus what it will likely deliver soon. The organisations that develop that judgement now will be far better placed to evaluate providers, commission tools, and scale what works when the next wave of improvement arrives.



 
 
 

Comments


bottom of page