High Urgency, High Bar!

Playback speed

Share post at current time

0:00

Transcript

High Urgency, High Bar!

Notes from the inference trenches

Tosin Dairo

Sep 02, 2025

I have been building in the AI domain for the last 7 years, and for the last two and a half I have been knee deep in generative AI. This is my first public blog like write‑up in a while and I have decided to try something new with monologues. This blog is a more concise format of the above monologue speaking on where the field is racing towards, and where it’s quietly stalling on.

The pain I keep running into

Let me walk you through a typical day using AI powered workflows; I move between agentic coding tools and chat models. Run research in one system, simulate and prototype in another, and review in a third. All of a sudden I find myself doing tons of copy/paste every single time.

Most agentic tools or chat apps have added memory toggles and longer context windows, but the moment I cross product boundaries, my work evaporates. There is no real portability of context. The result is a fragile workflow stitched together by the clipboard. That’s not intelligence; that is an unconscious adaptive behaviour.

The inference tax

Let’s be honest: we are being oversold at the app layer. To get the best of each world, I’m nudged into multiple subscriptions and stacked usage fees. Meanwhile we hear that token costs are trending down to 0 but invoices don’t reflect that reality. There is a lot of profiteering around inference, while the fundamentals that would bend the cost curve architecture, democratisation, engineering efficacy aren’t being pushed hard enough on. The big winners in that gap are the large labs and hyperscalers.

Progress without memory isn’t progress

We have made real advances such as vector databases, graph database, better retrieval, better models, and clever cache mechanisms. But continual learning across workflows is still mostly theatre. Even with in-app memory, I can’t carry today’s research from ChatGPT to an agentic coding tool to a different workspace and expect it to just work. I fall back to screenshots, pasted notes, and start over here. It’s productivity cosplay.

High urgency, high bar

If there is a theme to what I’m arguing for, it’s this:

High urgency: Treat inference affordability and context portability like P0 bugs for the entire ecosystem. They block real adoption and compound hidden costs.
High bar: Measure progress by fundamentals, not vibes. Prioritise interoperability, learning continuity, engineering efficiency, and customer‑validated outcomes. Don’t declare victory without these standards.

What I want to see (and help build)

Context portability by default
Your research, notes, and working state should be addressable and importable across providers securely, with your consent without manual glue. Memory must be a user primitive, not a product lock‑in lever.
Real continual learning for workflows
Beyond long prompts: models that retain and refine task‑specific knowledge across sessions and tools, with auditability and controls. This is the difference between good chat and compounding value. Pre-fill caching, ICL with verifiable feedbacks and Adapter tuning should be complimentary for this while we keep researching for more algorithmic breakthroughs
Inference economics that reflect reality
Push on architecture, scheduling, and caching strategies that lower effective cost.
Engineering efficacy over vibe coding
Natural language to code is powerful, but when it becomes vibe coding, we lose rigor. Ship systems that shorten the path to reliable software tests, tracing, reproducibility not just generated programs without engineering fundamentals.
Research → product loops everywhere
One thing we’ve nailed is compressing the cycle from paper to product. But we need to double down on putting products in users’ hands faster, with enough feedback rails in the product as an instrument for truth, then feed that back into model and system design. Let customers set the scoreboard.

Why the bar matters

Most times we find ourselves optimising for local maxima: one model, one app, one neat demo. But the real value shows up only when intelligence compounds across systems. If I can hand work off between tools without losing context and if the cost to do that is sane then AI stops being a novelty and starts behaving like infrastructure.

Until then, we’ll keep paying the inference tax while living in a copy/paste culture that pretends to be memory.

I’m optimistic because the path isn’t mystical, it is engineering. Set a higher bar. Move faster on the parts that actually unlock compounding value: context portability, continual learning, and honest economics. That’s the urgent work.

Look out for my next monologue: why every piece of software should come with a reinforcement‑learning environment. The age of experience means products get to improve with us.

High Urgency, High Bar!

The pain I keep running into

The inference tax

Progress without memory isn’t progress

High urgency, high bar

What I want to see (and help build)

Why the bar matters

Discussion about this video