The Platform Underneath the Intelligence

If you have spent enough time building with machine learning, you know what comes next. Something works, genuinely works, and the business case feels almost self-evident. Then you try to put it into production.
What follows is its own particular kind of disillusionment. The pipeline that ran cleanly on your laptop strains and fractures under real data volumes. Your model begins drifting in ways nobody anticipated, and the data contract you believed you had with the upstream team proves to have been a handshake at best, a polite fiction that dissolves the moment any real weight rests on it. Compliance surfaces questions you have not yet thought to ask. Costs metastasize. Other teams want to build on what you made, but there is nothing to hand them, just a tangle of scripts, a notebook, and tacit knowledge that has never been written down.
The instinct, when you hit this wall, is to reach for more tooling. A better orchestration framework, a richer feature store, another monitoring dashboard. I understand the reflex. But the wall is not a tooling problem, and it is not even primarily a technology problem. It is a platform problem. And platforms are not assembled from better components. They are thought through and designed.
A platform is a group of technologies used as a base upon which other applications, processes, or technologies are developed.
I have been sitting with this realization for a long time, through work at IBM, Red Hat, LTK, and Disney, across systems spanning enterprise mobile platforms, distributed cloud infrastructure, and machine learning at scale. The thread running through all of it has been consistent: the models are rarely where things fall apart. What lives underneath the intelligence, the data pipelines and infrastructure, the model lifecycle and observability, the governance and cost accountability, the team structures that hold it all together, that is what determines whether AI delivers on its potential or quietly recedes into the archaeology of failed initiatives.
That distance between potential and production is what this book tries to close.
Platform Engineering for AI is the book I kept reaching for and not finding. It does not open with a model. It opens with the ground underneath one, from the conviction that most AI projects do not fail because the model was wrong. They fail because the foundation was never constructed. The scaffolding that should have been there, repeatable, governed, observable, cost-aware, was treated as something to figure out after the demo had already impressed someone with budget authority.
The premise may sound contained, but its implications run far. If you want AI that is secure, operationally sound, and built to outlast a single product cycle, you have to approach it as a platform discipline. Not purely a machine learning exercise, not merely an infrastructure concern, but a platform discipline, one that binds data engineering, ML operations, infrastructure, governance, and team design into something that actually holds together.
In practice, this is less about the tools you choose than about the way you think. Treat data pipelines as products with genuine owners, defined service levels, and governance woven in from the beginning, not as disposable scripts that somehow become load-bearing walls. Compose ML pipelines from discrete, testable components that can be versioned and replaced without requiring someone to hold their breath, so the system can actually evolve. Bring to model operations the same engineering discipline you would apply to any serious production service: structured deployment patterns, statistical drift monitoring, automated retraining, staged rollout and rollback. Not because it is aesthetically cleaner, but because production AI without these properties is fragile by construction.
And then there are the things that rarely attract much attention but tend to be quietly decisive. Financial governance for AI workloads, for one, because unconstrained accelerator spend has a way of ending programs before they find their footing. Observability matters too, but only if conceived at the outset rather than bolted on after an incident. Infrastructure as Code applied to AI environments means the compute substrate is as reproducible and auditable as the software it carries. And then there is team composition, perhaps the least technical subject in the book but, in my experience, among the most consequential. The most thoughtfully engineered platform will still underperform if the team around it is misaligned, too narrowly drawn, or organized in ways that quietly recreate the very silos it was meant to dissolve.
The chapter I spent the most time on covers generative AI and agentic systems, because the architectural shift there is genuine and the demands it places on the platform are qualitatively different from classical ML. Moving from prediction to generation enlarges the surface area you have to govern: prompt templates, retrieval configurations, vector indexes, and model adapters all become versioned artifacts requiring the same lifecycle discipline as any other code. Agentic workflows, where models orchestrate their own actions through tool-use and agent-to-agent protocols, introduce failure modes that conventional ML monitoring was simply not conceived to catch. Retrieval-Augmented Generation presents itself as a retrieval problem until you are operating it at scale, at which point you discover it is an infrastructure problem, a latency problem, a freshness problem, and a content safety problem, all at once. A platform designed with foresight absorbs this complexity. One assembled without it amplifies the complexity instead.
What I hope comes through, beneath the technical content, is something I have arrived at not as a theoretical position but as a conclusion drawn from years of watching both outcomes: platform engineering is not the overhead you accept in order to do AI. It is the condition under which AI becomes worth doing. A well-engineered platform does not constrain teams, it liberates them. The governed path becomes the expedient one. Isolated experiments become composable, reusable capabilities. Governance stops being a source of friction and starts functioning as the architecture of trust. And the people doing the real model-level work finally have a stable enough foundation to actually take risks.
I did not write this as a universal blueprint, because no such thing survives contact with real organizations, their distributed systems, their inherited complexity, their political constraints, their unrelenting pressure. What I tried to write instead is the kind of reasoning you accumulate from working alongside engineers who have navigated these situations and lived to reflect honestly on what they learned. The kind of book that rewards a second reading more than a first.
If you are a platform engineer, an ML engineer, an architect, or a technical leader trying to make AI function not in a notebook or a controlled demo but in production, at scale, under real governance, this was written with you in mind. I hope it compresses some of the distance between where you are and where you are trying to arrive.
It took a long time to write. But then, the problems it concerns took even longer to understand. If any of this resonates, you can grab a copy here.
And life goes on…