Blog

AI Agents Are Users Too: Rethinking Research for Multi-Actor Systems 

The first time an AI assistant rescheduled a meeting without human input, it felt like a novelty. Now it happens daily. Agents draft documents, route tickets, manage workflows, and interact on our behalf. They are no longer hidden in the background. They have stepped into the front lines, shaping experiences as actively as the people they serve. 

For research leaders, that changes the question. We have always studied humans. But when an agent performs half the task, who is the user? 

Agents on the Front Lines of Experience 

AI agents reveal truths that interviews cannot. Their activity logs expose where systems succeed and where they stumble. A rerouted request highlights a friction point. A repeated error marks a design flaw. Escalations and overrides surface moments where human judgment still needs to intervene. These are not anecdotes filtered through memory. They are live records of system behavior. 

And that’s why we need to treat agents as participants in their own right. 

A New Kind of Participant 

Treating agents as research participants reframes what discovery looks like. Interaction data becomes a continuous feed, showing failure rates, repeated queries, and usage patterns at scale. Humans remain the primary source of insight: the frustrations, the context, and the emotional weight. Agent activity adds another layer, highlighting recurring points of friction within the workflow and offering evidence that supports and extends what people share. Together, they create a more complete picture than either could alone. 

Methodology That Respects the Signal 

Of course, agent data is not self-explanatory. Logs are noisy. Bias can creep in if models were trained on narrow datasets. Privacy concerns must be addressed with care. The job of the researcher remains critical: separating signal from noise, validating patterns, and weaving human context into machine traces. Instead of replacing human perspective, agent data can enrich and ground it, adding evidence that makes qualitative insight even stronger. This reframing doesn’t just affect research practice, it also changes how we think about design. 

Designing for Multi-Actor Systems 

Products are no longer built for humans alone. They must work for the people who use them and the agents that increasingly mediate their experience. A customer may never touch a form field if their AI assistant fills it in. An employee may never interact directly with a dashboard if their agent retrieves the results. Design must account for both participants. 

Organizations that learn to research this new ecosystem will see problems sooner, adapt faster, and scale more effectively. Those that continue to study humans alone risk optimizing for only half the journey. 

The New Research Frontier 

Research has always been about listening closely. Today, listening means more than interviews and surveys. It means learning from the digital actors working beside us, the agents carrying out tasks, flagging failures, and amplifying our actions. 

The user is no longer singular. It is human and machine together. Understanding both is the only way to design systems that reflect the reality of work today. 

This piece expands the very definition of the user. For the other shifts redefining research, see our earlier explorations on format, how to move beyond static deliverables, and scope, how AI dissolves the depth vs. breadth tradeoff. 

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request an AI briefing.   


Key Takeaways


FAQs

Why consider AI agents as research participants?
AI agents actively shape workflows and user experiences. Their activity logs reveal friction points, errors, and escalations that human feedback alone may miss. Including them as research participants offers a more complete picture of how systems actually perform.

Do AI agents replace human participants in research?
No. Humans remain the primary source of context, emotion, and motivation. Agent data adds a complementary layer of evidence, enriching and grounding what people already share.

What types of insight can AI agents provide?
Agents surface recurring points of friction, repeated errors, and escalation patterns. These signals highlight where workflows break down, offering evidence to support and extend human feedback.

What role do researchers play when analyzing agent data?
Researchers remain critical. They filter noise, validate patterns, address bias, and ensure agent activity is interpreted with proper human context. The shift broadens qualitative practice rather than replacing it.

What is a multi-actor system in research?
A multi-actor system is one where both humans and AI agents interact to complete tasks. Designing for these systems means studying the interplay between people and machines, ensuring both participants are accounted for.

How does including agents in research improve design?
By listening to both humans and agents, organizations can spot problems sooner, adapt faster, and create systems that reflect the true complexity of modern workflows.

How AI Ends the Depth vs. Breadth Research Tradeoff 

The transcripts pile up fast. Ten conversations yield sticky notes that cover the wall, each quote circled, each theme debated. By twenty, the clusters blur. At thirty, the team is saturated, sifting through repetition in search of clarity. The insights are still valuable, but the effort to make sense of them begins to outweigh the return. 

This has always been the tradeoff, go deep with a few voices or broaden the scope and risk losing nuance. Leaders accepted that limitation as the cost of qualitative research. 

That ceiling is gone. 

The Ceiling Was Human Labor 

Generative research has always promised what numbers cannot capture, the story beneath the metric. But human synthesis is slow. Each new transcript multiplies complexity, until the process itself becomes the limiter. Teams stopped at 20 or 30 conversations not because curiosity ended, but because the hours to make sense of them did. Nuance gave way to saturation. 

Executives signed off on smaller studies and called it pragmatism. In truth, it was constraint. 

AI Opens the Door to Scale 

Large language models change the equation. Instead of weeks of sticky notes and clustering, AI can surface themes in hours. It highlights recurring ideas, connects outliers, and organizes insights without exhausting the team. The researcher’s role remains. Judgment still matters, but the ceiling imposed by human-only synthesis disappears. 

Instead of losing clarity as the number grows, each additional conversation now sharpens the signal, strengthening patterns, surfacing weak signals earlier, and giving leaders the confidence to act with richer evidence. 

Discovery Becomes Active 

The real breakthrough is not only scale, but also timing. With AI-enabled synthesis, insights emerge as the study unfolds. After the first dozen conversations, early themes are visible. Gaps in demographics or use cases show up while there is still time to adjust. By week two, the research is already feeding product decisions. 

Instead of waiting for a final report, teams get a living stream of discovery. Research shifts from retrospective artifact to active driver of strategy. 

Nuance at Speed 

For organizations, this ends the false binary. Depth and breadth no longer compete. A bank exploring new digital features can capture voices across demographics in weeks, not months. A health-tech team can fold dozens of patient experiences into the design cycle in real time. A software platform can test adoption signals across continents without sacrificing cultural nuance. 

The payoff is more than efficiency. It is confidence. When executives see both scale and nuance in the evidence, they act faster and with greater conviction. 

The New Standard 

The era of choosing between depth or breadth is behind us. AI frees research leaders from the constraints of small samples or limited perspectives. With AI as a synthesis partner, the standard shifts: hundreds of voices, interpreted with clarity, delivered at speed. 

For teams still focused on fixing the format problem, our previous piece, The $150K PDF That Nobody Reads, explores how static reports constrain research. Our next article examines an even bigger shift: what happens when your users are no longer only people.

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request an AI briefing.   


Key Takeaways


FAQS

What is the depth vs. breadth tradeoff in qualitative research?
The depth vs. breadth tradeoff refers to the long-standing belief that teams must choose between conducting a small number of interviews with rich nuance (depth) or a larger sample with less detail (breadth). Human synthesis struggles to handle both simultaneously, forcing this choice.

How does AI change the depth vs. breadth tradeoff?
AI dissolves the tradeoff by enabling researchers to process hundreds of conversations quickly while still preserving nuance. Instead of diluting insight, scale strengthens pattern recognition and surfaces weak signals earlier.

Why has qualitative research been constrained to small sample sizes?
Human synthesis is time-consuming. After 20–30 interviews, transcripts become overwhelming, and important signals get lost in the noise. This labor bottleneck led leaders to view small samples as “pragmatic,” even though it was really a constraint of capacity.

Does AI replace the role of the researcher?
No. AI accelerates synthesis, but the researcher remains critical for judgment, interpretation, and ensuring context and nuance are applied correctly. AI acts as a partner that expands capacity rather than a replacement.

What is the impact of AI-enabled synthesis on decision-making?
With faster synthesis and preserved nuance, research insights emerge in real time rather than only in final reports. Leaders gain richer evidence earlier, which supports faster, more confident decisions.

What does this mean for the future of qualitative research?
The old tradeoff between depth and breadth is over. AI makes it possible to achieve both simultaneously, shifting the standard for research to hundreds of voices interpreted with clarity and delivered at speed.

Jeff Kirk Named Executive Vice President of Applied AI at Robots & Pencils 

From Alexa to Emma, Kirk brings two decades of AI breakthroughs that have reshaped industries. Now he’s powering Robots & Pencils’ rise in the intelligence age. 

Robots & Pencils, an AI-first, global digital innovation firm specializing in cloud-native web, mobile, and app modernization, today announced the executive appointment of Jeff Kirk as Executive Vice President of Applied AI. A seasoned technology leader with a career spanning global agencies, startups, and Fortune 100 enterprises, Kirk steps into this newly created role to accelerate the firm’s AI-first vision and unlock transformative outcomes for clients. As EVP of Applied AI, Kirk will lead the firm’s strategy and delivery of AI-powered and enterprise AI solutions across industries. 

Explore how Robots & Pencils blends science and design to build market leaders. 

Kirk’s track record speaks for itself, with AI breakthroughs that fueled customer engagement and business growth. He founded and scaled Moonshot, an intelligent digital products company later acquired by Pactera, where he spearheaded next-generation experiences in voice, augmented reality, and enterprise digitalization. At Amazon, he served as International Product & Technology Lead for Alexa, driving AI-powered personal assistant expansion to millions of households and users worldwide. Most recently, at bswift, Kirk led AI & Data as VP, delivering conversational AI breakthroughs with the award-winning Emma assistant and GenAI-powered EnrollPro decision support system. 

Across each of these roles runs a common thread. Kirk builds and scales innovations that transform how industries work, creating technologies that move from experimental to essential at breathtaking speed. 

“Jeff has been at the frontier of every major shift in digital innovation,” said Len Pagon, CEO of Robots & Pencils. “From shaping the future of eCommerce and mobile platforms at Brulant and Rosetta, to pioneering global voice AI at Amazon, to launching AI-driven customer experiences at bswift, Jeff has consistently delivered what’s next. He doesn’t just talk about AI. He builds products that millions use every day. With Jeff at the helm of Applied AI, Robots & Pencils is sharpening its challenger edge, helping clients leap ahead while legacy consultancies struggle to catch up. I’m energized by what this means for our clients and inspired by what it means for our people.” 

Across two decades, Kirk has built a reputation for translating complex business requirements into enterprise-grade AI and technology solutions that scale, stick, and generate measurable results. His entrepreneurial mindset and hands-on leadership style uniquely position him to help clients experiment, activate, and operate AI across their businesses. 

“Organizations and their workers are under pressure to innovate on behalf of customers while simultaneously learning to work with a new type of co-worker: artificial intelligence,” said Kirk. “The steps we take together to learn to work differently will lead to the most outsized innovation in our industries. I’m thrilled to join Robots & Pencils to push the boundaries of what’s possible with AI, to deliver outcomes that matter for our clients and their customers, and to create opportunities for our teams to do the most meaningful work of their careers.” 

Kirk began his career at Brulant and Rosetta, where he worked alongside Pagon and other Robots & Pencils’ executive team members, leading engineering and solutions architecture across content, commerce, mobile, and social platforms. His return to the fold marks both a reunion and a reinvention, positioning Robots & Pencils as a leader in applied AI at scale. 

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request an AI briefing.  

The $150K PDF That Nobody Reads: From Research Deliverables to Living Systems 

A product executive slides open her desk drawer. Tucked between old cables and outdated business cards is a thick, glossy report. The binding is pristine, the typography immaculate, the insights meticulously crafted. Six figures well spent, at least according to the invoice. Dust motes catch the light as she lifts it out: a monument to research that shaped… nothing, influenced… no one, and expired the day it was delivered. 

It’s every researcher’s quiet fear. The initiative they poured months of work, a chunk of their sanity, and about a thousand sticky notes into becomes shelf-ware. Just another artifact joining strategy decks and persona posters that never found their way into real decisions. 

This is the way research has been delivered for decades, by global consultancies, boutique agencies, and yes, even by me. At $150K a report, it sounds extravagant. But when you consider the sheer effort, the rarity of the talent involved, and the stakes of anchoring business decisions in real customer insight, it’s not hard to see why leaders sign the check. 

The issue isn’t the value of the research. It’s the belief that insights should live in documents at all. 

Research as a Living System 

Now picture a different moment. The same executive doesn’t reach for a drawer. She opens her laptop and types: “What causes the most friction when ordering internationally?” 

Within seconds she’s reviewing tagged quotes from dozens of interviews, seeing patterns of friction emerge, even testing new messaging against synthesized persona responses. The research isn’t locked in a PDF. It’s alive, queryable, and in motion. 

This isn’t a fantasy. It’s the natural evolution of how research should work: not as one-time deliverables, but as a living system

The numbers show why change is overdue. Eighty percent of Research Ops & UX professionals use some form of research repository, but over half reported fair or poor adoption. The tools are frustrating, time consuming to maintain, and lack ownership. Instead of mining the insights they already have, teams commission new studies, resulting in an expensive cycle of creating artifacts that sit idle, while decisions move on without them. 

It’s a Usability Problem 

Research hasn’t failed because of weak insights. It’s been constrained by the static format of reports. Once findings are bound in a PDF or slide deck, the deliverable has to serve multiple audiences at once, and it starts to bend under its own weight. 

For executives, the executive summary provides a clean snapshot of findings. But when the time comes to make a concrete decision, the summary isn’t enough. They have to dive into the hundred-page appendix to trace back the evidence, which slows down the moment of action. 

On the other hand, product teams don’t need summaries, they need detailed insights for the feature they’re building right now. In long static reports, those details are often buried or disconnected from their workflow. Sometimes they don’t even realize the answer exists at all, so the research goes unused, or even gets repeated. An insight that can’t be surfaced when it’s needed might as well not exist. 

The constraint isn’t the quality of the research. It’s the format. Static deliverables fracture usability across audiences and leave each group working harder than they should to put insights into play. 

Research as a Product 

While we usually view research as an input into products, research itself is a product too. And with a product mindset, there is no “final deliverable,” only an evolving body of user knowledge that grows in value over time. 

In this model, the researcher acts as a knowledge steward of the user insight “product,” curating, refining, and continuously delivering customer insights to their users: the executives, product managers, designers, and engineers who need insights in different forms and at different moments. 

Like any product, research needs a roadmap. It has gaps to fill, like user groups not yet heard from, or behaviors not yet explored. It has features to maintain like transcripts, coded data, and tagged insights. And it has adoption goals, because insights only create value when people use them. 

This approach transforms reports too. A static deck becomes just a temporary framing of the knowledge that already exists in the system. With AI, you can auto-generate the right “version” of research for the right audience, such as an executive summary for the C-suite, annotations on backlog items for product teams, or a user-centered evaluation for design reviews. 

Treating research as a product also opens the door to continuous improvement. A research backlog can track unanswered questions, emerging themes, and opportunities for deeper exploration. Researchers can measure not just delivery (“did we produce quality insights?”) but usage (“did the insights influence a decision?”). Over time, the research “product” compounds in value, becoming a living, evolving system rather than a series of static outputs. 

This new model requires a new generation of tools. AI can now cluster themes, surface patterns, simulate persona responses, and expose insights through natural Q&A. AI makes the recomposition of insights into deliverables cheap. That allows us to focus on how our users get the insights they need in the way they need them. 

From Deliverable to Product 

Treating research as a product changes the central question. It’s no longer, “What should this report contain?” but “What questions might stakeholders need to answer, and how do we make those answers immediately accessible?” 

When research is built for inquiry, every transcript, survey, and usability session becomes part of a living knowledge base that compounds in value over time. Success shifts too: not in the number of reports delivered, but in how often insights are pulled into decisions. A six-figure investment should inform hundreds of critical choices, not one presentation that fades into archives. 

And here’s the irony: the product mindset actually produces better reports as well. When purpose-built reports focus as much on their usage as the information they contain, they become invaluable components of the software production machine. 

Research itself isn’t broken. It just needs a product mindset and AI-based qualitative analysis tools that turns insights into a living system, not a slide deck. 

Next in the series, we look at two more shifts: AI removing the depth vs. breadth constraint, and the rise of agents as research participants.

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request a strategy session.  


Key Takeaways


FAQs

What is the problem with traditional research reports?
Traditional reports often serve as static artifacts. Once published, they struggle to meet the needs of multiple audiences and quickly become outdated, limiting their impact on real decisions.

Why is research often underutilized in organizations?
Research is underutilized because its insights are locked in formats like PDFs or decks. Executives, product teams, and designers often cannot access the right detail at the right time, so findings go unused or studies are repeated.

What does it mean to treat research as a product?
Treating research as a product means building a continuously evolving knowledge base rather than one-time deliverables. Insights are curated, updated, and delivered in forms that align with the needs of different stakeholders.

How does AI support this new model?
AI makes it possible to cluster themes, surface weak signals, and generate audience-specific deliverables on demand. This reduces maintenance overhead and ensures insights are always accessible when needed.

What role do researchers play in this model?
Researchers become knowledge stewards, ensuring the insight “product” is accurate, relevant, and continuously improved. Their work shifts from producing final reports to curating and delivering insights that compound in value over time.

How does this benefit organizations?
Organizations gain faster, more confident decision-making. A six-figure research investment can inform hundreds of decisions, rather than fading after a single presentation.

How Agentic AI Is Rewiring Higher Education 

A University Without a Nervous System 

Walk through the back offices of most universities, and you will see the challenge. Admissions runs on one platform, advising on another, learning management on a third, and academic affairs on a fourth. Each system functions, yet little connects them. Students feel the gaps when financial aid processing is delayed, academic records are incomplete, and support processes remain confusing and slow. Leaders feel it in the cost of complexity and the weight of compliance. 

Higher education institutions typically manage dozens of disconnected systems, with IT leaders facing persistent integration challenges that consume substantial staff time and budget resources while creating operational bottlenecks that affect both student services and institutional agility. 

For decades, CIOs and CTOs have been tasked with stitching these systems together. Progress came in patches, with integrations here and dashboards there. What emerged looked more like scar tissue than connective tissue. Patchwork technology blocks digital transformation in higher education, and leaders now seek infrastructure that can unify rather than just connect. 

The Rise of Agentic AI as Connective Tissue 

Agentic AI wires the university together. Acting like a nervous system, it routes information and triggers actions throughout the institution, coordinating workflows through intelligent routing and contextual decision-making. Unlike traditional automation that follows rigid rules, agentic AI systems can make contextual decisions, learn from outcomes, and coordinate across multiple platforms without constant human oversight. 

In practice, this means a transfer request automatically verifies transcripts through the National Student Clearinghouse, cross-references degree requirements in the SIS, flags discrepancies for staff to review, and updates student records, typically reducing processing time from 5-7 days to under 24 hours while maintaining accuracy. It means an advising system can recognize a retention risk, trigger outreach, and log the interaction without human staff piecing the puzzle together by hand. 

Agentic AI needs a strong foundation. That foundation is cloud-native infrastructure for universities that’s built to scale during peak demand, enforce compliance, and keep every action visible. With this base in place, universities move from pilot projects to production systems. The result is infrastructure that holds under pressure and adapts when conditions change. 

The Brain Still Decides 

A nervous system does not think on its own. It carries signals to the brain, where decisions are made. In the university context the brain is still human, made up of faculty, advisors, administrators, and executives. 

This is where the design philosophy matters. Agentic AI should amplify human capacity, not replace it. Advisors can spend more time in meaningful conversations with students because degree audits and schedule planning run on their own. CIOs can focus on strategic alignment because monitoring and audit logs are captured automatically. The architecture creates space for judgment, and it also creates space for human connection that strengthens the student experience. 

However, this transition requires careful change management. Faculty often express concerns about AI decision-making transparency, while staff worry about job displacement. Successful implementations address these concerns through clear governance frameworks, explainable AI requirements, and retraining programs that position staff as AI supervisors rather than replacements. 

What Happens When Signals Flow Freely 

When agentic systems begin to carry the load, universities see a different rhythm. Transcript processing moves with speed. Advising interactions trigger at the right time. Students find support without friction. Leaders gain resilience as workflows carry themselves from start to finish. What emerges is more than efficiency. It is an institution that thinks and acts as one, with every part working in concert to support the student journey. 

Designing for Resilience and Trust 

CIOs and CTOs recognize that orchestration brings new responsibility. Data must be structured and governed, with student information requiring FERPA compliant handling throughout all automated processes. Agents must be observable and auditable. Compliance cannot live as a separate checklist but as a property of the system itself. AWS-native controls, from encryption to identity management, provide the levers to design with security as a default rather than a bolt-on. 

At the same time, leaders must design for operational trust. A nervous system functions only when signals are reliable. This requires real-time monitoring dashboards, clear escalation protocols when agents encounter exceptions, and audit trails that document every automated decision. 

The Next Chapter of Higher Education Infrastructure 

What is happening now is less about another wave of apps and more about a shift in the foundation of the institution. Agentic AI is beginning to operate as infrastructure. It connects the university’s digital systems into something coordinated and adaptive. 

The role of leadership is to decide how that nervous system will function, and what kind of human judgment it will amplify. Presidents, provosts, CIOs, and CTOs who recognize this shift will shape not only the student experience but the operational resilience of their institutions for years to come. 

For leaders evaluating agentic AI initiatives, three factors determine readiness.  

Institutions strong in all three areas see faster implementation and higher adoption rates. 

The institutions that succeed will be those that view agentic AI not as a technology project, but as an organizational transformation requiring new governance models, staff capabilities, and student engagement strategies. 

When the nervous system works, the signals move freely, and people do their best work. Students find support when they need it. Advisors focus on real conversations. Leaders see further ahead. That is the promise of agentic AI in higher education, not machines in charge, but machines carrying the load so people can do what only people can do. 

Join Us

Join us at ASU’s Agentic AI and the Student Experience conference. Contact us to book time with our leaders and explore how agentic AI can strengthen your institution. 

Request an AI Briefing.  

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Learn more about Robots & Pencils AI Solutions for Education. 

Beyond Wrappers: What Protocols Leave Unsolved in AI Systems 

I recently built a Model Context Protocol (MCP) integration for my Oura Ring. Not because I needed MCP, but because I wanted to test the hype: Could an AI agent make sense of my sleep and recovery data? 

It worked. But halfway through I realized something. I could have just used the Oura REST API directly with a simple wrapper. What I ended up building was basically the same thing, just with extra ceremony. 

As someone who has architected enterprise AI systems, I understand the appeal. Reliability isn’t optional, and protocols like MCP promise standardization. To be clear, MCP wasn’t designed to fix hallucinations or context drift. It’s a coordination protocol. But the experiment left me wondering: Are we solving the real problems or just adding layers? 

The Wrapper Pattern That Won’t Go Away 

MCP joins a long list of frameworks like LangChain, LangGraph, SmolAgents, and LlamaIndex, each offering a slightly different spin on coordination. But at heart, they’re all wrappers around the same issue, getting LLMs to use tools consistently. 

Take CrewAI. On paper, it looked elegant with agents organized into “crews,” each with roles and tools. The demos showed frictionless orchestration. In practice? The agents ignored instructions, produced invalid JSON even after careful prompting, and burned days in debugging loops. When I dropped down to a lower-level tool like LangGraph, the problems vanished. CrewAI’s middleware hadn’t added resilience, it had hidden the bugs. 

This isn’t an isolated frustration. Billions of dollars are flowing into frameworks while fundamentals like building reliable agentic systems remain unsettled. MCP risks following the same path. Standardizing communication may sound mature, but without solving hallucinations and context loss, it’s just more scaffolding on shaky foundations. 

What We’re Not Solving 

The industry has been busy launching integration frameworks, yet the harder challenges remain stubbornly in place: 

As CData notes, these aren’t just implementation gaps. They’re fundamental challenges. 

What the Experiments Actually Reveal 

Working with MCP brought a sharper lesson. The difficulty isn’t about APIs or data formats. It’s about reliability and security. 

When I connected my Oura data, I was effectively giving an AI agent access to intimate health information. MCP’s “standardization” amounted to JSON-RPC endpoints. That doesn’t address the deeper issue: How do you enforce “don’t share my health data” in a system that reasons probabilistically? 

To be fair, there’s progress. Auth0 has rolled out authentication updates, and Anthropic has improved Claude’s function-calling reliability. But these are incremental fixes. They don’t resolve the architectural gap that protocols alone can’t bridge. 

The Evidence Is Piling Up 

The risks aren’t theoretical anymore. Security researchers keep uncovering cracks

Meanwhile, fragmentation accelerates. Merge.dev lists half a dozen MCP alternatives. Zilliz documents the “Great AI Agent Protocol Race.” Every new protocol claims to patch what the last one missed. 

Why This Goes Deeper Than Protocol Wars 

The adoption curve is steep. Academic analysis shows MCP servers grew from around 1,000 early this year to over 14,000 by mid-2025. With $50B+ in AI funding at stake, we’re not just tinkering with middleware; we’re building infrastructure on unsettled ground. 

Protocols like MCP can be valuable scaffolding. Enterprises with many tools and models do need coordination layers. But the real breakthroughs come from facing harder questions head-on: 

These problems exist no matter the protocol. And until they’re addressed, standardization risks becoming a distraction. 

The question isn’t whether MCP is useful; it’s whether the focus on protocol standardization is proportional to the underlying challenges. 

So Where Does That Leave Us? 

There’s nothing wrong with building integration frameworks. They smooth edges and create shared patterns. But we should be honest about what they don’t solve. 

For many use cases, native function calling or simple REST wrappers get the job done with less overhead. MCP helps in larger enterprise contexts. Yet the core challenges, reliability and security, remain active research problems. 

That’s where the true opportunity lies. Not in racing to the next protocol, but in tackling the questions that sit at the heart of agentic systems. 

Protocols are scaffolding. They’re not the main event. 

Learn more about Agentic AI. 

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request a strategy session.  

Stop Measuring AI Success by Lines of Code: The Real ROI is in the Boring Stuff 

The headlines are hard to miss, “AI-powered code generation boosting developer velocity by 30%.” Lines of code written per hour skyrocketing. Teams shipping features faster than ever. 

Yet the most significant returns aren’t showing up in those flashy metrics. The real ROI is emerging in places far less glamorous: the work that usually gets postponed, rushed, or quietly skipped. 

The Quality Underground 

While much attention is placed on code generation speed, something more consequential is happening behind the scenes. AI is proving most valuable when it tackles the tedious but essential work developers often deprioritize. 

Test creation. Documentation updates. Boilerplate scaffolding. The quiet foundations of reliable software. 

When testing becomes easier, teams actually do it. When documentation updates itself, it actually stays current. Organizations using AI-augmented testing report 50% lower costs and 60% faster test cycles¹. That’s more than efficiency. It’s a shift in quality assurance discipline. 

A clear pattern is emerging: the less exciting the task, the greater the AI payoff. 

The Multiplier Effect 

This is where traditional measurements fall short. Counting lines of code tells us little about stability. Shipping features faster is less impressive if those features fail in production. 

By contrast, metrics like test coverage and documentation completeness tell a different story. They reveal AI as a speed accelerator and a quality multiplier. 

Some organizations are already seeing dramatic improvements, with test coverage climbing from 60% to 85%, documentation kept current for the first time in years, and edge cases automatically captured. 

The takeaway is straightforward. AI makes developers quicker, and it makes the software they build more reliable. 

The Tasks That Actually Matter 

Consider the flow of software development. Writing business logic is often the easy part. The heavier lift comes in the margins: building robust test suites, maintaining documentation, handling edge cases thoroughly. 

These are the tasks that are critical for quality, slow to complete, and frequently sacrificed under pressure. They are also the exact tasks where AI thrives. 

Take test generation. Creating comprehensive tests often takes longer than the code itself, demanding developers think through failures and integration scenarios. AI can analyze code patterns, detect gaps, and generate tests that human teams might overlook. The result is not just faster coverage, but broader and more consistent coverage. 

The Measurement Revolution 

This shift creates an opening to rethink how AI success is measured.  Instead of tracking raw velocity, organizations are following quality indicators:  

These indicators surface AI’s true value: not simply producing more code but producing better software. 

The Compound Returns 

Quality improvements have a different kind of payoff: they compound. 

Faster code generation saves time today. Stronger test coverage prevents costly failures tomorrow. Automated documentation will reduce onboarding time next quarter. Better quality controls fuel faster iteration next year. 

Measured through this lens, AI’s impact becomes clearer. A 50% drop in production bugs delivers far greater financial benefit than a 50% increase in code generation speed. 

The Quality Advantage 

Teams focusing here are building something rare: systematic quality improvement woven into the development process itself. 

Others may continue to compete on speed, but organizations that compete on reliability are building resilience. They’re lowering technical debt instead of accumulating it. They’re creating the conditions for sustainable experimentation. 

Over time, that advantage compounds into a moat that’s hard to cross. 

Reframing Success 

When the next report touts impressive AI coding velocity, a different question is worth asking, “What is happening to quality?” 

Because real AI transformation isn’t about developers typing faster. It’s about software that’s more dependable, because the unglamorous work is finally being done. 

Organizations that see this are measuring the right outcomes. They’re finding that the “boring” tasks create the most durable advantages. Those are often the ones that matter most when customers decide whose product they trust. 

The pace of AI change can feel relentless with tools, processes, and practices evolving almost weekly. We help organizations navigate this landscape with clarity, balancing experimentation with governance, and turning AI’s potential into practical, measurable outcomes. If you’re looking to explore how AI can work inside your organization—not just in theory, but in practice—we’d love to be a partner in that journey. Request a strategy session. 

Sources: 

  1. Unisys, ROI of Generative AI in Software Testing, 2024 

Beyond Story Points: Rethinking Software Engineering Productivity in the Age of AI 

Why traditional metrics fall short, and how modern frameworks like DORA and SPACE can guide better outcomes 

For years, engineering leaders have relied on familiar metrics to gauge developer performance: story points, bug counts, and lines of code. These measures offered a shared baseline, especially in Agile environments where estimation and output needed a common language. 

But in today’s AI-assisted world, those numbers no longer tell the full story. Performance isn’t just about volume or velocity. It’s about outcomes. Did the developer deliver the expected functionality, with the right quality, on time? That’s how we compensate today, and that’s still what matters. But how we measure those things must evolve.  

With tools like GitHub Copilot, Claude Code, and Cursor generating entire functions, tests, and documentation quickly, output is becoming less about what a developer types and more about what they model, validate, and evolve. 

The challenge for CIOs, CTOs, and SVPs of Engineering isn’t just adopting new tools. It’s rethinking how to measure effectiveness in a world where productivity is amplified by AI and complexity often hides behind automation. 

Why Traditional Metrics Break Down 

The future of measurement hinges on three categories: productivity, quality, and functionality. These have always been essential to evaluating engineering work. But in the AI era, we must measure them differently. That shift doesn’t mean abandoning objectivity; it means updating our tools. 

The problem isn’t that legacy metrics are useless. It’s that they’re easily gamed, misinterpreted, or disconnected from business value. 

At best, these metrics create noise. At worst, they drive harmful incentives, like rewarding speed over safety, or activity over alignment. 

Today’s AI-assisted workflows lack mature solutions for tracking whether functionality requirements, like EPICs and user stories, have been fully met. But new approaches, like multi-domain linking (MDL), are emerging to close that gap. Measurement is getting smarter, and more connected, because it has to. 

The Rise of Directional Metrics 

Modern frameworks like DORA and SPACE were built to address these gaps. 

DORA (DevOps Research and Assessment) focuses on: 

These measure delivery health, not just effort. They’re useful for understanding how efficiently and safely value reaches users. 

SPACE (developed by Microsoft Research) considers: 

SPACE offers a more holistic view, especially in cross-functional and AI-assisted teams. It acknowledges that psychological safety, cross-team communication, and real flow states often impact long-term output more than individual commits. 

AI Complicates the Picture 

AI tools don’t eliminate the need for metrics; they demand smarter ones. When an LLM can write 80% of the code for a feature, how do we credit the developer? By the number of keystrokes? Or by their judgment in prompting, curating, and validating what the tool produced? 

But here’s the deeper challenge: What if that feature doesn’t do what it was supposed to? 

In AI-assisted workflows: 

Productivity isn’t just about output; it’s about fitness to purpose. Without strong traceability between code, tests, user stories, and epics, it’s easy for teams to ship fast but fall short of the business goal. 

Many organizations today struggle to answer a basic question: Did this delivery actually fulfill the intended functionality? 

This is where multi-domain linking (MDL) and AI-powered traceability show promise. By connecting user stories, requirements, test cases, design artifacts, and even user feedback within a unified graph, teams can use LLMs to assess whether the output truly matches the input. 

And this capability unlocks more than just better alignment, it opens the door to innovation. AI-assisted development enables organizations to build more complex, interconnected, and adaptive systems than ever before. As those capabilities expand, so too must our ability to measure their economic value. What applications can we now build that we couldn’t before? And what is that worth to the business? 

That’s not a theoretical exercise. It’s the next frontier in engineering measurement. 

Productivity as a System, Not a Score 

The best engineering organizations treat productivity like instrumentation. No single number can tell you what’s working, but the right mix of signals can guide better decisions. That system must account for both delivery efficiency and functional alignment. High velocity is meaningless if the outcome doesn’t meet the requirements it was designed to fulfill. 

That means: 

Most importantly, it means aligning measurement to what matters: Did the product deliver value? Did it meet its intended function? Was the effort worth the outcome? Those are the questions that still define success and the ones our measurement frameworks must help answer. 

How to Start Rethinking Measurement 

If your metrics haven’t evolved alongside your tooling, here’s how to get started: 

AI is reshaping how software gets built. That doesn’t mean productivity can’t be measured. It means it must be measured differently. The leaders who shift from tracking motion to monitoring momentum will build faster, healthier, and more resilient engineering teams. 

Robots & Pencils: Measuring What Matters in an AI-Driven World 

At Robots & Pencils, we believe productivity isn’t a score; it’s a system. A system that must measure not just speed, but alignment. Did the output meet the requirements? Did it fulfill the epic? Was the intended functionality delivered? 

We help clients extend traditional measurement approaches to fit an AI-first world. That means combining DORA and SPACE metrics with functional traceability, such as linking code to requirements, outcomes to epics, and user stories to business results. 

Our secure, AWS-native platforms are already instrumented for this kind of visibility. And our teams are actively designing multi-domain models that give leaders better answers to the questions they care about most. 

As AI opens the door to applications we never thought were possible, our job is to help you measure what matters, including what’s newly possible. We don’t just help teams move faster. We help them build with confidence and prove it. 

Pilot, Protect, Produce: A CIO’s Guide to Adopting AI Code Tools 

How to responsibly explore tools like GitHub Copilot, Claude Code, and Cursor—without compromising privacy, security, or developer trust 

AI-assisted development isn’t a future state. It’s already here. Tools like GitHub Copilot, Claude Code, and Cursor are transforming how software gets built, accelerating boilerplate, surfacing better patterns, and enabling developers to focus on architecture and logic over syntax and scaffolding. 

The productivity upside is real. But so are the risks. 

For CIOs, CTOs, and senior engineering leaders, the challenge isn’t whether to adopt these tools—it’s how. Because without the right strategy, what starts as a quick productivity gain can turn into a long-term governance problem. 

Here’s how to think about piloting, protecting, and operationalizing AI code tools so you move fast, without breaking what matters. 

Why This Matters Now 

In a recent survey of more than 1,000 developers, 81% of engineers reported using AI assistance in some form, and 49% reported using AI-powered coding assistants daily. Adoption is happening organically, often before leadership even signs off. The longer organizations wait to establish usage policies, the more likely they are to lose visibility and control. 

On the other hand, overly restrictive mandates risk boxing teams into tools that may not deliver the best results and limit experimentation that could surface new ways of working. 

This isn’t just a tooling decision. It’s a cultural inflection point. 

Understand the Risk Landscape 

Before you scale any AI-assisted development program, it’s essential to map the risks: 

These aren’t reasons to avoid adoption. But they are reasons to move intentionally with the right boundaries in place. 

Protect First: Establish Clear Guardrails 

Protect First: Establish Clear Guardrails 

A successful AI coding tool rollout begins with protection, not just productivity. As developers begin experimenting with tools like Copilot, Claude, and Cursor, organizations must ensure that underlying architectures and usage policies are built for scale, compliance, and security. 

Consider: 

For teams ready to push further, Bedrock AgentCore offers a secure, modular foundation for building scalable agents with memory, identity, sandboxed execution, and full observability, all inside AWS. Combined with S3 Vector Storage, which brings native embedding storage and cost-effective context management, these tools unlock a secure pathway to more advanced agentic systems. 

Most importantly, create an internal AI use policy tailored to software development. It should define tool approval workflows, prompt hygiene best practices, acceptable use policies, and escalation procedures when unexpected behavior occurs. 

These aren’t just technical recommendations, they’re prerequisites for building trust and control into your AI adoption journey. 

Pilot Intentionally 

Start with champion teams who can balance experimentation with critical evaluation. Identify low-risk use cases that reflect a variety of workflows: bug fixes, test generation, internal tooling, and documentation. 

Track results across three dimensions: 

Encourage developers to contribute usage insights and prompt examples. This creates the foundation for internal education and tooling norms. 

Don’t Just Test—Teach 

AI coding tools don’t replace development skills; they shift where those skills are applied. Prompt engineering, semantic intent, and architectural awareness become more valuable than line-by-line syntax. 

That means education can’t stop with the pilot. To operationalize safely: 

When used well, these tools amplify good developers. When used poorly, they obscure problems and inflate false productivity. Training is what makes the difference. 

Produce with Confidence 

Once you’ve piloted responsibly and educated your teams, you’re ready to operationalize with confidence. That means: 

Organizations that do this well won’t just accelerate development, they’ll build more resilient software teams. Teams that understand both what to build and how to orchestrate the right tools to do it. The best engineering leaders won’t mandate one AI tool or ban them altogether. They’ll create systems that empower teams to explore safely, evaluate critically, and build smarter together. 

Robots & Pencils: Secure by Design, Built to Scale 

At Robots & Pencils, we help enterprise engineering teams pilot AI-assisted development with the right mix of speed, structure, and security. Our preferred LLM, Anthropic, was chosen precisely because we prioritize data privacy, source integrity, and ethical model design; values we know matter to our clients as much as productivity gains. 

We’ve been building secure, AWS-native solutions for over a decade, earning recognition as an AWS Partner with a Qualified Software distinction. That means we meet AWS’s highest standards for reliability, security, and operational excellence while helping clients adopt tools like Copilot, Claude Code, and Cursor safely and strategically. 

We don’t just plug in AI; we help you govern it, contain it, and make it work in your world. From guardrails to guidance, we bring the technical and organizational design to ensure your AI tooling journey delivers impact without compromise. 

The Changing Role of the Computer Programmer 

How generative AI, cloud-native services, and intelligent orchestration are redefining the developer role and what it means for modern engineering teams 

In the early days of computing, programmers were indispensable because they were the only ones who could speak the language of machines. From punch cards to assembly language, software development was hands-on and highly specialized. Even as languages evolved, from COBOL and C to Java and C#, one thing stayed constant: developers wrote every line themselves. 

But that’s no longer true. And it hasn’t been for a while. 

Today, enterprise developers have access to an entirely new class of tools: generative AI, intelligent agents, and secure, cloud-native building blocks that reduce the need to write, or even see, large amounts of code. This shift isn’t superficial. It’s redefining the nature of software development itself. 

A recent Cornell University study reports that AI now generates at least 30% of Python code in major repositories in the U.S. And in enterprise environments at Google and Microsoft, 30–40% of new code is reported as AI-generated. That’s not a tweak in tooling. That’s a turning point in how software gets built. 

From Code to Composition 

For decades, the dominant paradigm in programming was one of writing: the developer’s job was to build logic from scratch, test it for accuracy, and ensure it could scale. As complexity grew, so did the stack of tools, including IDEs, frameworks, QA platforms, and versioning systems to support that work. 

But in the last few years, the developer toolbox has changed dramatically. Tools like GitHub Copilot, Claude Code, and Cursor now generate reliable code in real time. Entire modules can be scaffolded with a few prompts. Meanwhile, cloud platforms like AWS offer modular services that handle everything from authentication to observability out of the box. 

The result? Developers are shifting from authors to orchestrators. The value isn’t in how much code they can write; it’s in how well they can assemble, adapt, and govern systems that are increasingly AI-enabled, cloud-native, and composable. 

Productivity and Quality are Improving, but are We Building the Right Thing? 

AI-assisted development produces measurable gains. Code is being written faster. Boilerplate is disappearing. Bugs are easier to catch early. Even tests can be autogenerated. And yet, one challenge persists: verifying that the right thing is being built. 

It’s relatively straightforward to measure productivity (lines of code, lead time) and quality (bug rates, test coverage). But ensuring correct functionality, such as matching what’s shipped to product requirements, user stories, and EPICs, is harder than ever. Code generation tools accelerate output, but they don’t always ensure alignment with intent. 

That’s why the developer’s role is expanding. Understanding product vision, aligning technical architecture with business goals, and managing evolving requirements are becoming just as critical as technical skill. 

What Should Engineering Leaders Expect from Modern Developers? 

The pace of innovation in AI development tools is relentless. What a developer learns today may be outdated in a few months. This puts enormous pressure on engineering leaders to balance experimentation with sustainability. 

The safest path forward? Anchor learning and experimentation within robust cloud ecosystems. AWS, for instance, offers stable development trajectories, strong security guardrails, and continuous improvements that minimize disruption. The goal isn’t to chase every new tool; it’s to build foundational fluency and adapt deliberately. 

To succeed in this new environment, developers must think differently: 

Code Isn’t Dead, but It’s Being Delegated 

Let’s be clear: programming isn’t going away. But its role is evolving. The most impactful developers won’t be those who write the most lines of code, they’ll be the ones who know how to compose, configure, and coordinate intelligent systems with speed and confidence. 

They’ll use prompts, ontologies, and models as naturally as they once used loops and conditionals. They’ll know when to generate, when to review, and when to intervene. And they’ll be deeply embedded in outcome-oriented thinking. 

What Should Engineering Leaders Do Next? 

As the role of the programmer changes, so too must the systems that support them. This means: 

The ground is shifting. But for organizations willing to embrace this change, the opportunity is enormous: faster iteration, stronger alignment, and more resilient systems—built by developers who think in outcomes, not just code. 

Robots & Pencils: Redefining the Role, Rebuilding the Foundation 

At Robots & Pencils, we’ve spent over a decade helping organizations adapt to shifts in software architecture and engineering practice. As developers move from coding line-by-line to orchestrating intelligent, cloud-native systems, our role is to help them and their leaders make that leap with confidence. 

We design secure, cloud-native environments that empower developers to compose, not just code. With Anthropic as our preferred LLM and a track record of building modular, scalable solutions, we give teams the foundation they need to experiment responsibly, build faster, and deliver more value without compromising on security or quality. 

For teams rethinking what it means to “write software,” we bring the expertise, architecture, and systems design to make the next role of the developer a strength, not a risk.