22 Oct 2025 How to Set Realistic GenAI Expectations and Measure Real Impact in Your Engineering Team
CTOs are stuck between board GenAI expectations and engineering team scepticism
The board wants to know why your engineering productivity hasn’t increased by 25-30% after implementing GitHub Copilot. Your CFO is questioning the ROI on AI tool licenses. Meanwhile, you’re caught between vendor promises of transformational gains and the reality of modest improvements that are hard to quantify.
This may be the most challenging time in history to be a CTO or CIO.
For decades, the expectation hierarchy was predictable: engineers believed more was possible than CTOs, who in turn felt more was possible than boards and CEOs. GenAI has completely reversed this dynamic. Now CTOs find themselves squeezed between boards with inflated expectations fueled by vendor marketing, and engineering teams who are concerned about unrealistic expectations and, in some cases, outright sceptical.
As GM Engineering at Shine Solutions, I’ve seen this scenario play out across many Australian organisations. Engineering leaders know GenAI has potential, but they lack the frameworks to set appropriate expectations with stakeholders and measure genuine impact in their specific context. They desire to capitalise on the promise of GenAI safely and pragmatically. How do you harness GenAI’s potential without falling victim to unrealistic expectations? Where do you target first? Where are the early wins and most significant productivity improvements?
The reality is more nuanced than the headlines suggest. While GenAI can deliver meaningful productivity improvements, the impact varies dramatically based on your specific engineering context. More importantly, measuring and optimising this impact requires a fundamental shift in how we think about engineering productivity.
Your GenAI results will differ from everyone else’s – because context changes everything
Here’s what the productivity studies don’t tell you: the impact of GenAI on your engineering team depends entirely on your specific context. Comparing your results to a Silicon Valley unicorn or a European enterprise makes about as much sense as comparing Sydney house prices to those in Adelaide—the fundamentals are completely different.
Consider two scenarios:
Scenario A: A principal engineer at an established fintech company, seven years deep in the same codebase. They architected the core systems, understand every domain nuance, and work primarily with proprietary frameworks and legacy integrations. When they adopt tools like GitHub Copilot, the productivity gain might be modest because the AI can struggle with the company’s unique patterns and domain-specific logic (although progress is being made with approaches like SDD). Most of their work involves “What” problems: deciding on architectural approaches, handling edge cases, and making complex tradeoffs within the business and domain, rather than figuring out how to code something.
Scenario B: That same principal engineer joins a startup building a modern SaaS platform with React, Node.js, and standard AWS services. Suddenly, GenAI becomes dramatically more effective. The AI has encountered thousands of similar patterns in its training data, and productivity gains may reach significantly higher levels for certain tasks. Here, they’re tackling more “How” problems—implementing familiar patterns in unfamiliar territory where the AI can guide the implementation.
The difference isn’t the engineer’s skill, it’s the context. As we’ve explored at Shine, GenAI helps engineers more with “How” problems (clear goals, unclear implementation) but helps engineers less with “What” problems (unclear goals, easy implementation). Yet I regularly see Australian tech leaders trying to benchmark against organisations whose engineers are working on entirely different types of problems. Even teams within the same organisation will have very different results.
Coding time is only ~30% of the engineering productivity equation
Birgitta Böckeler, Distinguished Engineer at Thoughtworks and their full-time expert on AI-assisted software delivery for the past two years, provides the most comprehensive analysis I’ve seen of why GenAI productivity gains for AI coding assistants are often overstated.
In her detailed exploration, published in The Pragmatic Engineer newsletter, Böckeler breaks down the mathematics that reveal why 50% productivity headlines don’t translate to real-world team performance. Her framework is based on GitHub’s research, which shows that developers spend only about 30% of their time actually coding—the rest is spent on collaboration, requirements analysis, debugging production issues, code reviews, and learning new technologies.

Böckeler’s calculation framework looks at three key variables:

The true productivity gain is often much closer to 10-15%, and can vary greatly based on your specific environment. Each part of the equation will be different depending on the business context, and each part can be addressed independently to maximise the benefits of GenAI tools.
Birgitta’s framework helps engineering leaders set realistic expectations by understanding their specific context rather than relying on generic productivity claims.
Many leaders conflate using GenAI for development with building GenAI features
Before diving into measurement, we need to distinguish between two fundamentally different applications of GenAI in engineering:
1. GenAI Tools for Engineering Velocity
These include tools such as GitHub Copilot, ChatGPT for debugging, and AI-powered code review assistants. They help engineers work more efficiently on existing tasks, but the end result is the same as it would have been without GenAI.
2. Building AI-Powered Features
This involves integrating GenAI capabilities into your products, eg. workflow automation, chatbots, recommendation engines, or automated content generation.
The measurement approaches, investment requirements, and expected outcomes differ significantly. Many organisations confuse the two, leading to muddled strategies and unclear success metrics.
You can’t measure the impact of AI on engineering until you’ve first measured engineering itself
“AI Lines of Code” will fail for the same reasons “Lines of Code” failed
I’ve seen Australian companies implement KPIs that track the “percentage of code written by AI” or “AI-generated pull requests.” These metrics fail for the same reasons that measuring lines of code has never been effective in measuring developer productivity. Some of the best engineers remove code, refactor for simplicity, or solve problems with elegant, minimal solutions. AI code metrics create the same perverse incentives, for example:
- Engineers inflate AI usage by using it for tasks that could be completed faster manually.
- Teams avoid complex refactoring or architectural improvements that reduce AI assistance
- Code quality degrades as engineers prioritise AI-generated volume over thoughtful solutions

Every metric should have a complementary measure to catch adverse effects.
As per the “Lines of Code” example, this is crucial because focusing solely on a single metric can inadvertently lead to negative consequences in other areas. This applies to any metric. For example, I can easily increase your deployment frequency by asking your developers to work weekends. By pairing primary metrics with complementary ones, organisations can ensure a holistic view of performance and prevent unintended side effects from a narrow focus.
Use proven industry frameworks – don’t reinvent engineering measurement
The DX Core 4 framework is an excellent starting point, providing “a unified framework for measuring developer productivity that encapsulates DORA, SPACE, and DevEx” through four key dimensions:
- Speed: How quickly can teams deliver value?
- Effectiveness: Developer Experience Index (DXI) – how well do processes support engineers?
- Quality: System stability and change failure rates
- Impact: Time spent on new capabilities vs. maintenance
This framework is particularly relevant for Australian organisations because it:
- Provides industry benchmarks for meaningful comparison
- Balances output metrics with developer experience
- Avoids the gamification pitfalls of individual-level measurement
- Translates technical metrics into business language
You can start today, and you can start small
You don’t need to implement industry best practice measures from day one. Most orgs are very immature in this area; even a basic set of measures will put you ahead of the majority (assuming they represent a balanced scorecard). Pick a subset of these measures to get started – you can measure them in a very low-cost way, such as a monthly developer survey.
The path forward: Realistic expectations, better measurement, sustained improvement
The GenAI productivity revolution isn’t arriving as a single transformative moment—it’s emerging as a series of incremental improvements that compound over time. Australian engineering leaders who succeed will be those who resist the pressure to chase headline-grabbing productivity gains and instead focus on understanding their specific context, measuring systematically, and optimising continuously.
Start by establishing baseline measurements using proven frameworks, then introduce AI tools thoughtfully while tracking their actual impact on your team’s productivity dimensions. Set realistic near-term goals whilst laying the right foundations to exploit the longer-term potential of AI. Remember that a 13% improvement in cycle time (for example), sustained across your entire engineering organisation, represents significant business value—even if it doesn’t match the vendor marketing materials.
The goal isn’t to prove GenAI delivers miraculous productivity gains. The goal is to understand how AI tools can genuinely help your team in your specific context, measure that impact accurately, and communicate those results transparently to stakeholders. In doing so, you’ll not only maximise the value of your AI investments but also build the credibility needed to navigate future technology adoption decisions with confidence.
Your board may have started with unrealistic expectations, but with the right measurement approach, you can turn that initial enthusiasm into sustained support for genuine engineering improvements.
What’s your experience with GenAI productivity measurement? I’d love to hear from other Australian engineering leaders who are navigating this challenge. Connect with me on LinkedIn or reach out to discuss how we can share learnings across the community.
Navigating this landscape is complex, and you don’t have to do it alone. If you’d like expert advice on establishing a measurement framework and turning the promise of GenAI into real, quantifiable impact for your team, please get in touch with us at Shine Solutions. We can help.
No Comments