Data analysis reasoning at your fingertips
Xia Song, CVP, Microsoft 365 Engineering
As large language models (LLMs) and multimodal systems revolutionize information work by seamlessly navigating language, code, vision, and voice, a vast domain of structured, tabular data remains underutilized: Excel sheets, databases, CSV files, and Power BI reports often lack the natural intuitiveness of text or images. Picture a project manager urgently seeking quarterly performance insights scattered across multiple Excel worksheets and a badly formatted table inside a presentation. Some metrics are hidden in the middle of a worksheet, while some TSV files use commas instead of tabs—leaving little guidance on which data matters or how it connects. For those unskilled in data wrangling, this scenario can devolve into hours of frustration or missed insights. Yet armed with the know-how to manipulate data and harness code as a tool, one can swiftly unravel such complexity, extracting pivotal information and gaining a critical competitive edge.
But what if everyone had this capability readily available? That’s precisely the motivation behind the launch of Analyst, one of the first reasoning agents of its kind in M365 Copilot. Powered by our advanced reasoning model, post-trained on OpenAI o3-mini on analytic tasks, Analyst acts as your “virtual data scientist”. This reasoning-powered agent is built directly into Microsoft 365, placing sophisticated data analytics capabilities right at your fingertips.
The Era of Progressive Reasoning and Problem Solving
Traditional LLMs have historically jumped too quickly from problem to proposed solution while often failing to adjust to new complexities or gracefully recover from mistakes. The advanced reasoning model behind Analyst changes this by implementing a reasoning-driven, chain-of-thought (CoT) architecture derived from OpenAI’s o3-mini. Instead of providing quick answers, it progresses through problems iteratively by hypothesizing, testing, refining, and adapting. Analyst takes as many steps as necessary, adjusts to each complexity it encounters, and mirrors human analytical thinking.
With the capability to generate and execute code at every step within its reasoning trajectory, this model excels at incremental information gathering, constructing hypotheses, course correction, and automatic recovery from errors.
Real-World Data is Messy: A Case Study
Real-world data is messy. To illustrate the tangible benefits of the model’s reasoning capabilities, let's consider a practical challenge. Imagine you're presented with two datasets:
- Dataset A: An Excel file with multiple sheets containing data on world internet usage, where the critical data isn’t conveniently located at the top left but located somewhere in the middle of the second sheet.
- Dataset B: A .tsv file containing data on country statistics, presumably tab-delimited, but mis-formatted with commas as delimiters due to an export error.
The task at hand? Vague at best—something like, “Help identify and visualize interesting insights and connections between these two datasets”. Most of the traditional tools and existing models struggle here. They either stall entirely or deliver incomplete or incorrect analyses.
However, when faced with precisely this scenario, Analyst demonstrates remarkable resilience:
- It quickly identifies and navigates directly to relevant data hidden in the middle of an Excel sheet.
- Shows curiosity, discovers then lists the sheet names.
- Gracefully detects and corrects delimiter issues in the second dataset.
- Progressively explores the data through iterative hypothesis-testing steps, constructing actionable insights without explicit guidance.
As a result of the progressive problem solving shown, the model handles these complexities smoothly and provides observations, insights and visualizations all by itself, demonstrating transformative potential in real-world analytic tasks.
How It Learns: Reinforcement Learning, Structured Reasoning, and Dynamic Code Execution
The effectiveness of the advanced reasoning model behind Analyst lies largely in reinforcement learning (RL). Built by post-training OpenAI’s o3-mini model, it employs advanced RL coupled with rule-based rewards to handle extensive reasoning paths, incremental information discovery, and dynamic code execution. We’ve observed that model performance consistently improves with more reinforcement learning compute during training, as well as more deliberate thinking during inference.
Analyst takes advantage of STEM and analytical reasoning optimizations introduced by models like o3-mini, excelling in structured data scenarios. It dynamically writes, executes, and verifies Python code within a controlled execution environment. This iterative cycle enables the model to continually adjust its strategy through course corrections and effective recovery from errors, emulating human problem-solving behavior closely.
Data Diversity and Robust Reward
Training data diversity is fundamental to post-training effectiveness. We built extensive datasets that encompass a wide range of real-world enterprise scenarios and structured data types:
- File types: Excel, CSV, TSV, JSON, JSONL, XML, SQLite databases, PowerPoint presentations, etc.
- Similarly, the task variety ranged from straightforward numerical computations and visualizations to exploratory hypotheses construction and prediction.
The data points used in training were carefully constructed and selected to represent authentic complexity, preventing our model from overfitting any particular task or benchmark. Recognizing the "reward hacking" behavior often observed in reinforcement learning systems that can potentially lead to model capability loss, we refined our reward systems via adopting more advanced and robust graders. This meticulous data selection, combined with rigorous task design, was done to ensure genuine reasoning by incentivizing authentic exploration and accurate outcomes.
Results
The following benchmark results further underscore our model’s strengths on rigorous analytics-focused tasks like those in DABstep benchmarks and internal M365 Copilot comparisons.
DABStep (Data Agent Benchmark for Multi-step Reasoning)
DABStep is a rigorous evaluation suite designed to test AI agents on real-world data analysis and reasoning tasks. It consists of 450+ structured and unstructured tasks, split into Easy and Hard categories. The Easy set involves simpler data extraction and aggregation, while the Hard set requires multi-step reasoning, integration of diverse datasets, and domain knowledge.
When benchmarked against DABStep, our model demonstrated overall state-of-the-art performance among known baselines. It showed excellent capability on both simple and complex tasks, with a substantial lead in the latter category.
Note: The M365 Copilot Analyst Model currently appears in the real-time leaderboard as an unvalidated anonymous submission labeled "Test1". We have contacted the DABStep team to update the submission and reflect this as the Analyst model coming from Microsoft.
Product Benchmarks
While academic benchmarks provide valuable insights, the true measure of a model’s value lies in its practical application within real-world scenarios. We benchmark our model’s performance on enterprise data analysis tasks across diverse business documents, including Excel spreadsheets, CSVs, PDFs, XMLs, and PowerPoint files, reflecting common analytical workflows within the M365 suite. We compare the specialized Analyst agent against the mainline M365 Copilot Chat (without deep reasoning), evaluating their accuracy in insight generation, data interpretation, and structured query execution across various enterprise file formats.
Analyst is powered by the advanced reasoning model which consistently outperforms existing approaches, demonstrating not only incremental but transformative improvements in real-world analytic reasoning.
The Road Ahead: Opportunities and Acknowledgements
We are genuinely excited about what Analyst can unlock for Microsoft 365 users in making advanced data analytics capabilities accessible to every user. Yet we remain conscious of current limitations, recognizing plenty of room for further improvement. Opportunities remain for more seamless integration across applications, improved interaction paradigms, and expanded model capabilities to handle an even broader spectrum of analytic scenarios.
We are committed to continuous improvement of Analyst and the underlying model, listening closely to user feedback, and refining our model and its integration with other products. Our ultimate goal remains clear: to empower users and organizations to achieve more, turning everyday information workers into empowered analysts with a “virtual data scientist” at their fingertips.
For additional details on Analyst, including rollout and availability for customers, please also check out our blog post highlighting reasoning agents within M365 Copilot and more.
References:
Updated Mar 25, 2025
Version 1.0xiasong
Microsoft
Joined September 22, 2020
Microsoft 365 Copilot
Follow this blog board to get notified when there's new activity