What Is the 10 20 70 Rule for AI? A Clear Breakdown for 2026

The 10 20 70 rule for AI explains where the real work of building and deploying artificial intelligence actually happens — and the answer surprises most people who are new to the field. The rule breaks down AI project effort into three categories: 10% for algorithms and models, 20% for data pipelines and infrastructure, and 70% for the messy, unglamorous work of data preparation and integration. Understanding this rule changes how you plan AI projects, allocate resources, and set realistic expectations.

This breakdown comes from practical experience across hundreds of enterprise AI deployments and has been referenced by data scientists and ML engineers at companies like Google and IBM. It's not a mathematical law, but it reflects a pattern that repeats itself across industries and team sizes.

Breaking Down the 10 20 70 Rule

The 10%: Algorithms and Model Building

The part most people romanticize — choosing a neural network architecture, tuning hyperparameters, writing the training loop — accounts for just 10% of the total effort in a real AI project. That number shocks beginners who spent weeks learning PyTorch or TensorFlow expecting model building to be the bulk of the work.

This doesn't mean the algorithm work is easy or unimportant. Getting a model to perform well still requires real skill. But pre-trained models, open-source frameworks, and AutoML tools have dramatically reduced the time this step takes. You can fine-tune a state-of-the-art language model in an afternoon using tools that didn't exist five years ago.

What falls into the 10%:

Selecting model architecture (transformer, CNN, random forest, etc.)
Training and fine-tuning
Hyperparameter optimization
Evaluating model performance on test sets
Running experiments and comparing approaches

The point isn't that this work is trivial — it's that it rarely dominates the timeline the way beginners expect.

The 20%: Data Infrastructure and Pipelines

The middle 20% covers the engineering work that connects raw data sources to your model and connects your model's outputs to real applications. This is the plumbing layer — not glamorous, but everything breaks without it.

What falls into the 20%:

Building data ingestion pipelines
Setting up feature stores
Writing ETL (extract, transform, load) processes
Creating model serving infrastructure
Building monitoring and alerting systems
Managing compute environments and deployment pipelines

This layer requires solid software engineering skills on top of ML knowledge. A data scientist who understands infrastructure is significantly more effective than one who can only work with clean, pre-loaded datasets. The 20% is also where a lot of production AI projects stall — the model works fine in a notebook but falls apart when someone tries to plug it into a real system.

The 70%: Data Work — Collection, Cleaning, and Labeling

The 70% is where most of the actual time goes, and it's the part most overlooked in courses, tutorials, and AI headlines. Real-world data is messy, inconsistent, incomplete, and often collected for purposes entirely different from what your AI project needs.

What falls into the 70%:

Finding and collecting relevant data
Cleaning and deduplicating records
Handling missing values
Labeling or annotating data for supervised learning
Standardizing formats across different data sources
Auditing data for bias and quality issues
Building and maintaining data documentation

The reason this dominates is straightforward: a model trained on bad data gives bad results, no matter how sophisticated the algorithm. Garbage in, garbage out is one of the oldest principles in computing, and it applies more forcefully to AI than almost anywhere else.

Why This Rule Matters for AI Practitioners

Understanding the 10 20 70 rule shifts your mindset in three important ways.

It reframes what skills matter most. If you're learning AI to get hired or to build real products, spending 100% of your learning time on model architecture means you're prepared for 10% of the job. Learning data wrangling, SQL, data pipelines, and infrastructure work puts you in a much stronger position for actual project work.

It sets realistic timelines. AI projects consistently run over schedule because stakeholders underestimate the data work. A team that plans three weeks for data preparation and needs ten is not failing — they're running into the 70% problem without having anticipated it. Teams that know this rule plan accordingly.

It helps you ask better questions before starting. Before building a model, experienced practitioners ask: Do we have enough labeled data? Is it clean? Where does it live? How often does it change? These questions reflect an understanding that the data side determines whether the project succeeds.

How the 10 20 70 Rule Applies to Generative AI Specifically

Generative AI has shifted some of the percentages, but the core principle holds. Large language models like GPT-4 or Claude reduce the algorithm effort even further — you often skip training entirely and work with an existing model through an API. But the data and integration work expands in new ways.

In generative AI projects, the 70% now often includes:

Curating high-quality documents for retrieval-augmented generation (RAG) systems
Cleaning and formatting training data for fine-tuning
Evaluating model outputs for accuracy, tone, and safety
Building feedback loops to improve response quality over time
Writing and testing system prompts at scale

A company building a customer support chatbot powered by an LLM will spend a small slice of time choosing which model to use and a much larger slice preparing their knowledge base, testing edge cases, and integrating the system with their existing tools. The rule adapts, but the spirit stays the same.

Common Mistakes People Make When They Ignore This Rule

Mistake 1: Jumping straight to model selection

Teams that start by debating whether to use XGBoost or a neural network before understanding their data quality are solving the wrong problem first. The data determines which approach is even viable.

Mistake 2: Underbudgeting data labeling

Labeling data manually is time-consuming and expensive. A dataset of 50,000 labeled images for a computer vision project can take weeks of human effort. Teams that don't factor this in blow their budgets and timelines early.

Mistake 3: Treating data pipelines as an afterthought

Building a model in isolation and then figuring out how to connect it to real systems creates months of rework. The 20% infrastructure layer should be designed alongside the model, not after it.

Mistake 4: Assuming public datasets will substitute for domain-specific data

Public datasets like ImageNet or Common Crawl are useful for pre-training, but they rarely match the specific characteristics of real business data. A fraud detection model trained only on public data will likely miss the patterns specific to your transaction types.

What the 10 20 70 Rule Means for AI Education

Most AI courses — even good ones — spend the majority of their time on the 10%. You learn model architectures, loss functions, backpropagation, and training loops. These are genuinely important concepts. But learners who finish these courses and step into real AI roles often feel underprepared because the data and infrastructure work looks nothing like what they practiced.

The most effective AI learning paths deliberately cover all three layers:

Layer

Skills to Build

10% (Algorithms)

ML fundamentals, model training, evaluation metrics

20% (Infrastructure)

Data pipelines, MLOps, model deployment, monitoring

70% (Data)

SQL, data cleaning, feature engineering, data labeling, EDA

If your current learning path only covers the top row, you're building for a fraction of the job.

For people who want structured, code-first learning that actually covers all three layers, Educative is worth a serious look. Their interactive courses let you write real code in the browser across ML fundamentals, data engineering, and practical AI applications — without the setup friction that slows most self-taught learners down. Right now there's a 20% off Educative Premium discount coupon that makes starting even easier.

How Teams Structure Work Around the 10 20 70 Rule

In well-run AI teams, the rule influences how roles are defined and how projects are staffed.

Data engineers own the 70% and 20%. They build pipelines, maintain data quality, and make sure the infrastructure is ready before a model scientist ever writes a training loop.

ML engineers straddle the 20% and 10%. They handle model training and deployment infrastructure, and they're the bridge between data science and production systems.

Data scientists focus on the 10% — experimentation, model selection, and performance analysis — but the best ones understand enough of the other layers to communicate clearly across the team.

Startups often have one or two people trying to cover all three layers, which is why small AI projects regularly run into trouble. The 70% doesn't shrink just because the team is smaller.

Applying the Rule to Your Own AI Projects

Before starting any AI project, work through these questions in order:

Data first:

What data do you have, and how was it collected?
How clean is it, and what preprocessing does it need?
Do you have labels, or do you need to create them?
How much data do you have, and is it enough for your use case?

Infrastructure second:

Where will the model run once it's trained?
How will predictions be served to users or downstream systems?
How will you monitor performance after deployment?

Model last:

Given your data and infrastructure constraints, which model type fits?
What performance metrics actually matter for this use case?
How will you validate the model before releasing it?

This order feels backward to people trained primarily on academic ML content, but it reflects how successful practitioners approach real projects.

Learning Resources That Cover All Three Layers

For the 70% (data work):

Kaggle's free data cleaning and feature engineering micro-courses
"Python for Data Analysis" by Wes McKinney (the pandas creator)
dbt documentation for understanding data transformation at scale

For the 20% (infrastructure):

Full Stack Deep Learning (free online course)
MLOps Zoomcamp by DataTalks.Club (free cohort-based course)
Google Cloud's free MLOps learning path

For the 10% (algorithms):

DeepLearning.AI's free short courses
Fast.ai practical deep learning course
Hugging Face's free NLP course

Combining these free resources with a structured platform covers a lot of ground. Educative's code-interactive format works particularly well for the infrastructure and data engineering layer, where you need to actually run and modify code to understand what's happening. The Educative 50% off coupon code makes a full year of access very reasonable compared to most alternatives.

The Bottom Line on the 10 20 70 Rule for AI

The 10 20 70 rule for AI is a grounding principle that cuts through the hype around algorithms and reminds practitioners where work actually concentrates. Models are the visible face of AI, but data quality and infrastructure are the foundation everything else rests on.

If you're learning AI, let this rule shape your curriculum. If you're planning an AI project, let it shape your timeline and resourcing. And if you're building a team, let it shape how you hire and what skills you prioritize.

The teams that succeed with AI aren't always the ones with the most sophisticated models — they're the ones who got the data and infrastructure right first.