Machine Learning & Data Analysis Project

IPL Data
Intelligence

A comprehensive deep-dive into Indian Premier League deliveries data — uncovering scoring patterns, phase dynamics, and strategic insights through statistical modelling and machine learning.

260K+ Deliveries Analysed
10 Analytical Objectives
11 Visualisations
scroll

What This Project Is

This project applies the full data science pipeline — from raw cleaning to statistical inference — on the IPL deliveries dataset. It answers questions teams, analysts, and fans have always debated: which phase of an innings matters most? Do death overs really score more? Who are the true match-winners?

Every chart, model, and test is backed by the actual ball-by-ball data, making the conclusions statistically grounded rather than anecdotal.

🐼 Pandas Data Wrangling
🔢 NumPy Numerical Computing
📊 Matplotlib Visualisation
🎨 Seaborn Statistical Plots
🤖 Scikit-learn Machine Learning
🧮 SciPy Statistical Testing

The Data Behind the Numbers

Dataset
deliveries.csv

Ball-by-ball IPL delivery records spanning multiple seasons. Every row represents a single delivery — capturing runs scored, wickets, extras, dismissal types, batsmen, bowlers, and match context.

Records
260,920

Individual deliveries across all matched IPL games

Features
17

Columns: match_id, inning, over, ball, batter, bowler, runs, extras, wickets & more

Innings Analysed
1 & 2

Super overs excluded to keep stats clean and accurate

Powerplay
Overs 1–6
15.40 runs/over
Middle
Overs 7–15
Consolidation Phase
Death
Overs 16–20
16.77 runs/over

Analytical Objectives

01

Data Cleaning & EDA

Loading, inspecting, cleaning and preprocessing the raw deliveries data before any analysis begins.

📥Load CSV & inspect shape
🔍Head, Info, Describe
🩹Handle missing values
🗑️Remove duplicates
🔧Fix data types & over index
🏷️Add phase labels
02

Univariate Analysis

Examining individual variables in isolation — run distribution per delivery and breakdown of dismissal types across the dataset.

Univariate Analysis
Histogram of total_runs & Dismissal Countplot The vast majority of deliveries yield 0 or 1 run, confirming T20's dot-ball importance. "Caught" dominates dismissals (8,053), followed by "Bowled" (2,204) — emphasising the value of attacking field placements and tight bowling lines.
03

Bivariate Analysis

Exploring relationships between two variables — how run rate evolves over the innings and which batsmen dominate across all IPL matches.

Bivariate Analysis
Line Plot: Over vs Average Runs & Top 10 Batsmen Run rate dips in the early middle overs (overs 7–8) as new batsmen settle, before climbing steadily through the death. V Kohli leads all-time runs at 8,004 — nearly 1,300 runs clear of second-placed S Dhawan.
04

Multivariate Analysis

Analysing correlations across multiple numeric features simultaneously, and visualising the relationship between balls faced and total runs for elite batsmen.

Multivariate Analysis
Correlation Heatmap & Balls Faced vs Total Runs Scatter total_runs and batsman_runs show near-perfect correlation (0.98), confirming batting dominates scoring. The scatter plot clearly separates elite volume scorers — V Kohli sits alone in the top-right, combining maximum balls faced with maximum runs.
05

Outlier Detection & Removal

Applying the IQR method on runs-per-over (not raw ball-level) data to identify and remove extreme overs before regression modelling.

Outlier Detection
Boxplot Before/After + Distribution Comparison 369 outlier overs were removed (overs with anomalously high or low run totals — e.g., injury interruptions or extraordinary hitting). Post-removal, the distribution is cleaner and the regression model trains on representative data.
06

Linear Regression

Modelling the relationship between over number and runs scored per over using simple linear regression with train-test evaluation.

Model
Linear Regression
Coefficient
+0.1198
MSE
40.98
R² Score
0.0106
Linear Regression
Regression Line + Residual Plot The positive coefficient (0.12) confirms that runs per over increase as the innings progresses. The low R² reflects that "over number" alone can't fully predict scoring — pitch conditions, batting lineup, and match situation all add variance captured in the residuals.
07

Hypothesis Testing

Using Welch's independent t-test to statistically confirm whether Death overs score significantly more runs per over than Powerplay overs.

H₀
Mean runs/over in Powerplay = Mean runs/over in Death overs
Verdict
REJECT H₀  ·  p ≈ 0.000000  ·  t = −10.707
Hypothesis Boxplot
Boxplot: Powerplay vs Death Overs Death overs show a higher median and wider spread — reflecting aggressive hitting and higher risk.
Hypothesis Bar Chart
Average Runs per Over: Phase Comparison Death (16.77) clearly outscores Powerplay (15.40) on average — confirmed as statistically significant.
08

Advanced Analysis

Four deep-dive charts: elite batsmen by runs, top wicket-takers, most aggressive boundary hitters, and the most economical bowlers in IPL history.

Advanced Analysis Part 1
Top Batsmen · Top Bowlers V Kohli leads total runs (8,004), while YS Chahal dominates with 213 wickets. These charts highlight the most consistent performers in IPL history.
Advanced Analysis Part 2
Boundaries · Economy Rates V Kohli also tops boundary count (979). Bowlers like Sohail Tanvir (6.23) and A Chandila (6.28) showcase exceptional economy in a high-scoring format.
+

Phase & Distribution Extras

Additional visualisations exploring run distribution across match phases, boundary contribution to total scoring, and phase-wise run averages.

Run Distribution by Phase
Runs Distribution by Match Phase Middle overs produce the highest aggregate runs — expected given 9 overs vs 5 for death.
Boundary Contribution Pie
Boundary Contribution to Total Runs 59.9% of all runs come from boundaries — underscoring T20's boundary-or-bust nature.
Phase Average Line
Average Runs by Match Phase Middle overs peak at 137.8 avg runs per match phase — driven purely by volume of overs.

Key Findings

01
📈

Death Overs Dominate Scoring

Overs 16–20 average 16.77 runs/over vs 15.40 in Powerplay — a statistically significant difference confirmed by Welch's t-test (p ≈ 0.000000). The final five overs are the most explosive in any IPL innings.

02
🏏

V Kohli is the Undisputed Run Machine

Kohli leads all IPL batsmen with 8,004 total runs — nearly 1,300 ahead of S Dhawan. He also leads in boundaries (979), making him the most consistent and aggressive volume scorer in the league's history.

03
🎯

Caught is King of Dismissals

Of all 13,000+ wickets in the dataset, 62% are catches. Bowled accounts for just 17%. Field placement and inducing edges is far more effective than clean-bowling batsmen in T20 cricket.

04
💥

Boundaries Drive 59.9% of All Runs

Nearly 60% of all runs scored in the dataset come from 4s and 6s. Teams with boundary-hitting specialists have a structural advantage that cannot be compensated by running between the wickets alone.

05
📉

Run Rate Dips in Early Middle Overs

Overs 7–8 consistently show a run-rate dip as new batsmen settle after the Powerplay. This is the prime window for economical spin bowling — backed by YS Chahal's 213 wickets leading all IPL bowlers.

06
🔗

Runs and Balls Faced are Linearly Correlated

The scatter of top-30 batsmen shows a strong positive trend between balls faced and total runs (Pearson r ≈ 0.98 for batsman vs total runs). Consistency and longevity at the crease is the clearest predictor of overall output.

Practical Recommendations

Powerplay · Overs 1–6
🏏 Batting

Deploy attacking openers under fielding restrictions. Target the boundary — powerplay avg is 15.40 runs/over, setting the match's run-rate foundation.

🎳 Bowling

Bowl your best swing/seam bowlers with the new ball. Target tight lines; early breakthroughs dramatically shift the match momentum.

Middle · Overs 7–15
🏏 Batting

Consolidate and rotate strike. Avoid reckless shots during this highest wicket-rate phase — preserve your power hitters for the death.

🎳 Bowling

Introduce your best spinners. Build dot-ball pressure and exploit the transition period — this phase offers the highest wicket probability.

Death · Overs 16–20
🏏 Batting

Save your cleanest boundary hitters. At 16.77 avg runs/over, maximising this window with specialist finishers is the single biggest scoring lever.

🎳 Bowling

Invest in yorker specialists. Slower balls and wide yorkers suppress boundaries. Avoid full-pitched deliveries — they cost matches.

Data-Driven Extras
Dismissal Strategy

62% of wickets are catches. Prioritise attacking field placements and edge-inducing bowling over line-and-length containment.

Boundary Planning

With 59.9% of runs from 4s & 6s, teams should specifically recruit boundary-hitting specialists — not just high-average batsmen.

Regression Insight

Run rate rises ~0.12 runs per over as innings progress. Save your best bowlers for overs 15+ where the scoring spike accelerates.