Most Streamed Spotify Songs: Data Cleaning

Python

This is a simple data cleaning + EDA project. I used this data set from Kaggle.

import numpy as np
import pandas as pd
import re

Data Cleaning

Correct encoding
Data Types
Missing values
Duplicates
Outliers
Formating Data (Date)

1. Correct encoding

sdf = pd.read_csv("Most Streamed Spotify Songs 2024.csv", encoding='ISO-8859-1') # not encoded in utf-8
sdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  TikTok Posts                3427 non-null   object 
 14  TikTok Likes                3620 non-null   object 
 15  TikTok Views                3619 non-null   object 
 16  YouTube Playlist Reach      3591 non-null   object 
 17  Apple Music Playlist Count  4039 non-null   float64
 18  AirPlay Spins               4102 non-null   object 
 19  SiriusXM Spins              2477 non-null   object 
 20  Deezer Playlist Count       3679 non-null   float64
 21  Deezer Playlist Reach       3672 non-null   object 
 22  Amazon Playlist Count       3545 non-null   float64
 23  Pandora Streams             3494 non-null   object 
 24  Pandora Track Stations      3332 non-null   object 
 25  Soundcloud Streams          1267 non-null   object 
 26  Shazam Counts               4023 non-null   object 
 27  TIDAL Popularity            0 non-null      float64
 28  Explicit Track              4600 non-null   int64  
dtypes: float64(6), int64(1), object(22)
memory usage: 1.0+ MB

2. Data Types

Most numerical variables were erroneously encoded as strings. Correcting this here.

# selecting all numerical variables erroneously encoded as a string object.
objCol = sdf.select_dtypes(include=[object]).drop(['Track','Album Name','Artist','Release Date','ISRC'], axis=1).columns

def convert_to_int(value, default=None):
    try:
        # Use regex to remove non-numeric characters
        cleaned_value = re.sub(r'[^\d]', '', str(value))
        result = int(cleaned_value)
        return result
    except (ValueError, TypeError):
        return default
sdf[objCol] = sdf[objCol].map(convert_to_int)

sdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   int64  
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   float64
 8   Spotify Playlist Count      4530 non-null   float64
 9   Spotify Playlist Reach      4528 non-null   float64
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   float64
 12  YouTube Likes               4285 non-null   float64
 13  TikTok Posts                3427 non-null   float64
 14  TikTok Likes                3620 non-null   float64
 15  TikTok Views                3619 non-null   float64
 16  YouTube Playlist Reach      3591 non-null   float64
 17  Apple Music Playlist Count  4039 non-null   float64
 18  AirPlay Spins               4102 non-null   float64
 19  SiriusXM Spins              2477 non-null   float64
 20  Deezer Playlist Count       3679 non-null   float64
 21  Deezer Playlist Reach       3672 non-null   float64
 22  Amazon Playlist Count       3545 non-null   float64
 23  Pandora Streams             3494 non-null   float64
 24  Pandora Track Stations      3332 non-null   float64
 25  Soundcloud Streams          1267 non-null   float64
 26  Shazam Counts               4023 non-null   float64
 27  TIDAL Popularity            0 non-null      float64
 28  Explicit Track              4600 non-null   int64  
dtypes: float64(22), int64(2), object(5)
memory usage: 1.0+ MB

3. Missing Values

Get a list of the count of missing values in each column, and drop the ones with too many.
Impute missing numerical values with mean

sdfNAs =  pd.DataFrame(sdf.isnull().sum()).reset_index().rename(columns={'index':'Track', 0 :'Missing Values'})
sdfNAs[sdfNAs['Missing Values']>0].sort_values(by='Missing Values', ascending=False)

	Track	Missing Values
27	TIDAL Popularity	4600
25	Soundcloud Streams	3333
19	SiriusXM Spins	2123
24	Pandora Track Stations	1268
13	TikTok Posts	1173
23	Pandora Streams	1106
22	Amazon Playlist Count	1055
16	YouTube Playlist Reach	1009
15	TikTok Views	981
14	TikTok Likes	980
21	Deezer Playlist Reach	928
20	Deezer Playlist Count	921
10	Spotify Popularity	804
26	Shazam Counts	577
17	Apple Music Playlist Count	561
18	AirPlay Spins	498
12	YouTube Likes	315
11	YouTube Views	308
7	Spotify Streams	113
9	Spotify Playlist Reach	72
8	Spotify Playlist Count	70
2	Artist	5

Choosing to drop “TIDAL Popularity”, “Soundcloud Streams”, “SiriusXM Spins” because they are: 1. missing too many values, and; 2. not incredibly interesting.

sdf.drop(["TIDAL Popularity", "Soundcloud Streams", "SiriusXM Spins"], axis=1, inplace=True)
sdf.head()

	Track	Album Name	Artist	Release Date	ISRC	All Time Rank	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	YouTube Playlist Reach	Apple Music Playlist Count	AirPlay Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Shazam Counts	Explicit Track
0	MILLION DOLLAR BABY	Million Dollar Baby - Single	Tommy Richman	4/26/2024	QM24S2402528	1	725.4	3.904709e+08	30716.0	196631588.0	...	1.505970e+08	210.0	40975.0	62.0	17598718.0	114.0	18004655.0	22931.0	2669262.0	0
1	Not Like Us	Not Like Us	Kendrick Lamar	5/4/2024	USUG12400910	2	545.9	3.237039e+08	28113.0	174597137.0	...	1.563804e+08	188.0	40778.0	67.0	10422430.0	111.0	7780028.0	28444.0	1118279.0	1
2	i like the way you kiss me	I like the way you kiss me	Artemas	3/19/2024	QZJ842400387	3	538.4	6.013093e+08	54331.0	211607669.0	...	3.737850e+08	190.0	74333.0	136.0	36321847.0	172.0	5022621.0	5639.0	5285340.0	0
3	Flowers	Flowers - Single	Miley Cyrus	1/12/2023	USSM12209777	4	444.9	2.031281e+09	269802.0	136569078.0	...	3.351189e+09	394.0	1474799.0	264.0	24684248.0	210.0	190260277.0	203384.0	11822942.0	0
4	Houdini	Houdini	Eminem	5/31/2024	USUG12403398	5	423.3	1.070349e+08	7223.0	151469874.0	...	1.127639e+08	182.0	12185.0	82.0	17660624.0	105.0	4493884.0	7006.0	457017.0	1

5 rows × 26 columns

Imputing missing numerical values by mean.

intCol = sdf.select_dtypes(include=[np.number]).columns

sdf[intCol] = sdf[intCol].fillna(sdf[intCol].mean())
intCol

Index(['All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'Deezer Playlist Count',
       'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams',
       'Pandora Track Stations', 'Shazam Counts', 'Explicit Track'],
      dtype='object')

sdfNAs =  pd.DataFrame(sdf.isnull().sum()).reset_index().rename(columns={'index':'Track', 0 :'Missing Values'})
sdfNAs[sdfNAs['Missing Values']>0].sort_values(by='Missing Values', ascending=False)

	Track	Missing Values
2	Artist	5

4. Duplicates

sdfDupes = pd.DataFrame(sdf.duplicated()).rename(columns={0:'duplicate'})
sdfDupes[sdfDupes['duplicate']== True]

	duplicate
2450	True
3450	True

sdf.iloc[2449:2451]

	Track	Album Name	Artist	Release Date	ISRC	All Time Rank	Track Score	Spotify Streams	Spotify Playlist Count	Spotify Playlist Reach	...	YouTube Playlist Reach	Apple Music Playlist Count	AirPlay Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Shazam Counts	Explicit Track
2449	Tennessee Orange	Tennessee Orange	Megan Moroney	9/2/2022	TCAGJ2289254	2424	28.9	227893586.0	28139.0	12480714.0	...	238206228.0	33.0	129172.0	5.0	1370.0	49.0	56972562.0	26968.0	708143.0	0
2450	Tennessee Orange	Tennessee Orange	Megan Moroney	9/2/2022	TCAGJ2289254	2424	28.9	227893586.0	28139.0	12480714.0	...	238206228.0	33.0	129172.0	5.0	1370.0	49.0	56972562.0	26968.0	708143.0	0

2 rows × 26 columns

sdf = sdf.drop(sdf.index[2449], axis=0).reset_index()
sdf.iloc[2448:2452]

	index	Track	Album Name	Artist	Release Date	ISRC	All Time Rank	Track Score	Spotify Streams	Spotify Playlist Count	...	YouTube Playlist Reach	Apple Music Playlist Count	AirPlay Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Shazam Counts	Explicit Track
2448	2448	Am I Wrong	More Music 2013	Nico & Vinz	1/1/2013	USWB11304681	2440	28.9	965827790.0	160031.0	...	2936860.0	89.0	31268.0	64.0	111728.0	13.0	598857243.0	319806.0	19231875.0	0
2449	2450	Tennessee Orange	Tennessee Orange	Megan Moroney	9/2/2022	TCAGJ2289254	2424	28.9	227893586.0	28139.0	...	238206228.0	33.0	129172.0	5.0	1370.0	49.0	56972562.0	26968.0	708143.0	0
2450	2451	Still Trappin' (with King Von)	The Voice	Lil Durk	12/24/2020	USUM72024651	2448	28.8	176140377.0	76081.0	...	35450506.0	36.0	10226.0	7.0	31768.0	13.0	27955730.0	8002.0	494233.0	1
2451	2452	De Fresa y Coco	De Fresa y Coco	Luis R Conriquez	11/10/2023	QMANG2226771	2445	28.8	81186913.0	7818.0	...	242836946.0	3.0	3.0	1.0	2878.0	1.0	340543.0	2169.0	147308.0	0

4 rows × 27 columns

5. Outliers: Z-scores

def zscoring(df, columns, threshold =3):
    z_scores_dict = {}
    for col in columns:
        z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
        z_scores_dict[col + '_zscore'] = z_scores
    z_scores_df = pd.DataFrame(z_scores_dict)
    return z_scores_df

z_scores_sdf = zscoring(sdf, intCol)
z_scores_sdf

	All Time Rank_zscore	Track Score_zscore	Spotify Streams_zscore	Spotify Playlist Count_zscore	Spotify Playlist Reach_zscore	Spotify Popularity_zscore	YouTube Views_zscore	YouTube Likes_zscore	TikTok Posts_zscore	TikTok Likes_zscore	...	YouTube Playlist Reach_zscore	Apple Music Playlist Count_zscore	AirPlay Spins_zscore	Deezer Playlist Count_zscore	Deezer Playlist Reach_zscore	Amazon Playlist Count_zscore	Pandora Streams_zscore	Pandora Track Stations_zscore	Shazam Counts_zscore	Explicit Track_zscore
0	1.730622	17.732756	0.107108	0.406289	5.882628	1.938203	0.469887	0.274599	2.288098	1.104820	...	0.326421	2.315517	0.117118	0.611515	5.131862	3.885922	0.462468	0.292580	0.024738	0.748277
1	1.729866	13.076162	0.232649	0.443162	5.134597	1.938203	0.422586	0.125419	0.127990	0.158752	...	0.316659	1.987692	0.118749	0.714522	2.872978	3.754428	0.532345	0.267749	0.249878	1.336112
2	1.729111	12.881597	0.289327	0.071771	6.391039	1.938203	0.413365	0.158310	0.987168	0.333133	...	0.050303	2.017495	0.159020	2.136021	11.025354	6.428142	0.551190	0.370464	0.487939	0.748277
3	1.728355	10.456017	2.978073	2.980478	3.843613	1.462161	1.022392	1.736456	2.962738	1.980611	...	5.075927	5.057322	11.752106	4.773004	7.362179	8.093734	0.714773	0.520190	1.645482	0.748277
4	1.727599	9.895669	0.640047	0.739078	4.349468	1.666179	0.480065	0.166794	0.440283	0.000044	...	0.390280	1.898286	0.355442	1.023543	5.151349	3.491440	0.554804	0.364307	0.366960	1.336112
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4594	1.734176	0.582317	0.267724	0.090268	0.619423	0.510078	0.374063	0.345765	0.425117	0.189547	...	0.580527	0.769014	0.456260	0.624571	0.403222	0.000225	0.448120	0.336481	0.331669	1.336112
4595	1.726618	0.582317	0.742998	0.776206	0.743446	0.510012	0.797950	0.545153	0.000089	0.230960	...	0.538464	0.798816	0.452899	0.645173	0.407405	0.000225	0.000043	0.000060	0.413602	0.748277
4596	1.723594	0.582317	0.484101	0.179454	0.565047	0.102042	0.297696	0.417077	0.444460	0.220121	...	0.531240	0.530596	0.454621	0.645173	0.407674	0.847865	0.008521	0.265249	0.246740	1.336112
4597	1.738711	0.582317	0.567784	0.642555	0.541885	0.170048	1.055567	0.211595	0.442639	0.225767	...	0.538164	0.798816	0.446376	0.000122	0.000089	0.804034	0.538921	0.000060	0.368505	0.748277
4598	1.740223	0.582317	0.360439	0.386146	0.315129	0.034036	0.683474	0.202123	0.411060	0.210349	...	0.283827	0.649805	0.439067	0.583369	0.367571	0.935528	0.113906	0.344876	0.312074	1.336112

4599 rows × 21 columns

def remove_outliers(df, z_scores_sdf, threshold=3):
    # Identify rows with any Z-score greater than threshold
    outliers_mask = (z_scores_sdf > threshold).any(axis=1)
    
    # Remove outliers from main DataFrame
    df_clean = df[~outliers_mask]
    
    return df_clean

sdfClean = remove_outliers(sdf, z_scores_sdf)

sdf = sdfClean.reset_index()
sdf

	level_0	index	Track	Album Name	Artist	Release Date	ISRC	All Time Rank	Track Score	Spotify Streams	...	YouTube Playlist Reach	Apple Music Playlist Count	AirPlay Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Shazam Counts	Explicit Track
0	98	98	Agora Hills	Scarlet	Doja Cat	9/22/2023	USRC12301954	99	152.8	509653100.0	...	1.561457e+09	80.0	400206.000000	78.000000	9.341155e+06	39.000000	2.311027e+07	19559.000000	4295361.0	1
1	100	100	Montagem Rave Eterno	Montagem Rave Eterno	Dj Samir	5/2/2024	GXD7G2413058	101	152.3	5157486.0	...	3.439601e+08	3.0	55139.156753	2.000000	3.936600e+04	25.348942	8.566735e+07	87876.965786	47658.0	0
2	101	101	Lil Boo Thang	Lil Boo Thang	Paul Russell	8/18/2023	USAR12300323	102	152.1	242070373.0	...	1.559373e+07	106.0	370635.000000	69.000000	1.792014e+06	79.000000	8.442572e+07	30713.000000	2219984.0	0
3	106	106	Type Shit	WE DON'T TRUST YOU	Future	3/22/2024	USSM12402033	107	148.1	205206688.0	...	2.324409e+08	99.0	3432.000000	41.000000	2.943724e+06	47.000000	5.897303e+06	4527.000000	900933.0	1
4	107	107	Armageddon	Armageddon - The 1st Album	aespa	5/27/2024	KRA302400093	108	147.4	32942304.0	...	2.951077e+07	37.0	290.000000	5.000000	5.308310e+05	33.000000	2.186900e+04	48.000000	29736.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3689	4594	4595	For the Last Time	For the Last Time	$uicideboy$	9/5/2017	QM8DG1703420	4585	19.4	305049963.0	...	5.301600e+04	3.0	6.000000	2.000000	1.421700e+04	25.348942	2.010407e+07	13184.000000	656337.0	1
3690	4595	4596	Dil Meri Na Sune	Dil Meri Na Sune (From "Genius")	Atif Aslam	7/27/2018	INT101800122	4575	19.4	52282360.0	...	2.497305e+07	1.0	412.000000	1.000000	9.270000e+02	25.348942	8.566735e+07	87876.965786	193590.0	0
3691	4596	4597	Grace (feat. 42 Dugg)	My Turn	Lil Baby	2/28/2020	USUG12000043	4571	19.4	189972685.0	...	2.925315e+07	19.0	204.000000	1.000000	7.400000e+01	6.000000	8.442674e+07	28999.000000	1135998.0	1
3692	4597	4598	Nashe Si Chadh Gayi	November Top 10 Songs	Arijit Singh	11/8/2016	INY091600067	4591	19.4	145467020.0	...	2.515052e+07	1.0	1200.000000	32.310954	1.294939e+06	7.000000	6.817840e+06	87876.965786	448292.0	0
3693	4598	4599	Me Acostumbre (feat. Bad Bunny)	Me Acostumbre (feat. Bad Bunny)	Arcï¿½ï¿½	4/11/2017	USB271700107	4593	19.4	255740653.0	...	1.758314e+08	11.0	2083.000000	4.000000	1.274790e+05	4.000000	6.900674e+07	11320.000000	767006.0	1

3694 rows × 28 columns

6. Formating Date

sdf['Release Date'] = pd.to_datetime(sdf['Release Date'])

sdf['Release Month'] = sdf['Release Date'].dt.month
sdf['Release Year'] = sdf['Release Date'].dt.year
sdf

	level_0	index	Track	Album Name	Artist	Release Date	ISRC	All Time Rank	Track Score	Spotify Streams	...	AirPlay Spins	Deezer Playlist Count	Deezer Playlist Reach	Amazon Playlist Count	Pandora Streams	Pandora Track Stations	Shazam Counts	Explicit Track	Release Month	Release Year
0	98	98	Agora Hills	Scarlet	Doja Cat	2023-09-22	USRC12301954	99	152.8	509653100.0	...	400206.000000	78.000000	9.341155e+06	39.000000	2.311027e+07	19559.000000	4295361.0	1	9	2023
1	100	100	Montagem Rave Eterno	Montagem Rave Eterno	Dj Samir	2024-05-02	GXD7G2413058	101	152.3	5157486.0	...	55139.156753	2.000000	3.936600e+04	25.348942	8.566735e+07	87876.965786	47658.0	0	5	2024
2	101	101	Lil Boo Thang	Lil Boo Thang	Paul Russell	2023-08-18	USAR12300323	102	152.1	242070373.0	...	370635.000000	69.000000	1.792014e+06	79.000000	8.442572e+07	30713.000000	2219984.0	0	8	2023
3	106	106	Type Shit	WE DON'T TRUST YOU	Future	2024-03-22	USSM12402033	107	148.1	205206688.0	...	3432.000000	41.000000	2.943724e+06	47.000000	5.897303e+06	4527.000000	900933.0	1	3	2024
4	107	107	Armageddon	Armageddon - The 1st Album	aespa	2024-05-27	KRA302400093	108	147.4	32942304.0	...	290.000000	5.000000	5.308310e+05	33.000000	2.186900e+04	48.000000	29736.0	0	5	2024
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3689	4594	4595	For the Last Time	For the Last Time	$uicideboy$	2017-09-05	QM8DG1703420	4585	19.4	305049963.0	...	6.000000	2.000000	1.421700e+04	25.348942	2.010407e+07	13184.000000	656337.0	1	9	2017
3690	4595	4596	Dil Meri Na Sune	Dil Meri Na Sune (From "Genius")	Atif Aslam	2018-07-27	INT101800122	4575	19.4	52282360.0	...	412.000000	1.000000	9.270000e+02	25.348942	8.566735e+07	87876.965786	193590.0	0	7	2018
3691	4596	4597	Grace (feat. 42 Dugg)	My Turn	Lil Baby	2020-02-28	USUG12000043	4571	19.4	189972685.0	...	204.000000	1.000000	7.400000e+01	6.000000	8.442674e+07	28999.000000	1135998.0	1	2	2020
3692	4597	4598	Nashe Si Chadh Gayi	November Top 10 Songs	Arijit Singh	2016-11-08	INY091600067	4591	19.4	145467020.0	...	1200.000000	32.310954	1.294939e+06	7.000000	6.817840e+06	87876.965786	448292.0	0	11	2016
3693	4598	4599	Me Acostumbre (feat. Bad Bunny)	Me Acostumbre (feat. Bad Bunny)	Arcï¿½ï¿½	2017-04-11	USB271700107	4593	19.4	255740653.0	...	2083.000000	4.000000	1.274790e+05	4.000000	6.900674e+07	11320.000000	767006.0	1	4	2017

3694 rows × 30 columns