Most Streamed Spotify Songs: Data Cleaning

Python


This is a simple data cleaning + EDA project. I used this data set from Kaggle.

import numpy as np
import pandas as pd
import re

Data Cleaning

  1. Correct encoding
  2. Data Types
  3. Missing values
  4. Duplicates
  5. Outliers
  6. Formating Data (Date)

1. Correct encoding

sdf = pd.read_csv("Most Streamed Spotify Songs 2024.csv", encoding='ISO-8859-1') # not encoded in utf-8
sdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   object 
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   object 
 8   Spotify Playlist Count      4530 non-null   object 
 9   Spotify Playlist Reach      4528 non-null   object 
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   object 
 12  YouTube Likes               4285 non-null   object 
 13  TikTok Posts                3427 non-null   object 
 14  TikTok Likes                3620 non-null   object 
 15  TikTok Views                3619 non-null   object 
 16  YouTube Playlist Reach      3591 non-null   object 
 17  Apple Music Playlist Count  4039 non-null   float64
 18  AirPlay Spins               4102 non-null   object 
 19  SiriusXM Spins              2477 non-null   object 
 20  Deezer Playlist Count       3679 non-null   float64
 21  Deezer Playlist Reach       3672 non-null   object 
 22  Amazon Playlist Count       3545 non-null   float64
 23  Pandora Streams             3494 non-null   object 
 24  Pandora Track Stations      3332 non-null   object 
 25  Soundcloud Streams          1267 non-null   object 
 26  Shazam Counts               4023 non-null   object 
 27  TIDAL Popularity            0 non-null      float64
 28  Explicit Track              4600 non-null   int64  
dtypes: float64(6), int64(1), object(22)
memory usage: 1.0+ MB

2. Data Types

Most numerical variables were erroneously encoded as strings. Correcting this here.

# selecting all numerical variables erroneously encoded as a string object.
objCol = sdf.select_dtypes(include=[object]).drop(['Track','Album Name','Artist','Release Date','ISRC'], axis=1).columns
def convert_to_int(value, default=None):
    try:
        # Use regex to remove non-numeric characters
        cleaned_value = re.sub(r'[^\d]', '', str(value))
        result = int(cleaned_value)
        return result
    except (ValueError, TypeError):
        return default
sdf[objCol] = sdf[objCol].map(convert_to_int)
sdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Track                       4600 non-null   object 
 1   Album Name                  4600 non-null   object 
 2   Artist                      4595 non-null   object 
 3   Release Date                4600 non-null   object 
 4   ISRC                        4600 non-null   object 
 5   All Time Rank               4600 non-null   int64  
 6   Track Score                 4600 non-null   float64
 7   Spotify Streams             4487 non-null   float64
 8   Spotify Playlist Count      4530 non-null   float64
 9   Spotify Playlist Reach      4528 non-null   float64
 10  Spotify Popularity          3796 non-null   float64
 11  YouTube Views               4292 non-null   float64
 12  YouTube Likes               4285 non-null   float64
 13  TikTok Posts                3427 non-null   float64
 14  TikTok Likes                3620 non-null   float64
 15  TikTok Views                3619 non-null   float64
 16  YouTube Playlist Reach      3591 non-null   float64
 17  Apple Music Playlist Count  4039 non-null   float64
 18  AirPlay Spins               4102 non-null   float64
 19  SiriusXM Spins              2477 non-null   float64
 20  Deezer Playlist Count       3679 non-null   float64
 21  Deezer Playlist Reach       3672 non-null   float64
 22  Amazon Playlist Count       3545 non-null   float64
 23  Pandora Streams             3494 non-null   float64
 24  Pandora Track Stations      3332 non-null   float64
 25  Soundcloud Streams          1267 non-null   float64
 26  Shazam Counts               4023 non-null   float64
 27  TIDAL Popularity            0 non-null      float64
 28  Explicit Track              4600 non-null   int64  
dtypes: float64(22), int64(2), object(5)
memory usage: 1.0+ MB

3. Missing Values

  1. Get a list of the count of missing values in each column, and drop the ones with too many.
  2. Impute missing numerical values with mean
sdfNAs =  pd.DataFrame(sdf.isnull().sum()).reset_index().rename(columns={'index':'Track', 0 :'Missing Values'})
sdfNAs[sdfNAs['Missing Values']>0].sort_values(by='Missing Values', ascending=False)
Track Missing Values
27 TIDAL Popularity 4600
25 Soundcloud Streams 3333
19 SiriusXM Spins 2123
24 Pandora Track Stations 1268
13 TikTok Posts 1173
23 Pandora Streams 1106
22 Amazon Playlist Count 1055
16 YouTube Playlist Reach 1009
15 TikTok Views 981
14 TikTok Likes 980
21 Deezer Playlist Reach 928
20 Deezer Playlist Count 921
10 Spotify Popularity 804
26 Shazam Counts 577
17 Apple Music Playlist Count 561
18 AirPlay Spins 498
12 YouTube Likes 315
11 YouTube Views 308
7 Spotify Streams 113
9 Spotify Playlist Reach 72
8 Spotify Playlist Count 70
2 Artist 5

Choosing to drop “TIDAL Popularity”, “Soundcloud Streams”, “SiriusXM Spins” because they are: 1. missing too many values, and; 2. not incredibly interesting.

sdf.drop(["TIDAL Popularity", "Soundcloud Streams", "SiriusXM Spins"], axis=1, inplace=True)
sdf.head()
Track Album Name Artist Release Date ISRC All Time Rank Track Score Spotify Streams Spotify Playlist Count Spotify Playlist Reach ... YouTube Playlist Reach Apple Music Playlist Count AirPlay Spins Deezer Playlist Count Deezer Playlist Reach Amazon Playlist Count Pandora Streams Pandora Track Stations Shazam Counts Explicit Track
0 MILLION DOLLAR BABY Million Dollar Baby - Single Tommy Richman 4/26/2024 QM24S2402528 1 725.4 3.904709e+08 30716.0 196631588.0 ... 1.505970e+08 210.0 40975.0 62.0 17598718.0 114.0 18004655.0 22931.0 2669262.0 0
1 Not Like Us Not Like Us Kendrick Lamar 5/4/2024 USUG12400910 2 545.9 3.237039e+08 28113.0 174597137.0 ... 1.563804e+08 188.0 40778.0 67.0 10422430.0 111.0 7780028.0 28444.0 1118279.0 1
2 i like the way you kiss me I like the way you kiss me Artemas 3/19/2024 QZJ842400387 3 538.4 6.013093e+08 54331.0 211607669.0 ... 3.737850e+08 190.0 74333.0 136.0 36321847.0 172.0 5022621.0 5639.0 5285340.0 0
3 Flowers Flowers - Single Miley Cyrus 1/12/2023 USSM12209777 4 444.9 2.031281e+09 269802.0 136569078.0 ... 3.351189e+09 394.0 1474799.0 264.0 24684248.0 210.0 190260277.0 203384.0 11822942.0 0
4 Houdini Houdini Eminem 5/31/2024 USUG12403398 5 423.3 1.070349e+08 7223.0 151469874.0 ... 1.127639e+08 182.0 12185.0 82.0 17660624.0 105.0 4493884.0 7006.0 457017.0 1

5 rows × 26 columns

Imputing missing numerical values by mean.

intCol = sdf.select_dtypes(include=[np.number]).columns

sdf[intCol] = sdf[intCol].fillna(sdf[intCol].mean())
intCol
Index(['All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'Deezer Playlist Count',
       'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams',
       'Pandora Track Stations', 'Shazam Counts', 'Explicit Track'],
      dtype='object')
sdfNAs =  pd.DataFrame(sdf.isnull().sum()).reset_index().rename(columns={'index':'Track', 0 :'Missing Values'})
sdfNAs[sdfNAs['Missing Values']>0].sort_values(by='Missing Values', ascending=False)
Track Missing Values
2 Artist 5

4. Duplicates

sdfDupes = pd.DataFrame(sdf.duplicated()).rename(columns={0:'duplicate'})
sdfDupes[sdfDupes['duplicate']== True]
duplicate
2450 True
3450 True
sdf.iloc[2449:2451]
Track Album Name Artist Release Date ISRC All Time Rank Track Score Spotify Streams Spotify Playlist Count Spotify Playlist Reach ... YouTube Playlist Reach Apple Music Playlist Count AirPlay Spins Deezer Playlist Count Deezer Playlist Reach Amazon Playlist Count Pandora Streams Pandora Track Stations Shazam Counts Explicit Track
2449 Tennessee Orange Tennessee Orange Megan Moroney 9/2/2022 TCAGJ2289254 2424 28.9 227893586.0 28139.0 12480714.0 ... 238206228.0 33.0 129172.0 5.0 1370.0 49.0 56972562.0 26968.0 708143.0 0
2450 Tennessee Orange Tennessee Orange Megan Moroney 9/2/2022 TCAGJ2289254 2424 28.9 227893586.0 28139.0 12480714.0 ... 238206228.0 33.0 129172.0 5.0 1370.0 49.0 56972562.0 26968.0 708143.0 0

2 rows × 26 columns

sdf = sdf.drop(sdf.index[2449], axis=0).reset_index()
sdf.iloc[2448:2452]
index Track Album Name Artist Release Date ISRC All Time Rank Track Score Spotify Streams Spotify Playlist Count ... YouTube Playlist Reach Apple Music Playlist Count AirPlay Spins Deezer Playlist Count Deezer Playlist Reach Amazon Playlist Count Pandora Streams Pandora Track Stations Shazam Counts Explicit Track
2448 2448 Am I Wrong More Music 2013 Nico & Vinz 1/1/2013 USWB11304681 2440 28.9 965827790.0 160031.0 ... 2936860.0 89.0 31268.0 64.0 111728.0 13.0 598857243.0 319806.0 19231875.0 0
2449 2450 Tennessee Orange Tennessee Orange Megan Moroney 9/2/2022 TCAGJ2289254 2424 28.9 227893586.0 28139.0 ... 238206228.0 33.0 129172.0 5.0 1370.0 49.0 56972562.0 26968.0 708143.0 0
2450 2451 Still Trappin' (with King Von) The Voice Lil Durk 12/24/2020 USUM72024651 2448 28.8 176140377.0 76081.0 ... 35450506.0 36.0 10226.0 7.0 31768.0 13.0 27955730.0 8002.0 494233.0 1
2451 2452 De Fresa y Coco De Fresa y Coco Luis R Conriquez 11/10/2023 QMANG2226771 2445 28.8 81186913.0 7818.0 ... 242836946.0 3.0 3.0 1.0 2878.0 1.0 340543.0 2169.0 147308.0 0

4 rows × 27 columns

5. Outliers: Z-scores

def zscoring(df, columns, threshold =3):
    z_scores_dict = {}
    for col in columns:
        z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
        z_scores_dict[col + '_zscore'] = z_scores
    z_scores_df = pd.DataFrame(z_scores_dict)
    return z_scores_df

z_scores_sdf = zscoring(sdf, intCol)
z_scores_sdf
All Time Rank_zscore Track Score_zscore Spotify Streams_zscore Spotify Playlist Count_zscore Spotify Playlist Reach_zscore Spotify Popularity_zscore YouTube Views_zscore YouTube Likes_zscore TikTok Posts_zscore TikTok Likes_zscore ... YouTube Playlist Reach_zscore Apple Music Playlist Count_zscore AirPlay Spins_zscore Deezer Playlist Count_zscore Deezer Playlist Reach_zscore Amazon Playlist Count_zscore Pandora Streams_zscore Pandora Track Stations_zscore Shazam Counts_zscore Explicit Track_zscore
0 1.730622 17.732756 0.107108 0.406289 5.882628 1.938203 0.469887 0.274599 2.288098 1.104820 ... 0.326421 2.315517 0.117118 0.611515 5.131862 3.885922 0.462468 0.292580 0.024738 0.748277
1 1.729866 13.076162 0.232649 0.443162 5.134597 1.938203 0.422586 0.125419 0.127990 0.158752 ... 0.316659 1.987692 0.118749 0.714522 2.872978 3.754428 0.532345 0.267749 0.249878 1.336112
2 1.729111 12.881597 0.289327 0.071771 6.391039 1.938203 0.413365 0.158310 0.987168 0.333133 ... 0.050303 2.017495 0.159020 2.136021 11.025354 6.428142 0.551190 0.370464 0.487939 0.748277
3 1.728355 10.456017 2.978073 2.980478 3.843613 1.462161 1.022392 1.736456 2.962738 1.980611 ... 5.075927 5.057322 11.752106 4.773004 7.362179 8.093734 0.714773 0.520190 1.645482 0.748277
4 1.727599 9.895669 0.640047 0.739078 4.349468 1.666179 0.480065 0.166794 0.440283 0.000044 ... 0.390280 1.898286 0.355442 1.023543 5.151349 3.491440 0.554804 0.364307 0.366960 1.336112
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4594 1.734176 0.582317 0.267724 0.090268 0.619423 0.510078 0.374063 0.345765 0.425117 0.189547 ... 0.580527 0.769014 0.456260 0.624571 0.403222 0.000225 0.448120 0.336481 0.331669 1.336112
4595 1.726618 0.582317 0.742998 0.776206 0.743446 0.510012 0.797950 0.545153 0.000089 0.230960 ... 0.538464 0.798816 0.452899 0.645173 0.407405 0.000225 0.000043 0.000060 0.413602 0.748277
4596 1.723594 0.582317 0.484101 0.179454 0.565047 0.102042 0.297696 0.417077 0.444460 0.220121 ... 0.531240 0.530596 0.454621 0.645173 0.407674 0.847865 0.008521 0.265249 0.246740 1.336112
4597 1.738711 0.582317 0.567784 0.642555 0.541885 0.170048 1.055567 0.211595 0.442639 0.225767 ... 0.538164 0.798816 0.446376 0.000122 0.000089 0.804034 0.538921 0.000060 0.368505 0.748277
4598 1.740223 0.582317 0.360439 0.386146 0.315129 0.034036 0.683474 0.202123 0.411060 0.210349 ... 0.283827 0.649805 0.439067 0.583369 0.367571 0.935528 0.113906 0.344876 0.312074 1.336112

4599 rows × 21 columns

def remove_outliers(df, z_scores_sdf, threshold=3):
    # Identify rows with any Z-score greater than threshold
    outliers_mask = (z_scores_sdf > threshold).any(axis=1)
    
    # Remove outliers from main DataFrame
    df_clean = df[~outliers_mask]
    
    return df_clean

sdfClean = remove_outliers(sdf, z_scores_sdf)
sdf = sdfClean.reset_index()
sdf
level_0 index Track Album Name Artist Release Date ISRC All Time Rank Track Score Spotify Streams ... YouTube Playlist Reach Apple Music Playlist Count AirPlay Spins Deezer Playlist Count Deezer Playlist Reach Amazon Playlist Count Pandora Streams Pandora Track Stations Shazam Counts Explicit Track
0 98 98 Agora Hills Scarlet Doja Cat 9/22/2023 USRC12301954 99 152.8 509653100.0 ... 1.561457e+09 80.0 400206.000000 78.000000 9.341155e+06 39.000000 2.311027e+07 19559.000000 4295361.0 1
1 100 100 Montagem Rave Eterno Montagem Rave Eterno Dj Samir 5/2/2024 GXD7G2413058 101 152.3 5157486.0 ... 3.439601e+08 3.0 55139.156753 2.000000 3.936600e+04 25.348942 8.566735e+07 87876.965786 47658.0 0
2 101 101 Lil Boo Thang Lil Boo Thang Paul Russell 8/18/2023 USAR12300323 102 152.1 242070373.0 ... 1.559373e+07 106.0 370635.000000 69.000000 1.792014e+06 79.000000 8.442572e+07 30713.000000 2219984.0 0
3 106 106 Type Shit WE DON'T TRUST YOU Future 3/22/2024 USSM12402033 107 148.1 205206688.0 ... 2.324409e+08 99.0 3432.000000 41.000000 2.943724e+06 47.000000 5.897303e+06 4527.000000 900933.0 1
4 107 107 Armageddon Armageddon - The 1st Album aespa 5/27/2024 KRA302400093 108 147.4 32942304.0 ... 2.951077e+07 37.0 290.000000 5.000000 5.308310e+05 33.000000 2.186900e+04 48.000000 29736.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3689 4594 4595 For the Last Time For the Last Time $uicideboy$ 9/5/2017 QM8DG1703420 4585 19.4 305049963.0 ... 5.301600e+04 3.0 6.000000 2.000000 1.421700e+04 25.348942 2.010407e+07 13184.000000 656337.0 1
3690 4595 4596 Dil Meri Na Sune Dil Meri Na Sune (From "Genius") Atif Aslam 7/27/2018 INT101800122 4575 19.4 52282360.0 ... 2.497305e+07 1.0 412.000000 1.000000 9.270000e+02 25.348942 8.566735e+07 87876.965786 193590.0 0
3691 4596 4597 Grace (feat. 42 Dugg) My Turn Lil Baby 2/28/2020 USUG12000043 4571 19.4 189972685.0 ... 2.925315e+07 19.0 204.000000 1.000000 7.400000e+01 6.000000 8.442674e+07 28999.000000 1135998.0 1
3692 4597 4598 Nashe Si Chadh Gayi November Top 10 Songs Arijit Singh 11/8/2016 INY091600067 4591 19.4 145467020.0 ... 2.515052e+07 1.0 1200.000000 32.310954 1.294939e+06 7.000000 6.817840e+06 87876.965786 448292.0 0
3693 4598 4599 Me Acostumbre (feat. Bad Bunny) Me Acostumbre (feat. Bad Bunny) Arc�� 4/11/2017 USB271700107 4593 19.4 255740653.0 ... 1.758314e+08 11.0 2083.000000 4.000000 1.274790e+05 4.000000 6.900674e+07 11320.000000 767006.0 1

3694 rows × 28 columns

6. Formating Date

sdf['Release Date'] = pd.to_datetime(sdf['Release Date'])
sdf['Release Month'] = sdf['Release Date'].dt.month
sdf['Release Year'] = sdf['Release Date'].dt.year
sdf
level_0 index Track Album Name Artist Release Date ISRC All Time Rank Track Score Spotify Streams ... AirPlay Spins Deezer Playlist Count Deezer Playlist Reach Amazon Playlist Count Pandora Streams Pandora Track Stations Shazam Counts Explicit Track Release Month Release Year
0 98 98 Agora Hills Scarlet Doja Cat 2023-09-22 USRC12301954 99 152.8 509653100.0 ... 400206.000000 78.000000 9.341155e+06 39.000000 2.311027e+07 19559.000000 4295361.0 1 9 2023
1 100 100 Montagem Rave Eterno Montagem Rave Eterno Dj Samir 2024-05-02 GXD7G2413058 101 152.3 5157486.0 ... 55139.156753 2.000000 3.936600e+04 25.348942 8.566735e+07 87876.965786 47658.0 0 5 2024
2 101 101 Lil Boo Thang Lil Boo Thang Paul Russell 2023-08-18 USAR12300323 102 152.1 242070373.0 ... 370635.000000 69.000000 1.792014e+06 79.000000 8.442572e+07 30713.000000 2219984.0 0 8 2023
3 106 106 Type Shit WE DON'T TRUST YOU Future 2024-03-22 USSM12402033 107 148.1 205206688.0 ... 3432.000000 41.000000 2.943724e+06 47.000000 5.897303e+06 4527.000000 900933.0 1 3 2024
4 107 107 Armageddon Armageddon - The 1st Album aespa 2024-05-27 KRA302400093 108 147.4 32942304.0 ... 290.000000 5.000000 5.308310e+05 33.000000 2.186900e+04 48.000000 29736.0 0 5 2024
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3689 4594 4595 For the Last Time For the Last Time $uicideboy$ 2017-09-05 QM8DG1703420 4585 19.4 305049963.0 ... 6.000000 2.000000 1.421700e+04 25.348942 2.010407e+07 13184.000000 656337.0 1 9 2017
3690 4595 4596 Dil Meri Na Sune Dil Meri Na Sune (From "Genius") Atif Aslam 2018-07-27 INT101800122 4575 19.4 52282360.0 ... 412.000000 1.000000 9.270000e+02 25.348942 8.566735e+07 87876.965786 193590.0 0 7 2018
3691 4596 4597 Grace (feat. 42 Dugg) My Turn Lil Baby 2020-02-28 USUG12000043 4571 19.4 189972685.0 ... 204.000000 1.000000 7.400000e+01 6.000000 8.442674e+07 28999.000000 1135998.0 1 2 2020
3692 4597 4598 Nashe Si Chadh Gayi November Top 10 Songs Arijit Singh 2016-11-08 INY091600067 4591 19.4 145467020.0 ... 1200.000000 32.310954 1.294939e+06 7.000000 6.817840e+06 87876.965786 448292.0 0 11 2016
3693 4598 4599 Me Acostumbre (feat. Bad Bunny) Me Acostumbre (feat. Bad Bunny) Arc�� 2017-04-11 USB271700107 4593 19.4 255740653.0 ... 2083.000000 4.000000 1.274790e+05 4.000000 6.900674e+07 11320.000000 767006.0 1 4 2017

3694 rows × 30 columns