Coding Exercise: Visualise Netflix Viewing History with Python

Assumed audience: You are a TechLabs student pursuing the Data Science track and MacOS user. If you're not then you can still follow this guide but you'll have to adapt to your context.

Even though it's still early in the track I believe in pushing boundaries as early as possible. This exercise shouldn't be out of reach for most.

Scope

  • Simple data only. We'll be using Netflix Viewing History which outputs a CSV file with just a title and a date.
  • Simple transformations only. We'll be counting the titles watched in any given month. Then we add the counts for each month across the entire data set. We expect month: count as output.
  • Visualising the result. This may get tricky because we haven't reached this stage of the course yet.

Thought process

This is my first experience with pandas. It took me an hour to install pandas. That was ... fun. /s

I was to forced to learn about Python environments so I guess it was a valuable experience.

Python Environments

The VSCode Interactive terminal for Jupyter requires you to select a Python kernel or environment if you want to work with it.

What do I need?

I looked at a pandas cheat sheet to get a vague idea of what I might need and referred to StackOverflow with questions that arose as I tried things out. I think it took me about 3 hours total to produce a visual.

Read the code.

import pandas as ps

# read the CSV data 
df = ps.read_csv('./data/NetflixViewingHistory.csv', parse_dates=['Date'], dayfirst=True)

# set the index of the dataFrame to the Date column
# why?
# because we need the x-axis to show Months 
# this is the easiest way I could think of to accomplish that
df2 = df.set_index(['Date'])

# changed the index to a datetime object
df2.index = ps.to_datetime(df2.index)

# now we want two columns with the index and the count
df3 = df2.groupby([df2.index.month]).count()

# rename the column and index 
df3.columns = ['Count']
df3.index.names = [ 'Month']

# check result in Jupyter 
print(df3)

# we don't want numbers to display months 

import calendar

def mapper(month):
   return calendar.month_abbr[month]

# assign names to the index                          
df3.index = df3.index.map(mapper)                        

# plot a bar chart and give it a colour
df3.plot(kind='bar', color='#c7522a')

test

It's ... unsightly.

The data goes back to March 2015.

We're only counting titles.

We don't account for abandoned viewings. If you click on a title even if you didn't mean to watch it or only watched a few minutes of it, it'll still be added to your history.

I don't always 'watch' Netflix but rather use it as background noise.

Sometimes I fall asleep with Netflix still playing.

That means the numbers are inflated but I can't say by how much.

It's fun to see that I don't watch that many shows on average in March and a lot in September but it might be more interesting to see the average number of titles watched per month.

So let's do that.

df_blank = df 
df_blank = df.set_index(['Date'])
print(df_blank)



df_count = df_blank.resample('M').count()
print(df_count)
df_avg = df_count.resample('Y').mean()

df_count.loc['2021']
df_avg.loc['2021']

def mapper(month):
    return month.year

df_avg.index = df_avg.index.map(mapper)
print(df_avg)
df_avg.columns = ['Count']
ps.options.plotting.backend = 'plotly'

fig = df_avg.plot.bar(width=700, title='Average number of Netflix titles watched per month within any given year', labels=dict(Date='Year', Title='Titles watched'))
fig.update_xaxes(type='category')
fig.update_layout(showlegend=False)
fig.update_yaxes(title='')
fig.update_xaxes(title='')
fig.show()

And the plot.

test

That's better but still awful and it's not interactive.

curating the best of the web -

come explore!

media garden about ethos home