Pixal: The Music Video Classifier

CSCE 489: Data Science and Analytics - Spring Fall 2016
Christopher Foy, Clayton Petty, Dalton Harris, Wesley Moncrief

Overview and Motivation

Have you ever been watching a music video and stop and ask yourself: "Man, I wish I knew which genre this song is." Just weeks ago, this was us. Four college students with no sense of harmonic direction, just trying to find some musical fusion in our lives. We set out to find solution to this problem, and build a tool that analyzes music videos from YouTube and classifies them by genre.

We wanted to find out if it was possible to intelligently determine the genre of a video by its YouTube metadata (likes, dislikes, view count, etc.) and its average frame colors throughout the video.

After thorough analysis, Pixal emerged: a web interface that takes a YouTube URL as input, and outputs a predicted genre. This is done using a combination of OpenCV color analysis and scikit-learn machine learning, and makes intelligent assumptions on our data to output a substantially accurate genre classification for the input video.

Our motivation primarily comes from GitHub user Sacert's 'Colors of Film' analysis, in which he analyzed popular Hollywood films and extracted avereage frame colors. He used OpenCV and some built-in python libraries to generate his output, which is an image with colors from each frame of the movie. Below is Sacert's output image for Harry Potter and the Prizoner of Azkaban: Harry Potter and the Prizoner of Azkaban Average Color Image

Initial Questions

When we began our analysis, the question we asked was, "Can we accurately predict music genre by analyzing various aspects of a music video". This question was obviously extremely broad, and we quickly realized that we needed to narrow the scope and focus in on a few important/relevant features.

Thus, we revised our question to, "Can we accurately predict music genre by analyzing average frame color and youtube's video metadata (likes, dislikes, viewcount, etc.)". This question was much more manageable, but was more geared towards a final product than the actual data science process.

We then divided our project into two separate questions, with connected, but different goals. Firstly, "What are the most defining/relevant features for a music video when it comes to genre", and, "Is it possible to leverage these features to accurately predict the genre of a music video". These are the two questions that our project really sought to answer.

Data Aquisition and Cleaning

We employed several techniques to gather and aggregate the data we needed.

Date Acquisition
Firstly, we chose five of the most common genres to focus on and found a YouTube playlist for each of them. We used a chrome extension called scraper to download the links to the top 200 videos in each playlist. Then, we used Pafy, a python library, to download each of the videos. Pafy also allowed us to access the metadata (view count, rating, etc.) that we needed.

Compressing Videos
In order to speed up our frame-by-frame analysis of the videos, we wrote a script using 'ffmpeg', a video conversion tool, to lower the quality of videos, and therefore decrease the number of frames to analyze.

Analyzing Frame Colors
With skvideo, we iterated over the frames of each video. Originally, we took the average color of all the pixels in the frames. However, we quickly found that almost every frame's color averaged out to gray. Because of this, we decided to look at the most commonly appearing (mode) color of each frame. We then took the 10 most common frame colors throughout the video. These became the RGB values that we used in our analysis.

Here is the data that we started with, after pulling down Youtube metadata and running our color analysis script:

In [1]:
import pandas as pd
from os import path
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
import sklearn

# Edit path if need be (shouldn't need to b/c we all have the same folder structure)
CSV_PATH_1 = '../Videos/all_data'
CSV_PATH_2 = '../Videos2/all_data2'
FILE_EXTENSION = '_all.csv'
GENRES = ['country', 'edm', 'pop', 'rap', 'rock']

# Containers for the data frames
genre_dfs = {}
all_genres = None

# Read in the 5 genre's of CV's
for genre in GENRES:
    genre_csv_path_1 = path.join(CSV_PATH_1, genre) + FILE_EXTENSION
    genre_csv_path_2 = path.join(CSV_PATH_2, genre) + FILE_EXTENSION
    df_1 = pd.read_csv(genre_csv_path_1)
    df_2 = pd.read_csv(genre_csv_path_2)
    df_1 = df_1.drop('Unnamed: 0',1)
    df_2 = df_2.drop('Unnamed: 0',1)
    df_combined = pd.concat([df_1,df_2],ignore_index=True)
    genre_dfs[genre] = df_combined

all_genres = pd.concat(genre_dfs.values())
all_genres.head(3)

# genre_dfs is now a dictionary that contains the 5 different data frames
# all_genres is a dataframe that contains all of the data
Out[1]:
filename author description viewcount rating likes dislikes duration length keywords ... colors_8_red colors_8_blue colors_8_green colors_9_red colors_9_blue colors_9_green colors_10_red colors_10_blue colors_10_green genre
0 Luke Bryan - Roller Coaster.mp4 LukeBryanVEVO Luke Bryan - Crash My Party\nPurchase now on i... 28948653 4.840108 127866 5324 00:04:23 263 [Luke, Bryan, Roller, Coaster, Capitol, Record... ... 230 210 190 90 70 70 240 220 200 country
1 Dierks Bentley - Drunk On A Plane.mp4 DierksBentleyVEVO Purchase Dierks Bentley’s latest music: http:/... 41548786 4.763639 140682 8835 00:04:51 291 [Dierks, Bentley, Drunk, On, Plane, Capitol, R... ... 70 50 50 100 110 120 90 70 70 country
2 Thomas Rhett - Get Me Some Of That.mp4 ThomasRhettVEVO Music video by Thomas Rhett performing Get Me ... 43868160 4.826069 128488 5841 00:03:13 193 [Thomas, Rhett, Get, Me, Some, Of, That, The, ... ... 40 50 30 50 70 50 40 60 50 country

3 rows × 42 columns

Next, we formated the data for our analysis:

In accordance with scikit-learn, we had to make the genres ordinal to fit in the random forest classifiers. We add a new column to our dataframe to do so, write a function to populate it, and run it across the dataframe. We also create binary genre columns for each genre for additional analysis.

We create our training and test sets by splitting all_genres by genre, and making ~100 of each genre train and ~100 test. We aggregate by genre to make our full train and full test sets, each containing ~500 records of various genres.

In [2]:
def genre_to_ordinal(genre_in):
    if(genre_in == "country"):
        return 0
    elif(genre_in == "pop"):
        return 1
    elif(genre_in == "rock"):
        return 2
    elif(genre_in == "edm"):
        return 3
    elif(genre_in == "rap"):
        return 4
    else:
        return genre_in
    
all_genres['genre_ordinal'] = all_genres.genre.apply(genre_to_ordinal)

# Adding is_country flag
def is_country(genre_in):
    if(genre_in == "country"):
        return 1
    else:
        return 0
    
all_genres['is_country'] = all_genres.genre.apply(is_country)

# Adding is_country flag
def is_rock(genre_in):
    if(genre_in == "rock"):
        return 1
    else:
        return 0
    
all_genres['is_rock'] = all_genres.genre.apply(is_rock)

# Adding is_edm flag
def is_edm(genre_in):
    if(genre_in == "edm"):
        return 1
    else:
        return 0
    
all_genres['is_edm'] = all_genres.genre.apply(is_edm)

# Adding is_rap flag
def is_rap(genre_in):
    if(genre_in == "rap"):
        return 1
    else:
        return 0
    
all_genres['is_rap'] = all_genres.genre.apply(is_rap)

# Adding is_country flag
def is_pop(genre_in):
    if(genre_in == "pop"):
        return 1
    else:
        return 0
    
all_genres['is_pop'] = all_genres.genre.apply(is_pop)

# Subset all_genres to group by individual genres
country_records  = all_genres[all_genres["genre"] == "country"]
rock_records     = all_genres[all_genres["genre"] == "rock"]
pop_records      = all_genres[all_genres["genre"] == "pop"]
edm_records      = all_genres[all_genres["genre"] == "edm"]
rap_records      = all_genres[all_genres["genre"] == "rap"]

# From the subsets above, create train and test sets from each
country_train = country_records.head(len(country_records) / 2)
country_test  = country_records.tail(len(country_records) / 2)
rock_train    = rock_records.head(len(rock_records) / 2)
rock_test     = rock_records.tail(len(rock_records) / 2)
pop_train     = pop_records.head(len(pop_records) / 2)
pop_test      = pop_records.tail(len(pop_records) / 2)
edm_train     = edm_records.head(len(edm_records) / 2)
edm_test      = edm_records.tail(len(edm_records) / 2)
rap_train     = rap_records.head(len(rap_records) / 2)
rap_test      = rap_records.tail(len(rap_records) / 2)

# Create big training and big test set for analysis
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

training_set = training_set.fillna(0)
test_set = test_set.fillna(0)

print "Training Records:\t" , len(training_set)
print "Test Records:\t\t" , len(test_set)

training_set.head()
Training Records:	405
Test Records:		405
Out[2]:
filename author description viewcount rating likes dislikes duration length keywords ... colors_10_red colors_10_blue colors_10_green genre genre_ordinal is_country is_rock is_edm is_rap is_pop
0 Luke Bryan - Roller Coaster.mp4 LukeBryanVEVO Luke Bryan - Crash My Party\nPurchase now on i... 28948653 4.840108 127866 5324 00:04:23 263 [Luke, Bryan, Roller, Coaster, Capitol, Record... ... 240 220 200 country 0 1 0 0 0 0
1 Dierks Bentley - Drunk On A Plane.mp4 DierksBentleyVEVO Purchase Dierks Bentley’s latest music: http:/... 41548786 4.763639 140682 8835 00:04:51 291 [Dierks, Bentley, Drunk, On, Plane, Capitol, R... ... 90 70 70 country 0 1 0 0 0 0
2 Thomas Rhett - Get Me Some Of That.mp4 ThomasRhettVEVO Music video by Thomas Rhett performing Get Me ... 43868160 4.826069 128488 5841 00:03:13 193 [Thomas, Rhett, Get, Me, Some, Of, That, The, ... ... 40 60 50 country 0 1 0 0 0 0
3 David Nail - Whatever She's Got.mp4 DavidNailVEVO Purchase David Nail’s latest music: http://umg... 48648247 4.826632 141108 6393 00:04:01 241 [David, Nail, Whatever, She's, Got, MCA, Nashv... ... 70 60 50 country 0 1 0 0 0 0
4 Joe Nichols - Yeah.mp4 JoeNicholsVEVO Joe Nichols - Yeah\n“Yeah” from Joe Nichol’s C... 11397694 4.815725 33255 1606 00:03:52 232 [Joe Nichols, Red Bow Records, Country, Yeah] ... 20 30 50 country 0 1 0 0 0 0

5 rows × 48 columns

The above training_set and test_set were the two dataframes that we use for analysis over the course of the project.

Exploratory Data Analysis

What follows is our initial gathering and analysis of our data.

Generating Random Forest - Viewer Statistics

We start generating our random forests, and output a relative accuracy and a confusion matrix. In this first one, we simply factor in non-color variables (rating, likes, dislikes, length and viewcount), and run it across all records to predict an ordinal genre value.

As you will see, this method yields relatively poor results. This is because there's no distinct clusters being created by our random forest, and simple viewer statistics tell us nothing about what kind of video we're watching. However, we see that country, rap and pop are initially somewhat distinct (diagonal is the highest value), and rock and edm are getting mistaken for one another. Let's see if we can't make something of this.

In [3]:
# Predicting based solely on non-color features, using RF
clf = RandomForestClassifier(n_estimators=11)
meta_data_features = ['rating', 'likes','dislikes','length','viewcount']
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[meta_data_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[meta_data_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[meta_data_features]),rownames=["Actual"], colnames=["Predicted"])
0.43950617284
Out[3]:
Predicted 0 1 2 3 4
Actual
0 48 6 1 18 8
1 3 24 40 7 7
2 26 18 5 22 4
3 6 17 24 25 11
4 6 20 3 9 47

Random Forest - Only Color Statistics

Below, we do the same random forest as above, but strictly off of average frame color for the video.

As shown below, this actually yields worse results than just the viewer statistics, because the color of a video by itself does not determine the genre. If rappers only had red in their videos and rockers only had black this might be somewhat accurate, but that's just not the case. But, what if we pair these findings with our initial viewer statistics?

In [4]:
def gen_new_headers(old_headers):
    headers = ['colors_' + str(x+1) + '_' for x in range(10)]
    h = []
    for x in headers:
        h.append(x + 'red')
        h.append(x + 'blue')
        h.append(x + 'green')
    return old_headers + h + ['genre']

clf = RandomForestClassifier(n_estimators=11)
color_features = gen_new_headers([])[:-1]

# Predicting based solely on colors
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[color_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[color_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[color_features]),rownames=["Actual"], colnames=["Predicted"])
0.212345679012
Out[4]:
Predicted 0 1 2 3 4
Actual
0 30 20 11 13 7
1 24 13 11 21 12
2 23 20 10 13 9
3 17 14 30 17 5
4 24 11 20 22 8

Random Forest - All Features

Here we use all metadata and color data gathered. This approach, unsurprisingly, yields a success rate somewhere in between the solely-color and solely-metadata classification success rates.

In [5]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['genre_ordinal'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['genre_ordinal'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.genre_ordinal, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
0.4
Out[5]:
Predicted 0 1 2 3 4
Actual
0 38 11 1 19 12
1 4 20 34 19 4
2 20 30 4 14 7
3 8 20 17 26 12
4 19 13 3 16 34

Binary Classifiers

Scores were low, morale was down. It seems as if we're trying to make the classifier do way too much work by forcing it to choose from 5 different possible genres. The next step, then, was to use the binary features that we created initially.

First, we try Pop.

In [6]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['is_pop'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['is_pop'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.is_pop, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
0.807407407407
Out[6]:
Predicted 0 1
Actual
0 305 19
1 59 22
In [7]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features

# Predicting based on colors and non-color features
y, _ = pd.factorize(training_set['is_rap'])
clf = clf.fit(training_set[all_features], y)

z, _ = pd.factorize(test_set['is_rap'])
print clf.score(test_set[all_features],z)
pd.crosstab(test_set.is_rap, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
0.785185185185
Out[7]:
Predicted 0 1
Actual
0 299 21
1 66 19

What we're seeing above is a confusion matrix that, based on our training data, predicts whether or not a video in the test set is a pop video or not. You can see that it predicted 299 + 25 = 324 videos correctly, versus 60 + 21 = 81 videos incorrectly.

The confusion matrix above is our first effort at utilizing these binary classifiers. The classifier does a relatively good job with a success rate of 80%. However, we could use some improvement in the realm of "false negatives", where the model classified a video as not pop when it actually was.

We attempt this same test for each genre, and run the test multiple times to see what our average success rate is for each genre.

In [8]:
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

def multi_RF_averages(is_genre,num_iterations):
    clf = RandomForestClassifier(n_estimators=11)
    loop_indices = range(0,num_iterations)
    cumsum = 0

    for i in loop_indices:
        y, _ = pd.factorize(training_set[is_genre])
        clf = clf.fit(training_set[all_features], y)

        z, _ = pd.factorize(test_set[is_genre])
        cumsum = cumsum + clf.score(test_set[all_features],z)
    
    
    print "Average Score for",len(loop_indices),is_genre,"iterations:", cumsum/len(loop_indices)

NUM_TIMES = 20
pop_class = multi_RF_averages("is_pop", NUM_TIMES)
rap_class = multi_RF_averages("is_rap", NUM_TIMES)
rock_class = multi_RF_averages("is_rock", NUM_TIMES)
edm_class = multi_RF_averages("is_edm", NUM_TIMES)
country_class = multi_RF_averages("is_country", NUM_TIMES)
Average Score for 20 is_pop iterations: 0.812469135802
Average Score for 20 is_rap iterations: 0.785679012346
Average Score for 20 is_rock iterations: 0.818148148148
Average Score for 20 is_edm iterations: 0.755679012346
Average Score for 20 is_country iterations: 0.792592592593

Final Analysis

Now, let's look at certain genres.

In [9]:
# Removing EDM for better analysis - makes is_pop and is_rap much more accurate
# training_set = pd.concat([rock_train,edm_train,country_train,pop_train])
# test_set     = pd.concat([rock_test,edm_test,country_test,pop_test])

training_set = pd.concat([country_train,pop_train])
test_set     = pd.concat([country_test,pop_test])
multi_RF_averages("is_country",50)
multi_RF_averages("is_pop",50)

print '--------------------------------'

training_set = pd.concat([edm_train,rock_train])
test_set     = pd.concat([edm_test,rock_test])
multi_RF_averages("is_rock",50)
multi_RF_averages("is_edm",50)
Average Score for 50 is_country iterations: 0.850864197531
Average Score for 50 is_pop iterations: 0.857037037037
--------------------------------
Average Score for 50 is_rock iterations: 0.583164556962
Average Score for 50 is_edm iterations: 0.589240506329

So, what does this tell us? It seems that country and pop are easy to distinguish between while rock and edm are significantly harder to differentiate.

Next let's look at which genres seem to be the best at identifying themselves.

In [10]:
training_set = pd.concat([country_train,rock_train,edm_train,rap_train,pop_train])

test_set     = pd.concat([rock_test])
multi_RF_averages("is_rock",50)

test_set     = pd.concat([rap_test])
multi_RF_averages("is_rap",50)

test_set     = pd.concat([country_test])
multi_RF_averages("is_country",50)

test_set     = pd.concat([pop_test])
multi_RF_averages("is_pop",50)

test_set     = pd.concat([edm_test])
multi_RF_averages("is_edm",50)
Average Score for 50 is_rock iterations: 0.825066666667
Average Score for 50 is_rap iterations: 0.726823529412
Average Score for 50 is_country iterations: 0.221975308642
Average Score for 50 is_pop iterations: 0.733827160494
Average Score for 50 is_edm iterations: 0.918313253012

Something seems to be going on with Country. Let's also take a look at EDM which seems to be very successful at differentiating itself from other genres.

In [11]:
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features
y, _ = pd.factorize(training_set['is_edm'])
clf = clf.fit(training_set[all_features], y)
z, _ = pd.factorize(test_set['is_edm'])
edm_cross = pd.crosstab(test_set.is_edm, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
edm_cross
Out[11]:
Predicted 0 1
Actual
0 295 27
1 74 9
In [12]:
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features
y, _ = pd.factorize(training_set['is_pop'])
clf = clf.fit(training_set[all_features], y)
z, _ = pd.factorize(test_set['is_pop'])
pop_cross = pd.crosstab(test_set.is_pop, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
pop_cross
Out[12]:
Predicted 0 1
Actual
0 304 20
1 64 17
In [13]:
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features
y, _ = pd.factorize(training_set['is_rock'])
clf = clf.fit(training_set[all_features], y)
z, _ = pd.factorize(test_set['is_rock'])
rock_cross = pd.crosstab(test_set.is_rock, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
rock_cross
Out[13]:
Predicted 0 1
Actual
0 324 6
1 60 15
In [14]:
training_set = pd.concat([country_train,rock_train,pop_train,edm_train,rap_train])
test_set     = pd.concat([country_test,rock_test,pop_test,edm_test,rap_test])

clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features
y, _ = pd.factorize(training_set['is_country'])
clf = clf.fit(training_set[all_features], y)
z, _ = pd.factorize(test_set['is_country'])
country_cross = pd.crosstab(test_set.is_country, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
country_cross
Out[14]:
Predicted 0 1
Actual
0 24 300
1 19 62
In [15]:
clf = RandomForestClassifier(n_estimators=11)
all_features = meta_data_features + color_features
y, _ = pd.factorize(training_set['is_rap'])
clf = clf.fit(training_set[all_features], y)
z, _ = pd.factorize(test_set['is_rap'])
rap_cross = pd.crosstab(test_set.is_rap, clf.predict(test_set[all_features]),rownames=["Actual"], colnames=["Predicted"])
rap_cross
Out[15]:
Predicted 0 1
Actual
0 296 24
1 63 22

Let's throw all of our above crosstables into a bar graph.

In [16]:
rap_cross = rap_cross.values.tolist()
rock_cross = rock_cross.values.tolist()
country_cross = country_cross.values.tolist()
pop_cross = pop_cross.values.tolist()
edm_cross = edm_cross.values.tolist()

def flatten_l_of_ls(lol):
    return [val for sublist in lol for val in sublist]

rap_cross = flatten_l_of_ls(rap_cross)
rock_cross = flatten_l_of_ls(rock_cross)
country_cross = flatten_l_of_ls(country_cross)
pop_cross = flatten_l_of_ls(pop_cross)
edm_cross = flatten_l_of_ls(edm_cross)
In [17]:
rap_cross
# *_cross[0] = true negatives
# *_cross[1] = false positives
# *_cross[2] = false negatives
# *_cross[3] = true positives

def normalize(cross_list):
    soln = [float(cross_list[0])/(float(cross_list[0])+float(cross_list[1])),
            float(cross_list[1])/(float(cross_list[0])+float(cross_list[1])),
            float(cross_list[2])/(float(cross_list[2])+float(cross_list[3])),
            float(cross_list[3])/(float(cross_list[2])+float(cross_list[3]))]
    return soln

rap_cross = normalize(rap_cross)
rock_cross = normalize(rock_cross)
country_cross = normalize(country_cross)
pop_cross = normalize(pop_cross)
edm_cross = normalize(edm_cross)
In [18]:
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode()


x_vals = ['True Negatives', 'False Positives', 'False Negatives', 'True Positives']

rap = go.Bar(
    x=x_vals,
    y=rap_cross,
    name='rap'
)
edm = go.Bar(
    x=x_vals,
    y=edm_cross,
    name='edm'
)
pop = go.Bar(
    x=x_vals,
    y=pop_cross,
    name='pop'
)
rock = go.Bar(
    x=x_vals,
    y=rock_cross,
    name='rock'
)
country = go.Bar(
    x=x_vals,
    y=country_cross,
    name='country'
)
data = [rap,edm,pop,rock,country]
layout = go.Layout(
    barmode='group',
    title='False Positives/Negatives'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

Takeaways from individual feature analysis

Here we see some interesting insights. It seems that when it comes to classifying for country, everything looks like country. You'll notice that the country classifier incorrectly labels a large number (~300) of non-country videos as country. This makes sense, as our production system that we developed after all of this analysis seems to categorize videos as country extremely often.

Where the rest of the videos suffer is that they have a hard time correctly identifying themselves as what they actually are. For example, any given input video will likely be classified as 'not EDM', even if it actually is EDM.

Now let's look at which features seem to be the most helpful in classifying videos.

Selecting Most Valuable Features per Genre - Rock

Here, we look at finding the most important features for classifying each genre.

In [19]:
model = ExtraTreesClassifier()

training_set = pd.concat([country_train,pop_train,rap_train,rock_train,edm_train])
y, _ = pd.factorize(training_set['is_rock'])
model.fit(training_set[all_features], y)

df = pd.DataFrame()
df['index'] = all_features

y, _ = pd.factorize(training_set['is_rap'])
model.fit(training_set[all_features], y)
        
df['rap'] = model.feature_importances_

y, _ = pd.factorize(training_set['is_rock'])
model.fit(training_set[all_features], y)

df['rock'] = model.feature_importances_

y, _ = pd.factorize(training_set['is_country'])
model.fit(training_set[all_features], y)

df['country'] = model.feature_importances_

y, _ = pd.factorize(training_set['is_edm'])
model.fit(training_set[all_features], y)

df['edm'] = model.feature_importances_

y, _ = pd.factorize(training_set['is_pop'])
model.fit(training_set[all_features], y)

df['pop'] = model.feature_importances_

df.head()
Out[19]:
index rap rock country edm pop
0 rating 0.041921 0.035323 0.037033 0.036475 0.049065
1 likes 0.035218 0.030522 0.053931 0.033370 0.113991
2 dislikes 0.022170 0.037736 0.027769 0.032252 0.101643
3 length 0.037189 0.046866 0.039282 0.018612 0.016624
4 viewcount 0.041743 0.023423 0.050578 0.036979 0.192936
In [20]:
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode()

df = df.set_index('index')
df = df.transpose()
df.head()

title = 'Feature Importance By Genre'

labels = ['rap','rock','country','edm','pop']

cols = []
for x in df.columns:
    cols.append(x)
x_data = cols

y_data = df.values.tolist()

traces = []

for i in range(0, 5):
    traces.append(go.Scatter(
        x=x_data,
        y=y_data[i],
        mode='lines',
        connectgaps=True,
        name = labels[i]
    ))

layout = go.Layout(
    yaxis=dict(
        showgrid=False,
        zeroline=False,
        showline=False,
        showticklabels=False,
    ),
    autosize=False,
    margin=dict(
        autoexpand=True,
        l=100,
        r=20,
        t=110,
    ),
    showlegend=False,
)

layout = dict(title = 'Feature Importance by Genre',
                  xaxis = dict(title = 'Feature'),
                  yaxis = dict(title = 'Percent Importance (All Features Sum to 1.0)',
                               showgrid=False),
                  margin=go.Margin(
                    l=80,
                    r=50,
                    b=170,
                    t=100,
                    pad=8
                  ),
              )

fig = go.Figure(data=traces, layout=layout)
py.iplot(fig, filename='news-source')

Takeaways from feature analysis

Here we notice a few things. Perhaps unsurprisingly, pop classification seems to place heavy emphasis on the likes and viewcount feautres. This seems logical, as often pop videos have a much larger number of likes and views than other videos.

The other take away here is that, aside from a small increase in importance of viewcount over all genres, none of the features, be they metatdata or color data, seem to be particularly more important than others. This could mean that frame color data and view/like data is not the right information for classifying genre, however, some more analysis would need to be done to confirm that.

Final Analysis Takeaways

Unforunately, it appears that we have failed in our mission to classify videos by metadata and frame color data. There are three main issues that we discovered when analyzing that data.

  1. When looking at individual feature data, we notice that the genre classifiers seem to either classify everything as that particular genre, or nothing as that particular genre. This seems to indicate that none of the genres are very distinct from the others.
  2. As a result of this, we see a high number of false positives or false negatives in all of our predictions.
  3. No particular feature seems to be extremely important when classifying genres, which could indicate that no feature is particularly useful. We would normally expect that some features stick out as the most useful or indicative of a genre. Because we do not see this, it could mean that all of the features are not particularly useful.

Even though we did not create a strong classifier, we still count our project as successful. We learned that despite the differences in the sounds of music, the quantifiable traits (color data, metadata) that we examined are very similar.

Future Improvements

Since we do know there are differences between genres, we could search further into data to try to find them. We would consider the following ideas in future exploration.

  • Use audio data. This could include average pitch frequency, musical key, chord progressions used, presence of consonance/dissonance, and more
  • Examine lyrics and song title, using something like LDA or Watson's tonal analysis
  • Expand our test and train sets