Preparing Fanfiction Data

Cara Marta Messina
Northeastern University
messina [dot] c [at] husky [dot] neu [dot] edu

This notebook takes data collected from Archive of Our Own, a popular fanfiction repository, and sets it up to be analyzed. The data was collected using this AO3 python scraper. The corpus consists of The Legend of Korra and Game of Thrones fanfics, from the first one published on AO3 to 2019. Specifically, I am preparing the data to be analyzed using computatioanl temporal analysis methods, which focus on trends over time. Read more about this method in my article "Tracing Fan Uptakes: Tagging, Language, and Ideological Practices in The Legend of Korra Fanfiction," to be published in The Journal of Writing Analytics. The code for this article is published on my GitHub.

This notebook:

  • creates a 'month' column based on published dates of the fanfics
  • saves new datasets, reads them in, and merges them
  • preps the data by selecting only the columns of interest, including making sure there are no empty columns
  • groups each corpus by 'months'
  • saves the new corpora

This notebook is part of the Critical Fan Toolkit, Cara Marta Messina's public + digital dissertation

In [1]:
#pandas for working with dataframes
import pandas as pd

#regular expression library
import re

#numpy specifically works with numbers
import numpy as np

#matplot library creates visualizations
import matplotlib.pyplot as plt
%matplotlib inline

#nltk libraries
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string

#for making a string of elements separated by commas into a list
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

#has the nice counter feature for counting tags
import collections
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

Reading in & Prepping the Data

I have to read in multiple CSVs of the same dataset (the Game of Thrones fanfics published on AO3) because the original CSV was almost 2GB and my Python kernels kept crashing.

First I load in my CSVs. Then I have a function that goes through and replaces any empty bodies with just empty strings. I have also added a space after the end of each body string because if I merge all the bodies together in a groupby, I want there to be a space.

Second, I have created a function that uses a regular expression to take the published dates information from one column, add another column, and input specific information into that column. The first column, the published date, is structured as 0000-00-00 (YEAR-MONTH-DATE). I want to just keep the month+year, so my new column has 0000-00 (YEAR-MONTH).

Finally, I'm going to save my new CSVs!

In [2]:
korra = pd.read_csv('./data/allkorra.csv')
In [4]:
got = pd.read_csv('./data/got_data_original/got0.csv')
got1 = pd.read_csv('./data/got_data_original/got1.csv')
got2 = pd.read_csv('./data/got_data_original/got2.csv')
got3 = pd.read_csv('./data/got_data_original/got3.csv')
In [5]:
merged_got = pd.concat([got, got1, got2, got3])
merged_got.count()
Out[5]:
Unnamed: 0         29897
work_id            29897
title              29897
rating             29896
category           28472
fandom             29897
relationship       27733
character          28913
additional tags    27118
language           29897
published          29897
status             29897
status date        29897
words              29892
chapters           29897
comments           26793
kudos              29670
bookmarks          26948
hits               29293
body               29897
month              29897
dtype: int64

Replacing Empty Columns

I only need to do this with the GoT fanfics because I've already done it with TLoK fanfics.

This is to prevent future errors with potentially empty variables.

In [6]:
def prepare_columns(df):
    '''
    Description: Takes a dataframe collected from AO3 (so the column names are the same).
    Prepares them to be grouped by:
    -lowercasing all the letters
    -adding new lines/columns after particular columns
    -replaces NAN values with an empty slot 
    
    Input: A dataframe from AO3 with the same column names
    Output: A similar dataframe, except the words are lower-cased and there are commas or newlines after particular columns
    '''

    #creating a new dataframe in case I accidentally run this cell multiple times; then there would be multiple commas.
    newDF = df

    # make all the text lowercased. The "applymap" function applies a function to each element in a df.
    newDF = newDF[['work_id','title','published', 'rating', 'character','relationship','additional tags','category','body']]
    newDF = newDF.applymap(lambda s:s.lower() if type(s) == str else s)

    #adding a new line after each "body" column
    newDF['body'] = (newDF['body'] + ' ')
    newDF['body'] = newDF['body'].replace(np.nan,'',regex=True)
    newDF.dropna(how='all')

    #adding commas after the ratings, relationship, and additional tags columns to make sure they are separated properly
    newDF['rating'] = (newDF['rating'] + ', ')
    newDF['rating'] = newDF['rating'].replace(np.nan,'',regex=True)
    newDF['title'] = (newDF['title'] + ', ')
    newDF['title'] = newDF['title'].replace(np.nan,'',regex=True)
    newDF['character'] = (newDF['character'] + ', ')
    newDF['character'] = newDF['character'].replace(np.nan,'',regex=True)
    newDF['relationship'] = (newDF['relationship'] + ', ')
    newDF['relationship'] = newDF['relationship'].replace(np.nan,'',regex=True)
    newDF['additional tags'] = (newDF['additional tags'] + ', ')
    newDF['additional tags'] = newDF['additional tags'].replace(np.nan,'',regex=True)
    newDF['category'] = (newDF['category'] + ', ')
    newDF['category'] = newDF['category'].replace(np.nan,'',regex=True)


    return newDF
In [7]:
tlok_prepped = prepare_columns(korra)
tlok_prepped[:1]
Out[7]:
work_id title published rating character relationship additional tags category body
0 6388009 a more perfect union, 2016-03-28 general audiences, noatak (avatar), tarrlok (avatar), amon (avata... alternate universe, gen, he's forgotten how to be warm. the thought wou...
In [8]:
got_prepped = prepare_columns(got)
got1_prepped = prepare_columns(got1)
got2_prepped = prepare_columns(got2)
got3_prepped = prepare_columns(got3)
In [9]:
got_prepped[:2]
Out[9]:
work_id title published rating character relationship additional tags category body
0 19289563 game of thrones, 2019-06-20 explicit, arya stark, bella, gendry baratheon - characte... gendry baratheon/arya stark, gendry baratheon/... multi, authors note: this is really short but i promi...
1 17179712 game of thrones, 2018-12-27 teen and up audiences, jon snow | aegon targaryen, arya stark, sansa ... jon snow/daenerys targaryen, arya stark/gendry... armies and allies, war, romance, eventual happ... f/m, arya's chambers still felt different.\n\n \n\n...

Adding Month Column

Again, I've already done this with the TLoK fanfics, so I only need to do it for the GoT ones.

In [10]:
def add_month_column(df, newcolumn, originalcolumn):
    '''
    Description: Takes a column that specifically usess the style 2000-11-22 (date) and adds a new column with 0000-00 (year + month)
    
    Input: dataframe with a column structure like 0000-11-22
    Output: same dataframe with a new column with the addition of 0000-11 (year + month in this case)
    '''
    #using a regular expression to create a "month" column
    df[newcolumn] = df[originalcolumn].replace('(\d{4})(\-)(\d{2})(\-)(\d{2})','\g<1>\g<2>\g<3>', regex=True)

    return df
In [11]:
tlok_new = add_month_column(tlok_prepped,'month','published')
tlok_new.head(2)
Out[11]:
work_id title published rating character relationship additional tags category body month
0 6388009 a more perfect union, 2016-03-28 general audiences, noatak (avatar), tarrlok (avatar), amon (avata... alternate universe, gen, he's forgotten how to be warm. the thought wou... 2016-03
1 13974048 let go, 2018-03-14 teen and up audiences, korra (avatar), lin beifong, lin beifong/korra, just a quick one-shot i never posted properly,... f/f, "korra." somewhere distant. someone holding h... 2018-03
In [12]:
got_new = add_month_column(got_prepped, 'month','published')
got1_new = add_month_column(got1_prepped, 'month','published')
got2_new = add_month_column(got2_prepped, 'month','published')
got3_new = add_month_column(got3_prepped, 'month','published')
In [13]:
#checking that the month columb has been added
got_new[:2]
Out[13]:
work_id title published rating character relationship additional tags category body month
0 19289563 game of thrones, 2019-06-20 explicit, arya stark, bella, gendry baratheon - characte... gendry baratheon/arya stark, gendry baratheon/... multi, authors note: this is really short but i promi... 2019-06
1 17179712 game of thrones, 2018-12-27 teen and up audiences, jon snow | aegon targaryen, arya stark, sansa ... jon snow/daenerys targaryen, arya stark/gendry... armies and allies, war, romance, eventual happ... f/m, arya's chambers still felt different.\n\n \n\n... 2018-12

Saving!

Let's save the new dataframes as CSVs so I can use them! They are commented out, though, so I don't accidentally run it again. I am saving them with the same name as the original CSVs. This is not a great practice, because you want to save all steps of your data in case something happens. However, I have the data already saved on an external file, so I would prefer not to save the same data over and over on my harddrive.

In [ ]:
got_new.to_csv(r'./data/got_data_clean/got0.csv')
got1_new.to_csv(r'./data/got_data_clean/got1.csv')
got2_new.to_csv(r'./data/got_data_clean/got2.csv')
got3_new.to_csv(r'./data/got_data_clean/got3.csv')

Merging Dataframes

While it may seem counterintiative to read in a bunch of split dataframes and then merge them again, the one large csv kept crashing my python. This means, I will probably need to keep all the CSVs separate when I load them in, and then merge them each time in my notebook. However, it seems to have worked.

In [ ]:
#loading these in so I no longer have to run all the code above to access these
got0New = pd.read_csv('./data/got_data_clean/got0.csv')
got1New = pd.read_csv('./data/got_data_clean/got1.csv')
got2New = pd.read_csv('./data/got_data_clean/got2.csv')
got3New = pd.read_csv('./data/got_data_clean/got3.csv')

merged_got1 = pd.concat([got0New, got1New, got2New, got3New])
In [ ]:
got_metadata = merged_got1.drop(['body'], axis=1)
got_metadata.head(1)
In [ ]:
got_textual = merged_got[['work_id','body']]
got_textual.head(4)
In [ ]:
#saving the metadata files and body of text files

got_textual.to_csv(r'./data/got_data_clean/got_body.csv')
got_metadata.to_csv(r'./data/got_data_clean/got_metadata.csv')

Visualizing Month: Checking

I wanted to just check what months the most fanfics were published in (and made a quick bar chart). This graph function can be used for a lot of datasets and is fairly easy.

In [14]:
def viz_months(df, column, name_of_show):
    '''
    Description: This function takes a dataframe with a number column, counts the top 10 frequency in that column, and then visualizes it. I am specifically using this function for visualizing published dates of fanfictions, but the labels below can be changed.
    
    Input: the dataframe, the column that you want to count the highest values, and the name of the show
    Output: the top 10 highest months published in that set & a cute graph
    '''
    monthcount = df[column].value_counts().head(10)
    print(monthcount)

    monthCountGraph = monthcount.plot.bar()
    monthCountGraph = plt.title('Highest Months of Published Fanfics of '+name_of_show)
    monthCountGraph = plt.xlabel('Month and Year')
    monthCountGraph = plt.xlabel('Month and Year')
In [15]:
viz_months(merged_got, 'month', 'Game of Thrones')
2019-05    2498
2019-06    1261
2019-04     971
2019-08     915
2019-07     852
2017-08     834
2017-09     803
2017-10     616
2016-07     493
2018-03     473
Name: month, dtype: int64
In [16]:
viz_months(korra, 'month', 'The Legend of Korra')
2015-01    400
2014-12    359
2015-02    311
2015-03    236
2014-10    225
2015-06    221
2015-08    201
2015-04    201
2012-06    197
2015-05    197
Name: month, dtype: int64

Grouping By Month

Next, I will demonstrate how I prepared the data for computational temporal analysis, or tracing trends over time. I use two corpora: The Legend of Korra and Game of Thrones fanfiction published on Archive of Our Own. This notebook will work with four different columns with textual data: the "relationship" tags column , the "additional tags" column, the "categories" column which provides the gender of particular relationships (such as M/M, F/F, M/F, etc), and the "body" which contains the entire text for each fanfic.

Since I do not need to load in the data again, I will just show the beginning of the data files. Next, I wil need to define my functions.

In [19]:
def prepare_columns_for_groupby(df):
    '''
    Takes a dataframe collected from AO3 (so the column names are the same) and prepares them to be grouped by lowercasing all the letters and adding new lines/columns after particular columns.
    
    Input: A dataframe from AO3 with the same column names
    Output: A similar dataframe, except the words are lower-cased and there are commas or newlines after particular columns
    '''

    #creating a new dataframe in case I accidentally run this cell multiple times; then there would be multiple commas.
    newDF = df

    # make all the text lowercased. The "applymap" function applies a function to each element in a df.
    newDF = newDF.applymap(lambda s:s.lower() if type(s) == str else s)
    newDF = newDF[['published','rating','relationship','additional tags','character','category','month','body']]

    #adding a new line after each "body" column
    newDF['body'] = (newDF['body'])
    newDF['body'] = newDF['body'].replace(np.nan,'',regex=True)
    newDF.dropna(how='all')

    #adding commas after the ratings, relationship, and additional tags columns to make sure they are separated properly
    newDF['rating'] = (newDF['rating'])
    newDF['rating'] = newDF['rating'].replace(np.nan,'',regex=True)
    newDF['relationship'] = (newDF['relationship'])
    newDF['relationship'] = newDF['relationship'].replace(np.nan,'',regex=True)
    newDF['character'] = (newDF['character'])
    newDF['character'] = newDF['character'].replace(np.nan,'',regex=True)
    newDF['additional tags'] = (newDF['additional tags'])
    newDF['additional tags'] = newDF['additional tags'].replace(np.nan,'',regex=True)
    newDF['category'] = (newDF['category'])
    newDF['category'] = newDF['category'].replace(np.nan,'',regex=True)

    #make publsihed dates into readable dates
    newDF['published'] = pd.to_datetime(newDF['published'])


    return newDF
In [20]:
def group_by(df):
    '''
    This function will group a dataframe by the 'month' column. This can also be used in a later function to group by particular months.
    
    Input: a pandas dataframe and the column you want to groupby with
    
    Output: a new dataframe with the month as the index
    '''
    #first, group a dataframe by months and count. This will create a list of how many rows for each month.
    month_count = df.groupby('month').count()

    #in the new dataframe, use a column that has been counted and rename it 'Count'
    month_count['count'] = month_count['relationship']
    month_count_new = month_count['count']

    #create another new dataframe that aggregrates all the proper columns
    new_group = df.groupby('month').agg({'rating':'sum','additional tags':'sum','category':'sum','character':'sum','relationship':'sum','body':'sum'})

    #join both dataframes to include the count & make it ascending to the earliest FFs are on top
    join_df = pd.concat([new_group, month_count_new], axis=1)
    join_df.sort_index(ascending=True)

    return join_df

The Legend of Korra Group By Month

Before I group the dataframe, I want to first prepare my data.

The steps to do this are:

  • make sure I'm adding a new line or a comma after particular data so that it is lists/separated properly
  • make sure all the text is lower-cased to avoid capitalization inconsitencies
In [21]:
allkorra_prepped = prepare_columns_for_groupby(tlok_new)
allkorra_prepped.head(2)
Out[21]:
published rating relationship additional tags character category month body
0 2016-03-28 general audiences, alternate universe, noatak (avatar), tarrlok (avatar), amon (avata... gen, 2016-03 he's forgotten how to be warm. the thought wou...
1 2018-03-14 teen and up audiences, lin beifong/korra, just a quick one-shot i never posted properly,... korra (avatar), lin beifong, f/f, 2018-03 "korra." somewhere distant. someone holding h...
In [22]:
allkorra_month = group_by(allkorra_prepped)
allkorra_month.head(5)
Out[22]:
rating additional tags category character relationship body count
month
2011-02 not rated, original characters - freeform, multi, zuko (avatar), mai (avatar), sokka (avatar), s... mai/zuko (avatar), sokka/suki (avatar), aang (... when kato listens to his father's war stories,... 1
2011-04 general audiences, family, angst, one shot, gen, tenzin (avatar), aang (avatar), katara (avatar... aang (avatar)/katara, his father shows tenzin where the flowers grow... 1
2011-05 teen and up audiences, completely au, written pre-canon, rated for la... gen, korra (avatar), toph beifong, \n \nthe earthbender's answer was not what s... 1
2011-07 general audiences, teen and up audiences, humor, family, sweet, drama, dysfunctional fam... gen, gen, korra (avatar), tenzin (avatar), jinora, ikki,... "i'm here for my lesson," korra shouted, leadi... 2
2011-08 teen and up audiences, teen and up audiences, ... speculation, drama, action, character study, f... gen, f/m, gen, f/f, f/f, korra (avatar), mako (avatar), bolin (avatar),... korra/mako (avatar), azula (avatar)/toph bei f... nothing had ever been given to her. korra had ... 4
In [ ]:
#save the new dataframe to be used later (commenting out so I don't resave)

allkorra_month.to_csv(r'./data/group_month/allkorra_months.csv')
In [23]:
#preKorrasami
tlok01 = allkorra_month.loc['2011-02':'2014-05']

#subtextual Korrasami
tlok02 = allkorra_month.loc['2014-06':'2014-11']

#postKorrasami
tlok03 = allkorra_month.loc['2014-12':'2015-07']

Game of Thrones Group By Month

Before I group the dataframe, I want to first prepare my data.

The steps to do this are:

  • make sure I'm adding a new line or a comma after particular data so that it is lists/separated properly
  • make sure all the text is lower-cased to avoid capitalization inconsitencies

I am currently saving the dataframe as one .csv, but then I will use a csv splitter created by Jordi Rivero. This way, I can upload the split csv without killing my kernels.

In [24]:
allgot_prepped = prepare_columns_for_groupby(got_new)
allgot_prepped.head(2)
Out[24]:
published rating relationship additional tags character category month body
0 2019-06-20 explicit, gendry baratheon/arya stark, gendry baratheon/... arya stark, bella, gendry baratheon - characte... multi, 2019-06 authors note: this is really short but i promi...
1 2018-12-27 teen and up audiences, jon snow/daenerys targaryen, arya stark/gendry... armies and allies, war, romance, eventual happ... jon snow | aegon targaryen, arya stark, sansa ... f/m, 2018-12 arya's chambers still felt different.\n\n \n\n...
In [25]:
allgot_months = group_by(allgot_prepped)
In [26]:
#making individual dataframes for each season 

#Season 1: Beginning of data to March 2012 (Season 2 airs April 1st, 2012)
gots1 = allgot_months.loc['2007-02':'2012-03']

#Season 2: April 2012-March 2013 (season 3 airs March 31, 2013)
gots2 = allgot_months.loc['2012-04':'2013-03']

#Season 3: April 2013-March 2014 (season 4 airs April 6, 2014)
gots3 = allgot_months.loc['2013-04':'2014-03']

#Season 4: April 2014 to March 2015 (season 5 airs April 12, 2015)
gots4 = allgot_months.loc['2014-04':'2015-03']

#Season 5: April 2015-March 2016 (season 6 airs April 24, 2016)
gots5 = allgot_months.loc['2015-04':'2016-03']

#Season 6: April 2016-June 2017 (season 7 airs July 16, 2017)
gots6 = allgot_months.loc['2016-04':'2017-06']

#Season 7: July 2017-March 2019 (season 8 airs April 14, 2019)
gots7 = allgot_months.loc['2017-07':'2019-03']

#Season 8: April 2019-end of data
gots8 = allgot_months.loc['2019-04':'2019-09']

# allgot_prepped.to_csv(r'./data/group_month/allgot_months.csv')
In [27]:
gots8
Out[27]:
rating additional tags category character relationship body count
month
2019-04 teen and up audiences, not rated, general audi... alternate universe - fantasy, game of thrones ... m/m, f/m, f/m, multi, other, f/f, f/m, f/m, f/... midoriya izuku, todoroki shouto, bakugou katsu... midoriya izuku/todoroki shouto, bakugou katsuk... izuku midoriya is a green, curly-haired advent... 247
2019-05 mature, teen and up audiences, not rated, expl... game of thrones alternate universe, game of th... f/f, f/m, m/m, f/m, f/f, f/m, m/m, f/f, multi,... all characters from got (season 1-8), jayden r... jon snow/daenerys targaryen, gilly (asoiaf)/sa... \n\nprologue\n\n\n\n \n\n \n\n\nharry was figh... 726
2019-06 explicit, mature, not rated, explicit, not rat... a better rewrite, no dragon death, dialogue he... multi, multi, f/m, gen, multi, other, f/m, mul... arya stark, bella, gendry baratheon - characte... gendry baratheon/arya stark, gendry baratheon/... authors note: this is really short but i promi... 274
2019-07 mature, mature, general audiences, teen and up... mpreg, grahpic birth, male lactation, secret p... f/m, gen, m/m, m/m, f/m, gen, f/f, f/m, gen, m... robert baratheon, cersei lannister, joffrey ba... robert baratheon/cersei lannister (one sided),... game of thrones daenerys lannister\n\nby 4quie... 169
2019-08 teen and up audiences, not rated, teen and up ... show rewrite, alternate universe, war of five ... f/m, f/m, multi, other, f/f, f/m, m/m, gen, f/... sandor clegane, jon snow, klaus mikaelson, reb... ramsay bolton/ kol mikaelson, rebekah mikaelso... annie's pov:\n\n"ugh." i groaned as i sat up f... 166
2019-09 general audiences, mature, mature, teen and up... established relationship, game of thrones refe... m/m, f/m, other, gen, m/m, f/m, multi, f/m, f/... sidney crosby, evgeni malkin, jon snow | aegon... sidney crosby/evgeni malkin, jon snow/daenerys... the sound of his phone ringing pulled zhenya f... 103
In [ ]:
# save data

gots1.to_csv(r'./data/group_month/got_s1.csv')
gots2.to_csv(r'./data/group_month/got_s2.csv')
gots3.to_csv(r'./data/group_month/got_s3.csv')
gots4.to_csv(r'./data/group_month/got_s4.csv')
gots5.to_csv(r'./data/group_month/got_s5.csv')
gots6.to_csv(r'./data/group_month/got_s6.csv')
gots7.to_csv(r'./data/group_month/got_s7.csv')
gots8.to_csv(r'./data/group_month/got_s8.csv')

Metadata Count

For this new data I want to create, I want to create a few different dataframes:

TLoK

  • relationships
  • relationship categories
  • character count
  • additional tags
  • rating

GoT

  • relationships
  • relationship categories
  • character count
  • additional tags
  • rating
In [28]:
def column_to_list(df,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes & cleans that string.
    input: the name of the dataframe and the column name
    output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    df[columnName] = df[columnName].replace(np.nan,'',regex=True)
    string = ' '.join(df[columnName].tolist())
    return string
In [29]:
def MetadataToDF(df, columnName,NewSeasonColumnName):
    '''
    input: the dataframe you will work with, the new column names for your new DF)
    output: a new dataframe with the metadata and the count in a newly named column 
    
    load in the proper data into a string'''

    #replace empty values & make a list of all the words
    string = column_to_list(df, columnName)

    #the function to tokenize, or put each value as an element in a list
    class CommaPoint(PunktLanguageVars):
        sent_end_chars = (',')
    tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())

    #tokenizing the list of strings based on the COMMA, not the white space (as seen in the CommaPoint above)
    ListOfTags = tokenizer.tokenize(string)
    length = len(ListOfTags)

    #the "Counter" function is from the collections library
    allCounter=collections.Counter(ListOfTags)
    most = allCounter.most_common()

    newDF = pd.DataFrame(most, columns =[columnName, NewSeasonColumnName])

    #return 
    return newDF

TLoK Metadata

I will be creating dataframes for all the metadatacategories to count across the three different periods from TLoK: the PreKorrasami time, the Subtextual Korrasami period, and the PostKorrasami Period.

Character

In [30]:
tlokALLcharacter = MetadataToDF(allkorra_month, 'character','allCount')
tlok01character = MetadataToDF(tlok01, 'character','01Count')
tlok02character = MetadataToDF(tlok02, 'character','02Count')
tlok03character = MetadataToDF(tlok03, 'character','03Count')
In [31]:
tlokALLcharacter[:5]
Out[31]:
character allCount
0 korra (avatar), 4570
1 asami sato, 3899
2 mako (avatar), 2024
3 bolin (avatar), 1842
4 tenzin (avatar), 940
In [32]:
tlokCharMerge01 = pd.merge(tlok01character, tlok02character, on='character', how='outer')
tlokCharMerge02 = pd.merge(tlokCharMerge01, tlok03character, on='character', how='outer')
tlokCharMerge = pd.merge(tlokCharMerge02, tlokALLcharacter, on='character', how='outer')
tlokCharMerge[:5]
Out[32]:
character 01Count 02Count 03Count allCount
0 korra (avatar), 842.0 434.0 1394.0 4570
1 mako (avatar), 509.0 198.0 556.0 2024
2 bolin (avatar), 455.0 146.0 492.0 1842
3 asami sato, 396.0 346.0 1302.0 3899
4 lin bei fong, 243.0 86.0 97.0 426

Relationships

In [33]:
tlokALLrel = MetadataToDF(allkorra_month, 'relationship','allCount')
tlok01rel = MetadataToDF(tlok01, 'relationship','01Count')
tlok02rel = MetadataToDF(tlok02, 'relationship','02Count')
tlok03rel = MetadataToDF(tlok03, 'relationship','03Count')
In [34]:
tlokALLrel[:5]
Out[34]:
relationship allCount
0 korra/asami sato, 3432
1 korra/mako (avatar), 572
2 bolin/opal (avatar), 364
3 lin beifong/kya ii, 168
4 korrasami, 150
In [36]:
tlokRelMerge01 = pd.merge(tlok01rel, tlok02rel, on='relationship', how='outer')
tlokRelMerge02 = pd.merge(tlokRelMerge01, tlok03rel, on='relationship', how='outer')
tlokRelMerge = pd.merge(tlokRelMerge02, tlokALLrel, on='relationship', how='outer')
tlokRelMerge[:5]
Out[36]:
relationship 01Count 02Count 03Count allCount
0 korra/mako (avatar), 305.0 63.0 96.0 572
1 bolin/korra (avatar), 91.0 8.0 9.0 127
2 korra/asami sato, 90.0 281.0 1250.0 3432
3 amon/lieutenant (avatar), 68.0 6.0 10.0 90
4 amon/korra (avatar), 48.0 17.0 6.0 80

Categories

In [37]:
tlokALLcat = MetadataToDF(allkorra_month, 'category','allCount')
tlok01cat = MetadataToDF(tlok01, 'category','01Count')
tlok02cat = MetadataToDF(tlok02, 'category','02Count')
tlok03cat = MetadataToDF(tlok03, 'category','03Count')
In [38]:
tlokALLcat
Out[38]:
category allCount
0 f/f, 3896
1 f/m, 2154
2 gen, 1612
3 m/m, 590
4 multi, 354
5 other, 146
In [39]:
tlokCatMerge01 = pd.merge(tlok01cat, tlok02cat, on='category', how='outer')
tlokCatMerge02 = pd.merge(tlokCatMerge01, tlok03cat, on='category', how='outer')
tlokCatMerge = pd.merge(tlokCatMerge02, tlokALLcat, on='category', how='outer')
tlokCatMerge[:5]
Out[39]:
category 01Count 02Count 03Count allCount
0 f/m, 820 180 490 2154
1 gen, 617 170 346 1612
2 m/m, 243 50 120 590
3 f/f, 200 326 1351 3896
4 multi, 91 20 83 354

Additional Tags

In [40]:
tlokALLtags = MetadataToDF(allkorra_month, 'additional tags','allCount')
tlok01tags = MetadataToDF(tlok01, 'additional tags','01Count')
tlok02tags = MetadataToDF(tlok02, 'additional tags','02Count')
tlok03tags = MetadataToDF(tlok03, 'additional tags','03Count')
In [41]:
tlokALLtags[:5]
Out[41]:
additional tags allCount
0 fluff, 847
1 romance, 667
2 angst, 512
3 alternate universe - modern setting, 438
4 korrasami - freeform, 341
In [42]:
tlokTagsMerge01 = pd.merge(tlok01tags, tlok02tags, on='additional tags', how='outer')
tlokTagsMerge02 = pd.merge(tlokTagsMerge01, tlok03tags, on='additional tags', how='outer')
tlokTagsMerge = pd.merge(tlokTagsMerge02, tlokALLtags, on='additional tags', how='outer')
tlokTagsMerge[:5]
Out[42]:
additional tags 01Count 02Count 03Count allCount
0 romance, 145.0 54.0 213.0 667
1 angst, 109.0 43.0 145.0 512
2 friendship, 92.0 24.0 74.0 263
3 family, 72.0 9.0 47.0 204
4 fluff, 67.0 63.0 229.0 847

Ratings

In [43]:
tlokALLrating = MetadataToDF(allkorra_month, 'rating','allCount')
tlok01rating = MetadataToDF(tlok01, 'rating','01Count')
tlok02rating = MetadataToDF(tlok02, 'rating','02Count')
tlok03rating = MetadataToDF(tlok03, 'rating','03Count')
In [44]:
tlokALLrating
Out[44]:
rating allCount
0 general audiences, 2703
1 teen and up audiences, 2455
2 mature, 1047
3 explicit, 730
4 not rated, 551
In [45]:
tlokRatingMerge01 = pd.merge(tlok01rating, tlok02rating, on='rating', how='outer')
tlokRatingMerge02 = pd.merge(tlokRatingMerge01, tlok03rating, on='rating', how='outer')
tlokRatingMerge = pd.merge(tlokRatingMerge02, tlokALLrating, on='rating', how='outer')
tlokRatingMerge[:5]
Out[45]:
rating 01Count 02Count 03Count allCount
0 general audiences, 672 281 762 2703
1 teen and up audiences, 568 258 723 2455
2 mature, 185 78 323 1047
3 explicit, 182 56 151 730
4 not rated, 126 47 117 551

Saving CSVs

In [46]:
tlokCharMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_character.csv')
tlokRelMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_relationship.csv')
tlokCatMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_categories.csv')
tlokTagsMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_tags.csv')
tlokRatingMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_rating.csv')

GoT Metadata

As I did with TLoK, I will be creating dataframes for all the metadatacategories to count across different Game of Thrones periods. I will be using seasons as date markers, though. So seasons 1-8 (as grouped above)

Character

In [47]:
gots0character = MetadataToDF(allgot_months, 'character','AllCount')
gots1character = MetadataToDF(gots1, 'character','S1Count')
gots2character = MetadataToDF(gots2, 'character','S2Count')
gots3character = MetadataToDF(gots3, 'character','S3Count')
gots4character = MetadataToDF(gots4, 'character','S4Count')
gots5character = MetadataToDF(gots5, 'character','S5Count')
gots6character = MetadataToDF(gots6, 'character','S6Count')
gots7character = MetadataToDF(gots7, 'character','S7Count')
gots8character = MetadataToDF(gots8, 'character','S8Count')
In [48]:
gots0character[:5]
Out[48]:
character AllCount
0 sansa stark, 2552
1 jon snow, 1904
2 arya stark, 1412
3 jaime lannister, 1337
4 daenerys targaryen, 1160
In [49]:
gotMerge1Chara = pd.merge(gots1character, gots2character, on='character', how='outer')
gotMerge2Chara = pd.merge(gotMerge1Chara, gots3character, on='character', how='outer')
gotMerge3Chara = pd.merge(gotMerge2Chara, gots4character, on='character', how='outer')
gotMerge4Chara = pd.merge(gotMerge3Chara, gots5character, on='character', how='outer')
gotMerge5Chara = pd.merge(gotMerge4Chara, gots6character, on='character', how='outer')
gotMerge6Chara = pd.merge(gotMerge5Chara, gots7character, on='character', how='outer')
gotMerge7Chara = pd.merge(gotMerge6Chara, gots8character, on='character', how='outer')
gotMergeChara = pd.merge(gotMerge7Chara, gots0character, on='character', how='outer')
gotMergeChara[:5]
Out[49]:
character S1Count S2Count S3Count S4Count S5Count S6Count S7Count S8Count AllCount
0 sansa stark, 42.0 94.0 217.0 223.0 232.0 445.0 727.0 572.0 2552
1 jon snow, 39.0 77.0 89.0 94.0 107.0 346.0 703.0 449.0 1904
2 robb stark, 33.0 64.0 75.0 64.0 55.0 102.0 188.0 92.0 673
3 jaime lannister, 32.0 53.0 107.0 96.0 105.0 187.0 344.0 413.0 1337
4 cersei lannister, 29.0 35.0 66.0 67.0 58.0 96.0 201.0 184.0 736

Relationship

In [50]:
gots0rel = MetadataToDF(allgot_months, 'relationship','AllCount')
gots1rel = MetadataToDF(gots1, 'relationship','S1Count')
gots2rel = MetadataToDF(gots2, 'relationship','S2Count')
gots3rel = MetadataToDF(gots3, 'relationship','S3Count')
gots4rel = MetadataToDF(gots4, 'relationship','S4Count')
gots5rel = MetadataToDF(gots5, 'relationship','S5Count')
gots6rel = MetadataToDF(gots6, 'relationship','S6Count')
gots7rel = MetadataToDF(gots7, 'relationship','S7Count')
gots8rel = MetadataToDF(gots8, 'relationship','S8Count')
In [51]:
gots0rel[:5]
Out[51]:
relationship AllCount
0 jaime lannister/brienne of tarth, 923
1 jon snow/sansa stark, 526
2 arya stark/gendry waters, 518
3 jon snow/daenerys targaryen, 511
4 sandor clegane/sansa stark, 379
In [53]:
gotMerge1rel = pd.merge(gots1rel, gots2rel, on='relationship', how='outer')
gotMerge2rel = pd.merge(gotMerge1rel, gots3rel, on='relationship', how='outer')
gotMerge3rel = pd.merge(gotMerge2rel, gots4rel, on='relationship', how='outer')
gotMerge4rel = pd.merge(gotMerge3rel, gots5rel, on='relationship', how='outer')
gotMerge5rel = pd.merge(gotMerge4rel, gots6rel, on='relationship', how='outer')
gotMerge6rel = pd.merge(gotMerge5rel, gots7rel, on='relationship', how='outer')
gotMerge7Rel = pd.merge(gotMerge6rel, gots8rel, on='relationship', how='outer')
gotMergeRel = pd.merge(gotMerge7Rel, gots0rel, on='relationship', how='outer')
gotMergeRel[:5]
Out[53]:
relationship S1Count S2Count S3Count S4Count S5Count S6Count S7Count S8Count AllCount
0 jaime lannister/brienne of tarth, 37.0 103.0 78.0 45.0 61.0 120.0 178.0 301.0 923
1 cersei lannister/jaime lannister, 18.0 16.0 29.0 22.0 16.0 21.0 53.0 51.0 226
2 jon snow/robb stark, 10.0 17.0 7.0 10.0 8.0 8.0 11.0 4.0 75
3 theon greyjoy/robb stark, 10.0 20.0 15.0 14.0 5.0 13.0 18.0 11.0 106
4 petyr baelish/sansa stark, 8.0 1.0 26.0 47.0 43.0 57.0 75.0 15.0 272

Category

In [54]:
gots0cat = MetadataToDF(allgot_months, 'category','AllCount')
gots1cat = MetadataToDF(gots1, 'category','S1Count')
gots2cat = MetadataToDF(gots2, 'category','S2Count')
gots3cat = MetadataToDF(gots3, 'category','S3Count')
gots4cat = MetadataToDF(gots4, 'category','S4Count')
gots5cat = MetadataToDF(gots5, 'category','S5Count')
gots6cat = MetadataToDF(gots6, 'category','S6Count')
gots7cat = MetadataToDF(gots7, 'category','S7Count')
gots8cat = MetadataToDF(gots8, 'category','S8Count')
In [55]:
gots0cat
Out[55]:
category AllCount
0 f/m, 4722
1 gen, 1326
2 m/m, 1171
3 f/f, 647
4 multi, 312
5 other, 189
6 f/m, 1
In [56]:
gotMerge1cat = pd.merge(gots1cat, gots2cat, on='category', how='outer')
gotMerge2cat = pd.merge(gotMerge1cat, gots3cat, on='category', how='outer')
gotMerge3cat = pd.merge(gotMerge2cat, gots4cat, on='category', how='outer')
gotMerge4cat = pd.merge(gotMerge3cat, gots5cat, on='category', how='outer')
gotMerge5cat = pd.merge(gotMerge4cat, gots6cat, on='category', how='outer')
gotMerge6cat = pd.merge(gotMerge5cat, gots7cat, on='category', how='outer')
gotMerge7Cat = pd.merge(gotMerge6cat, gots8cat, on='category', how='outer')
gotMergeCat = pd.merge(gotMerge7Cat, gots0cat, on='category', how='outer')
gotMergeCat[:5]
Out[56]:
category S1Count S2Count S3Count S4Count S5Count S6Count S7Count S8Count AllCount
0 f/m, 125.0 300.0 384.0 372.0 370.0 719.0 1352.0 1100.0 4722
1 gen, 48.0 107.0 183.0 167.0 119.0 187.0 297.0 218.0 1326
2 m/m, 36.0 112.0 112.0 169.0 138.0 167.0 265.0 172.0 1171
3 f/f, 8.0 17.0 54.0 84.0 62.0 98.0 194.0 130.0 647
4 multi, 3.0 6.0 20.0 35.0 34.0 40.0 106.0 68.0 312

Additional Tags

In [57]:
gots0tags = MetadataToDF(allgot_months, 'additional tags','AllCount')
gots1tags = MetadataToDF(gots1, 'additional tags','S1Count')
gots2tags = MetadataToDF(gots2, 'additional tags','S2Count')
gots3tags = MetadataToDF(gots3, 'additional tags','S3Count')
gots4tags = MetadataToDF(gots4, 'additional tags','S4Count')
gots5tags = MetadataToDF(gots5, 'additional tags','S5Count')
gots6tags = MetadataToDF(gots6, 'additional tags','S6Count')
gots7tags = MetadataToDF(gots7, 'additional tags','S7Count')
gots8tags = MetadataToDF(gots8, 'additional tags','S8Count')
In [58]:
gots0tags[:5]
Out[58]:
additional tags AllCount
0 fluff, 571
1 angst, 502
2 alternate universe - modern setting, 417
3 romance, 393
4 alternate universe, 270
In [59]:
gotMerge1tags = pd.merge(gots1tags, gots2tags, on='additional tags', how='outer')
gotMerge2tags = pd.merge(gotMerge1tags, gots3tags, on='additional tags', how='outer')
gotMerge3tags = pd.merge(gotMerge2tags, gots4tags, on='additional tags', how='outer')
gotMerge4tags = pd.merge(gotMerge3tags, gots5tags, on='additional tags', how='outer')
gotMerge5tags = pd.merge(gotMerge4tags, gots6tags, on='additional tags', how='outer')
gotMerge6tags = pd.merge(gotMerge5tags, gots7tags, on='additional tags', how='outer')
gotMerge7Tags = pd.merge(gotMerge6tags, gots8tags, on='additional tags', how='outer')
gotMergeTags = pd.merge(gotMerge7Tags, gots0tags, on='additional tags', how='outer')
gotMergeTags[:5]
Out[59]:
additional tags S1Count S2Count S3Count S4Count S5Count S6Count S7Count S8Count AllCount
0 no english, 32.0 80.0 8.0 1.0 NaN NaN NaN NaN 121
1 somali, 25.0 62.0 8.0 1.0 NaN NaN NaN NaN 96
2 alternate universe, 13.0 19.0 30.0 32.0 27.0 29.0 81.0 39.0 270
3 incest, 12.0 6.0 7.0 11.0 4.0 12.0 34.0 15.0 101
4 sibling incest, 11.0 13.0 26.0 8.0 3.0 10.0 24.0 17.0 112

Rating

In [60]:
gots0rating = MetadataToDF(allgot_months, 'rating','AllCount')
gots1rating = MetadataToDF(gots1, 'rating','S1Count')
gots2rating = MetadataToDF(gots2, 'rating','S2Count')
gots3rating = MetadataToDF(gots3, 'rating','S3Count')
gots4rating = MetadataToDF(gots4, 'rating','S4Count')
gots5rating = MetadataToDF(gots5, 'rating','S5Count')
gots6rating = MetadataToDF(gots6, 'rating','S6Count')
gots7rating = MetadataToDF(gots7, 'rating','S7Count')
gots8rating = MetadataToDF(gots8, 'rating','S8Count')
In [61]:
gots0rating
Out[61]:
rating AllCount
0 teen and up audiences, 2048
1 general audiences, 2021
2 mature, 1643
3 explicit, 1154
4 not rated, 1151
In [62]:
gotMerge1rat = pd.merge(gots1rating, gots2rating, on='rating', how='outer')
gotMerge2rat = pd.merge(gotMerge1rat, gots3rating, on='rating', how='outer')
gotMerge3rat = pd.merge(gotMerge2rat, gots4rating, on='rating', how='outer')
gotMerge4rat = pd.merge(gotMerge3rat, gots5rating, on='rating', how='outer')
gotMerge5rat = pd.merge(gotMerge4rat, gots6rating, on='rating', how='outer')
gotMerge6rat = pd.merge(gotMerge5rat, gots7rating, on='rating', how='outer')
gotMerge7Rat = pd.merge(gotMerge6rat, gots8rating, on='rating', how='outer')
gotMergeRat = pd.merge(gotMerge7Rat, gots0rating, on='rating', how='outer')
gotMergeRat[:5]
Out[62]:
rating S1Count S2Count S3Count S4Count S5Count S6Count S7Count S8Count AllCount
0 teen and up audiences, 65 113 204 197 176 317 541 435 2048
1 not rated, 48 132 97 112 79 174 262 247 1151
2 general audiences, 40 102 177 236 189 297 544 436 2021
3 mature, 36 103 137 144 154 253 460 356 1643
4 explicit, 27 79 85 114 128 193 317 211 1154
In [63]:
gotMergeChara.to_csv(r'./data/metadata/GoT/got_metadata_character.csv')
gotMergeRel.to_csv(r'./data/metadata/GoT/got_metadata_relationship.csv')
gotMergeCat.to_csv(r'./data/metadata/GoT/got_metadata_categories.csv')
gotMergeTags.to_csv(r'./data/metadata/GoT/got_metadata_tags.csv')
gotMergeRat.to_csv(r'./data/metadata/GoT/got_metadata_rating.csv')