Preparing Fanfiction Data¶

Cara Marta Messina
Northeastern University
messina [dot] c [at] husky [dot] neu [dot] edu

This notebook takes data collected from Archive of Our Own, a popular fanfiction repository, and sets it up to be analyzed. The data was collected using this AO3 python scraper. The corpus consists of The Legend of Korra and Game of Thrones fanfics, from the first one published on AO3 to 2019. Specifically, I am preparing the data to be analyzed using computatioanl temporal analysis methods, which focus on trends over time. Read more about this method in my article "Tracing Fan Uptakes: Tagging, Language, and Ideological Practices in The Legend of Korra Fanfiction," to be published in The Journal of Writing Analytics. The code for this article is published on my GitHub.

This notebook:

creates a 'month' column based on published dates of the fanfics
saves new datasets, reads them in, and merges them
preps the data by selecting only the columns of interest, including making sure there are no empty columns
groups each corpus by 'months'
saves the new corpora

This notebook is part of the Critical Fan Toolkit, Cara Marta Messina's public + digital dissertation

#pandas for working with dataframes
import pandas as pd

#regular expression library
import re

#numpy specifically works with numbers
import numpy as np

#matplot library creates visualizations
import matplotlib.pyplot as plt
%matplotlib inline

#nltk libraries
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string

#for making a string of elements separated by commas into a list
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

#has the nice counter feature for counting tags
import collections
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

Reading in & Prepping the Data¶

I have to read in multiple CSVs of the same dataset (the Game of Thrones fanfics published on AO3) because the original CSV was almost 2GB and my Python kernels kept crashing.

First I load in my CSVs. Then I have a function that goes through and replaces any empty bodies with just empty strings. I have also added a space after the end of each body string because if I merge all the bodies together in a groupby, I want there to be a space.

Second, I have created a function that uses a regular expression to take the published dates information from one column, add another column, and input specific information into that column. The first column, the published date, is structured as 0000-00-00 (YEAR-MONTH-DATE). I want to just keep the month+year, so my new column has 0000-00 (YEAR-MONTH).

Finally, I'm going to save my new CSVs!

korra = pd.read_csv('./data/allkorra.csv')

got = pd.read_csv('./data/got_data_original/got0.csv')
got1 = pd.read_csv('./data/got_data_original/got1.csv')
got2 = pd.read_csv('./data/got_data_original/got2.csv')
got3 = pd.read_csv('./data/got_data_original/got3.csv')

merged_got = pd.concat([got, got1, got2, got3])
merged_got.count()

Unnamed: 0         29897
work_id            29897
title              29897
rating             29896
category           28472
fandom             29897
relationship       27733
character          28913
additional tags    27118
language           29897
published          29897
status             29897
status date        29897
words              29892
chapters           29897
comments           26793
kudos              29670
bookmarks          26948
hits               29293
body               29897
month              29897
dtype: int64

Replacing Empty Columns¶

I only need to do this with the GoT fanfics because I've already done it with TLoK fanfics.

This is to prevent future errors with potentially empty variables.

def prepare_columns(df):
    '''
    Description: Takes a dataframe collected from AO3 (so the column names are the same).
    Prepares them to be grouped by:
    -lowercasing all the letters
    -adding new lines/columns after particular columns
    -replaces NAN values with an empty slot 
    
    Input: A dataframe from AO3 with the same column names
    Output: A similar dataframe, except the words are lower-cased and there are commas or newlines after particular columns
    '''

    #creating a new dataframe in case I accidentally run this cell multiple times; then there would be multiple commas.
    newDF = df

    # make all the text lowercased. The "applymap" function applies a function to each element in a df.
    newDF = newDF[['work_id','title','published', 'rating', 'character','relationship','additional tags','category','body']]
    newDF = newDF.applymap(lambda s:s.lower() if type(s) == str else s)

    #adding a new line after each "body" column
    newDF['body'] = (newDF['body'] + ' ')
    newDF['body'] = newDF['body'].replace(np.nan,'',regex=True)
    newDF.dropna(how='all')

    #adding commas after the ratings, relationship, and additional tags columns to make sure they are separated properly
    newDF['rating'] = (newDF['rating'] + ', ')
    newDF['rating'] = newDF['rating'].replace(np.nan,'',regex=True)
    newDF['title'] = (newDF['title'] + ', ')
    newDF['title'] = newDF['title'].replace(np.nan,'',regex=True)
    newDF['character'] = (newDF['character'] + ', ')
    newDF['character'] = newDF['character'].replace(np.nan,'',regex=True)
    newDF['relationship'] = (newDF['relationship'] + ', ')
    newDF['relationship'] = newDF['relationship'].replace(np.nan,'',regex=True)
    newDF['additional tags'] = (newDF['additional tags'] + ', ')
    newDF['additional tags'] = newDF['additional tags'].replace(np.nan,'',regex=True)
    newDF['category'] = (newDF['category'] + ', ')
    newDF['category'] = newDF['category'].replace(np.nan,'',regex=True)


    return newDF

tlok_prepped = prepare_columns(korra)
tlok_prepped[:1]

got_prepped = prepare_columns(got)
got1_prepped = prepare_columns(got1)
got2_prepped = prepare_columns(got2)
got3_prepped = prepare_columns(got3)

got_prepped[:2]

Adding Month Column¶

Again, I've already done this with the TLoK fanfics, so I only need to do it for the GoT ones.

def add_month_column(df, newcolumn, originalcolumn):
    '''
    Description: Takes a column that specifically usess the style 2000-11-22 (date) and adds a new column with 0000-00 (year + month)
    
    Input: dataframe with a column structure like 0000-11-22
    Output: same dataframe with a new column with the addition of 0000-11 (year + month in this case)
    '''
    #using a regular expression to create a "month" column
    df[newcolumn] = df[originalcolumn].replace('(\d{4})(\-)(\d{2})(\-)(\d{2})','\g<1>\g<2>\g<3>', regex=True)

    return df

tlok_new = add_month_column(tlok_prepped,'month','published')
tlok_new.head(2)

got_new = add_month_column(got_prepped, 'month','published')
got1_new = add_month_column(got1_prepped, 'month','published')
got2_new = add_month_column(got2_prepped, 'month','published')
got3_new = add_month_column(got3_prepped, 'month','published')

#checking that the month columb has been added
got_new[:2]

Saving!¶

Let's save the new dataframes as CSVs so I can use them! They are commented out, though, so I don't accidentally run it again. I am saving them with the same name as the original CSVs. This is not a great practice, because you want to save all steps of your data in case something happens. However, I have the data already saved on an external file, so I would prefer not to save the same data over and over on my harddrive.

got_new.to_csv(r'./data/got_data_clean/got0.csv')
got1_new.to_csv(r'./data/got_data_clean/got1.csv')
got2_new.to_csv(r'./data/got_data_clean/got2.csv')
got3_new.to_csv(r'./data/got_data_clean/got3.csv')

Merging Dataframes¶

While it may seem counterintiative to read in a bunch of split dataframes and then merge them again, the one large csv kept crashing my python. This means, I will probably need to keep all the CSVs separate when I load them in, and then merge them each time in my notebook. However, it seems to have worked.

#loading these in so I no longer have to run all the code above to access these
got0New = pd.read_csv('./data/got_data_clean/got0.csv')
got1New = pd.read_csv('./data/got_data_clean/got1.csv')
got2New = pd.read_csv('./data/got_data_clean/got2.csv')
got3New = pd.read_csv('./data/got_data_clean/got3.csv')

merged_got1 = pd.concat([got0New, got1New, got2New, got3New])

got_metadata = merged_got1.drop(['body'], axis=1)
got_metadata.head(1)

got_textual = merged_got[['work_id','body']]
got_textual.head(4)

#saving the metadata files and body of text files

got_textual.to_csv(r'./data/got_data_clean/got_body.csv')
got_metadata.to_csv(r'./data/got_data_clean/got_metadata.csv')

Visualizing Month: Checking¶

I wanted to just check what months the most fanfics were published in (and made a quick bar chart). This graph function can be used for a lot of datasets and is fairly easy.

def viz_months(df, column, name_of_show):
    '''
    Description: This function takes a dataframe with a number column, counts the top 10 frequency in that column, and then visualizes it. I am specifically using this function for visualizing published dates of fanfictions, but the labels below can be changed.
    
    Input: the dataframe, the column that you want to count the highest values, and the name of the show
    Output: the top 10 highest months published in that set & a cute graph
    '''
    monthcount = df[column].value_counts().head(10)
    print(monthcount)

    monthCountGraph = monthcount.plot.bar()
    monthCountGraph = plt.title('Highest Months of Published Fanfics of '+name_of_show)
    monthCountGraph = plt.xlabel('Month and Year')
    monthCountGraph = plt.xlabel('Month and Year')

viz_months(merged_got, 'month', 'Game of Thrones')

2019-05    2498
2019-06    1261
2019-04     971
2019-08     915
2019-07     852
2017-08     834
2017-09     803
2017-10     616
2016-07     493
2018-03     473
Name: month, dtype: int64

viz_months(korra, 'month', 'The Legend of Korra')

2015-01    400
2014-12    359
2015-02    311
2015-03    236
2014-10    225
2015-06    221
2015-08    201
2015-04    201
2012-06    197
2015-05    197
Name: month, dtype: int64

Grouping By Month¶

Next, I will demonstrate how I prepared the data for computational temporal analysis, or tracing trends over time. I use two corpora: The Legend of Korra and Game of Thrones fanfiction published on Archive of Our Own. This notebook will work with four different columns with textual data: the "relationship" tags column , the "additional tags" column, the "categories" column which provides the gender of particular relationships (such as M/M, F/F, M/F, etc), and the "body" which contains the entire text for each fanfic.

Since I do not need to load in the data again, I will just show the beginning of the data files. Next, I wil need to define my functions.

def prepare_columns_for_groupby(df):
    '''
    Takes a dataframe collected from AO3 (so the column names are the same) and prepares them to be grouped by lowercasing all the letters and adding new lines/columns after particular columns.
    
    Input: A dataframe from AO3 with the same column names
    Output: A similar dataframe, except the words are lower-cased and there are commas or newlines after particular columns
    '''

    #creating a new dataframe in case I accidentally run this cell multiple times; then there would be multiple commas.
    newDF = df

    # make all the text lowercased. The "applymap" function applies a function to each element in a df.
    newDF = newDF.applymap(lambda s:s.lower() if type(s) == str else s)
    newDF = newDF[['published','rating','relationship','additional tags','character','category','month','body']]

    #adding a new line after each "body" column
    newDF['body'] = (newDF['body'])
    newDF['body'] = newDF['body'].replace(np.nan,'',regex=True)
    newDF.dropna(how='all')

    #adding commas after the ratings, relationship, and additional tags columns to make sure they are separated properly
    newDF['rating'] = (newDF['rating'])
    newDF['rating'] = newDF['rating'].replace(np.nan,'',regex=True)
    newDF['relationship'] = (newDF['relationship'])
    newDF['relationship'] = newDF['relationship'].replace(np.nan,'',regex=True)
    newDF['character'] = (newDF['character'])
    newDF['character'] = newDF['character'].replace(np.nan,'',regex=True)
    newDF['additional tags'] = (newDF['additional tags'])
    newDF['additional tags'] = newDF['additional tags'].replace(np.nan,'',regex=True)
    newDF['category'] = (newDF['category'])
    newDF['category'] = newDF['category'].replace(np.nan,'',regex=True)

    #make publsihed dates into readable dates
    newDF['published'] = pd.to_datetime(newDF['published'])


    return newDF

def group_by(df):
    '''
    This function will group a dataframe by the 'month' column. This can also be used in a later function to group by particular months.
    
    Input: a pandas dataframe and the column you want to groupby with
    
    Output: a new dataframe with the month as the index
    '''
    #first, group a dataframe by months and count. This will create a list of how many rows for each month.
    month_count = df.groupby('month').count()

    #in the new dataframe, use a column that has been counted and rename it 'Count'
    month_count['count'] = month_count['relationship']
    month_count_new = month_count['count']

    #create another new dataframe that aggregrates all the proper columns
    new_group = df.groupby('month').agg({'rating':'sum','additional tags':'sum','category':'sum','character':'sum','relationship':'sum','body':'sum'})

    #join both dataframes to include the count & make it ascending to the earliest FFs are on top
    join_df = pd.concat([new_group, month_count_new], axis=1)
    join_df.sort_index(ascending=True)

    return join_df

The Legend of Korra Group By Month¶

Before I group the dataframe, I want to first prepare my data.

The steps to do this are:

make sure I'm adding a new line or a comma after particular data so that it is lists/separated properly
make sure all the text is lower-cased to avoid capitalization inconsitencies

allkorra_prepped = prepare_columns_for_groupby(tlok_new)
allkorra_prepped.head(2)

allkorra_month = group_by(allkorra_prepped)
allkorra_month.head(5)

#save the new dataframe to be used later (commenting out so I don't resave)

allkorra_month.to_csv(r'./data/group_month/allkorra_months.csv')

#preKorrasami
tlok01 = allkorra_month.loc['2011-02':'2014-05']

#subtextual Korrasami
tlok02 = allkorra_month.loc['2014-06':'2014-11']

#postKorrasami
tlok03 = allkorra_month.loc['2014-12':'2015-07']

Game of Thrones Group By Month¶

Before I group the dataframe, I want to first prepare my data.

The steps to do this are:

make sure I'm adding a new line or a comma after particular data so that it is lists/separated properly
make sure all the text is lower-cased to avoid capitalization inconsitencies

I am currently saving the dataframe as one .csv, but then I will use a csv splitter created by Jordi Rivero. This way, I can upload the split csv without killing my kernels.

allgot_prepped = prepare_columns_for_groupby(got_new)
allgot_prepped.head(2)

allgot_months = group_by(allgot_prepped)

#making individual dataframes for each season 

#Season 1: Beginning of data to March 2012 (Season 2 airs April 1st, 2012)
gots1 = allgot_months.loc['2007-02':'2012-03']

#Season 2: April 2012-March 2013 (season 3 airs March 31, 2013)
gots2 = allgot_months.loc['2012-04':'2013-03']

#Season 3: April 2013-March 2014 (season 4 airs April 6, 2014)
gots3 = allgot_months.loc['2013-04':'2014-03']

#Season 4: April 2014 to March 2015 (season 5 airs April 12, 2015)
gots4 = allgot_months.loc['2014-04':'2015-03']

#Season 5: April 2015-March 2016 (season 6 airs April 24, 2016)
gots5 = allgot_months.loc['2015-04':'2016-03']

#Season 6: April 2016-June 2017 (season 7 airs July 16, 2017)
gots6 = allgot_months.loc['2016-04':'2017-06']

#Season 7: July 2017-March 2019 (season 8 airs April 14, 2019)
gots7 = allgot_months.loc['2017-07':'2019-03']

#Season 8: April 2019-end of data
gots8 = allgot_months.loc['2019-04':'2019-09']

# allgot_prepped.to_csv(r'./data/group_month/allgot_months.csv')

gots8

# save data

gots1.to_csv(r'./data/group_month/got_s1.csv')
gots2.to_csv(r'./data/group_month/got_s2.csv')
gots3.to_csv(r'./data/group_month/got_s3.csv')
gots4.to_csv(r'./data/group_month/got_s4.csv')
gots5.to_csv(r'./data/group_month/got_s5.csv')
gots6.to_csv(r'./data/group_month/got_s6.csv')
gots7.to_csv(r'./data/group_month/got_s7.csv')
gots8.to_csv(r'./data/group_month/got_s8.csv')

Metadata Count¶

For this new data I want to create, I want to create a few different dataframes:

TLoK¶

relationships
relationship categories
character count
additional tags
rating

GoT¶

relationships
relationship categories
character count
additional tags
rating

def column_to_list(df,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes & cleans that string.
    input: the name of the dataframe and the column name
    output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    df[columnName] = df[columnName].replace(np.nan,'',regex=True)
    string = ' '.join(df[columnName].tolist())
    return string

def MetadataToDF(df, columnName,NewSeasonColumnName):
    '''
    input: the dataframe you will work with, the new column names for your new DF)
    output: a new dataframe with the metadata and the count in a newly named column 
    
    load in the proper data into a string'''

    #replace empty values & make a list of all the words
    string = column_to_list(df, columnName)

    #the function to tokenize, or put each value as an element in a list
    class CommaPoint(PunktLanguageVars):
        sent_end_chars = (',')
    tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())

    #tokenizing the list of strings based on the COMMA, not the white space (as seen in the CommaPoint above)
    ListOfTags = tokenizer.tokenize(string)
    length = len(ListOfTags)

    #the "Counter" function is from the collections library
    allCounter=collections.Counter(ListOfTags)
    most = allCounter.most_common()

    newDF = pd.DataFrame(most, columns =[columnName, NewSeasonColumnName])

    #return 
    return newDF

TLoK Metadata¶

I will be creating dataframes for all the metadatacategories to count across the three different periods from TLoK: the PreKorrasami time, the Subtextual Korrasami period, and the PostKorrasami Period.

Character¶

tlokALLcharacter = MetadataToDF(allkorra_month, 'character','allCount')
tlok01character = MetadataToDF(tlok01, 'character','01Count')
tlok02character = MetadataToDF(tlok02, 'character','02Count')
tlok03character = MetadataToDF(tlok03, 'character','03Count')

tlokALLcharacter[:5]

tlokCharMerge01 = pd.merge(tlok01character, tlok02character, on='character', how='outer')
tlokCharMerge02 = pd.merge(tlokCharMerge01, tlok03character, on='character', how='outer')
tlokCharMerge = pd.merge(tlokCharMerge02, tlokALLcharacter, on='character', how='outer')
tlokCharMerge[:5]

Relationships¶

tlokALLrel = MetadataToDF(allkorra_month, 'relationship','allCount')
tlok01rel = MetadataToDF(tlok01, 'relationship','01Count')
tlok02rel = MetadataToDF(tlok02, 'relationship','02Count')
tlok03rel = MetadataToDF(tlok03, 'relationship','03Count')

tlokALLrel[:5]

tlokRelMerge01 = pd.merge(tlok01rel, tlok02rel, on='relationship', how='outer')
tlokRelMerge02 = pd.merge(tlokRelMerge01, tlok03rel, on='relationship', how='outer')
tlokRelMerge = pd.merge(tlokRelMerge02, tlokALLrel, on='relationship', how='outer')
tlokRelMerge[:5]

Categories¶

tlokALLcat = MetadataToDF(allkorra_month, 'category','allCount')
tlok01cat = MetadataToDF(tlok01, 'category','01Count')
tlok02cat = MetadataToDF(tlok02, 'category','02Count')
tlok03cat = MetadataToDF(tlok03, 'category','03Count')

tlokALLcat

tlokCatMerge01 = pd.merge(tlok01cat, tlok02cat, on='category', how='outer')
tlokCatMerge02 = pd.merge(tlokCatMerge01, tlok03cat, on='category', how='outer')
tlokCatMerge = pd.merge(tlokCatMerge02, tlokALLcat, on='category', how='outer')
tlokCatMerge[:5]

Additional Tags¶

tlokALLtags = MetadataToDF(allkorra_month, 'additional tags','allCount')
tlok01tags = MetadataToDF(tlok01, 'additional tags','01Count')
tlok02tags = MetadataToDF(tlok02, 'additional tags','02Count')
tlok03tags = MetadataToDF(tlok03, 'additional tags','03Count')

tlokALLtags[:5]

tlokTagsMerge01 = pd.merge(tlok01tags, tlok02tags, on='additional tags', how='outer')
tlokTagsMerge02 = pd.merge(tlokTagsMerge01, tlok03tags, on='additional tags', how='outer')
tlokTagsMerge = pd.merge(tlokTagsMerge02, tlokALLtags, on='additional tags', how='outer')
tlokTagsMerge[:5]

Ratings¶

tlokALLrating = MetadataToDF(allkorra_month, 'rating','allCount')
tlok01rating = MetadataToDF(tlok01, 'rating','01Count')
tlok02rating = MetadataToDF(tlok02, 'rating','02Count')
tlok03rating = MetadataToDF(tlok03, 'rating','03Count')

tlokALLrating

tlokRatingMerge01 = pd.merge(tlok01rating, tlok02rating, on='rating', how='outer')
tlokRatingMerge02 = pd.merge(tlokRatingMerge01, tlok03rating, on='rating', how='outer')
tlokRatingMerge = pd.merge(tlokRatingMerge02, tlokALLrating, on='rating', how='outer')
tlokRatingMerge[:5]

Saving CSVs¶

tlokCharMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_character.csv')
tlokRelMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_relationship.csv')
tlokCatMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_categories.csv')
tlokTagsMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_tags.csv')
tlokRatingMerge.to_csv(r'./data/metadata/TLoK/tlok_metadata_rating.csv')

GoT Metadata¶

As I did with TLoK, I will be creating dataframes for all the metadatacategories to count across different Game of Thrones periods. I will be using seasons as date markers, though. So seasons 1-8 (as grouped above)

Character¶

gots0character = MetadataToDF(allgot_months, 'character','AllCount')
gots1character = MetadataToDF(gots1, 'character','S1Count')
gots2character = MetadataToDF(gots2, 'character','S2Count')
gots3character = MetadataToDF(gots3, 'character','S3Count')
gots4character = MetadataToDF(gots4, 'character','S4Count')
gots5character = MetadataToDF(gots5, 'character','S5Count')
gots6character = MetadataToDF(gots6, 'character','S6Count')
gots7character = MetadataToDF(gots7, 'character','S7Count')
gots8character = MetadataToDF(gots8, 'character','S8Count')

gots0character[:5]

gotMerge1Chara = pd.merge(gots1character, gots2character, on='character', how='outer')
gotMerge2Chara = pd.merge(gotMerge1Chara, gots3character, on='character', how='outer')
gotMerge3Chara = pd.merge(gotMerge2Chara, gots4character, on='character', how='outer')
gotMerge4Chara = pd.merge(gotMerge3Chara, gots5character, on='character', how='outer')
gotMerge5Chara = pd.merge(gotMerge4Chara, gots6character, on='character', how='outer')
gotMerge6Chara = pd.merge(gotMerge5Chara, gots7character, on='character', how='outer')
gotMerge7Chara = pd.merge(gotMerge6Chara, gots8character, on='character', how='outer')
gotMergeChara = pd.merge(gotMerge7Chara, gots0character, on='character', how='outer')
gotMergeChara[:5]

Relationship¶

gots0rel = MetadataToDF(allgot_months, 'relationship','AllCount')
gots1rel = MetadataToDF(gots1, 'relationship','S1Count')
gots2rel = MetadataToDF(gots2, 'relationship','S2Count')
gots3rel = MetadataToDF(gots3, 'relationship','S3Count')
gots4rel = MetadataToDF(gots4, 'relationship','S4Count')
gots5rel = MetadataToDF(gots5, 'relationship','S5Count')
gots6rel = MetadataToDF(gots6, 'relationship','S6Count')
gots7rel = MetadataToDF(gots7, 'relationship','S7Count')
gots8rel = MetadataToDF(gots8, 'relationship','S8Count')

gots0rel[:5]

gotMerge1rel = pd.merge(gots1rel, gots2rel, on='relationship', how='outer')
gotMerge2rel = pd.merge(gotMerge1rel, gots3rel, on='relationship', how='outer')
gotMerge3rel = pd.merge(gotMerge2rel, gots4rel, on='relationship', how='outer')
gotMerge4rel = pd.merge(gotMerge3rel, gots5rel, on='relationship', how='outer')
gotMerge5rel = pd.merge(gotMerge4rel, gots6rel, on='relationship', how='outer')
gotMerge6rel = pd.merge(gotMerge5rel, gots7rel, on='relationship', how='outer')
gotMerge7Rel = pd.merge(gotMerge6rel, gots8rel, on='relationship', how='outer')
gotMergeRel = pd.merge(gotMerge7Rel, gots0rel, on='relationship', how='outer')
gotMergeRel[:5]

Category¶

gots0cat = MetadataToDF(allgot_months, 'category','AllCount')
gots1cat = MetadataToDF(gots1, 'category','S1Count')
gots2cat = MetadataToDF(gots2, 'category','S2Count')
gots3cat = MetadataToDF(gots3, 'category','S3Count')
gots4cat = MetadataToDF(gots4, 'category','S4Count')
gots5cat = MetadataToDF(gots5, 'category','S5Count')
gots6cat = MetadataToDF(gots6, 'category','S6Count')
gots7cat = MetadataToDF(gots7, 'category','S7Count')
gots8cat = MetadataToDF(gots8, 'category','S8Count')

gots0cat

gotMerge1cat = pd.merge(gots1cat, gots2cat, on='category', how='outer')
gotMerge2cat = pd.merge(gotMerge1cat, gots3cat, on='category', how='outer')
gotMerge3cat = pd.merge(gotMerge2cat, gots4cat, on='category', how='outer')
gotMerge4cat = pd.merge(gotMerge3cat, gots5cat, on='category', how='outer')
gotMerge5cat = pd.merge(gotMerge4cat, gots6cat, on='category', how='outer')
gotMerge6cat = pd.merge(gotMerge5cat, gots7cat, on='category', how='outer')
gotMerge7Cat = pd.merge(gotMerge6cat, gots8cat, on='category', how='outer')
gotMergeCat = pd.merge(gotMerge7Cat, gots0cat, on='category', how='outer')
gotMergeCat[:5]

Additional Tags¶

gots0tags = MetadataToDF(allgot_months, 'additional tags','AllCount')
gots1tags = MetadataToDF(gots1, 'additional tags','S1Count')
gots2tags = MetadataToDF(gots2, 'additional tags','S2Count')
gots3tags = MetadataToDF(gots3, 'additional tags','S3Count')
gots4tags = MetadataToDF(gots4, 'additional tags','S4Count')
gots5tags = MetadataToDF(gots5, 'additional tags','S5Count')
gots6tags = MetadataToDF(gots6, 'additional tags','S6Count')
gots7tags = MetadataToDF(gots7, 'additional tags','S7Count')
gots8tags = MetadataToDF(gots8, 'additional tags','S8Count')

gots0tags[:5]

gotMerge1tags = pd.merge(gots1tags, gots2tags, on='additional tags', how='outer')
gotMerge2tags = pd.merge(gotMerge1tags, gots3tags, on='additional tags', how='outer')
gotMerge3tags = pd.merge(gotMerge2tags, gots4tags, on='additional tags', how='outer')
gotMerge4tags = pd.merge(gotMerge3tags, gots5tags, on='additional tags', how='outer')
gotMerge5tags = pd.merge(gotMerge4tags, gots6tags, on='additional tags', how='outer')
gotMerge6tags = pd.merge(gotMerge5tags, gots7tags, on='additional tags', how='outer')
gotMerge7Tags = pd.merge(gotMerge6tags, gots8tags, on='additional tags', how='outer')
gotMergeTags = pd.merge(gotMerge7Tags, gots0tags, on='additional tags', how='outer')
gotMergeTags[:5]

Rating¶

gots0rating = MetadataToDF(allgot_months, 'rating','AllCount')
gots1rating = MetadataToDF(gots1, 'rating','S1Count')
gots2rating = MetadataToDF(gots2, 'rating','S2Count')
gots3rating = MetadataToDF(gots3, 'rating','S3Count')
gots4rating = MetadataToDF(gots4, 'rating','S4Count')
gots5rating = MetadataToDF(gots5, 'rating','S5Count')
gots6rating = MetadataToDF(gots6, 'rating','S6Count')
gots7rating = MetadataToDF(gots7, 'rating','S7Count')
gots8rating = MetadataToDF(gots8, 'rating','S8Count')

gots0rating

gotMerge1rat = pd.merge(gots1rating, gots2rating, on='rating', how='outer')
gotMerge2rat = pd.merge(gotMerge1rat, gots3rating, on='rating', how='outer')
gotMerge3rat = pd.merge(gotMerge2rat, gots4rating, on='rating', how='outer')
gotMerge4rat = pd.merge(gotMerge3rat, gots5rating, on='rating', how='outer')
gotMerge5rat = pd.merge(gotMerge4rat, gots6rating, on='rating', how='outer')
gotMerge6rat = pd.merge(gotMerge5rat, gots7rating, on='rating', how='outer')
gotMerge7Rat = pd.merge(gotMerge6rat, gots8rating, on='rating', how='outer')
gotMergeRat = pd.merge(gotMerge7Rat, gots0rating, on='rating', how='outer')
gotMergeRat[:5]

gotMergeChara.to_csv(r'./data/metadata/GoT/got_metadata_character.csv')
gotMergeRel.to_csv(r'./data/metadata/GoT/got_metadata_relationship.csv')
gotMergeCat.to_csv(r'./data/metadata/GoT/got_metadata_categories.csv')
gotMergeTags.to_csv(r'./data/metadata/GoT/got_metadata_tags.csv')
gotMergeRat.to_csv(r'./data/metadata/GoT/got_metadata_rating.csv')

	work_id	title	published	rating	character	relationship	additional tags	category	body
0	19289563	game of thrones,	2019-06-20	explicit,	arya stark, bella, gendry baratheon - characte...	gendry baratheon/arya stark, gendry baratheon/...		multi,	authors note: this is really short but i promi...
1	17179712	game of thrones,	2018-12-27	teen and up audiences,	jon snow \| aegon targaryen, arya stark, sansa ...	jon snow/daenerys targaryen, arya stark/gendry...	armies and allies, war, romance, eventual happ...	f/m,	arya's chambers still felt different.\n\n \n\n...

	work_id	title	published	rating	character	relationship	additional tags	category	body	month
0	6388009	a more perfect union,	2016-03-28	general audiences,	noatak (avatar), tarrlok (avatar), amon (avata...		alternate universe,	gen,	he's forgotten how to be warm. the thought wou...	2016-03
1	13974048	let go,	2018-03-14	teen and up audiences,	korra (avatar), lin beifong,	lin beifong/korra,	just a quick one-shot i never posted properly,...	f/f,	"korra." somewhere distant. someone holding h...	2018-03

	work_id	title	published	rating	character	relationship	additional tags	category	body	month
0	19289563	game of thrones,	2019-06-20	explicit,	arya stark, bella, gendry baratheon - characte...	gendry baratheon/arya stark, gendry baratheon/...		multi,	authors note: this is really short but i promi...	2019-06
1	17179712	game of thrones,	2018-12-27	teen and up audiences,	jon snow \| aegon targaryen, arya stark, sansa ...	jon snow/daenerys targaryen, arya stark/gendry...	armies and allies, war, romance, eventual happ...	f/m,	arya's chambers still felt different.\n\n \n\n...	2018-12

	published	rating	relationship	additional tags	character	category	month	body
0	2016-03-28	general audiences,		alternate universe,	noatak (avatar), tarrlok (avatar), amon (avata...	gen,	2016-03	he's forgotten how to be warm. the thought wou...
1	2018-03-14	teen and up audiences,	lin beifong/korra,	just a quick one-shot i never posted properly,...	korra (avatar), lin beifong,	f/f,	2018-03	"korra." somewhere distant. someone holding h...

	rating	additional tags	category	character	relationship	body	count
month
2011-02	not rated,	original characters - freeform,	multi,	zuko (avatar), mai (avatar), sokka (avatar), s...	mai/zuko (avatar), sokka/suki (avatar), aang (...	when kato listens to his father's war stories,...	1
2011-04	general audiences,	family, angst, one shot,	gen,	tenzin (avatar), aang (avatar), katara (avatar...	aang (avatar)/katara,	his father shows tenzin where the flowers grow...	1
2011-05	teen and up audiences,	completely au, written pre-canon, rated for la...	gen,	korra (avatar), toph beifong,		\n \nthe earthbender's answer was not what s...	1
2011-07	general audiences, teen and up audiences,	humor, family, sweet, drama, dysfunctional fam...	gen, gen,	korra (avatar), tenzin (avatar), jinora, ikki,...		"i'm here for my lesson," korra shouted, leadi...	2
2011-08	teen and up audiences, teen and up audiences, ...	speculation, drama, action, character study, f...	gen, f/m, gen, f/f, f/f,	korra (avatar), mako (avatar), bolin (avatar),...	korra/mako (avatar), azula (avatar)/toph bei f...	nothing had ever been given to her. korra had ...	4

	rating	additional tags	category	character	relationship	body	count
month
2019-04	teen and up audiences, not rated, general audi...	alternate universe - fantasy, game of thrones ...	m/m, f/m, f/m, multi, other, f/f, f/m, f/m, f/...	midoriya izuku, todoroki shouto, bakugou katsu...	midoriya izuku/todoroki shouto, bakugou katsuk...	izuku midoriya is a green, curly-haired advent...	247
2019-05	mature, teen and up audiences, not rated, expl...	game of thrones alternate universe, game of th...	f/f, f/m, m/m, f/m, f/f, f/m, m/m, f/f, multi,...	all characters from got (season 1-8), jayden r...	jon snow/daenerys targaryen, gilly (asoiaf)/sa...	\n\nprologue\n\n\n\n \n\n \n\n\nharry was figh...	726
2019-06	explicit, mature, not rated, explicit, not rat...	a better rewrite, no dragon death, dialogue he...	multi, multi, f/m, gen, multi, other, f/m, mul...	arya stark, bella, gendry baratheon - characte...	gendry baratheon/arya stark, gendry baratheon/...	authors note: this is really short but i promi...	274
2019-07	mature, mature, general audiences, teen and up...	mpreg, grahpic birth, male lactation, secret p...	f/m, gen, m/m, m/m, f/m, gen, f/f, f/m, gen, m...	robert baratheon, cersei lannister, joffrey ba...	robert baratheon/cersei lannister (one sided),...	game of thrones daenerys lannister\n\nby 4quie...	169
2019-08	teen and up audiences, not rated, teen and up ...	show rewrite, alternate universe, war of five ...	f/m, f/m, multi, other, f/f, f/m, m/m, gen, f/...	sandor clegane, jon snow, klaus mikaelson, reb...	ramsay bolton/ kol mikaelson, rebekah mikaelso...	annie's pov:\n\n"ugh." i groaned as i sat up f...	166
2019-09	general audiences, mature, mature, teen and up...	established relationship, game of thrones refe...	m/m, f/m, other, gen, m/m, f/m, multi, f/m, f/...	sidney crosby, evgeni malkin, jon snow \| aegon...	sidney crosby/evgeni malkin, jon snow/daenerys...	the sound of his phone ringing pulled zhenya f...	103

	character	allCount
0	korra (avatar),	4570
1	asami sato,	3899
2	mako (avatar),	2024
3	bolin (avatar),	1842
4	tenzin (avatar),	940

	character	01Count	02Count	03Count	allCount
0	korra (avatar),	842.0	434.0	1394.0	4570
1	mako (avatar),	509.0	198.0	556.0	2024
2	bolin (avatar),	455.0	146.0	492.0	1842
3	asami sato,	396.0	346.0	1302.0	3899
4	lin bei fong,	243.0	86.0	97.0	426

	relationship	allCount
0	korra/asami sato,	3432
1	korra/mako (avatar),	572
2	bolin/opal (avatar),	364
3	lin beifong/kya ii,	168
4	korrasami,	150

	relationship	01Count	02Count	03Count	allCount
0	korra/mako (avatar),	305.0	63.0	96.0	572
1	bolin/korra (avatar),	91.0	8.0	9.0	127
2	korra/asami sato,	90.0	281.0	1250.0	3432
3	amon/lieutenant (avatar),	68.0	6.0	10.0	90
4	amon/korra (avatar),	48.0	17.0	6.0	80

	category	01Count	02Count	03Count	allCount
0	f/m,	820	180	490	2154
1	gen,	617	170	346	1612
2	m/m,	243	50	120	590
3	f/f,	200	326	1351	3896
4	multi,	91	20	83	354

	additional tags	allCount
0	fluff,	847
1	romance,	667
2	angst,	512
3	alternate universe - modern setting,	438
4	korrasami - freeform,	341

	additional tags	01Count	02Count	03Count	allCount
0	romance,	145.0	54.0	213.0	667
1	angst,	109.0	43.0	145.0	512
2	friendship,	92.0	24.0	74.0	263
3	family,	72.0	9.0	47.0	204
4	fluff,	67.0	63.0	229.0	847

	rating	allCount
0	general audiences,	2703
1	teen and up audiences,	2455
2	mature,	1047
3	explicit,	730
4	not rated,	551

	character	AllCount
0	sansa stark,	2552
1	jon snow,	1904
2	arya stark,	1412
3	jaime lannister,	1337
4	daenerys targaryen,	1160

	character	S1Count	S2Count	S3Count	S4Count	S5Count	S6Count	S7Count	S8Count	AllCount
0	sansa stark,	42.0	94.0	217.0	223.0	232.0	445.0	727.0	572.0	2552
1	jon snow,	39.0	77.0	89.0	94.0	107.0	346.0	703.0	449.0	1904
2	robb stark,	33.0	64.0	75.0	64.0	55.0	102.0	188.0	92.0	673
3	jaime lannister,	32.0	53.0	107.0	96.0	105.0	187.0	344.0	413.0	1337
4	cersei lannister,	29.0	35.0	66.0	67.0	58.0	96.0	201.0	184.0	736

	relationship	AllCount
0	jaime lannister/brienne of tarth,	923
1	jon snow/sansa stark,	526
2	arya stark/gendry waters,	518
3	jon snow/daenerys targaryen,	511
4	sandor clegane/sansa stark,	379

	relationship	S1Count	S2Count	S3Count	S4Count	S5Count	S6Count	S7Count	S8Count	AllCount
0	jaime lannister/brienne of tarth,	37.0	103.0	78.0	45.0	61.0	120.0	178.0	301.0	923
1	cersei lannister/jaime lannister,	18.0	16.0	29.0	22.0	16.0	21.0	53.0	51.0	226
2	jon snow/robb stark,	10.0	17.0	7.0	10.0	8.0	8.0	11.0	4.0	75
3	theon greyjoy/robb stark,	10.0	20.0	15.0	14.0	5.0	13.0	18.0	11.0	106
4	petyr baelish/sansa stark,	8.0	1.0	26.0	47.0	43.0	57.0	75.0	15.0	272

	category	S1Count	S2Count	S3Count	S4Count	S5Count	S6Count	S7Count	S8Count	AllCount
0	f/m,	125.0	300.0	384.0	372.0	370.0	719.0	1352.0	1100.0	4722
1	gen,	48.0	107.0	183.0	167.0	119.0	187.0	297.0	218.0	1326
2	m/m,	36.0	112.0	112.0	169.0	138.0	167.0	265.0	172.0	1171
3	f/f,	8.0	17.0	54.0	84.0	62.0	98.0	194.0	130.0	647
4	multi,	3.0	6.0	20.0	35.0	34.0	40.0	106.0	68.0	312

	additional tags	S1Count	S2Count	S3Count	S4Count	S5Count	S6Count	S7Count	S8Count	AllCount
0	no english,	32.0	80.0	8.0	1.0	NaN	NaN	NaN	NaN	121
1	somali,	25.0	62.0	8.0	1.0	NaN	NaN	NaN	NaN	96
2	alternate universe,	13.0	19.0	30.0	32.0	27.0	29.0	81.0	39.0	270
3	incest,	12.0	6.0	7.0	11.0	4.0	12.0	34.0	15.0	101
4	sibling incest,	11.0	13.0	26.0	8.0	3.0	10.0	24.0	17.0	112

	rating	AllCount
0	teen and up audiences,	2048
1	general audiences,	2021
2	mature,	1643
3	explicit,	1154
4	not rated,	1151