This notebook will analyze particular tags -- character, additional, relationship, or others -- and trends in the actual fanfic texts centering Missandei of Naath. This is the computational essay to be paired along with "'“Missandei Deserved Better:” Fan Uptakes of Missandei of Naath," part of the Critical Fan Toolkit.
The Critical Fan Toolkit is a dissertation by Cara Marta Messina.
When I asked WriteGirl about the lack of use of Missandei's character, she said that even when Missandei appears in a fic, she is often written as "window dressing" and merely there to affirm Dany's arguments.
To determine the accuracy of this, I will pull all the fanfics that use the Missandei character tag, see who the most used character tag is in that smaller corpus, and count how often the actual texts mention Missandei and other characters. In order to do this, I am using the "String contains" function because Missandei appears as "Missandei (ASOIF)," "Missandei," and in other ways. String contains allows to pull all mentions of Missandei.
There are 29897 fanfics in total, and only 1018 use Missandei in the character tag, or 3.4%.
#pandas for working with dataframes
import pandas as pd
import numpy as np
#visualizing
import seaborn as sns
#nltk libraries
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.stem.porter import *
import string
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
#has the nice counter feature for counting tags
import collections
from collections import Counter
got0 = pd.read_csv('data/got_data_original/got0.csv')
got1 = pd.read_csv('data/got_data_original/got1.csv')
got2 = pd.read_csv('data/got_data_original/got2.csv')
got3 = pd.read_csv('data/got_data_original/got3.csv')
got_all = pd.concat([got0, got1, got2, got3])
got_all[:2]
len(got_all.index)
gotMiss = got_all[got_all['character'].str.contains("Missandei", na=False)]
len(gotMiss.index)
def character_percentage(df, characterName):
dfcount = len(df.index)
df1 = df[df['character'].str.contains(characterName, na=False)]
df1count = len(df1.index)
percentage = (df1count * 100)/dfcount
print(percentage)
character_percentage(got_all, 'Brienne')
character_percentage(got_all, 'Gendry')
character_percentage(got_all, 'Tormund')
character_percentage(got_all, 'Lyanna')
character_percentage(got_all, 'Podrick')
character_percentage(got_all, 'Gilly')
I created functions to pull tags from the dataframe and count the top tags. I wil use this function specifically on the Missandei corpus to see which are the most popular characters and relationships used in this corpus.
def column_to_string(df,columnName):
'''
this function takes all the information from a specific column, joins it to a string, and then tokenizes & cleans that string.
input: the name of the dataframe and the column name
output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
'''
df[columnName] = df[columnName].replace(np.nan,'',regex=True)
string = ' '.join(df[columnName].tolist())
return string
def TagsAnalyzer(df, columnName):
'''
Input: the dataframe, the column you want to analyze, and the number of most common tags/phrases
Output: A tupple of the most common tags in a column
Description: this separates tags by commas, counts the most frequent tags, and will show you the most common
'''
#replace empty values & make a list of all the words using the column_to_string function
string = column_to_string(df, columnName)
#the function to tokenize, or put each value as an element in a list
class CommaPoint(PunktLanguageVars):
sent_end_chars = (',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
#tokenizing the list of strings based on the COMMA, not the white space (as seen in the CommaPoint above)
ListOfTags = tokenizer.tokenize(string)
#the "Counter" function is from the collections library
allCounter=collections.Counter(ListOfTags)
#returning a dictionary in which the keys are all the tags, and the items are the counts
return allCounter.most_common()
Missandei is, of course, the most popular tag, which makes sense. Then it's Dany, Tyrion, Sansa, Arya, then Grey Worm. I'm unsurprised, yet still disappointed to see Grey Worm so far down.
Because Dany appears in second, this may support Ebony Elizabeth Thomas's observation that Missandei is often only depicted as Dany's friend and confidant, not her own person.
After I created the list of tupples ("MissCharacters), I also wanted to visualize the results. I put the list of tupples into a dataframe, as Seaborn (the graphing library) works best with dataframes. Then I visualized the top 5 characters used.
MissCharacters = TagsAnalyzer(gotMiss,'character')
MissCharacters[:100]
MissCharactersDF = pd.DataFrame(MissCharacters, columns=["Character", "Count"])
MissCharactersDFtop5 = MissCharactersDF[:5]
MissCharactersDFtop5
MissCharaViz = sns.barplot(x="Count", y="Character", data=MissCharactersDFtop5, color="gray").set_title("Top Character Tags Used in Missandei Fanfics")
MissCharaViz.figure.savefig("./images/MissCharacterTags.png", bbox_inches="tight")
The top relationship is Arya/Gendry, with Grey Worm/Missandei coming in next. In the top 20 relationships, Missandei only appears once and is paired with Grey Worm.
MissRelations = TagsAnalyzer(gotMiss,'relationship')
MissRelations[:20]
Of course, looking at metadata does not provide the entire picture. How often does Missandei actually appear in the fanfics that use her in the character tags? How does that compare to other characters? What types of words are used to describe Missandei and other characters?
To analyze this, I will begin with simple word frequency counts of the whole Missandei corpus with specific names.
First, I will grab all the fanfic texts and put it into a string using the column_to_string function. Then, I will calculate how many words actually appear in the entire body of these fanfics using the len function. This will help to provide a ratio as I calculate how often particular names are used.
Number of words that appear in the Missandei fanfic corpus: 194,504,308
MissText = column_to_string(gotMiss,"body")
MissText[:1000]
length = len(MissText)
length
df = ["Name", "Frequency"]
df
def WordPercentage(word, string):
'''
Description: This function finds the percentage for how often a word appears in a string. It's really just a simple word search function
Input: Input a string and a word that you would like to calculate how often it appears in that string.
Output: A percentage for how often that word appears in the string.
'''
#first, find the length of the string, or how many words appear, using the len() function
length = len(string)
#using the count() function, an NLTK function, count how often the word appears
count = string.count(word)
#calculate the percentage
percent = (count * 100)/length
return(percent)
print(WordPercentage("Missandei",MissText) + WordPercentage("Missy",MissText))
WordPercentage("Arya",MissText)
print(WordPercentage("Dany",MissText) + WordPercentage("Daenerys",MissText))
WordPercentage("Sansa",MissText)
WordPercentage("Tyrion",MissText)
WordPercentage("Grey",MissText)
WordPercentage("Westeros",MissText)
WordPercentage("Naath",MissText)
print((MissText.count("Missandei")) + MissText.count("Missy"))
print((MissText.count("Daenerys")) + MissText.count("Dany"))
print(MissText.count("Arya"))
print(MissText.count("Sansa"))
print(MissText.count("Grey"))
useData = {'Character': ["Missandei", "Dany", "Arya", "Sansa", "Grey Worm","Tyrion"], 'Frequency': [0.01741658081938216, 0.08329070017307791, 0.033672261901777516, 0.051355160729910414, 0.011530850000504873, 0.028398342724624895]}
dfUseData = pd.DataFrame.from_dict(useData).sort_values("Frequency", ascending=False)
dfUseData
MissCharaUse = sns.barplot(x="Frequency", y="Character", data=dfUseData, color="gray").set_title("Character Name Frequency in Missandei Fanfics (total words, n = 194,504,308)")
MissCharaUse.figure.savefig("./images/MissCharacterFrequency.png", bbox_inches="tight")