Cara Marta Messina
PhD Candidate in English, Writing & Rhetoric
Northeastern University
Published for The Journal of Writing Analytics, Volume 3 and The Critical Fan Toolkit.
This notebook will demonstrate the code used to prepare the "body" of The Legend of Korrafanfiction texts published on AO3. The next steps are then to use the saved strings and perform forms of computational text analysis on them.
#pandas for working with dataframes
import pandas as pd
#regular expression library
import re
#numpy specifically works with numbers
import numpy as np
#nltk libraries
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string
#I am not saving the data in this GitHub folde
allkorra = pd.read_csv('../../../data/korra/korra2018/allkorra.csv')
#providing a preview of the data.
allkorra.head(1)
In this section, I will demonstrate how to prepare the "Body" column of the data, which contains all the text for each published fanfiction saved in the data.
There are several different functions that this section defines:
stopwords_ = ['``','`','--','...',"'m","''","'re","'ve",'i','me', 'my', 'myself', "“", "”", 'we', 'our', '’', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her','hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', "he's","she's","they're","they've","i've"'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'do',"n't","don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", "'t", "'s", 'wasn', "wasn't", 'weren', "weren't", "would", "could", 'won', "won't", 'wouldn', "wouldn't"]
def column_to_string(dataframe,columnName):
'''
this function takes all the information from a specific column and joins it into a string. This is NOT for cleaning the text, just for saving the string.
Input: the name of the dataframe and the column name
Output: a string of all the words/characters from a column
'''
string = ' '.join(dataframe[columnName].tolist())
return string
print(string[:10])
def column_to_token(dataframe,columnName):
'''
this function takes all the information from a specific column, joins it to a string, and then tokenizes that string.
Input: the name of the dataframe and the column name
Output: a tokenized list of all the characters from that specific column
'''
string = ' '.join(dataframe[columnName].tolist())
corpus_token = word_tokenize(string)
return corpus_token
print(corpus_token[:3])
def cleaning(token_text):
'''
This function takes a tokenized list of words/punctuation and does several "cleaning" methods
The first "lowers" everything, then it removes the stop words and punctuation. The stop words are listed and I used the punctuation list from NLTK (in my "imports")
Input: the tokenized list of the text
Output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
'''
stemmer = PorterStemmer()
text_lc = [word.lower() for word in token_text]
text_tokens_clean = [word for word in text_lc if word not in stopwords_]
text_tokens_clean_ = [word for word in text_tokens_clean if word not in string.punctuation]
tokens_stemmed = [stemmer.stem(w) for w in text_tokens_clean_]
return tokens_stemmed
def save_txt(cleanToken,filePath):
'''
take the tokenized list of words, convert to a string, and save it
input: the tokenized list of words and your new filepath
output: a saved file of the full corpus' string
'''
clean_string = " ".join(cleanToken)
file2 = open(filePath,"w")
file2.write(clean_string)
file2.close()
def save_string_txt(string,filePath):
'''
take the string of words and saved it
input: the string of words and your new filepath
output: a saved file of the full corpus' string
'''
file2 = open(filePath,"w")
file2.write(string)
file2.close()
While I will not be using the full corpus for this particular project, I will be using it in future projects. I am saving the full corpus now so I do not have to do it later.
full_corpus = do_it(allkorra,'body')
full_corpus[:5]
# SAVE THE FULL CORPUS
# save_txt(full_corpus,'../../../data/korra/korra2018/allkorra.txt')
Based on the computational temporal analysis conducted in previous notebooks, I will separate the overall corpus into three smaller corpora:
I decided to stop at March 2015 because this is when the publishing peak begins to die down.
I will do the same process to create smaller corpora out of the larger corpus. I will follow the same steps as above, but instead begin by first creating three separate dataframes that reflect these spans of time. As the cell below shows, I made the month into the index and then sorted it by lowest number to highest number (or earlier dates to later dates).
month_ff = allkorra.set_index('month')
month_ff = month_ff.sort_index(ascending=True)
month_ff[:1]
preKorrasami = month_ff.loc['2011-02':'2014-07']
subtextKorrasami = month_ff.loc['2014-08':'2014-11']
postKorrasami = month_ff.loc['2014-12':'2015-03']
full = month_ff.loc['2011-02':'2015-03']
len(full.index)
Next, I check the dataframes to make sure the months are all correct by looking at the first and last fanfiction for each dataframe.
preKorrasami[:1]
preKorrasami[-1:]
subtextKorrasami[:1]
subtextKorrasami[-1:]
postKorrasami[:1]
postKorrasami[-1:]
Since the dataframes are all confirmed to be within the proper timeframes, I will now use the functions to take the body of text from each published fanfiction, tokenize it, remove stopwords/punctuation, and stem it. Then, I will use the "save_txt" function to save each corpus as a text file so they can be used in the computational text analysis notebook.
First, I will save the three corpora as their uncleaned versions.
preKorra_string = column_to_string(preKorrasami,'body')
subtextKorra_string = column_to_string(subtextKorrasami,'body')
postKorra_string = column_to_string(postKorrasami,'body')
print(preKorra_string[:20])
print(subtextKorra_string[:20])
print(postKorra_string[:20])
save_string_txt(preKorra_string,'../../../data/korra/korra2018/time/preKorrasami_unclean.txt')
save_string_txt(subtextKorra_string,'../../../data/korra/korra2018/time/subtextKorrasami_unclean.txt')
save_string_txt(postKorra_string,'../../../data/korra/korra2018/time/postKorrasami_unclean.txt')
Next, I will create and save the three cleaned corpora.
preKorra_token = column_to_token(preKorrasami,'body')
preKorra_clean = cleaning(preKorra_token)
preKorra_clean[:20]
# SAVING THIS
save_txt(preKorra_clean,'../../../data/korra/korra2018/time/preKorrasami.txt')
subtextKorra_token = column_to_token(subtextKorrasami,'body')
subtextKorra_clean = cleaning(subtextKorra_token)
subtextKorra_clean[:20]
save_txt(subtextKorra_clean,'../../../data/korra/korra2018/time/subtextKorrasami.txt')
postKorra_token = column_to_token(postKorrasami,'body')
postKorra_clean = cleaning(postKorra_token)
postKorra_clean[:20]
# SAVING THIS
save_txt(postKorra_clean,'../../../data/korra/korra2018/time/postKorrasami.txt')