Text Preparation for Computational Text Analysis¶

Cara Marta Messina
PhD Candidate in English, Writing & Rhetoric
Northeastern University
Published for The Journal of Writing Analytics, Volume 3 and The Critical Fan Toolkit.

This notebook will demonstrate the code used to prepare the "body" of The Legend of Korrafanfiction texts published on AO3. The next steps are then to use the saved strings and perform forms of computational text analysis on them.

#pandas for working with dataframes
import pandas as pd

#regular expression library
import re

#numpy specifically works with numbers
import numpy as np

#nltk libraries
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string

#I am not saving the data in this GitHub folde
allkorra = pd.read_csv('../../../data/korra/korra2018/allkorra.csv')

#providing a preview of the data. 
allkorra.head(1)

Text Preparation¶

In this section, I will demonstrate how to prepare the "Body" column of the data, which contains all the text for each published fanfiction saved in the data.

There are several different functions that this section defines:

"column_to_token" takes the text from a column, puts it in a list, and then tokenizes that list. Tokenizing is when a string of words/characters is taken and separated by white space and put into a list
the "cleaning" function takes the tokenized list and does several things: lowers all the capital letters, removes all the stopwords saved in the below cell, remove punctuation, and stem the words using NLTK's Porter stememer
- I used NLTK's list of stopwords, but copied it into my own list because the NLTK stopwords and punctuation lists missed a few key characters, especially for this corpus. For example, the double hyphen "--" is used quite often in fanfictions, so I added it to stopwords below
"save_txt" takes the output of the do_it function (the tokenized, cleaned list of words from the corpus), turns the list back into a string, and then saves that string as a file

stopwords_ = ['``','`','--','...',"'m","''","'re","'ve",'i','me', 'my',  'myself', "“", "”", 'we', 'our', '’', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her','hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', "he's","she's","they're","they've","i've"'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'do',"n't","don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", "'t", "'s", 'wasn', "wasn't", 'weren', "weren't", "would", "could", 'won', "won't", 'wouldn', "wouldn't"]

def column_to_string(dataframe,columnName):
    '''
    this function takes all the information from a specific column and joins it into a string. This is NOT for cleaning the text, just for saving the string.
    Input: the name of the dataframe and the column name
    Output: a string of all the words/characters from a column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    return string
    print(string[:10])

def column_to_token(dataframe,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes that string.
    Input: the name of the dataframe and the column name
    Output: a tokenized list of all the characters from that specific column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    corpus_token = word_tokenize(string)
    return corpus_token
    print(corpus_token[:3])

def cleaning(token_text):
    '''
    This function takes a tokenized list of words/punctuation and does several "cleaning" methods
    The first "lowers" everything, then it removes the stop words and punctuation. The stop words are listed and I used the punctuation list from NLTK (in my "imports")
    Input: the tokenized list of the text
    Output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    stemmer = PorterStemmer()

    text_lc = [word.lower() for word in token_text]
    text_tokens_clean = [word for word in text_lc if word not in stopwords_]
    text_tokens_clean_ = [word for word in text_tokens_clean if word not in string.punctuation]
    tokens_stemmed = [stemmer.stem(w) for w in text_tokens_clean_]
    return tokens_stemmed

def save_txt(cleanToken,filePath):
    '''
    take the tokenized list of words, convert to a string, and save it
    input: the tokenized list of words and your new filepath
    output: a saved file of the full corpus' string
    '''
    clean_string = " ".join(cleanToken)
    file2 = open(filePath,"w")
    file2.write(clean_string)
    file2.close()

def save_string_txt(string,filePath):
    '''
    take the string of words and saved it
    input: the string of words and your new filepath
    output: a saved file of the full corpus' string
    '''
    file2 = open(filePath,"w")
    file2.write(string)
    file2.close()

Full Corpus¶

While I will not be using the full corpus for this particular project, I will be using it in future projects. I am saving the full corpus now so I do not have to do it later.

full_corpus = do_it(allkorra,'body')
full_corpus[:5]

['forgotten', 'warm', 'thought', 'sent', 'noatak']

# SAVE THE FULL CORPUS 
# save_txt(full_corpus,'../../../data/korra/korra2018/allkorra.txt')

Creating Three Corpora¶

Based on the computational temporal analysis conducted in previous notebooks, I will separate the overall corpus into three smaller corpora:

pre-Korrasami: from 2011-02 to 2014-07, or Feb 2011 to July 2014.
subtext-Korrasami: from 2014-08 to 2014-11, or August 2014 to November 2014
post-Korrasami: from 2014-12 to 2015-03, or December 2012 to March 2015

I decided to stop at March 2015 because this is when the publishing peak begins to die down.

I will do the same process to create smaller corpora out of the larger corpus. I will follow the same steps as above, but instead begin by first creating three separate dataframes that reflect these spans of time. As the cell below shows, I made the month into the index and then sorted it by lowest number to highest number (or earlier dates to later dates).

month_ff = allkorra.set_index('month')
month_ff = month_ff.sort_index(ascending=True)
month_ff[:1]

preKorrasami = month_ff.loc['2011-02':'2014-07']
subtextKorrasami = month_ff.loc['2014-08':'2014-11']
postKorrasami = month_ff.loc['2014-12':'2015-03']

full = month_ff.loc['2011-02':'2015-03']
len(full.index)

3759

Next, I check the dataframes to make sure the months are all correct by looking at the first and last fanfiction for each dataframe.

PreKorrasami¶

preKorrasami[:1]

preKorrasami[-1:]

Subtext Korrasami¶

subtextKorrasami[:1]

subtextKorrasami[-1:]

PostKorrasami¶

postKorrasami[:1]

postKorrasami[-1:]

Transforming Dataframes to Text Files¶

Since the dataframes are all confirmed to be within the proper timeframes, I will now use the functions to take the body of text from each published fanfiction, tokenize it, remove stopwords/punctuation, and stem it. Then, I will use the "save_txt" function to save each corpus as a text file so they can be used in the computational text analysis notebook.

Saving the Unclean Corpus¶

First, I will save the three corpora as their uncleaned versions.

preKorra_string = column_to_string(preKorrasami,'body')
subtextKorra_string = column_to_string(subtextKorrasami,'body')
postKorra_string = column_to_string(postKorrasami,'body')

print(preKorra_string[:20])
print(subtextKorra_string[:20])
print(postKorra_string[:20])

When Kato listens to
He remembered the da

          (See the

save_string_txt(preKorra_string,'../../../data/korra/korra2018/time/preKorrasami_unclean.txt')
save_string_txt(subtextKorra_string,'../../../data/korra/korra2018/time/subtextKorrasami_unclean.txt')
save_string_txt(postKorra_string,'../../../data/korra/korra2018/time/postKorrasami_unclean.txt')

Saving Cleaned Corpus¶

Next, I will create and save the three cleaned corpora.

preKorra_token = column_to_token(preKorrasami,'body')
preKorra_clean = cleaning(preKorra_token)
preKorra_clean[:20]

['kato',
 'listen',
 'father',
 'war',
 'stori',
 'listen',
 'great',
 'raptur',
 'especi',
 'enjoy',
 'hear',
 'tale',
 'master',
 'waterbend',
 'father',
 'travel',
 'enjoy',
 'listen',
 'father',
 'fought']

# SAVING THIS 
save_txt(preKorra_clean,'../../../data/korra/korra2018/time/preKorrasami.txt')

subtextKorra_token = column_to_token(subtextKorrasami,'body')
subtextKorra_clean = cleaning(subtextKorra_token)
subtextKorra_clean[:20]

['rememb',
 'day',
 'though',
 'yesterday',
 'age',
 'fourteen',
 'fate',
 'dark',
 'day',
 'day',
 'hatr',
 'bender',
 'came',
 'day',
 'especi',
 'moment',
 'stare',
 'bender',
 'resent',
 'everi']

save_txt(subtextKorra_clean,'../../../data/korra/korra2018/time/subtextKorrasami.txt')

postKorra_token = column_to_token(postKorrasami,'body')
postKorra_clean = cleaning(postKorra_token)
postKorra_clean[:20]

['see',
 'end',
 'chapter',
 'note',
 'everyon',
 'room',
 'explod',
 'bright',
 'blind',
 'purpl',
 'flash',
 'crumbl',
 'around',
 'like',
 'shadow',
 'melt',
 'away',
 'light',
 'like',
 'time']

# SAVING THIS 
save_txt(postKorra_clean,'../../../data/korra/korra2018/time/postKorrasami.txt')

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2011-02	7485	165877	Kato of the Fire Nation	Not Rated	Multi	Avatar: The Last Airbender, Avatar: Legend of ...	Mai/Zuko (Avatar), Sokka/Suki (Avatar), Aang (...	Zuko (Avatar), Mai (Avatar), Sokka (Avatar), S...	Original Characters - Freeform	English	2011-02-26	Completed	2011-02-26	4180	1/1	3	107	17	2396	When Kato listens to his father's war stories,...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2011-02	7485	165877	Kato of the Fire Nation	Not Rated	Multi	Avatar: The Last Airbender, Avatar: Legend of ...	Mai/Zuko (Avatar), Sokka/Suki (Avatar), Aang (...	Zuko (Avatar), Mai (Avatar), Sokka (Avatar), S...	Original Characters - Freeform	English	2011-02-26	Completed	2011-02-26	4180	1/1	3	107	17	2396	When Kato listens to his father's war stories,...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2014-07	5729	1948089	maquette	General Audiences	Gen	Avatar: Legend of Korra	NaN	Huan (Avatar), Toph Bei Fong	NaN	English	2014-07-13	Completed	2014-07-13	1627	1/1	4	99	17	790	One afternoon, his mother rounds up the entire...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2014-08	5618	2186247	Vengeance...Is Not So Sweet	Teen And Up Audiences	NaN	Avatar: Legend of Korra	NaN	Lieutenant (Avatar), Amon \| Noatak, Amon (Avat...	Ghazan is Lieutenant's older brother, Headcano...	English	2014-08-23	Completed	2014-08-23	5157	1/1	2	3	null	245	He remembered the day as though it were yester...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2014-11	5222	2595557	95% Cotton	Teen And Up Audiences	F/F	Avatar (TV), Avatar: Legend of Korra	Korra/Kuvira, One-Sided - Relationship, Korra/...	Korra, Kuvira, Bolin, Baatar Jr, Asami Sato	LaundryVerse, Alternate Universe - Modern Setting	English	2014-11-10	Completed	2014-11-10	4128	1/1	25	289	33	5940	In a moment of horror Korra realises that this...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2014-12	4872	2828432	The Things You Leave Behind	Not Rated	F/F	Avatar: Legend of Korra	Korra/Asami Sato	Korra (Avatar), Asami Sato, Mako (Avatar), Bol...	Canon Queer Relationship, Canon Queer Characte...	English	2014-12-22	Updated	2014-12-27	5076	2/?	12	159	19	3629	\n (See the end of the chapter for ...

	Unnamed: 0	work_id	title	rating	category	fandom	relationship	character	additional tags	language	published	status	status date	words	chapters	comments	kudos	bookmarks	hits	body
month
2015-03	3597	3499004	Humanity	Mature	F/F	Avatar: Legend of Korra	Korra/Asami Sato	Korra (Avatar), Asami Sato	Romance, Alternate Universe - Modern Setting	English	2015-03-07	Completed	2015-05-29	95215	37/37	160	837	107	20079	"More!" Desperate, panicked, a voice recognizi...