Text Preparation for Computational Text Analysis

Cara Marta Messina
PhD Candidate in English, Writing & Rhetoric
Northeastern University
Published for The Journal of Writing Analytics, Volume 3 and The Critical Fan Toolkit.

This notebook will demonstrate the code used to prepare the "body" of The Legend of Korrafanfiction texts published on AO3. The next steps are then to use the saved strings and perform forms of computational text analysis on them.

In [75]:
#pandas for working with dataframes
import pandas as pd

#regular expression library
import re

#numpy specifically works with numbers
import numpy as np

#nltk libraries
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string
In [76]:
#I am not saving the data in this GitHub folde
allkorra = pd.read_csv('../../../data/korra/korra2018/allkorra.csv')

#providing a preview of the data. 
allkorra.head(1)
Out[76]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language ... status status date words chapters comments kudos bookmarks hits body month
0 0 6388009 A More Perfect Union General Audiences Gen Avatar: Legend of Korra NaN Noatak (Avatar), Tarrlok (Avatar), Amon (Avatar) Alternate Universe English ... Updated 2018-03-14 8139 4/? 11 27 4 286 He's forgotten how to be warm. The thought wou... 2016-03

1 rows × 21 columns

Text Preparation

In this section, I will demonstrate how to prepare the "Body" column of the data, which contains all the text for each published fanfiction saved in the data.

There are several different functions that this section defines:

  • "column_to_token" takes the text from a column, puts it in a list, and then tokenizes that list. Tokenizing is when a string of words/characters is taken and separated by white space and put into a list
  • the "cleaning" function takes the tokenized list and does several things: lowers all the capital letters, removes all the stopwords saved in the below cell, remove punctuation, and stem the words using NLTK's Porter stememer
    • I used NLTK's list of stopwords, but copied it into my own list because the NLTK stopwords and punctuation lists missed a few key characters, especially for this corpus. For example, the double hyphen "--" is used quite often in fanfictions, so I added it to stopwords below
  • "save_txt" takes the output of the do_it function (the tokenized, cleaned list of words from the corpus), turns the list back into a string, and then saves that string as a file
In [63]:
stopwords_ = ['``','`','--','...',"'m","''","'re","'ve",'i','me', 'my',  'myself', "“", "”", 'we', 'our', '’', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her','hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', "he's","she's","they're","they've","i've"'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'do',"n't","don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", "'t", "'s", 'wasn', "wasn't", 'weren', "weren't", "would", "could", 'won', "won't", 'wouldn', "wouldn't"]
In [89]:
def column_to_string(dataframe,columnName):
    '''
    this function takes all the information from a specific column and joins it into a string. This is NOT for cleaning the text, just for saving the string.
    Input: the name of the dataframe and the column name
    Output: a string of all the words/characters from a column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    return string
    print(string[:10])

def column_to_token(dataframe,columnName):
    '''
    this function takes all the information from a specific column, joins it to a string, and then tokenizes that string.
    Input: the name of the dataframe and the column name
    Output: a tokenized list of all the characters from that specific column
    '''
    string = ' '.join(dataframe[columnName].tolist())
    corpus_token = word_tokenize(string)
    return corpus_token
    print(corpus_token[:3])

def cleaning(token_text):
    '''
    This function takes a tokenized list of words/punctuation and does several "cleaning" methods
    The first "lowers" everything, then it removes the stop words and punctuation. The stop words are listed and I used the punctuation list from NLTK (in my "imports")
    Input: the tokenized list of the text
    Output: the tokenized list of the text with all lower case, punctuation removed, and no stop words
    '''
    stemmer = PorterStemmer()

    text_lc = [word.lower() for word in token_text]
    text_tokens_clean = [word for word in text_lc if word not in stopwords_]
    text_tokens_clean_ = [word for word in text_tokens_clean if word not in string.punctuation]
    tokens_stemmed = [stemmer.stem(w) for w in text_tokens_clean_]
    return tokens_stemmed

def save_txt(cleanToken,filePath):
    '''
    take the tokenized list of words, convert to a string, and save it
    input: the tokenized list of words and your new filepath
    output: a saved file of the full corpus' string
    '''
    clean_string = " ".join(cleanToken)
    file2 = open(filePath,"w")
    file2.write(clean_string)
    file2.close()

def save_string_txt(string,filePath):
    '''
    take the string of words and saved it
    input: the string of words and your new filepath
    output: a saved file of the full corpus' string
    '''
    file2 = open(filePath,"w")
    file2.write(string)
    file2.close()

Full Corpus

While I will not be using the full corpus for this particular project, I will be using it in future projects. I am saving the full corpus now so I do not have to do it later.

In [82]:
full_corpus = do_it(allkorra,'body')
full_corpus[:5]
Out[82]:
['forgotten', 'warm', 'thought', 'sent', 'noatak']
In [98]:
# SAVE THE FULL CORPUS 
# save_txt(full_corpus,'../../../data/korra/korra2018/allkorra.txt')

Creating Three Corpora

Based on the computational temporal analysis conducted in previous notebooks, I will separate the overall corpus into three smaller corpora:

  • pre-Korrasami: from 2011-02 to 2014-07, or Feb 2011 to July 2014.
  • subtext-Korrasami: from 2014-08 to 2014-11, or August 2014 to November 2014
  • post-Korrasami: from 2014-12 to 2015-03, or December 2012 to March 2015

I decided to stop at March 2015 because this is when the publishing peak begins to die down.

I will do the same process to create smaller corpora out of the larger corpus. I will follow the same steps as above, but instead begin by first creating three separate dataframes that reflect these spans of time. As the cell below shows, I made the month into the index and then sorted it by lowest number to highest number (or earlier dates to later dates).

In [37]:
month_ff = allkorra.set_index('month')
month_ff = month_ff.sort_index(ascending=True)
month_ff[:1]
Out[37]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2011-02 7485 165877 Kato of the Fire Nation Not Rated Multi Avatar: The Last Airbender, Avatar: Legend of ... Mai/Zuko (Avatar), Sokka/Suki (Avatar), Aang (... Zuko (Avatar), Mai (Avatar), Sokka (Avatar), S... Original Characters - Freeform English 2011-02-26 Completed 2011-02-26 4180 1/1 3 107 17 2396 When Kato listens to his father's war stories,...
In [57]:
preKorrasami = month_ff.loc['2011-02':'2014-07']
subtextKorrasami = month_ff.loc['2014-08':'2014-11']
postKorrasami = month_ff.loc['2014-12':'2015-03']
In [99]:
full = month_ff.loc['2011-02':'2015-03']
len(full.index)
Out[99]:
3759

Next, I check the dataframes to make sure the months are all correct by looking at the first and last fanfiction for each dataframe.

PreKorrasami

In [18]:
preKorrasami[:1]
Out[18]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2011-02 7485 165877 Kato of the Fire Nation Not Rated Multi Avatar: The Last Airbender, Avatar: Legend of ... Mai/Zuko (Avatar), Sokka/Suki (Avatar), Aang (... Zuko (Avatar), Mai (Avatar), Sokka (Avatar), S... Original Characters - Freeform English 2011-02-26 Completed 2011-02-26 4180 1/1 3 107 17 2396 When Kato listens to his father's war stories,...
In [17]:
preKorrasami[-1:]
Out[17]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2014-07 5729 1948089 maquette General Audiences Gen Avatar: Legend of Korra NaN Huan (Avatar), Toph Bei Fong NaN English 2014-07-13 Completed 2014-07-13 1627 1/1 4 99 17 790 One afternoon, his mother rounds up the entire...

Subtext Korrasami

In [20]:
subtextKorrasami[:1]
Out[20]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2014-08 5618 2186247 Vengeance...Is Not So Sweet Teen And Up Audiences NaN Avatar: Legend of Korra NaN Lieutenant (Avatar), Amon | Noatak, Amon (Avat... Ghazan is Lieutenant's older brother, Headcano... English 2014-08-23 Completed 2014-08-23 5157 1/1 2 3 null 245 He remembered the day as though it were yester...
In [21]:
subtextKorrasami[-1:]
Out[21]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2014-11 5222 2595557 95% Cotton Teen And Up Audiences F/F Avatar (TV), Avatar: Legend of Korra Korra/Kuvira, One-Sided - Relationship, Korra/... Korra, Kuvira, Bolin, Baatar Jr, Asami Sato LaundryVerse, Alternate Universe - Modern Setting English 2014-11-10 Completed 2014-11-10 4128 1/1 25 289 33 5940 In a moment of horror Korra realises that this...

PostKorrasami

In [22]:
postKorrasami[:1]
Out[22]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2014-12 4872 2828432 The Things You Leave Behind Not Rated F/F Avatar: Legend of Korra Korra/Asami Sato Korra (Avatar), Asami Sato, Mako (Avatar), Bol... Canon Queer Relationship, Canon Queer Characte... English 2014-12-22 Updated 2014-12-27 5076 2/? 12 159 19 3629 \n (See the end of the chapter for ...
In [23]:
postKorrasami[-1:]
Out[23]:
Unnamed: 0 work_id title rating category fandom relationship character additional tags language published status status date words chapters comments kudos bookmarks hits body
month
2015-03 3597 3499004 Humanity Mature F/F Avatar: Legend of Korra Korra/Asami Sato Korra (Avatar), Asami Sato Romance, Alternate Universe - Modern Setting English 2015-03-07 Completed 2015-05-29 95215 37/37 160 837 107 20079 "More!" Desperate, panicked, a voice recognizi...

Transforming Dataframes to Text Files

Since the dataframes are all confirmed to be within the proper timeframes, I will now use the functions to take the body of text from each published fanfiction, tokenize it, remove stopwords/punctuation, and stem it. Then, I will use the "save_txt" function to save each corpus as a text file so they can be used in the computational text analysis notebook.

Saving the Unclean Corpus

First, I will save the three corpora as their uncleaned versions.

In [86]:
preKorra_string = column_to_string(preKorrasami,'body')
subtextKorra_string = column_to_string(subtextKorrasami,'body')
postKorra_string = column_to_string(postKorrasami,'body')
In [87]:
print(preKorra_string[:20])
print(subtextKorra_string[:20])
print(postKorra_string[:20])
When Kato listens to
He remembered the da

          (See the
In [90]:
save_string_txt(preKorra_string,'../../../data/korra/korra2018/time/preKorrasami_unclean.txt')
save_string_txt(subtextKorra_string,'../../../data/korra/korra2018/time/subtextKorrasami_unclean.txt')
save_string_txt(postKorra_string,'../../../data/korra/korra2018/time/postKorrasami_unclean.txt')

Saving Cleaned Corpus

Next, I will create and save the three cleaned corpora.

In [67]:
preKorra_token = column_to_token(preKorrasami,'body')
preKorra_clean = cleaning(preKorra_token)
preKorra_clean[:20]
Out[67]:
['kato',
 'listen',
 'father',
 'war',
 'stori',
 'listen',
 'great',
 'raptur',
 'especi',
 'enjoy',
 'hear',
 'tale',
 'master',
 'waterbend',
 'father',
 'travel',
 'enjoy',
 'listen',
 'father',
 'fought']
In [68]:
# SAVING THIS 
save_txt(preKorra_clean,'../../../data/korra/korra2018/time/preKorrasami.txt')
In [69]:
subtextKorra_token = column_to_token(subtextKorrasami,'body')
subtextKorra_clean = cleaning(subtextKorra_token)
subtextKorra_clean[:20]
Out[69]:
['rememb',
 'day',
 'though',
 'yesterday',
 'age',
 'fourteen',
 'fate',
 'dark',
 'day',
 'day',
 'hatr',
 'bender',
 'came',
 'day',
 'especi',
 'moment',
 'stare',
 'bender',
 'resent',
 'everi']
In [70]:
save_txt(subtextKorra_clean,'../../../data/korra/korra2018/time/subtextKorrasami.txt')
In [71]:
postKorra_token = column_to_token(postKorrasami,'body')
postKorra_clean = cleaning(postKorra_token)
postKorra_clean[:20]
Out[71]:
['see',
 'end',
 'chapter',
 'note',
 'everyon',
 'room',
 'explod',
 'bright',
 'blind',
 'purpl',
 'flash',
 'crumbl',
 'around',
 'like',
 'shadow',
 'melt',
 'away',
 'light',
 'like',
 'time']
In [72]:
# SAVING THIS 
save_txt(postKorra_clean,'../../../data/korra/korra2018/time/postKorrasami.txt')