Computational Text Analysis: Word Embedding Models & Concordances

Cara Marta Messina
PhD Candidate in English, Writing & Rhetoric
Northeastern University
To be published in The Journal of Writing Analytics, Volume 3

Using the corpora created in the text preparation notebooks, this notebook will use several computational text analysis methods, including some NLP (natural language processing) and word embedding models. The fourth notebook will then use concordances to "fold" the computational models back to the text (William Reed Quinn, forthcoming).

In [170]:
#pandas for working with dataframes
import pandas as pd

#nltk libraries
import nltk
from nltk import word_tokenize
from nltk.util import ngrams

#word2vec models
import gensim
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec as wv

First, I created a function that reads in a text file of a string of words, tokenizes that string, and returns it as ready to be analyzed. Then, I used this function for each of the text corpora that I created.

In [29]:
def read_txt(filePath):
    '''
    This function reads a file (specifically a text file) and tokenizes that file
    Input: a .txt filepath of a string of words
    Output: a tokenized list of words
    '''
    file = open(filePath, "r") 
    new_string = file.read() 
    file.close()
    corpus_token = word_tokenize(new_string)
    return corpus_token
In [183]:
preKorrasami = read_txt('../../../data/korra/korra2018/time/preKorrasami.txt')
subtextKorrasami = read_txt('../../../data/korra/korra2018/time/subtextKorrasami.txt')
postKorrasami = read_txt('../../../data/korra/korra2018/time/postKorrasami.txt')
In [31]:
type(preKorrasami)
Out[31]:
list
In [32]:
print(preKorrasami[:10])
print(subtextKorrasami[:10])
print(postKorrasami[:10])
['kato', 'listen', 'father', 'war', 'stori', 'listen', 'great', 'raptur', 'especi', 'enjoy']
['rememb', 'day', 'though', 'yesterday', 'age', 'fourteen', 'fate', 'dark', 'day', 'day']
['see', 'end', 'chapter', 'note', 'everyon', 'room', 'explod', 'bright', 'blind', 'purpl']

Exploring The Corpora

Using basic Natural Language Processing methods (parts of speech and word counters), I will use the functions below to explore trends in each corpus as a place to begin pulling together results.

The below function (POS_tag) does several things:

  • counts all the words in the tokenized text to create a ratio
  • creates a frequent distribution list of the words, birgrams, and trigrams
  • tags the text using NLTK's parts of speech algorithm
  • counts the nouns, verbs, adjectives, and prepositions used (although most prepositions will probably be removed because the stop words have been removed)
  • using the printWordRatio function, the final output prints out the ratios of the most frequent: words, bigrams, trigrams, nouns, adjectives, & prepositions.
    • the ratio is created by dividing the number of each with the overall wordcount.

Although I wound up not using this function or these results in the Journal of Writing Analytics article to be published, I want to still publish this function so others may use it, if they are interested. I have also used this function in other research.

In [33]:
def printWordRatio(freqDist,wordcount,stringName,num):
    '''
    This takes the frequent distribution list, which counts basic frequency, and then finds the most common words/nouns/nGrams/verbs/etc
    input: the frequent distribution list (does basic frequency count), the wordcount from the overall text, a string that explains the output, the number of most common published
    output: a printed list of the top most common words/nGrams and the ratio of their appearance (ratio found by dividing the number they appear with the overall wordcount of the corpus)
    '''
    print("Word count for text:")
    print(wordcount)
    print("________________________")
    print(stringName)
    for tup0,tup1 in freqDist.most_common(num):
        print(tup0, tup1/wordcount)

def POS_tag(text,num):
    '''
    This takes a tokenized text, tags it with parts of speech, and then counts the most frequent words used for particular parts of speech. 
    input: a tokenized text (could be clean or not)
    output: a printed list of the most frequent words tagged in different parts of speech
    '''
    
    #word count!
    wordcount=(len(text))
    
    text_word_frequency = nltk.FreqDist(text)
    
    #do some bigram stuff
    bigram = list(ngrams(text,2))
    biGramFreq = nltk.FreqDist(bigram)
    
    #do some trigram stuff
    triGram = list(ngrams(text,3))
    triGramFreq = nltk.FreqDist(triGram)
    
    #then use parts of speech tag!
    text_tagged = nltk.pos_tag(text)
    
    #count the number of nouns
    text_nouns = [word for word,pos in text_tagged if pos=='NN' or pos=='NNS']
    text_freq_nouns=nltk.FreqDist(text_nouns)
    
    #count the verbs
    text_verbs = [word for word,pos in text_tagged if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
    text_freq_verbs=nltk.FreqDist(text_verbs)
    
    #count the adjectives
    text_adj = [word for word,pos in text_tagged if pos == 'JJ' or pos == 'JJR' or pos == 'JJS']
    text_freq_adj=nltk.FreqDist(text_adj)
    
    #count the prepositions
    text_prep = [word for word,pos in text_tagged if pos == 'IN']
    text_freq_prep=nltk.FreqDist(text_prep)
    
    printWordRatio(text_word_frequency,wordcount,"Most frequent words:",num)
    printWordRatio(biGramFreq,wordcount,"Most frequent bigrams:",num)
    printWordRatio(triGramFreq,wordcount,"Most frequent trigrams:",num)
    printWordRatio(text_freq_nouns,wordcount,"Most frequent nouns:",num)
    printWordRatio(text_freq_verbs,wordcount,"Most frequent verbs:",num)
    printWordRatio(text_freq_adj,wordcount,"Most frequent adjectives:",num)
    printWordRatio(text_freq_prep,wordcount,"Most frequent prepositions:",num)

Exploratory Results

In [34]:
preKorra = POS_tag(preKorrasami,40)
preKorra
Word count for text:
4148808
________________________
Most frequent words:
korra 0.01296468768860839
back 0.007504806199756653
like 0.007289322619894678
look 0.006783153136997422
hand 0.006463061197336681
eye 0.005899043773536881
mako 0.0058226362849281045
said 0.005750085325712831
one 0.005730320612571129
go 0.005248013405296172
know 0.0052393362141607905
asami 0.004947204112602945
bolin 0.004761608635540618
get 0.004529734805756256
say 0.004497918438259856
time 0.004491169511821227
want 0.004260500847472334
face 0.00392184936010536
head 0.0038264002576161634
around 0.003810733106955058
'd 0.0037847015335489135
even 0.0036991347876305677
think 0.0035653614242934354
see 0.0033927817339341805
still 0.0033346927599445433
tri 0.0033231231717640343
way 0.0032845578778290052
feel 0.003279255149912939
make 0.003229361300884495
take 0.003177298154072206
right 0.0030664229340089974
away 0.0029854358167454363
turn 0.0029039666333076876
ask 0.0027839321559349093
arm 0.0027663367405770527
smile 0.0027562133509191076
amon 0.0027125863621551057
let 0.0026783596637877673
come 0.0026747441674813584
thing 0.0026438919323333353
Word count for text:
4148808
________________________
Most frequent bigrams:
('republ', 'citi') 0.000739730544291276
('look', 'like') 0.0004974922917618747
('shook', 'head') 0.0003866170716986662
('let', 'go') 0.0003764936820407211
('korra', 'said') 0.00036709339164405776
('first', 'time') 0.0003649240938602124
('water', 'tribe') 0.0003535955387667976
('arm', 'around') 0.0003499800424603886
('close', 'eye') 0.0003290101638832166
('feel', 'like') 0.0003261177668380894
('gon', 'na') 0.0002971937963868176
('fire', 'nation') 0.0002969527632997237
('roll', 'eye') 0.0002841780096837453
('air', 'templ') 0.0002709211898935791
('shake', 'head') 0.00026899292519682764
('korra', 'say') 0.0002497102782293131
('make', 'sure') 0.00024802304661965556
('felt', 'like') 0.0002453716826616223
('korra', 'look') 0.00024344341796487086
('even', 'though') 0.0002422382525294012
('deep', 'breath') 0.00023862275622299223
('turn', 'around') 0.00022994556508761072
('one', 'hand') 0.00022898143273923497
('avatar', 'korra') 0.00022199147321351096
('come', 'back') 0.0002207863077780413
('look', 'back') 0.00021885804308128986
('go', 'back') 0.00021042188503300226
('turn', 'back') 0.00020945775268462654
('pull', 'away') 0.00020487812402984182
('sound', 'like') 0.00020029849537505712
('templ', 'island') 0.00019837023067830568
('seem', 'like') 0.00019523680054608455
('mako', 'said') 0.00018993407263001808
('asami', 'said') 0.0001863185763236091
('look', 'away') 0.00018511341088813945
('take', 'care') 0.00017908758371079116
('korra', 'ask') 0.00017595415357857003
('wrap', 'around') 0.00017523105431728825
('head', 'back') 0.00017330278962053678
('open', 'eye') 0.00017016935948831568
Word count for text:
4148808
________________________
Most frequent trigrams:
('air', 'templ', 'island') 0.00019716506524283602
('wrap', 'arm', 'around') 0.00014028125668866817
('took', 'deep', 'breath') 0.00010099286349235733
('ba', 'sing', 'se') 9.520806940210296e-05
('polar', 'bear', 'dog') 8.773604370219109e-05
('take', 'deep', 'breath') 8.315641504740638e-05
('southern', 'water', 'tribe') 6.628409895083118e-05
('see', 'end', 'chapter') 6.48379004282676e-05
('end', 'chapter', 'note') 6.48379004282676e-05
('northern', 'water', 'tribe') 6.122240412185862e-05
('.*.*', '.', '*') 5.6642775467073914e-05
('arm', 'around', 'waist') 4.242182332853195e-05
('close', 'door', 'behind') 3.9770459370498705e-05
('arm', 'wrap', 'around') 3.9288393196310845e-05
('arm', 'around', 'shoulder') 3.9047360109216915e-05
('chief', 'bei', 'fong') 3.832426084793512e-05
('back', 'republ', 'citi') 3.832426084793512e-05
('korra', 'shook', 'head') 3.181636749639897e-05
('arm', 'around', 'neck') 3.109326823511717e-05
('cross', 'arm', 'chest') 2.9888102799647513e-05
('korra', 'roll', 'eye') 2.9888102799647513e-05
('put', 'arm', 'around') 2.916500353836572e-05
('one', 'last', 'time') 2.795983810289606e-05
('tripl', 'threat', 'triad') 2.7718805015802128e-05
('long', 'time', 'ago') 2.7236738841614268e-05
('lean', 'back', 'chair') 2.6754672667426404e-05
('take', 'step', 'back') 2.627260649323854e-05
('back', 'air', 'templ') 2.5308474144862813e-05
('put', 'hand', 'shoulder') 2.5067441057768882e-05
('hand', 'behind', 'back') 2.5067441057768882e-05
('make', 'feel', 'better') 2.434434179648709e-05
('lin', 'bei', 'fong') 2.313917636101743e-05
('hundr', 'year', 'war') 2.2657110186829567e-05
('white', 'lotu', 'guard') 2.2416077099735633e-05
('korra', 'close', 'eye') 2.2416077099735633e-05
('want', 'make', 'sure') 2.169297783845384e-05
('spread', 'across', 'face') 2.048781240298418e-05
('la', 'la', 'la') 2.024677931589025e-05
('fire', 'lord', 'zuko') 1.9523680054608457e-05
('turn', 'around', 'face') 1.9041613880420594e-05
Word count for text:
4148808
________________________
Most frequent nouns:
korra 0.008216335872857939
hand 0.006463061197336681
eye 0.005899043773536881
look 0.004599152334839308
time 0.004491169511821227
mako 0.004187708855169967
bolin 0.0038760530735575133
face 0.0036774418097921136
head 0.003347708546647615
way 0.0032845578778290052
asami 0.003110049922772999
thing 0.0026438919323333353
arm 0.002482640797067495
man 0.0022382332467542486
tri 0.0021671284860615386
water 0.0021008443871107074
room 0.0018952431638195838
avatar 0.0018928328329486446
day 0.0018448672486169521
lin 0.0018009992267658567
moment 0.0017525515762599763
feel 0.0017318227307698982
ask 0.0016144396173551536
shoulder 0.0016115472203100264
breath 0.0015703305624169641
door 0.0015469503529688528
finger 0.0015312832023077472
air 0.0015293549376109957
place 0.0015240522096949293
hair 0.0015192315479530506
work 0.00151489295238536
realli 0.0014951282392436575
word 0.0014710249305342643
smile 0.0014628298055730706
tenzin 0.0014452343902152135
side 0.001412453890370439
fire 0.00140883839406403
help 0.0014081152948027481
night 0.0014032946330608696
voic 0.001394376408838394
Word count for text:
4148808
________________________
Most frequent verbs:
said 0.005750085325712831
go 0.005088931567814177
know 0.004484420585382597
say 0.004435972934876717
want 0.0036422509790764
get 0.003465332693149454
think 0.0030266524746384985
make 0.0028412980306632652
take 0.002840815964489077
see 0.002584838825995322
korra 0.002297045320005168
look 0.0020622790931756783
thought 0.0020029849537505713
come 0.001996236027311941
made 0.0019275415974901707
got 0.0017513464108245068
felt 0.0016503535473321495
seem 0.0016276964371453198
took 0.0016038341615230206
turn 0.0015874439116006333
feel 0.001447162654911965
left 0.0013724423979128463
need 0.0013377336333713203
keep 0.001289044949778346
find 0.001278198460859119
let 0.0011670822077088166
put 0.0011653949760991591
came 0.0011475585276542082
knew 0.0010289702488039938
told 0.0010002873114398159
found 0.0009337621794018909
tell 0.0009024278780796797
gave 0.0008826631649379773
went 0.0008763963046735352
done 0.0008684422127994354
began 0.0008327693159095336
run 0.0008055325770679192
give 0.0007995067498905709
set 0.0007821523676198079
stand 0.0007652800515232327
Word count for text:
4148808
________________________
Most frequent adjectives:
good 0.00228065507008278
much 0.0018395645207008858
open 0.0018091943517270503
sure 0.0017725573224887726
right 0.0017265200028538318
last 0.0013442415267228562
korra 0.0012073347332535033
amon 0.0011791338620635132
smile 0.0011641898106636893
asami 0.0011608153474443744
small 0.001158646049660529
final 0.0011405685681284842
sigh 0.0010810333956162831
first 0.0010598224839520171
hard 0.0010289702488039938
mako 0.0009952256166108433
next 0.000993297351914092
close 0.0009732916056852956
littl 0.0009470189991920571
lean 0.0009340032124889848
second 0.0008520519628770481
new 0.0008481954334835451
old 0.0008060146432421071
enough 0.0007934809227132227
long 0.0007874550955358744
actual 0.0007626286875651994
someth 0.0007433460405976849
tri 0.0007423819082493092
least 0.0007332226509397398
notic 0.000711529673101286
kiss 0.0006970676878756501
live 0.0006777850409081356
best 0.0006563330961567756
give 0.0006459686734117365
young 0.0006404249124085761
bad 0.0006392197469731065
light 0.0006223474308765313
noatak 0.0006211422654410616
deep 0.0006146343720895255
wrong 0.0005922182949897898
Word count for text:
4148808
________________________
Most frequent prepositions:
like 0.007046843334278183
around 0.0032874502748741325
though 0.0016966319000541843
behind 0.0015611713051073947
toward 0.0013095327621813302
without 0.001137676171083357
across 0.0010928440168838857
onto 0.0009701581755530745
laugh 0.0007018883496175287
along 0.0006327118536215703
past 0.0004439829464270219
upon 0.0003875812040470419
within 0.0003237074359671501
de 0.00030948648382860813
whisper 0.00029381933316750255
near 0.00028008044720314847
except 0.00022994556508761072
although 0.00018535444397523336
oh 0.0001841492785397637
asami 0.00018222101384301226
whether 0.00017547208740438218
beyond 0.00016824109479156424
next 0.0001465481169531104
toss 0.00014413778608217108
love 0.0001405222897757621
accept 0.00013714782655644705
ago 0.00013642472729516526
allow 0.00011955241119859005
unless 0.00011931137811149612
among 0.000109187988453551
beifong 0.0001053314590600481
wind 9.978769805688767e-05
bolin 9.520806940210296e-05
aang 9.062844074731827e-05
anyth 8.966430839894253e-05
okay 8.677191135381536e-05
blind 8.653087826672143e-05
tear 8.195124961193673e-05
wound 7.520232317330664e-05
amon 7.447922391202486e-05
In [35]:
subtextKorra = POS_tag(subtextKorrasami,40)
subtextKorra
Word count for text:
1506803
________________________
Most frequent words:
korra 0.022512564681647168
asami 0.017359269924469223
back 0.0070898451887871205
look 0.006833010021880763
said 0.006798499870255103
like 0.006744743672530517
hand 0.006082414224022649
one 0.005538879335918498
go 0.005427384999897133
eye 0.005426721343135101
know 0.005179841027659223
time 0.004944242877137887
want 0.004815493465303693
get 0.0047298817430015735
mako 0.004438536424469556
bolin 0.003800098619394838
around 0.003790143767964359
even 0.003702541075376144
'd 0.0035731280067799174
face 0.003559191214777247
see 0.003527999346961746
kuvira 0.003472252178951064
smile 0.0034709248654270003
head 0.0033873041134109766
ask 0.0033109835857773047
think 0.003287091942344155
avatar 0.0032625366421489737
opal 0.003170288352226535
still 0.0031663064116543437
way 0.003133787230314779
let 0.0031278143194564917
tri 0.0031191867815500767
feel 0.003095958794878959
make 0.0030714034946837773
turn 0.003009019759052776
say 0.0028683245255020067
lin 0.0028557150470234
realli 0.0028411145982586974
away 0.0028324870603522825
take 0.0028119137007292924
Word count for text:
1506803
________________________
Most frequent bigrams:
('republ', 'citi') 0.0008050156523447326
('korra', 'said') 0.000761214306050625
('asami', 'said') 0.0006517109403153564
('look', 'like') 0.000511679363526619
('shook', 'head') 0.0004658870469464157
('let', 'go') 0.0004559321955159367
('zhu', 'li') 0.00045394122522984093
('korra', 'asami') 0.00042208570065230825
('arm', 'around') 0.0004134581627458931
('water', 'tribe') 0.00039885771398119065
('first', 'time') 0.000395539430171031
('korra', 'look') 0.00036501121911756214
('asami', 'smile') 0.0003490834568287958
('felt', 'like') 0.00034443785949457226
('roll', 'eye') 0.0003338193513020614
('korra', 'ask') 0.00032386449987158243
('feel', 'like') 0.00031855524577532694
('gon', 'na') 0.0003132459916790715
('fire', 'nation') 0.00028802703472185815
('air', 'templ') 0.00028669972119779426
('even', 'though') 0.0002860360644357623
('earth', 'kingdom') 0.0002780721832913792
('deep', 'breath') 0.0002767448697673153
('korra', 'smile') 0.0002720992724330918
('spirit', 'world') 0.00026745367509886826
('close', 'eye') 0.0002661263615748044
('asami', 'korra') 0.00026479904805074055
('asami', 'look') 0.00026347173452667666
('make', 'sure') 0.0002628080777646447
('turn', 'around') 0.0002555078533822935
('asami', 'ask') 0.0002555078533822935
('korra', 'hand') 0.00025019859928603804
('rais', 'eyebrow') 0.00024422568842775066
('avatar', 'korra') 0.00023493449375930364
('look', 'korra') 0.0002342708369972717
('turn', 'back') 0.00022829792613898432
('come', 'back') 0.0002256432990908566
('hand', 'korra') 0.00022497964232882467
('korra', 'eye') 0.000221661358518665
('eye', 'widen') 0.0002209977017566331
Word count for text:
1506803
________________________
Most frequent trigrams:
('air', 'templ', 'island') 0.00019046949070316424
('wrap', 'arm', 'around') 0.00018980583394113233
('took', 'deep', 'breath') 0.0001572866526015677
('see', 'end', 'chapter') 0.0001320676956443543
('end', 'chapter', 'note') 0.0001320676956443543
('ba', 'sing', 'se') 0.00012476747126200307
('southern', 'water', 'tribe') 0.0001128216495454283
('arm', 'around', 'asami') 6.636567620319312e-05
('asami', 'shook', 'head') 6.636567620319312e-05
('korra', 'shook', 'head') 6.304739239303347e-05
('arm', 'around', 'korra') 6.105642210693767e-05
('polar', 'bear', 'dog') 6.039276534490574e-05
('asami', 'roll', 'eye') 5.972910858287381e-05
('back', 'republ', 'citi') 5.972910858287381e-05
('take', 'deep', 'breath') 5.508351124865029e-05
('korra', 'roll', 'eye') 5.3756197724586424e-05
('5', 'first', 'time') 4.8446943628330974e-05
('hair', 'behind', 'ear') 4.7783286866299046e-05
('arm', 'wrap', 'around') 4.579231658020325e-05
('korra', 'rais', 'eyebrow') 4.446500305613939e-05
('asami', 'bit', 'lip') 4.446500305613939e-05
('asami', 'rais', 'eyebrow') 4.380134629410746e-05
('3', 'crack', 'fic') 4.3137689532075525e-05
('asami', 'eye', 'widen') 3.915574895988394e-05
('ran', 'hand', 'hair') 3.849209219785201e-05
('northern', 'water', 'tribe') 3.650112191175622e-05
('took', 'step', 'back') 3.583746514972428e-05
('korra', 'eye', 'widen') 3.5173808387692354e-05
('korra', 'look', 'asami') 3.384649486362849e-05
('korra', 'close', 'eye') 3.384649486362849e-05
('lean', 'back', 'chair') 3.318283810159656e-05
('korra', 'said', 'softli') 3.318283810159656e-05
('water', 'tribe', 'girl') 3.2519181339564626e-05
('close', 'door', 'behind') 3.2519181339564626e-05
('asami', 'look', 'korra') 3.2519181339564626e-05
('three', 'year', 'ago') 3.18555245775327e-05
('make', 'feel', 'better') 3.119186781550077e-05
('gon', 'na', 'go') 3.052821105346883e-05
('long', 'time', 'ago') 3.052821105346883e-05
('arm', 'around', 'waist') 2.9200897529404972e-05
Word count for text:
1506803
________________________
Most frequent nouns:
korra 0.014269947697210584
asami 0.011050548744593686
hand 0.006082414224022649
eye 0.005426721343135101
time 0.004944242877137887
look 0.004678116515563083
face 0.0033620851564537635
mako 0.003190198055087493
way 0.003133787230314779
bolin 0.0030800310325901927
head 0.002952608934280062
thing 0.00260485279097533
avatar 0.002478758006189263
arm 0.0023977918812213674
kuvira 0.002121710668216084
room 0.002092509770686679
tri 0.0020400808864861563
ask 0.001988979315809698
day 0.001978360807617187
lin 0.0019425233424674626
woman 0.001880803263598493
smile 0.001880803263598493
moment 0.001783245719579799
realli 0.0017726272113872882
feel 0.0016843608620370413
water 0.0016744060106065623
help 0.0016551599645076363
let 0.0016292773507883911
shoulder 0.001580166750398028
door 0.0015682209286814533
girl 0.0015655663016333257
place 0.001533710777055793
work 0.0015303924932456334
hair 0.0015071645065745157
man 0.0014441171141814822
move 0.0014388078600852268
air 0.00142619838160662
year 0.0014215527842723966
world 0.001417570843700205
tenzin 0.0014135889031280135
Word count for text:
1506803
________________________
Most frequent verbs:
said 0.006798499870255103
go 0.005264125436437278
know 0.004401371645795768
want 0.004139890881555187
korra 0.003974640347809236
get 0.0036872769698494097
say 0.002826514149493995
think 0.0027734216085314404
see 0.0027236473513790457
make 0.0026931191403255766
take 0.002486721887333646
made 0.0023393900861625573
felt 0.00227833366405562
took 0.0020872005165904236
look 0.0020301260350556777
got 0.001988315659047666
thought 0.0019119951314139937
come 0.0018535933363551839
turn 0.0016372412319327742
seem 0.0014447807709435142
left 0.0013558507648312355
let 0.001320676956443543
feel 0.0013186859861574474
need 0.0013180223293954153
came 0.0012947943427242977
knew 0.0012894850886280423
put 0.0011448079145050813
keep 0.0011414896306949217
began 0.0010552142516307705
went 0.001051895967820611
find 0.0010392864893420041
found 0.0010339772352457488
gave 0.0010306589514355892
told 0.0010160585026708867
love 0.0010067673080024395
asami 0.0009238102127484482
tell 0.0009039005098874903
saw 0.0008893000611227878
held 0.0008508079689249358
stood 0.0007924061738661259
Word count for text:
1506803
________________________
Most frequent adjectives:
asami 0.004001186618290513
opal 0.00264268122641115
good 0.0021376384305048504
korra 0.0020122073024808154
much 0.0019624330453284207
open 0.0017507265382402345
sure 0.001706261535184095
right 0.001446108084467578
smile 0.001437480546561163
last 0.0012662571019569247
final 0.0012284286665211045
next 0.0012191374718526576
sigh 0.001177327095844646
small 0.0011521081388874326
first 0.001108970449355357
lean 0.0010757876112537604
new 0.001031322608197621
second 0.0009404016317992465
aita 0.0009390743182751826
littl 0.000925801183034544
kiss 0.0009204919289382885
close 0.000916509988366097
hard 0.0008713813285479257
notic 0.0008441714013046165
actual 0.0008348802066361694
long 0.0008322255795880418
best 0.0008070066226308283
old 0.0007658599033848486
mako 0.0007413046031896671
young 0.0007366590058554436
someth 0.0007346680355693478
enough 0.0007207312435666773
live 0.0007147583327083899
tri 0.0006955122866094638
great 0.0006729479567003782
least 0.0006709569864142824
deep 0.0006696296728902185
bad 0.000665647732318027
amon 0.0006231736995479834
lin 0.0006231736995479834
Word count for text:
1506803
________________________
Most frequent prepositions:
like 0.006471317086573361
around 0.003267182239483197
though 0.0014998642821921644
toward 0.0013047491941547767
behind 0.0012981126265344575
without 0.001021367756767142
across 0.0008999185693152987
onto 0.0008023610252966048
laugh 0.0007997063982484771
asami 0.0006145461616415683
along 0.0004619051063742241
de 0.00043602249265497876
past 0.0004327042088448191
whisper 0.0003431105459705084
within 0.00031789158901329506
near 0.00031191867815500763
upon 0.00023360718023523978
oh 0.0002190067314705373
except 0.00019246046098926004
love 0.00018250560955878107
although 0.00017520538517642984
ago 0.00017321441489033404
next 0.00017188710136627018
whether 0.00016923247431814244
okay 0.00014202254707483327
beyond 0.0001274220983101308
throughout 0.00011680359011761989
allow 0.00011680359011761989
accept 0.00010618508192510899
unless 0.00010552142516307706
anyth 9.423926020853423e-05
bolin 9.357560344650229e-05
aita 9.357560344650229e-05
beifong 9.291194668447036e-05
toss 9.092097639837458e-05
among 8.893000611227878e-05
wind 8.826634935024685e-05
tear 8.229343849195946e-05
blind 7.565687087164015e-05
opal 6.902030325132084e-05
In [36]:
postKorra = POS_tag(postKorrasami,40)
postKorra
Word count for text:
6156530
________________________
Most frequent words:
korra 0.024658046009684027
asami 0.020027028212320903
said 0.007534276613611888
look 0.007326854575548239
back 0.00732344356317601
like 0.006636855501394454
hand 0.005912746303518378
one 0.005779067104359111
go 0.005484420607062745
know 0.0053780295068813115
eye 0.005280572010531907
time 0.004998107700279216
get 0.004762260559113656
want 0.004560685970830971
even 0.003775990696057682
kuvira 0.003775665837736517
around 0.0037415557140142255
mako 0.003593582748723713
head 0.0035844867157311016
think 0.0035710050954027676
see 0.0035687310871546146
face 0.0034881662235057734
smile 0.003372191802849982
still 0.0033172907465731506
tri 0.0032677498525955366
way 0.0032633642652598136
avatar 0.0032594659654058373
say 0.003249557786610315
ask 0.0032175592419755933
'd 0.0031863728431437838
make 0.003169805068764385
feel 0.0031628206148593446
bolin 0.0031207514622685183
turn 0.0030617896769771285
take 0.003036288298765701
right 0.0028646006760301664
thing 0.00284819533081135
let 0.0028236685275634162
someth 0.0028209072318335165
well 0.0028105117655562468
Word count for text:
6156530
________________________
Most frequent bigrams:
('korra', 'said') 0.0008266019982035334
('republ', 'citi') 0.0007832334123280485
('asami', 'said') 0.0006802533245188442
('korra', 'asami') 0.000612520364556008
('look', 'like') 0.0005212351763087324
('shook', 'head') 0.0004842013276959586
('zhu', 'li') 0.0004141943594849696
('let', 'go') 0.00040867176802517004
('water', 'tribe') 0.0004075347639010936
('korra', 'look') 0.0003982763017479002
('korra', 'ask') 0.0003947028602150887
('spirit', 'world') 0.000379596948280931
('arm', 'around') 0.0003680644778795848
('close', 'eye') 0.00036172974061687347
('first', 'time') 0.0003417509538652455
('asami', 'smile') 0.0003362283624054459
('asami', 'korra') 0.0003360659332448636
('asami', 'look') 0.00033314220835438147
('deep', 'breath') 0.0003276196168945818
('feel', 'like') 0.0003203103046683765
('asami', 'ask') 0.0003065038260188775
('look', 'korra') 0.00030536682189480114
('earth', 'kingdom') 0.00029838236798976044
('make', 'sure') 0.00029643321806277235
('korra', 'smile') 0.0002949713556175313
('roll', 'eye') 0.00029237248904821387
('avatar', 'korra') 0.0002913979140847198
('gon', 'na') 0.00028928633499714936
('fire', 'nation') 0.0002855504643037555
('felt', 'like') 0.00028327645605560274
('even', 'though') 0.00027547985634765037
('look', 'asami') 0.00027450528138415635
('futur', 'industri') 0.0002736931355812446
('korra', 'hand') 0.0002637849567857218
('turn', 'around') 0.0002621606651798984
('look', 'back') 0.00025939936944999864
('look', 'around') 0.00025436406547194604
('see', 'end') 0.00025306463218728733
('turn', 'back') 0.0002512779114208816
('end', 'chapter') 0.00025079062393913455
Word count for text:
6156530
________________________
Most frequent trigrams:
('see', 'end', 'chapter') 0.000246242607442829
('end', 'chapter', 'note') 0.000246242607442829
('took', 'deep', 'breath') 0.00019199126780832708
('air', 'templ', 'island') 0.00016632746043631722
('wrap', 'arm', 'around') 0.00015885571904952954
('ba', 'sing', 'se') 0.00013903936145848392
('southern', 'water', 'tribe') 0.00011142640415948595
('korra', 'shook', 'head') 8.283887189699393e-05
('polar', 'bear', 'dog') 8.024000532767647e-05
('asami', 'shook', 'head') 7.585441799195326e-05
('take', 'deep', 'breath') 7.504227218904156e-05
('water', 'tribe', 'girl') 6.643352667817748e-05
('back', 'republ', 'citi') 5.9286643612554475e-05
('arm', 'around', 'korra') 5.831206864906043e-05
('arm', 'around', 'asami') 5.1652473065184444e-05
('arm', 'wrap', 'around') 4.8728748174702305e-05
('korra', 'roll', 'eye') 4.742931489004358e-05
('asami', 'roll', 'eye') 4.71044565688789e-05
('northern', 'water', 'tribe') 4.548016496305549e-05
('put', 'arm', 'around') 4.3855873357232074e-05
('korra', 'close', 'eye') 4.0120002663838234e-05
('korra', 'look', 'asami') 3.9957573503255896e-05
('rub', 'back', 'neck') 3.833328189743248e-05
('took', 'step', 'back') 3.7358706933938434e-05
('varrick', 'zhu', 'li') 3.7196277773356096e-05
('asami', 'look', 'korra') 3.670899029160907e-05
('wan', 'shi', 'tong') 3.3622836240544595e-05
('close', 'door', 'behind') 3.2973119598215226e-05
('asami', 'rais', 'eyebrow') 3.281069043763289e-05
('korra', 'asami', 'said') 3.24858321164682e-05
('make', 'feel', 'better') 3.24858321164682e-05
('pinch', 'bridg', 'nose') 3.134882799239182e-05
('korra', 'said', 'asami') 3.1023969671227134e-05
('cross', 'arm', 'chest') 3.1023969671227134e-05
('korra', 'rais', 'eyebrow') 3.0699111350062456e-05
('long', 'time', 'ago') 3.0374253028897772e-05
('hand', 'korra', 'shoulder') 2.9562107225986068e-05
('place', 'hand', 'shoulder') 2.9399678065403726e-05
('arm', 'around', 'waist') 2.9074819744239045e-05
('korra', 'took', 'deep') 2.874996142307436e-05
Word count for text:
6156530
________________________
Most frequent nouns:
korra 0.015561200871270017
asami 0.012508344798124918
hand 0.005912746303518378
eye 0.005280572010531907
look 0.005005254583344838
time 0.004998107700279216
face 0.0032758713106246537
way 0.0032633642652598136
head 0.003124649762122494
thing 0.00284819533081135
mako 0.002589932965485428
bolin 0.0025631321539893413
avatar 0.0025010842146468873
arm 0.0023388174832251283
kuvira 0.0022759573980797626
tri 0.002162906702314453
room 0.0019187756739591946
ask 0.0018911627166601965
day 0.0018750822297625448
moment 0.0018172574485952313
smile 0.0017997150992523387
water 0.0017860710497634219
place 0.0017630061089607294
work 0.0017048564694722515
feel 0.0016913748491439171
help 0.0016049625357141117
world 0.0016012266650207178
door 0.0015674413996195908
realli 0.0015549343542547508
lin 0.001540803017284087
woman 0.00153755443407244
shoulder 0.0015058807477588837
let 0.0014543907038542816
spirit 0.0014413963710076944
word 0.0014379853586354652
air 0.0014353864920661477
breath 0.001432625196336248
side 0.0013619685114829294
talk 0.0013471874578699365
move 0.0013424770122130486
Word count for text:
6156530
________________________
Most frequent verbs:
said 0.007534276613611888
go 0.005317118571662933
know 0.004565721274809024
korra 0.004315742796672801
want 0.0039276995320415885
get 0.0036523821048545205
say 0.003198230171866295
think 0.0030371004445686124
make 0.002769579617089497
see 0.0027146785608126657
take 0.0026867407451925027
look 0.0021888953680076276
made 0.0021229491288111972
thought 0.002025004344980046
come 0.0019878080672066896
took 0.001961332114031768
got 0.0019181259573168652
felt 0.0018926245791054377
turn 0.0016882886950928527
seem 0.0015947294985974242
need 0.0014238540216648015
feel 0.0013705772569937936
left 0.0013376041373955783
put 0.0012898499641843702
came 0.0012723076148414773
let 0.00121074696298077
keep 0.001125471653675041
find 0.0011236849329086351
began 0.0011162131915218474
asami 0.0011032188586752602
knew 0.001091199100792167
told 0.0010471807982743526
found 0.0010384096236029062
gave 0.0009307190901368141
went 0.0009183744739325562
walk 0.0008821527711226941
love 0.0008761428921811475
tell 0.0008733815964512477
done 0.0008537276680207844
happen 0.0008118209445905404
Word count for text:
6156530
________________________
Most frequent adjectives:
asami 0.004773468171193838
korra 0.0023243612879333
good 0.002083316413629106
much 0.00207665681804523
open 0.0018133591487412553
sure 0.0017284086977566908
right 0.0015978156526484887
opal 0.0014852522443649264
smile 0.0014128088387452022
last 0.0013605066490376884
small 0.0012066862339662114
final 0.0011984023467765121
next 0.00119547862188603
sigh 0.0010976962672154607
first 0.0010780423387849974
new 0.0010028376374353735
actual 0.0009978023334573209
close 0.0009893560171070391
lean 0.0009596314807204708
second 0.0009414394147352486
littl 0.0008712700173636773
notic 0.0008701330132396009
hard 0.000844469205867591
someth 0.0008350483145538152
long 0.0007931415911235712
best 0.000767315354590979
least 0.0007658534921457379
red 0.0007385653931679046
tri 0.0007338549475110167
deep 0.0007231346229125823
great 0.0007075414234966775
old 0.0007057547027302717
enough 0.0006997448237887251
kiss 0.0006987702488252311
young 0.0006856134868180615
live 0.0006661219875481806
light 0.0006591375336431398
mako 0.0006178805268552253
mean 0.0006146319436435785
give 0.0006066729147750437
Word count for text:
6156530
________________________
Most frequent prepositions:
like 0.006396297914572007
around 0.003207326204858906
though 0.001781847891588281
toward 0.001633225209655439
behind 0.0013929924811541567
without 0.0010858389384929498
across 0.0009307190901368141
laugh 0.0008624988426922308
onto 0.0008249777065977101
asami 0.0007224849062702529
along 0.0005288693468561024
past 0.0004630855368202543
within 0.0003459741120403864
near 0.0003186860130625531
whisper 0.00028084001864686766
upon 0.0002803527311651206
although 0.00021521863777160187
except 0.00021310705868403143
ago 0.00018630624718794515
love 0.00018468195558212175
whether 0.0001841946681003747
oh 0.000183382522297463
next 0.00018029636824639853
beyond 0.00017087547693262276
okay 0.00013611563656800178
accept 0.00012295887456083214
unless 0.00011581199149520916
de 0.0001120761208018153
allow 0.00010898996675075083
toss 0.0001076905334660921
throughout 9.956907543697505e-05
anyth 9.77823546705693e-05
beifong 9.22597632107697e-05
wind 8.949846748086991e-05
pout 8.787417587504649e-05
among 8.787417587504649e-05
tear 8.706203007213479e-05
blind 7.504227218904156e-05
alreadi 6.464680591177173e-05
wound 6.3347372627113e-05

Word2Vec

Using the LineSentence function (from gensim), which takes a file, reads it in, and does the necessary pre-processing for you, I read in all my files and then created word2vec models for each.

In [41]:
sent_preKorra = LineSentence('../../../data/korra/korra2018/time/preKorrasami.txt')
sent_subKorra = LineSentence('../../../data/korra/korra2018/time/subtextKorrasami.txt')
sent_postKorra = LineSentence('../../../data/korra/korra2018/time/postKorrasami.txt')

Pre-Korrasami Corpus

In [63]:
preKorra_w2v = wv(sent_preKorra, window=20, min_count=10, workers=4)
In [68]:
preKorra_w2v.wv.most_similar(['asami'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[68]:
[('cheerlead', 0.6560064554214478),
 ('quarterback', 0.6462293267250061),
 ('heiress', 0.6236462593078613),
 ("'sami", 0.6234376430511475),
 ('girlfriend', 0.515772819519043),
 ('korra', 0.5117356777191162),
 ('footbal', 0.49044904112815857),
 ('hanok', 0.4889821410179138),
 ('babe', 0.46558764576911926),
 ('sato', 0.46101731061935425),
 ('li-dha', 0.4536556303501129),
 ('car', 0.4393694996833801),
 ('girl', 0.43191707134246826),
 ('viya', 0.4230498671531677),
 ('sanya', 0.42139139771461487),
 ('hulan', 0.4182591140270233),
 ('softli', 0.40957459807395935),
 ('stutter', 0.4036216139793396),
 ('haruhi', 0.39520490169525146),
 ('miss', 0.39332282543182373)]
In [67]:
preKorra_w2v.wv.most_similar(['girlfriend'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[67]:
[('boyfriend', 0.8421016931533813),
 ('date', 0.77755206823349),
 ('jealou', 0.7470348477363586),
 ('cute', 0.6559523940086365),
 ('total', 0.6552077531814575),
 ('hanok', 0.6398621797561646),
 ('romant', 0.6346932053565979),
 ('quarterback', 0.631979763507843),
 ('joke', 0.6248250603675842),
 ('cheerlead', 0.6194512248039246),
 ("'well", 0.6132126450538635),
 ('drama', 0.6126875877380371),
 ('ador', 0.5971378684043884),
 ('ex-boyfriend', 0.5900479555130005),
 ('girl', 0.5899331569671631),
 ('geez', 0.5846318602561951),
 ("'sami", 0.580193042755127),
 ('crazi', 0.5799494981765747),
 ('kinda', 0.573547899723053),
 ('footbal', 0.569688081741333)]
In [74]:
preKorra_w2v.wv.most_similar(['korra'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[74]:
[('asami', 0.5117356181144714),
 ('mako', 0.49496975541114807),
 ('cheerlead', 0.4835444390773773),
 ('quarterback', 0.4768112599849701),
 ('hanok', 0.4364459216594696),
 ('nervous', 0.42786461114883423),
 ('bolin', 0.42751795053482056),
 ('shuchun', 0.42369264364242554),
 ("'sami", 0.4183400869369507),
 ('li-dha', 0.4138311445713043),
 ('eska', 0.40702885389328003),
 ('girlfriend', 0.3953183591365814),
 ('sheepish', 0.38773053884506226),
 ('embarrass', 0.3870137929916382),
 ('heiress', 0.38474151492118835),
 ('naga', 0.38408178091049194),
 ('respond', 0.37803637981414795),
 ('stutter', 0.36565929651260376),
 ('nervou', 0.3590152859687805),
 ('okay', 0.35591644048690796)]
In [84]:
preKorra_w2v.wv.most_similar(['cheerlead'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[84]:
[('quarterback', 0.9433119893074036),
 ('footbal', 0.8608570694923401),
 ('hanok', 0.7478722333908081),
 ('li-dha', 0.7405582070350647),
 ("'sami", 0.7339423298835754),
 ('blush', 0.6835891008377075),
 ('pleasantli', 0.6664596796035767),
 ('babe', 0.6577969789505005),
 ('asami', 0.6560064554214478),
 ('viya', 0.6487923264503479),
 ('gorgeou', 0.638032853603363),
 ('wink', 0.6341010928153992),
 ('girlfriend', 0.6194513440132141),
 ('cute', 0.6056030988693237),
 ('chuckl', 0.6010433435440063),
 ('snicker', 0.5905696153640747),
 ('amaz', 0.5841995477676392),
 ('teasingli', 0.5779315233230591),
 ('hulan', 0.5650051236152649),
 ('stutter', 0.5603939294815063)]
In [90]:
preKorra_w2v.wv.most_similar(['heiress'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[90]:
[('gorgeou', 0.6264893412590027),
 ('asami', 0.6236463785171509),
 ('sato', 0.6088579297065735),
 ('girl', 0.5857400894165039),
 ('quarterback', 0.5717872381210327),
 ('beauti', 0.5587961673736572),
 ('businesswoman', 0.5566233396530151),
 ('girlfriend', 0.5515145063400269),
 ('cheerlead', 0.540593147277832),
 ('green-ey', 0.5291796326637268),
 ('tula', 0.5224825143814087),
 ('salaci', 0.521117091178894),
 ('memoris', 0.5210148692131042),
 ('ms', 0.5171505212783813),
 ('eyeshadow', 0.5109556913375854),
 ('hairstyl', 0.5050166249275208),
 ('feminin', 0.5038310289382935),
 ("'sami", 0.5028831362724304),
 ('lusciou', 0.49360984563827515),
 ('flatter', 0.49308541417121887)]
In [94]:
preKorra_w2v.wv.most_similar(['woman'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[94]:
[('young', 0.6998170614242554),
 ('man', 0.6571545004844666),
 ('women', 0.5473827719688416),
 ('ladi', 0.5403400659561157),
 ('older', 0.4757651686668396),
 ('elderli', 0.4643067717552185),
 ('tall', 0.4623173475265503),
 ('men', 0.46191948652267456),
 ('wizen', 0.44976165890693665),
 ('girl', 0.4437141716480255),
 ('husband', 0.4388071298599243),
 ('male', 0.43597882986068726),
 ('hana', 0.4241524636745453),
 ('wife', 0.42195069789886475),
 ('stranger', 0.4186514616012573),
 ('nia', 0.4169045686721802),
 ('midwif', 0.4082654118537903),
 ('statur', 0.40812942385673523),
 ('child', 0.3999372720718384),
 ('age', 0.39916473627090454)]
In [212]:
preKorra_w2v.wv.most_similar(['man'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[212]:
[('woman', 0.6571544408798218),
 ('young', 0.520764946937561),
 ('kwan', 0.518744170665741),
 ('men', 0.5075659155845642),
 ('urahara', 0.49407991766929626),
 ('ishida', 0.4933697581291199),
 ('zolt', 0.4825531244277954),
 ('hugh', 0.4617175757884979),
 ('jagur', 0.45737224817276),
 ('mask', 0.45695123076438904),
 ('sneer', 0.4542919993400574),
 ('soldier', 0.45304712653160095),
 ('roman', 0.4509067237377167),
 ('kuchiki', 0.43662548065185547),
 ('rukia', 0.43548718094825745),
 ('lieuten', 0.433056503534317),
 ('kilaun', 0.4327288568019867),
 ('menac', 0.4323747158050537),
 ('stranger', 0.4258544445037842),
 ('male', 0.42425400018692017)]
In [98]:
preKorra_w2v.wv.most_similar(['muscular'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[98]:
[('slender', 0.8226813077926636),
 ('tan', 0.8156390190124512),
 ('lith', 0.7800117135047913),
 ('wiri', 0.776208221912384),
 ('curvi', 0.770183801651001),
 ('baggi', 0.7607766389846802),
 ('broad', 0.7517485022544861),
 ('musculatur', 0.7298089265823364),
 ('cleavag', 0.7227165699005127),
 ('taller', 0.7075567841529846),
 ('mould', 0.6963127851486206),
 ('sinew', 0.6940883994102478),
 ('shorter', 0.6925441026687622),
 ('stocki', 0.6922821998596191),
 ('angular', 0.689975380897522),
 ('expos', 0.6879399418830872),
 ('silki', 0.6861910820007324),
 ('accentu', 0.6851731538772583),
 ('physiqu', 0.6839115619659424),
 ('clad', 0.6810009479522705)]
In [103]:
preKorra_w2v.wv.most_similar(['feminin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[103]:
[('eleg', 0.7760363817214966),
 ('accent', 0.7643482685089111),
 ('accentu', 0.7594501376152039),
 ('complement', 0.751521110534668),
 ('facial', 0.7389618754386902),
 ('tan', 0.7328726649284363),
 ('contrast', 0.723888099193573),
 ('decidedli', 0.7234376668930054),
 ('highlight', 0.7176596522331238),
 ('creami', 0.712325930595398),
 ('ensembl', 0.7110721468925476),
 ('hue', 0.7000831961631775),
 ('allur', 0.6996458172798157),
 ('hairstyl', 0.698199450969696),
 ('textur', 0.6962378025054932),
 ('paler', 0.6957379579544067),
 ('pearl', 0.6943222284317017),
 ('darker', 0.6939800381660461),
 ('mocha', 0.6930546164512634),
 ('sideburn', 0.6809380054473877)]
In [106]:
preKorra_w2v.wv.most_similar(['masculin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[106]:
[('undeni', 0.7111497521400452),
 ('mixtur', 0.7092783451080322),
 ('echapp', 0.6977778673171997),
 ('throati', 0.6919038891792297),
 ('jeun', 0.683685302734375),
 ('enchant', 0.6823471188545227),
 ('incomprehens', 0.6816055178642273),
 ('blith', 0.6812763810157776),
 ('labia', 0.6803156733512878),
 ('ng', 0.679475724697113),
 ('douleur', 0.6747332811355591),
 ('korrak', 0.6683462858200073),
 ('gen', 0.6678991317749023),
 ('suppl', 0.6645246744155884),
 ('visag', 0.6644068360328674),
 ('du', 0.6642574667930603),
 ("s'en", 0.6627510786056519),
 ("l'earthbend", 0.6623719930648804),
 ('san', 0.6609686613082886),
 ('ting', 0.660069465637207)]
In [107]:
preKorra_w2v.wv.most_similar(['gender'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[107]:
[('inspir', 0.7662824988365173),
 ('interpret', 0.7536213994026184),
 ('writer', 0.7420913577079773),
 ('genr', 0.7362577319145203),
 ('charact', 0.7339369654655457),
 ('fiction', 0.7296613454818726),
 ('creativ', 0.725774347782135),
 ('common', 0.723240077495575),
 ('clueless', 0.7187334895133972),
 ('japanes', 0.7186621427536011),
 ('supposedli', 0.7159520387649536),
 ('random', 0.7155631184577942),
 ('fame', 0.7129549384117126),
 ('artist', 0.7111949324607849),
 ('bryke', 0.7096675634384155),
 ('theme', 0.7081787586212158),
 ('factor', 0.7048810720443726),
 ('titl', 0.7030603885650635),
 ('model', 0.7000835537910461),
 ('equat', 0.6995790600776672)]
In [151]:
preKorra_w2v.wv.most_similar(['marri'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[151]:
[('marriag', 0.7519068717956543),
 ('wed', 0.704032301902771),
 ('wife', 0.6999097466468811),
 ('pregnant', 0.6946468949317932),
 ('sixteen', 0.6815783381462097),
 ('grandfath', 0.6669420599937439),
 ('eighteen', 0.6628835797309875),
 ('age', 0.6503024101257324),
 ('born', 0.6490695476531982),
 ('oldest', 0.6464373469352722),
 ('anniversari', 0.6340847015380859),
 ('birthday', 0.6301572918891907),
 ('fourteen', 0.6154270172119141),
 ('husband', 0.6101883053779602),
 ('dote', 0.6096040606498718),
 ('adult', 0.6045727133750916),
 ('grandpar', 0.6043574213981628),
 ('thirteen', 0.5979408025741577),
 ('adventur', 0.5970245003700256),
 ('famili', 0.5940732955932617)]
In [210]:
preKorra_w2v.wv.most_similar(['pregnant'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[210]:
[('marri', 0.6946468949317932),
 ('pregnanc', 0.6559553742408752),
 ('babi', 0.6346104145050049),
 ('fourteen-year-old', 0.6060411930084229),
 ('wife', 0.5976323485374451),
 ('mum', 0.5938212871551514),
 ('child', 0.5782669186592102),
 ('parent', 0.577325701713562),
 ('precoci', 0.5690056681632996),
 ('toddler', 0.5680645704269409),
 ('marriag', 0.5655030012130737),
 ('husband', 0.5637431740760803),
 ('unhappi', 0.5569580793380737),
 ('confess', 0.5528407692909241),
 ('adult', 0.5495383739471436),
 ('sibl', 0.5408585667610168),
 ('gran-gran', 0.5324091911315918),
 ('rohan', 0.5316610932350159),
 ('motherli', 0.5239673256874084),
 ('sixteen', 0.5235865116119385)]

Subtextual Korrasami Corpus

In [61]:
subKorra_w2v = wv(sent_subKorra, window=20, min_count=10, workers=4)
In [203]:
subKorra_w2v.wv.most_similar(['asami'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[203]:
[('engin', 0.7523688673973083),
 ('korra', 0.740334153175354),
 ('softli', 0.62079918384552),
 ('babe', 0.5745729804039001),
 ('mumbl', 0.5618496537208557),
 ('girlfriend', 0.5595892071723938),
 ('breathlessli', 0.554783284664154),
 ('shila', 0.5525208711624146),
 ('heiress', 0.5449532270431519),
 ("'sami", 0.5323458909988403),
 ('lean', 0.4970782399177551),
 ('wink', 0.4914647936820984),
 ('ok.', 0.48380032181739807),
 ('nervous', 0.4816775619983673),
 ('ok', 0.47638535499572754),
 ('closer', 0.46714597940444946),
 ('grin', 0.462266206741333),
 ('tenderli', 0.45690199732780457),
 ('comfortingli', 0.45532649755477905),
 ('sweeti', 0.45292192697525024)]
In [69]:
subKorra_w2v.wv.most_similar(['girlfriend'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[69]:
[('cute', 0.8325461149215698),
 ('chuckl', 0.8270843029022217),
 ('date', 0.8218371868133545),
 ('wink', 0.8067721128463745),
 ('babe', 0.7899971604347229),
 ('blush', 0.777965784072876),
 ('grin', 0.7705411911010742),
 ('wow', 0.7702885866165161),
 ('amaz', 0.7698530554771423),
 ('flirt', 0.7596547603607178),
 ('awkward', 0.7505388259887695),
 ('embarrass', 0.7356637120246887),
 ('huh', 0.7268903851509094),
 ('giggl', 0.725811779499054),
 ('ador', 0.7232818007469177),
 ('sheepishli', 0.7199611663818359),
 ('yep', 0.71490877866745),
 ('boyfriend', 0.7123020887374878),
 ('nice', 0.7092101573944092),
 ('fluster', 0.698421061038971)]
In [213]:
subKorra_w2v.wv.most_similar(['korra'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[213]:
[('asami', 0.7403340935707092),
 ('engin', 0.7342380285263062),
 ('babe', 0.6112719774246216),
 ('softli', 0.6020066738128662),
 ('breathlessli', 0.5999543070793152),
 ('shila', 0.574532151222229),
 ('girlfriend', 0.5721224546432495),
 ('mumbl', 0.5694355368614197),
 ('heiress', 0.5671155452728271),
 ("'sami", 0.558881402015686),
 ('ok.', 0.5278437733650208),
 ('nervous', 0.5046630501747131),
 ('ok', 0.502859354019165),
 ('sweeti', 0.498298317193985),
 ('chuckl', 0.4930209815502167),
 ('grin', 0.4865219295024872),
 ('closer', 0.48427778482437134),
 ('tenderli', 0.48180460929870605),
 ('wink', 0.47795185446739197),
 ('comfortingli', 0.47723710536956787)]
In [85]:
subKorra_w2v.wv.most_similar(['cheerlead'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[85]:
[('oop', 0.9346482753753662),
 ('li-dha', 0.8943625688552856),
 ('hipbon', 0.8943070769309998),
 ('clark', 0.8928951025009155),
 ('darkli', 0.8794020414352417),
 ('quarterback', 0.8741528391838074),
 ('hanok', 0.8739421367645264),
 ('lexa', 0.873622715473175),
 ('condescend', 0.864906370639801),
 ('yasha', 0.8637580275535583),
 ('dame', 0.859812319278717),
 ('hypothet', 0.8597378730773926),
 ('unreal', 0.8575913906097412),
 ('clich', 0.855718731880188),
 ('millimet', 0.8555583953857422),
 ('viya', 0.8478541374206543),
 ('plow', 0.8464254140853882),
 ('mean-', 0.8451640009880066),
 ('aisl', 0.8434481620788574),
 ('flirtati', 0.8423473238945007)]
In [91]:
subKorra_w2v.wv.most_similar(['heiress'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[91]:
[('breathlessli', 0.7743628621101379),
 ('wider', 0.7459204792976379),
 ('flush', 0.7384091019630432),
 ('arch', 0.7338321805000305),
 ('a-asami', 0.717422366142273),
 ('nervou', 0.7121500372886658),
 ('lace', 0.7069330215454102),
 ('devilishli', 0.7031633853912354),
 ('stammer', 0.6989263892173767),
 ('deepen', 0.6959136724472046),
 ('fluster', 0.6903343796730042),
 ('stiffen', 0.6890716552734375),
 ('babe', 0.6837170720100403),
 ('lopsid', 0.682559072971344),
 ('nervous', 0.6820570826530457),
 ('ab', 0.6731303930282593),
 ('slight', 0.6720433235168457),
 ('brush', 0.6694374084472656),
 ('azur', 0.6662962436676025),
 ('thickli', 0.6599745154380798)]
In [92]:
subKorra_w2v.wv.most_similar(['woman'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[92]:
[('older', 0.7941965460777283),
 ('younger', 0.7170405387878418),
 ('taller', 0.6838128566741943),
 ('young', 0.6511724591255188),
 ('women', 0.6392420530319214),
 ('pregnant', 0.6206789612770081),
 ('slightli', 0.6049616932868958),
 ('shorter', 0.5875957608222961),
 ('wife', 0.579010009765625),
 ('drew', 0.574859082698822),
 ('warmli', 0.5709928870201111),
 ('daughter', 0.5529758930206299),
 ('clearli', 0.5469895005226135),
 ('korro', 0.5428802371025085),
 ('gaze', 0.5410525798797607),
 ('hesit', 0.5338935256004333),
 ('husband', 0.5256304144859314),
 ('stun', 0.522885799407959),
 ('gown', 0.5206611156463623),
 ('wider', 0.5098415017127991)]
In [102]:
subKorra_w2v.wv.most_similar(['feminin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[102]:
[('brunett', 0.9074066281318665),
 ('creami', 0.8784425854682922),
 ('repeatedli', 0.8638737797737122),
 ('chestnut', 0.8565282821655273),
 ('probe', 0.8523907661437988),
 ('vagina', 0.8495334386825562),
 ('peau', 0.8468303680419922),
 ('flawless', 0.8445796966552734),
 ('seul', 0.8445689082145691),
 ('likewis', 0.8444569706916809),
 ('flail', 0.8433421850204468),
 ('naughti', 0.8432282209396362),
 ('thereaft', 0.842755138874054),
 ('sensual', 0.8407859802246094),
 ('cunt', 0.8404106497764587),
 ('womanhood', 0.8363346457481384),
 ('elicit', 0.8347560167312622),
 ('abdomen', 0.8338559865951538),
 ('taut', 0.8336033821105957),
 ('allur', 0.8335332870483398)]
In [211]:
subKorra_w2v.wv.most_similar(['man'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[211]:
[('lee', 0.7369656562805176),
 ('ann', 0.6511579155921936),
 ('men', 0.6344538331031799),
 ('sneer', 0.5720396637916565),
 ('young', 0.5554693341255188),
 ('rozu', 0.5554302334785461),
 ('lieu', 0.5537645220756531),
 ('shang', 0.5429895520210266),
 ('lowli', 0.5403271913528442),
 ('rin', 0.5399022698402405),
 ('tozen', 0.5395759344100952),
 ('torq', 0.522463321685791),
 ('azula', 0.5150624513626099),
 ('glare', 0.5136635899543762),
 ('sir', 0.513161838054657),
 ('ty', 0.5131481289863586),
 ('whore', 0.5121971964836121),
 ('chau', 0.5085482001304626),
 ('comrad', 0.5025727152824402),
 ('pompou', 0.49624136090278625)]
In [ ]:
 
In [105]:
subKorra_w2v.wv.most_similar(['masculin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[105]:
[('porn', 0.9184079170227051),
 ('inexperienc', 0.9115799069404602),
 ('pervert', 0.9065878987312317),
 ('envi', 0.9055134654045105),
 ('swoon', 0.9023139476776123),
 ('oop', 0.9021033644676208),
 ('cont', 0.900875985622406),
 ('nativ', 0.8963315486907959),
 ('clich', 0.8945992588996887),
 ('persuas', 0.894284725189209),
 ('witti', 0.893638014793396),
 ('rabbit', 0.8925989270210266),
 ('intox', 0.8907217383384705),
 ('astound', 0.8906519412994385),
 ('bold', 0.8896130919456482),
 ('deed', 0.8887728452682495),
 ('hone', 0.8873229026794434),
 ('millimet', 0.886099100112915),
 ('glossi', 0.8843542337417603),
 ('strictli', 0.8843110203742981)]
In [120]:
subKorra_w2v.wv.most_similar(['gender'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[120]:
[('reput', 0.9274627566337585),
 ('wage', 0.9266194701194763),
 ('specul', 0.9165371060371399),
 ('inventor', 0.9161173105239868),
 ('exclus', 0.9067599177360535),
 ('spectacular', 0.9064094424247742),
 ('sophist', 0.9052457213401794),
 ('outcast', 0.9051941633224487),
 ('13', 0.9046571850776672),
 ('organis', 0.9034374356269836),
 ('commun', 0.9010162949562073),
 ('graduat', 0.9005753993988037),
 ('therefor', 0.9002510905265808),
 ('farmer', 0.9000048041343689),
 ('hobbi', 0.8985443115234375),
 ('backstori', 0.8981733322143555),
 ('factor', 0.8975718021392822),
 ('folk', 0.8952187299728394),
 ('suitabl', 0.8939134478569031),
 ('associ', 0.8934099078178406)]
In [131]:
subKorra_w2v.wv.most_similar(['lesbian'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[131]:
[('but-', 0.9141560196876526),
 ('peg', 0.8600756525993347),
 ('smarter', 0.8549880385398865),
 ('sorta', 0.8404135704040527),
 ('rambl', 0.8402645587921143),
 ('hah', 0.8382165431976318),
 ('duh', 0.8349829316139221),
 ('impli', 0.8345717191696167),
 ('nerd', 0.8309779763221741),
 ('inappropri', 0.8273021578788757),
 ('immatur', 0.8241227269172668),
 ('wiser', 0.8215035200119019),
 ('sober', 0.8112272620201111),
 ('unhappi', 0.8083264231681824),
 ('fangirl', 0.805036187171936),
 ('asshol', 0.802947998046875),
 ('trivial', 0.8026454448699951),
 ('scari', 0.8025082349777222),
 ('ex-girlfriend', 0.7974963784217834),
 ('breakup', 0.7949498295783997)]
In [152]:
subKorra_w2v.wv.most_similar(['marri'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[152]:
[('proud', 0.7784759402275085),
 ('marriag', 0.7744129300117493),
 ('daughter', 0.7685775756835938),
 ('husband', 0.7571151852607727),
 ('lucki', 0.7524663805961609),
 ('refrain', 0.7516282200813293),
 ('parent', 0.7490397095680237),
 ('grandmoth', 0.7444103956222534),
 ('famili', 0.7367498874664307),
 ('wife', 0.7358691096305847),
 ('honor', 0.7280506491661072),
 ('birthday', 0.7257238626480103),
 ('dear', 0.7253580093383789),
 ('accept', 0.7243698239326477),
 ('fondli', 0.7237311601638794),
 ('adopt', 0.7139495611190796),
 ('nicknam', 0.7135270833969116),
 ('told', 0.7100114226341248),
 ('wed', 0.708044707775116),
 ('yumi', 0.7009112238883972)]
In [157]:
subKorra_w2v.wv.most_similar(['pregnant'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[157]:
[('husband', 0.8924795985221863),
 ('wife', 0.8594977259635925),
 ('korro', 0.7888396978378296),
 ('daughter', 0.7444596886634827),
 ('warmli', 0.7282865643501282),
 ('senna', 0.7272575497627258),
 ('dear', 0.7227210402488708),
 ('older', 0.7166091799736023),
 ('younger', 0.7136698961257935),
 ('child', 0.7121751308441162),
 ('women', 0.7119937539100647),
 ('mother', 0.710296630859375),
 ('fondli', 0.7090214490890503),
 ('lover', 0.6949121952056885),
 ('wive', 0.6912193298339844),
 ('kindli', 0.6842564344406128),
 ('delight', 0.676864743232727),
 ('marri', 0.6670804023742676),
 ('affection', 0.6641011834144592),
 ('izumi', 0.6531662344932556)]

Post-Korrasami Corpus

In [65]:
postKorra_w2v = wv(sent_postKorra, window=20, min_count=10, workers=4)
In [204]:
postKorra_w2v.wv.most_similar(['asami'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[204]:
[('korra', 0.8552396297454834),
 ('girlfriend', 0.6660484671592712),
 ('heiress', 0.5773499608039856),
 ('sami', 0.5623942613601685),
 ('babe', 0.5372836589813232),
 ('dork', 0.5250464081764221),
 ('lightli', 0.5144843459129333),
 ('mmm', 0.5117138028144836),
 ('breathlessli', 0.504883348941803),
 ("'sami", 0.4988788366317749),
 ('engin', 0.4957447946071625),
 ('shyli', 0.4895719289779663),
 ('mm', 0.48589250445365906),
 ('blush', 0.4811437427997589),
 ('hmm', 0.47896090149879456),
 ('okay', 0.47339361906051636),
 ('laura', 0.4697876572608948),
 ('softli', 0.46250495314598083),
 ('breathless', 0.4566752016544342),
 ('lovingli', 0.4543737471103668)]
In [70]:
postKorra_w2v.wv.most_similar(['girlfriend'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[70]:
[('korra', 0.6986696124076843),
 ('asami', 0.6660485863685608),
 ('dork', 0.6309612393379211),
 ('ador', 0.6270508766174316),
 ('heiress', 0.6221494078636169),
 ('babe', 0.6041741967201233),
 ('fianc', 0.5740950107574463),
 ('pout', 0.5674326419830322),
 ('cute', 0.5636457204818726),
 ('teas', 0.5542226433753967),
 ('ceo', 0.5346977710723877),
 ('lover', 0.5226887464523315),
 ('lovingli', 0.5208807587623596),
 ('fluster', 0.5186401009559631),
 ('blush', 0.5139248371124268),
 ('sexi', 0.5102159976959229),
 ('fiance', 0.508668839931488),
 ('giggl', 0.5068516135215759),
 ('ex-boyfriend', 0.5025683641433716),
 ('seduct', 0.5015711784362793)]
In [77]:
postKorra_w2v.wv.most_similar(['korra'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[77]:
[('asami', 0.855239748954773),
 ('girlfriend', 0.6986696124076843),
 ('breathlessli', 0.5904357433319092),
 ('heiress', 0.5890895128250122),
 ('babe', 0.5677245259284973),
 ('sami', 0.559546172618866),
 ('lightli', 0.5382922291755676),
 ("'sami", 0.5264853835105896),
 ('mmm', 0.5210567116737366),
 ('mm', 0.519873857498169),
 ('dork', 0.5108180046081543),
 ('breathless', 0.5032798051834106),
 ('teasingli', 0.4974740147590637),
 ('blush', 0.49296054244041443),
 ('shyli', 0.47768986225128174),
 ('pout', 0.4743771255016327),
 ('gentli', 0.4739275276660919),
 ('hmm', 0.46916264295578003),
 ('okay', 0.4683947265148163),
 ('softli', 0.4628128409385681)]
In [89]:
postKorra_w2v.wv.most_similar(['heiress'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[89]:
[('omega', 0.6915645599365234),
 ('alpha', 0.6567217707633972),
 ('inventor', 0.6470993757247925),
 ('squirm', 0.624725878238678),
 ('girlfriend', 0.6221494078636169),
 ('korra', 0.5890895128250122),
 ('brunett', 0.583236813545227),
 ('asami', 0.5773500204086304),
 ('arous', 0.5766016244888306),
 ('raven-hair', 0.5765764117240906),
 ('whimper', 0.5752760171890259),
 ('straddl', 0.5624386668205261),
 ('tan', 0.5509368181228638),
 ('breast', 0.5499779582023621),
 ('ceo', 0.547355055809021),
 ('pur', 0.539905309677124),
 ('fondl', 0.5350602269172668),
 ('pussi', 0.5292314291000366),
 ('hip', 0.5262117385864258),
 ('lust', 0.5254981517791748)]
In [206]:
postKorra_w2v.wv.most_similar(['woman'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[206]:
[('girl', 0.6128966212272644),
 ('tan', 0.5491816997528076),
 ('teen', 0.51688152551651),
 ('man', 0.5110265016555786),
 ('raven', 0.5066137313842773),
 ('heiress', 0.4812253415584564),
 ('dark-skin', 0.48040083050727844),
 ('raven-hair', 0.47327181696891785),
 ('inventor', 0.4690433740615845),
 ('tren', 0.46807485818862915),
 ('young', 0.4651164710521698),
 ('older', 0.4608103632926941),
 ('taller', 0.45827239751815796),
 ('women', 0.4442558288574219),
 ('prodigi', 0.43734028935432434),
 ('muscular', 0.4348944425582886),
 ('younger', 0.43127143383026123),
 ('stranger', 0.428732305765152),
 ('pale-skin', 0.4274151921272278),
 ('dark-hair', 0.4234382212162018)]
In [101]:
postKorra_w2v.wv.most_similar(['feminin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[101]:
[('allur', 0.7315315008163452),
 ('contrast', 0.7279200553894043),
 ('eleg', 0.7141945958137512),
 ('accentu', 0.7073886394500732),
 ('physiqu', 0.706202507019043),
 ('angular', 0.6974437236785889),
 ('masculin', 0.6968234777450562),
 ('entic', 0.6965820789337158),
 ('musculatur', 0.6892468333244324),
 ('sculpt', 0.6783508658409119),
 ('flawless', 0.6769590973854065),
 ('distinct', 0.6709342002868652),
 ('muscular', 0.6657618284225464),
 ('mesmer', 0.6578235626220703),
 ('undeni', 0.6561500430107117),
 ('exquisit', 0.6532119512557983),
 ('complexion', 0.6444883942604065),
 ('marvel', 0.6425728797912598),
 ('chisel', 0.6417080163955688),
 ('delic', 0.640998125076294)]
In [104]:
postKorra_w2v.wv.most_similar(['masculin'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[104]:
[('epitom', 0.7003445029258728),
 ('feminin', 0.6968234777450562),
 ('qualiti', 0.6952031850814819),
 ('gender', 0.6914891004562378),
 ('angular', 0.6889472007751465),
 ('allur', 0.6764456033706665),
 ('worship', 0.6762538552284241),
 ('musculatur', 0.675748884677887),
 ('manli', 0.6754828691482544),
 ('trademark', 0.6736838221549988),
 ('complement', 0.6731098294258118),
 ('flashi', 0.6717349290847778),
 ('undeni', 0.6714355945587158),
 ('erot', 0.6695486903190613),
 ('contemporari', 0.6690717935562134),
 ('stereotyp', 0.6660759449005127),
 ('superfici', 0.6659425497055054),
 ('queer', 0.6608322858810425),
 ('japanes', 0.653177797794342),
 ('user', 0.6527581214904785)]
In [109]:
postKorra_w2v.wv.most_similar(['gender'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[109]:
[('biolog', 0.7380633354187012),
 ('common', 0.7290197014808655),
 ('stereotyp', 0.728571355342865),
 ('renown', 0.7237654328346252),
 ('heritag', 0.7177746295928955),
 ('categori', 0.7151609659194946),
 ('matur', 0.7136565446853638),
 ('exampl', 0.703478991985321),
 ('dislik', 0.7019073963165283),
 ('queer', 0.7015695571899414),
 ('genet', 0.7014217972755432),
 ('averag', 0.6998161673545837),
 ('masculin', 0.6914891004562378),
 ('concept', 0.6876851916313171),
 ('suprem', 0.6873180866241455),
 ('interact', 0.6867102384567261),
 ('worship', 0.6861883401870728),
 ('obsess', 0.6861700415611267),
 ('trait', 0.6847718954086304),
 ('speci', 0.6835721731185913)]
In [114]:
postKorra_w2v.wv.most_similar(['queer'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[114]:
[('gene', 0.8113480806350708),
 ('categor', 0.8026371598243713),
 ('homophob', 0.7778434157371521),
 ('oversight', 0.7768436670303345),
 ('synonym', 0.7708051800727844),
 ('poser', 0.7690443396568298),
 ('open-mind', 0.768737256526947),
 ("'new", 0.759117066860199),
 ('innumer', 0.7587401270866394),
 ('genocid', 0.7576499581336975),
 ('pouvoir', 0.7570508122444153),
 ('standpoint', 0.7549765110015869),
 ('high-profil', 0.7549502849578857),
 ('mot', 0.7545825839042664),
 ('unaccustom', 0.7538706660270691),
 ('esprit', 0.7527821660041809),
 ('tomboy', 0.746514618396759),
 ('juvenil', 0.7424350380897522),
 ('eu', 0.7423019409179688),
 ('superfici', 0.740811824798584)]
In [133]:
postKorra_w2v.wv.most_similar(['bisexu'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[133]:
[('lesbian', 0.7358127236366272),
 ('fangirl', 0.6794250011444092),
 ('gay', 0.6769630312919617),
 ('heterosexu', 0.6747435331344604),
 ('romanc', 0.6363551020622253),
 ('nerd', 0.6221931576728821),
 ('ami', 0.6101471185684204),
 ('secondli', 0.6081928610801697),
 ('corni', 0.6047763228416443),
 ('woo', 0.6042453646659851),
 ('maker', 0.6038182377815247),
 ('oke', 0.6038076281547546),
 ('hockey', 0.6025301814079285),
 ('concert', 0.5989912748336792),
 ('19', 0.5936625599861145),
 ('bae', 0.5933998227119446),
 ('categori', 0.5922248959541321),
 ('famou', 0.5869439244270325),
 ('scholarship', 0.5863643884658813),
 ('ex', 0.5847569108009338)]
In [147]:
postKorra_w2v.wv.most_similar(['racist'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[147]:
[('queer', 0.6622498631477356),
 ('homophob', 0.6554720401763916),
 ('lesbian', 0.6422443985939026),
 ('aristocrat', 0.6391816139221191),
 ('vindict', 0.6317988038063049),
 ('snooti', 0.6288875937461853),
 ('stupidest', 0.6279544234275818),
 ('stereotyp', 0.6253094673156738),
 ('rich', 0.6236247420310974),
 ('gene', 0.6233779191970825),
 ('gig', 0.6204466223716736),
 ('suitor', 0.6162865161895752),
 ('.that', 0.6161808371543884),
 ('nobl', 0.6161414384841919),
 ('prom', 0.6159921288490295),
 ('ex-wif', 0.6135737299919128),
 ('juvenil', 0.6128740906715393),
 ('imbecil', 0.6123341917991638),
 ('dislik', 0.6114457845687866),
 ('pervert', 0.6084153652191162)]
In [155]:
postKorra_w2v.wv.most_similar(['marri'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[155]:
[('marriag', 0.7341805100440979),
 ('wed', 0.6527568101882935),
 ('propos', 0.6325207352638245),
 ('wife', 0.6007057428359985),
 ('spous', 0.5760090351104736),
 ('honeymoon', 0.5428836345672607),
 ('bisexu', 0.5397241115570068),
 ('elop', 0.5394207239151001),
 ('birthday', 0.5360244512557983),
 ('grandmoth', 0.5315588712692261),
 ('bride', 0.5304710865020752),
 ('adopt', 0.5283628702163696),
 ('gay', 0.5264289975166321),
 ('husband', 0.5150956511497498),
 ('hero', 0.5113064646720886),
 ('niec', 0.4979856312274933),
 ('ex', 0.49780890345573425),
 ("'and", 0.4883483350276947),
 ('love', 0.4878637492656708),
 ('pregnant', 0.48736679553985596)]
In [167]:
postKorra_w2v.wv.most_similar(['pregnant'], topn=20)
/Users/caramessina/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
Out[167]:
[('pregnanc', 0.6898149251937866),
 ('birth', 0.6444066166877747),
 ('husband', 0.6266837120056152),
 ('sire', 0.6051164865493774),
 ('wife', 0.600002110004425),
 ('babi', 0.5909838080406189),
 ('newborn', 0.5690059661865234),
 ('omega', 0.5682777166366577),
 ('senna', 0.5609380602836609),
 ('daughter', 0.5432636141777039),
 ('child', 0.5412541031837463),
 ('parent', 0.5409191846847534),
 ('tought', 0.5388841032981873),
 ('born', 0.5347464680671692),
 ('mother', 0.5282257199287415),
 ('grandchild', 0.5247682332992554),
 ('motherli', 0.5243872404098511),
 ('tonraq', 0.5184610486030579),
 ('marriag', 0.5181713700294495),
 ('katara', 0.5169015526771545)]

Potential Results Overall

Good things to look at:

  • Asami:
    • pre: heriress, cheerlead, girlfriend, football, girl, babe
    • subtext: engine, mumble, babe, breathlessli, nervous, grin, lean
    • post: girlfeind, heiress, dork, babe, lightli, 'mmm,' breathlessli, engine
  • "girlfriend"
    • pre: boyfriend, jelaous, date, quarterback, drama (more focused on the external relationships)
    • sub: cute, chuckl, date, wink, babe, blush (more about the actions, the feelings, the emotions)
    • post: asami, korra, dork, adore, lovingli, lover, fianc, etc. Much more about the interactions, building the feelings, and korra/asami (CEO, heiress, etc)
  • Korra:
    • pre: asami, cheerleader, quarterback, nervous, sheepish, embarrass
    • sub: asami, babe, softli, nervous,
    • post: girlfriend, heiress, teasingli,
  • cheerleader (cheerlead)/quarterback
    • first: qusrterback, blush, football, 'sami, etc.
    • sub/post: not important
  • heiress:
    • in FF, "nicknames" are given to characters, especially characters of the same gender, because relying on pronouns can make prose confusing. For example, "the heiress" may be used as a replacement for Asami
    • pre: gorgeous, asami, girl, beautiful, cheerlead, businesswoman, eyeshadow, hairstyle
    • sub: breathlessli, wider, flush, arch, a-asami
    • post: omega/alpha, inventor, squirm, girlfriend,
  • woman:
    • pre: young, man, ladi, elderli, husband (relationships)
    • sub: older, younger, taller, etc (descriptions)
    • post: descriptions, more specifically related to Korra and Asami (raven-haired, dark-skin, muscular, etc.)
  • feminin (feminine):

    • Pre: eleg, accent, accentu, complement, facial, tan, contrast, decideli, highlight, creami
    • sub: brunett (specific character), creami, chestnut (hair, skin), probe, vagina, explicitly about sex
    • post: allur, contrast, eleg, accentu, physiqu, angular, masculin – about shifting bodies and roles
  • masculin (masculine):

    • Pre: undeni (undeniable), mixtur, enchat, incomprehens, labia
    • sub: porn, inexperienc, pervert, envi, (much more negative)
    • post: epitom, feminin, qualiti, gender, angular, allur, worship
  • gender:

    • pre: inspir, interpret, writer, genr, charact, fiction, creativ
    • sub: reput, wage, specul, inventor, exclus, spectacular, sophist, outcast, commun.
    • post: biolog, common, stereotyp, renown, heritag, categori, matur, queer,
  • queer:

    • pre: not in model
    • sub: not in model
    • post: gene, categor, homophob, oversight, synonym, poser, open-minded, new,
  • bisexual:

    • pre: not in model
    • sub: not in model
    • post: lesbian, gay, heterosex, romanc,
  • lesbian:

    • pre: not in model
    • sub: but, peg, smarter, sorta, rambl, hah, duh, impli, nerd, inappropriate, wiser, sober, unhappi, fangirl
    • post: gay, bisexual, yeah, ex, fangirl, fuckin, racist
  • heterosexu (heterosexual, heterosexuality, etc) :

    • pre: not in model
    • sub: not in model
    • post: vocab around sexuality
  • racist:

    • pre: not in model
    • sub: not in model
    • post: queer, homophob, aristocrat, – aware of social issues
  • marri (marry, etc):

    • pre: marriag, wed, wife, pregnant, sixteen, eighteen, age
    • sub: proud, marriag, daughter, husband, luki, refrain, parent, famili
    • post: marriag, wed, propos, wife, spous, honeymoon
  • pregnant:

    • pre: marri, pregnanc, babi, fourteen-year-old, wife, mum, child
    • sub: husband, wife, korro (masc korra? child), daughter, warmli, dear, older
    • post: pregnanc, birth, husband, sire, wife, babi, newborn, omega

Concordances

Although NLTK has a concordance function, it only shows the first 25 results. I instead found this function "makeConc" from Geoffrey Rockwell that shows more than 25 results and is fairly flexible in its results.

The "makeConc" function requires a tokenized list, so I will still be using the "read_txt" function, but I will be using the concordance function on the uncleaned versions of the corpora so the context is a bit more clear.

I have chosen to keep the output results hidden because these excerpts and texts do not belong to me, so I would prefer not to publish someone else's writing and language unless I have their permission.

In [226]:
preKorra_string = read_txt('../../../data/korra/korra2018/time/preKorrasami_unclean.txt')
subtextKorra_string = read_txt('../../../data/korra/korra2018/time/subtextKorrasami_unclean.txt')
postKorra_string = read_txt('../../../data/korra/korra2018/time/postKorrasami_unclean.txt')
In [222]:
def makeConc(word2conc,list2FindIn,context2Use,concList):
    # Lets get 
    end = len(list2FindIn)
    for location in range(end):
        if list2FindIn[location] == word2conc:
            # Here we check whether we are at the very beginning or end
            if (location - context2Use) < 0:
                beginCon = 0
            else:
                beginCon = location - context2Use
                
            if (location + context2Use) > end:
                endCon = end
            else:
                endCon = location + context2Use + 1
                
            theContext = (list2FindIn[beginCon:endCon])
            concordanceLine = ' '.join(theContext)
            # print(str(location) + ": " + concordanceLine)
            concList.append(str(location) + ": " + concordanceLine)

Gender

In [ ]:
gender1 = []
makeConc('gender',preKorra_string,5,gender1)
gender1
In [ ]:
gender2 = []
makeConc('gender',subtextKorra_string,5,gender2)
gender2
In [ ]:
gender3 = []
makeConc('gender',postKorra_string,5,gender3)
gender3

Feminine/Masculine

In [ ]:
fem1 = []
makeConc('feminine',preKorra_string,6,fem1)
fem1
In [ ]:
masc1 = []
makeConc('masculine',preKorra_string,6,masc1)
masc1
In [ ]:
fem2 = []
makeConc('feminine',subtextKorra_string,7,fem2)
fem2
In [ ]:
masc2 = []
makeConc('masculine',subtextKorra_string,7,masc2)
masc2
In [ ]:
fem3 = []
makeConc('feminine',postKorra_string,6,fem3)
fem3
In [ ]:
masc3 = []
makeConc('masculine',postKorra_string,6,masc3)
masc3

Sexuality Markers: Gay, Bi, & Lesbian

In [ ]:
gay1 = []
makeConc('gay',preKorra_string,9,gay1)
gay1
In [ ]:
les1 = []
makeConc('lesbian',preKorra_string,9,les1)
les1
In [ ]:
bi1 = []
makeConc('bi',preKorra_string,7,bi1)
bi1
In [ ]:
gay2 = []
makeConc('gay',subtextKorra_string,9,gay2)
gay2
In [ ]:
les2 = []
makeConc('lesbian',subtextKorra_string,9,les2)
les2
In [ ]:
bi2 = []
makeConc('bisexual',subtextKorra_string,9,bi2)
bi2
In [ ]:
gay3 = []
makeConc('gay',postKorra_string,7,gay3)
gay3
In [ ]:
les3 = []
makeConc('lesbian',postKorra_string,7,les3)
les3

Girlfriend

In [ ]:
gf1 = []
makeConc('girlfriend',preKorra_string,6,gf1)
gf1
In [ ]:
gf2 = []
makeConc('girlfriend',subtextKorra_string,8,gf2)
gf2
In [ ]:
gf3 = []
makeConc('girlfriend',postKorra_string,5,gf3)
gf3

Biological

In [ ]:
bio1 = []
makeConc('biological',preKorra_string,6,bio1)
bio1
In [ ]:
bio2 = []
makeConc('biological',subtextKorra_string,6,bio2)
bio2
In [ ]:
bio3 = []
makeConc('biological',postKorra_string,6,bio3)
bio3
In [ ]:
bi3 = []
makeConc('bi',postKorra_string,8,bi3)
bi3