This computational notebook will demonstrate how to transform Archive of Our Own fanfiction metadata for network analysis. To help with this, I used the Program Historian's "Exploring and Analyzing Network Data with Python". William Reed Quinn also helped with transforming the additional tags into a matrix.
What this computational notebook does is take a list of additional tags, count how often particular tags appear together, and then create a nodes/edges list based on this data. This approach can be applied to the "Additional Tags" as well as other metadata like "Character" or "Relationship."
The goal for my network analysis is to better trace the creation and curation of fandom communities through additional tags. Which additional tags are used together? Why may this be?
import pandas as pd
from operator import itemgetter
import networkx as nx
from networkx.algorithms import community
import csv
import numpy as np
import plotly.graph_objects as go
korra_all = pd.read_csv('./data/allkorra.csv')
korra_all.head(3)
Since this is going to be a massive file, I want to start with a small sample first to make sure my code works correctly I will use the .sample() function to grab 5 random fanfics. Then, I will lower-case all the additional tags and make each fanfics additional tag its own list.
My new dataframe, which now has the additional tags all lower-cased and in an individual list for each code, will then be "exploded." What exploding does is take a list within a dataframe and makes each item in that list a new row, while still maintaining all the original information.
practice = korra_all.sample(n=15, random_state=1).reset_index()
practice[:1]
practice["additional tags"]= practice["additional tags"].str.lower().str.split(",")
practice[:1]
practice = practice.drop(columns = ["index","Unnamed: 0"])
practice[:1]
practice = practice[['additional tags', 'work_id']]
practiceNewdf = practice.explode('additional tags')
practiceNewdf[:1]
Now that I have used the "explode()" function above to create a dataframe where each additional tag has its own row and still has the work_id as a unique ID, I will now add a "count" column that just adds one number, which I can then use to transform into a longer dataframe.
practiceNewdf["count"] = practiceNewdf.groupby(["work_id","additional tags"])["additional tags"].transform("count").fillna(0).astype(int)
practiceNewdf[:1]
After creating a dataframe that has three columns (additional tags, work_id, and count), I will use the "pivot" function to take this data frame and make it wider. I'm not 100% sure how to explain pivot, but as you see in the dataframe below, the new index is the "work_id" and the new dataframe counts how often a particular additional tag appears in that index.
I probably do not need to keep the work_id, but I think it is better to be safe than sorry.
The function "df.columns.name = None" removes any weird metadata that names all the columns "additional tags." One issue I was facing is my matrix (see below) was not being properly made into an edges list because the columns were being labeled as "additional tags" and throwing everything off.
Finally, the function "df.columns.fillna('none')" replaces the Nan (blank) column name with the name "none." Then, I deleted the "none" column. This way, I don't have any blank tags, and they're not important anyway!
practiceWorkID = practiceNewdf.pivot(index="work_id", columns="additional tags", values="count").fillna(0)
#removing column metadata
practiceWorkID.columns.name = None
#renaming "Nan" into "none" so it actually appears instead of being a null value
practiceWorkID.columns = practiceWorkID.columns.fillna('none')
practiceWorkID = practiceWorkID.drop(columns = 'none')
practiceWorkID
adj_matrix = practiceWorkID.T.dot(practiceWorkID).astype(int)
adj_matrix
In this final section, I will create both the edges and nodes csv files.
To create the edges list, I used the function "stack" on the matrix I created above and the reset_index function. The edges will now have three columns: the additional tag, another additonal tag, and the weight of its relationship. I also renamed all the columns.
To create the nodes list,
#using .stack() and .reset(index)
practice_edges_list = adj_matrix.stack().reset_index()
#renaming columns for clarity purposes
practice_edges_list = practice_edges_list.rename({'level_0':'source', 'level_1': 'target', 0:'weight'}, axis='columns')
#yay looks good!
practice_edges_list[:2]
# nodes list: adj_matrix['additional tags'].unique
practice_nodes_list = practiceNewdf['additional tags'].drop_duplicates().reset_index()
practice_nodes_list = practice_nodes_list.drop(columns="index")
practice_nodes_list[:2]
practice_edges_list.to_csv('./data/practice_edges.csv', index=False)
practice_nodes_list.to_csv('./data/practice_nodes.csv', index=False)
Code taken from Programming Historian!
with open('./data/practice_nodes.csv', 'r') as nodecsv:
nodereader = csv.reader(nodecsv)
nodes = [n for n in nodereader][1:]
# Get a list of just the node names (the first item in each row)
node_names = [n[0] for n in nodes]
# Read in the edgelist file
with open('./data/practice_edges.csv', 'r') as edgecsv:
edgereader = csv.reader(edgecsv)
edges = [tuple(e) for e in edgereader][1:]
# Print the number of nodes and edges in our two lists
print(len(node_names))
print(len(edges))
G = nx.Graph() # Initialize a Graph object
G.add_nodes_from(node_names) # Add nodes to the Graph
G.add_weighted_edges_from(edges) # Add edges to the Graph
print(nx.info(G)) # Print information about the Graph
density = nx.density(G)
print("Network density:", density)
nx.draw(G, with_labels = True)
Let's try with the full corpus! I am following the exact same steps I have above, except I am doing the entire corpus!
Before I save the corpus, I will also choose only those with a weight of 20 or above. This means two additional tags that appear together more than 20 times, as this is a better determiner of larger community practices.
korra_all["additional tags"]= korra_all["additional tags"].str.lower().str.split(",")
korra_all[:1]
korra_all = korra_all.drop(columns = ["Unnamed: 0"])
korra_all[:1]
korra_all = korra_all[['additional tags', 'work_id']]
korraDFexploded = korra_all.explode('additional tags')
korraDFexploded[:5]
korraDFexploded["count"] = korraDFexploded.groupby(["work_id","additional tags"])["additional tags"].transform("count").fillna(0).astype(int)
korraDFexploded[:2]
korraWorkID = korraDFexploded.pivot(index="work_id", columns="additional tags", values="count").fillna(0)
#removing column metadata
korraWorkID.columns.name = None
#renaming "Nan" into "none" so it actually appears instead of being a null value
korraWorkID.columns = korraWorkID.columns.fillna('none')
korraWorkID = korraWorkID.drop(columns = 'none')
korraWorkID
korraMatrix = korraWorkID.T.dot(korraWorkID).astype(int)
korraMatrix
#using .stack() and .reset(index)
korra_edges = korraMatrix.stack().reset_index()
#renaming columns for clarity purposes
korra_edges = korra_edges.rename({'level_0':'source', 'level_1': 'target', 0:'weight'}, axis='columns')
#yay looks good!
korra_edges[:2]
#The "Weight" >= 20 means the tags must appear at least 15 times to be incorporated.
#Also, the "query" makes sure to remove any duplicate sources/targets
korra_edges15 = korra_edges[ (korra_edges['weight'] >= 15)]
korra_edges15 = korra_edges15.query("source != target")
korra_edges15
# nodes list: adj_matrix['additional tags'].unique
korra_nodes = korra_edges15['source'].drop_duplicates().reset_index()
korra_nodes = korra_nodes.drop(columns="index")
korra_nodes[:2]
korra_edges15.to_csv('./data/korra_edges.csv', index=False)
korra_nodes.to_csv('./data/korra_nodes.csv', index=False)
with open('./data/korra_nodes.csv', 'r') as nodecsv:
nodereader = csv.reader(nodecsv)
nodes = [n for n in nodereader][1:]
# Get a list of just the node names (the first item in each row)
node_names = [n[0] for n in nodes]
# Read in the edgelist file
with open('./data/korra_edges.csv', 'r') as edgecsv:
edgereader = csv.reader(edgecsv)
edges = [tuple(e) for e in edgereader][1:]
# Print the number of nodes and edges in our two lists
print(len(node_names))
print(len(edges))
G = nx.Graph() # Initialize a Graph object
G.add_nodes_from(node_names) # Add nodes to the Graph
G.add_weighted_edges_from(edges) # Add edges to the Graph
print(nx.info(G)) # Print information about the Graph
density = nx.density(G)
print("Network density:", density)
nx.draw(G, with_labels = True)
nx.write_gexf(G, './data/korra_network.gexf')
I created the visualization by first reading in my nodes and edges files using the Program Historian's code. Then, I used Plotly's network graph documentation to create the graph.
The main line of code I added is three cells below creating a "pos" (positions) column that Plotly can then use to map onto the graph. Positon means the literal x,y position of a node. NetworkX has different layouts to choose from for positioning your nodes.
with open('./data/korra_nodes.csv', 'r') as nodecsv:
nodereader = csv.reader(nodecsv)
nodes = [n for n in nodereader][1:]
# Get a list of just the node names (the first item in each row)
node_names = [n[0] for n in nodes]
# Read in the edgelist file
with open('./data/korra_edges.csv', 'r') as edgecsv:
edgereader = csv.reader(edgecsv)
edges = [tuple(e) for e in edgereader][1:]
print(len(node_names))
print(len(edges))
G = nx.Graph() # Initialize a Graph object
G.add_nodes_from(node_names) # Add nodes to the Graph
G.add_weighted_edges_from(edges) # Add edges to the Graph
print(nx.info(G)) # Print information about the Graph
# Adding positions in my data
pos = nx.spring_layout(G, k=0.6, iterations=50)
for n,p in pos.items():
G.nodes[n]['pos'] = p
edge_x = []
edge_y = []
for edge in G.edges():
x0, y0 = G.nodes[edge[0]]['pos']
x1, y1 = G.nodes[edge[1]]['pos']
edge_x.append(x0)
edge_x.append(x1)
edge_x.append(None)
edge_y.append(y0)
edge_y.append(y1)
edge_y.append(None)
edge_trace = go.Scatter(
x=edge_x, y=edge_y,
line=dict(width=0.9, color='#888'),
hoverinfo='none',
mode='lines')
node_x = []
node_y = []
node_label =[]
for node in G.nodes():
x, y = G.nodes[node]['pos']
node_x.append(x)
node_y.append(y)
node_label.append(node)
node_trace = go.Scatter(
x=node_x, y=node_y,
mode='markers',
text= node_label, textposition='top center',
hoverinfo='text',
marker=dict(
showscale=True,
# colorscale options
#'Greys' | 'YlGnBu' | 'Greens' | 'YlOrRd' | 'Bluered' | 'RdBu' |
#'Reds' | 'Blues' | 'Picnic' | 'Rainbow' | 'Portland' | 'Jet' |
#'Hot' | 'Blackbody' | 'Earth' | 'Electric' | 'Viridis' |
colorscale='purd',
reversescale=True,
color=[],
size=10,
colorbar=dict(
thickness=20,
title='Node Connections',
xanchor='left',
titleside='right'
),
line_width=2))
node_adjacencies = []
# node_text = []
for node, adjacencies in enumerate(G.adjacency()):
node_adjacencies.append(len(adjacencies[1]))
# node_text.append('Tag:' +str(node))
node_trace.marker.color = node_adjacencies
# node_trace.text = node_text
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
# title='The Legend of Korra Fanfiction Additional Tags',
titlefont_size=16,
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
annotations=[ dict(
text="Python code: <a href='https://plotly.com/ipython-notebooks/network-graphs/'> https://plotly.com/ipython-notebooks/network-graphs/</a>",
showarrow=False,
xref="paper", yref="paper",
x=0.005, y=-0.002 ) ],
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
)
fig.show()
fig.write_html("./models/network_graph_korra.html")