This computational notebook can be used to parse XML files, particularly to pull out attribute values and see how often particular attribute values are used together. I used this XML parser for the Critical Fan Toolkit interviews, which I qualitatively coded using an XML schema I created in RelaxNG.
Information about the interviews and about the qualitative coding process is available on the Critical Fan Toolkit's interview portion of the website.
This computational notebook was created by Cara Marta Messina and William Reed Quinn. Cara wrote the documentation, while Bill helped with transforming the attribute values dictionary into a dataframe. This notebook is part of Cara's documentation process for her dissertation.
import xml.etree.ElementTree as ET
import os
import pandas as pd
import numpy as np
import re
import plotly.graph_objs as go
import plotly.express as px
import plotly
First, I created two functions. The first function I found on Stackoverflow by ponayz. The second function is a basic function to read through each XML file and grab all the coding content begininning with the root element. For the interviews I coded, the rooot element is "interview." Then, I read in the XML data for all 6 of the interviews; there is probably a more automatic way to do this, but because I only had six, I just did the get_root_data function for each filepath.
path = '../critical-fan-toolkit-XML/interviews-encoded/v2/'
def get_xml_files(path):
'''
Describe: This function retrieves a list of all the xml files in a particular folder.
Input: A path to a folder with XML files
Output: A list of all the XML files in that folder.
'''
xml_list = []
for filename in os.listdir(path):
if filename.endswith(".xml"):
xml_list.append(os.path.join(path, filename))
return xml_list
def get_root_data(path):
'''
Describe: This function retrieves the root element and all subsequent children nodes within an XML file. There is probably an easier way to do this with a for loop, but I was failing miserably.
Input: An XML file path.
Output: The entire XML document from the root element that can now be parsed
'''
tree = ET.parse(path)
root = tree.getroot()
return root
xml_list = get_xml_files(path)
xml_list
aria = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/aria-interview-transcription.xml')
dia = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/dialux-interview-transcription.xml')
gill = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/gillywulf-interview-transcription.xml')
kitt = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/kittya-interview-transcription.xml')
valk = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/valk-interview-transcription.xml')
wg = get_root_data('../critical-fan-toolkit-XML/interviews-encoded/v2/writegirl-interview-transcription.xml')
In this next section, I created a list of dictionaries with all the attribute values used in each "code" element. The way my XML is set up, the only time I mark up text explicitly is with the "code" element or the "power-identity" element. However, for this portion, I only focus on the "code" element.
Within the "code" element, there are several attributes, inclulding "fan-communities," "writing-agency," and "rgs-genre." For each of these attributes, there are a series of values. To read my codebook, visit the "Qualitative Coding" webpage on my dissertation.
For each "code" across all six interviews, I created a dictionary with the attribute as the key and the attribute value as the value.
attribute_values = []
def append_to_dict(dictionary, rootXML):
for item in rootXML.findall('./transcription/dialogue/code'): # find all projects node
code_list = {}
code_list = item.attrib
for child in item.getchildren():
allbabies = child.attrib
code_list.update(allbabies)
attribute_values.append(code_list)
return attribute_values
#adding every interview to the dictionary to make a list of dictionaries of ALL the attribute values
append_to_dict(attribute_values, aria)
append_to_dict(attribute_values, dia)
append_to_dict(attribute_values, gill)
append_to_dict(attribute_values, kitt)
append_to_dict(attribute_values, valk)
append_to_dict(attribute_values, wg)
for key in attribute_values:
del key['id']
#a list of dictionaries
attribute_values[:20]
In my XML code, there are a few attributes that have multiple attribute values. For example, in the "rgs-genre", I include both "angst fluff" and separate the values by whitespace.
This next codebox simply goes through each of the values in the dictionary and creates a list of those values. This way, if there are any attributes that have two attribute values, there will be a list of two values ('angst', 'fluff') instead of just a string that Python will not be able to recognize.
copy_att = attribute_values.copy()
new_list = []
for i in copy_att:
new = {key: value.split(" ") for key, value in i.items()}
new_list.append(new)
#a list of dictionaries, and each dictionary's value is a list. PHEW
new_list[:2]
This next portion creates a dataframe and then uses the for loop to go through the newly created list from above. This for loop goes through each item in the list (aka all the attributes and attribute values as dictionaries), breaks down the key and value structure of the dictionary, and takes only the values.
So {rgs-genre: ['fluff', 'angst'] becomes just [fluff, angst] and gets added to an empty list. The empty list and an id for each " code"are added to the data frame. This way, the dataframe can maintain the integrity of each "code."
#create an empty dataframe
df = pd.DataFrame(columns=['id','values'])
#create an id for each "code"
n = 0
for i in new_list:
#create empty list
empty_list = []
#add new id for each new code (the +1 increases the number)
n = n+1
#for loop for dictionary items: k is key and v is values
for k,v in i.items():
#append onto the new list
empty_list = empty_list + v
#append each new list and the id for that list into a dataframe
df = df.append({'id':n, "values":empty_list}, ignore_index = True)
The "explode" function takes a list within a dataframe that has a unique identifier AND a list within a dataframe column/row and creates a longer dataframe. The attribute values ["fluff","angst"] now each have their own row along with their original code's unique identifier.
Then I will add a "count" column and "pivot" the dataframe to make it into the matrix.
The last cell in this section is the final dataframe, which has more of the matrix feel. There is a unique ID (the index) for each of the original codes and then a number of how many times each attribute value appears in that code. As the dataframe shows, most codes only have one or two attribute values, so there are mostly zeros.
#explode takes a list within a dataframe and gives each item a unique identifier
df = df.explode('values')
df
#adding a "count" columnn
df["count"] = df.groupby(["id","values"])["values"].transform("count")
df
df = df.pivot(index="id", columns="values", values="count").fillna(0)
#removing the columns "uptake" and "critical" because there wasn an error in the whitespace and it separated one of the attribute values.
df= df.drop(columns=["important-quote"])
df
Now that we have this large dataframe that has the 424 unique code elements and the number of times each attribute value appeared in that element, we can create matrixes and better capture how often particular attribute values appeared together!
Specifically, the adjacency matrix (df.T.dot(df) shows a raw count of when particular attribute values were used together. This will be especially useful when talking about the relationship between writing, motivation, uptake, genre, and how fans interpret the canonical text.
#correlation matrix
corr = df.corr()
corr
strong_pairs = corr[abs(corr) > 0.10]
print(strong_pairs)
#Adjacency/co-occurance matrix!!!
coocc = df.T.dot(df)
coocc
#saving the co-occurrance
coocc.to_csv('./data/cooccurance_matrix_interviews.csv')
This next section is just playing with some visualizations. The last cell is the interactive chart, which can be found in the "Vizualization" section of the Critical Fan Toolkit. I mainly relied on plotly and plotly express documentation to create these visualizations.
import matplotlib.pyplot as plt
plt.imshow(coocc, cmap='hot', interpolation='nearest')
plt.show()
fig = px.imshow(coocc, color_continuous_scale='purd')
fig.show()
fig = px.imshow(corr, color_continuous_scale='purd')
fig.show()
plotly.offline.plot(fig, filename='./images/correlation.html')
fig = go.Figure()
#adding correlation map
correlation = fig.add_trace(go.Heatmap(z=corr, x=corr.index, y=corr.index, colorscale="purd"))
#adding Co-concordance map
cooccurrence = fig.add_trace(go.Heatmap(z=coocc, x=coocc.index, y=coocc.index, colorscale="purd"))
#making it a square
fig.update_layout(
width=800,
height=800,
template="plotly_white",)
fig.update_scenes(
aspectratio=dict(x=1, y=1, z=0.7),
aspectmode="manual"
)
fig.update_layout(
updatemenus=[
dict(
buttons=list([
dict(
args = [{"visible": [False, True]}],
label="Adjacency Matrix",
method="restyle"
),
dict(
args = [{"visible": [True, False]}],
label="Correlation Matrix",
method="restyle"
)
]),
direction="down",
pad={"r":10, "t": 5},
showactive=True,
x=0.1,
xanchor="right",
y=1.1,
yanchor="top"
),
]
)
plotly.offline.plot(fig, filename='./images/correlation_and_cooccurrence.html')