Graph Analysis and Relationship Extraction from Dark Netflix
Decoding the Time Travel Puzzle: A Graph-based Exploration of Relationships in Dark
Introduction
Calling all data scientists and enthusiasts! Prepare to explore the intricacies of Dark, the acclaimed TV series that offers a data-rich playground for analysis. In my project, I unraveled Dark’s complex relationships using graph analysis, natural language processing (NLP), and web scraping.
Dark’s interwoven timelines and interconnected characters provide a perfect challenge for data scientists. Through graph analysis, we visually map relationships, while NLP extracts insights from dialogues. Web scraping enriches our analysis, uncovering hidden connections.
Join me on this data-driven adventure into Dark’s heart. We’ll navigate graphs, NLP, and web scraping, encountering challenges and making captivating discoveries along the way.
Are you ready to unlock Dark’s secrets? Let’s embark on this data-driven exploration together.
What Are Graphs?
Now, let’s take a moment to explore the concept of graphs in a gentle and easy-to-understand manner. In the context of data science and analysis, graphs provide a powerful framework for representing and analyzing relationships between entities.
Imagine you have a social network like Facebook or LinkedIn. Each person in the network can be represented as a node, while the connections between individuals form the edges of the graph. This graphical representation allows us to visualize the social connections, identify clusters of friends, and understand the overall structure of the network.
Similarly, consider a transportation network like a city’s road system. Each road intersection or junction can be a node, and the roads themselves become the edges connecting those nodes. By analyzing this graph, we can uncover insights about traffic patterns, identify key routes, and optimize transportation logistics.
Graphs are also used in recommendation systems, where each user and item can be represented as nodes, and the connections reflect user-item interactions. By analyzing this graph, we can generate personalized recommendations, discover similar users or items, and enhance the user experience.
In data science and analysis, graphs serve as a versatile tool for understanding relationships, detecting patterns, and making informed decisions. They provide a visual representation that simplifies complex networks and unlocks valuable insights.
Now that we understand the fundamental concept of graphs and their relevance, let’s apply this knowledge to unravel the relationships within the Dark TV series. By leveraging graph analysis techniques, we can uncover the hidden connections and gain a deeper understanding of the complex web that defines this captivating show.
Where to get data?
So the first challenge I had was to find data. And, like always I go and check online resources such as Kaggle and Google Dataset Search and … But I didn’t find anything So I started collecting it myself! Yeah, it’s time for web scraping.
So, I found a website called Dark Netflix Fandom that had All the data I need like the biographies of the main characters of the series which had the details like relationships and place of birth and the role in the series kind of like Wikipedia for Dark tv series.
What Data I scraped?
The first challenge I encountered was sourcing the required data for my project. Despite scouring popular platforms like Kaggle and conducting extensive searches through Google Dataset Search, I was unable to locate a suitable dataset. Undeterred, I turned to an alternative approach: collecting the data myself through web scraping.
After thorough exploration, I discovered a goldmine of information on a website called Dark Netflix Fandom. This website served as a comprehensive hub, akin to a specialized Wikipedia for the Dark TV series, housing detailed biographies of the main characters. These biographies contained invaluable data such as character relationships, place of birth, and their roles within the series.
To accomplish the scraping task, I employed Selenium — a powerful library primarily used for test automation. Selenium proved to be a perfect choice due to the website’s dynamic nature, allowing me to navigate its pages and extract the desired data seamlessly.
Implementing my scraper, I targeted the character biographies specifically. By executing the scraping process, I successfully extracted the biographical details of each character and meticulously stored them in a .txt file. To maintain organization, I assigned the name of each character as the filename, ensuring easy reference and further analysis.
With this meticulous approach, I acquired the necessary dataset, laying the foundation for a more in-depth exploration of the relationships within the Dark TV series.
Cleaning Character Names
While exploring the dataset, I discovered that some character names had dual spellings in both German and English. To ensure consistency, I decided to replace the German spellings with their English counterparts using Python’s replace function. This simple yet effective solution standardized the names, eliminating potential inconsistencies for further analysis.
def clean_name(txt):
"""
This function takes Extracted names from combined_text and
clean them. Some characters may have multi names with one personality
and some character's names may have some extras. In the end
all remain is character names without last-name
"""
txt=txt.replace(".[2]After Jonas's", "")
txt=txt.replace("sex.[4]", "")
txt=txt.replace("Michael/Mikkel", "Mikkel")
txt=txt.replace("Michael", "Mikkel")
txt=txt.replace("Aleksander", "Alexander")
txt=txt.replace("added", "")
txt=txt.replace("Hannoh", "Noah")
txt=txt.split(' ')[0]
return txt
Extracting Relationships: Applying NER and Graph Analysis
To uncover the intricate relationships within the Dark TV series, I employed a combination of Named Entity Recognition (NER) and graph analysis. Here’s a breakdown of the code I used for extracting the relationships:
nlp = spacy.load('en_core_web_sm')
def extract_relationships(text):
doc = nlp(text)
G = nx.Graph()
for sent in doc.sents:
# get named entities in the sentence
entities = [ent.text for ent in sent.ents if ent.label_ == 'PERSON']
# apply cleaning function
entities = [clean_name(s) for s in entities]
# add edges between named entities in the sentence
for i, source in enumerate(entities):
for j, target in enumerate(entities):
if i != j:
# check if edge already exists in the graph
if G.has_edge(source, target):
# increment the weight of the edge
G[source][target]['weight'] += 1
else:
# add a new edge with weight of 1
G.add_edge(source, target, weight=1)
return G
G = extract_relationships(combined_text)
# Create graph from a pandas dataframe
df = nx.to_pandas_edgelist(G)
In this code snippet, I leveraged the power of spaCy’s NER module (`nlp`) to extract named entities, specifically persons, from the combined biographies of all the characters. I then constructed a graph using the NetworkX library `G = nx.Graph()`.
For each sentence in the text, I retrieved the named entities and applied a cleaning function (`clean_name`) to ensure consistency. This function handles any necessary formatting or standardization to enhance the quality of the extracted names.
Next, I established edges between the named entities within each sentence. Using nested loops, I iterated through the entities and created edges connecting different entities while avoiding self-connections. To capture the frequency or importance of each relationship, I incorporated weighted edges. If an edge already existed, I incremented its weight, reflecting its relevance. Otherwise, I added a new edge with an initial weight of 1.
Finally, the function returns the constructed graph (`G`), representing the relationships between the characters, and put it into a dataframe.
Now armed with the extracted relationships, we can proceed to unravel the intriguing web of connections and unveil the narrative dynamics of Dark.
Constructing the Graph: Converting a Pandas DataFrame
To further analyze the relationships extracted from the Dark TV series, I utilized the NetworkX library to create a graph from a Pandas DataFrame. The following code snippet demonstrates this process:
Graph = nx.from_pandas_edgelist(df,
source='source',
target='target',
edge_attr='weight',
create_using=nx.Graph())
In this code, the Pandas DataFrame (df
) serves as the foundation for building the graph. By utilizing the from_pandas_edgelist()
function from NetworkX, I seamlessly transformed the DataFrame into a graph representation.
The source
and target
parameters indicate the columns in the DataFrame that specify the source and target nodes for each edge in the graph. This allows for the identification of the relationship between different characters.
Additionally, the edge_attr
parameter specifies the column that contains the edge weights. This enables the incorporation of weighted edges, reflecting the significance or frequency of the relationships between characters.
By specifying create_using=nx.Graph()
, I ensured that a new instance of an undirected graph is created to represent the Dark TV series relationships.
Identifying the Most Important Characters: Introducing Degree Centrality
To gain deeper insights into the Dark TV series, I turned to a fundamental concept in graph analysis: degree centrality. Degree centrality measures the importance or prominence of a node within a graph based on the number of edges it has. In simpler terms, it quantifies how connected a character is to others in the series.
Let’s consider a real-world analogy to understand degree centrality. Imagine you’re attending a social gathering where connections and interactions are abundant. If you find yourself at the center of many conversations, linking different groups of people together, you possess a high degree centrality within the social network. Your influence and ability to disseminate information are considerable due to the numerous connections you maintain.
Applying this concept to the Dark TV series, degree centrality allows us to identify the characters who hold the most pivotal positions within the narrative. By analyzing the number of relationships each character has, we gain insights into their significance and impact on the storyline.
Now, visualizing this analysis became a priority. Although Matplotlib, a popular data visualization library, provides useful functionalities, I sought a more dynamic and interactive solution. Enter ‘pyvis.network’ a powerful library that allows for the creation of interactive network visualizations.
net = Network(notebook=True,height="800px", width="100%",bgcolor='#222222',font_color='white')
net.toggle_hide_edges_on_drag(True)
net.barnes_hut(gravity=-500)
node_degree = dict(Graph.degree)
With ‘pyvis.network,’ I was able to bring the graph to life, enhancing the demonstration of character importance based on degree centrality. I employed a color scheme and applied an ordering based on degree centrality values, visually highlighting the characters’ relative significance. This approach facilitated a more engaging and intuitive exploration of the network, enabling a deeper understanding of the Dark TV series’ character dynamics.
nx.set_node_attributes(Graph,node_degree,'size')
net.from_nx(Graph)
for n in net.nodes:
if 5 > n['size'] :
n['color'] = 'blue'
if n['size'] > 10:
n['color'] = 'green'
if n['size'] > 20:
n['color'] = 'yellow'
if n['size'] > 25:
n['color'] = 'purple'
if n['size'] > 30:
n['color'] = 'red'
Graph Color Documentation
- 🔴 Most Important
- 🟣 Semi Important
- 🟡 Moderate Important
- 🟢 Secondary characters
- 🔵 NPCs
Top 10 Most Important Characters in The series based on their degree centrality
Detecting Communities Between the Characters: Unveiling Relationship Patterns
To delve even deeper into the relationships within the Dark TV series, I employed advanced graph analysis techniques to detect communities or clusters among the characters. These communities represent groups of characters who are closely connected to one another, often sharing similar roles or storylines.
To achieve this, I utilized two important centrality measures: betweenness centrality and closeness centrality. Let’s explore these concepts with real-life examples to better grasp their significance.
Betweenness Centrality
Betweenness centrality measures the extent to which a character serves as a bridge or intermediary between other characters within the network. In a social context, consider a person who acts as a liaison, connecting different groups and facilitating communication. This person holds a high betweenness centrality within the network due to their crucial role in maintaining connections. Similarly, in the Dark TV series, characters with high betweenness centrality serve as vital links between different groups or storylines, playing pivotal roles in the overall narrative.
Closeness Centrality
Closeness centrality, on the other hand, gauges how quickly a character can access information or influence others within the network. In a social scenario, think of an individual who is highly connected to others and can easily disseminate information or influence decision-making. This person possesses high closeness centrality, as they can efficiently reach other individuals in the network. In the Dark TV series, characters with high closeness centrality have the ability to spread information swiftly and exert influence over others.
Community Detection
To perform the community detection and incorporate centrality measures, I utilized the ‘community_louvain’ library, which provides efficient algorithms for community detection.
The Louvain method operates through a two-step process. Initially, it assigns each node to its own community, treating them as separate entities. It then optimizes the modularity score, which measures the quality of the community structure, by iteratively merging communities to maximize the modularity gain.
The algorithm continues to iterate until it reaches an optimal modularity score or convergence. The result is a partitioning of the network into distinct communities, with nodes grouped together based on the strength of their connections within the community compared to connections outside the community. By applying the Louvain method, I was able to identify the distinct communities among the characters, highlighting the groups that share the strongest connections.
To visualize the detected communities, I utilized NetworkX’s ‘set_node_attributes’ function to assign each character their respective community membership. This allowed for clear visual representation of the communities within the graph.
With the communities detected and visualized, I gained deeper insights into the intricate relationship patterns within the Dark TV series. The integration of betweenness centrality, closeness centrality, and community detection techniques provided a comprehensive understanding of the characters’ roles and interconnections.
Top 10 Communicators in the Series
Conclusion
In this data-driven journey through the world of Dark TV series, we have harnessed the power of graph analysis, natural language processing (NLP), including Named Entity Recognition (NER), and web scraping to unravel complex relationships and uncover the underlying narrative dynamics.
Through NER, we extracted valuable information from the character biographies, enabling us to identify and analyze key entities within the Dark universe. Leveraging advanced graph analysis techniques, such as degree centrality, community detection using the Louvain method, and centrality measures like betweenness and closeness centrality, we gained profound insights into the characters’ importance, relationship patterns, and community structures.
I would like to express my heartfelt gratitude for taking the time to read and engage with my article. Your interest and enthusiasm are truly appreciated. It is your support that motivates me to share these data-driven explorations.
And to extend a special thanks to Thu Vu, whose insightful YouTube video ‘Network of The Witcher’ served as the inspiration for this project. Thu Vu’s clear explanation of the entire process was instrumental in guiding me through the intricacies of network analysis.