Star Wars Universe Social Network

Original author: Evelina Gabasova
  • Transfer

Someone is waiting for Christmas, someone - a new series of Star Wars, "Awakening of power." And at this time I decided to process the entire six-series cycle from a quantitative point of view and isolate the social networks contained in it - both from each film individually and from the entire ZV universe together. A close look at the social networks reveals interesting differences between the original parts and their prequels.

Below is the social network obtained from all 6 films in total.



You can open an interactive page where visualization will be presented with the ability to drag individual nodes with the mouse. When you hover over a node, you will see the name of the character.

Nodes are characters. Their connection by a line means that they are speaking in the same scene. The more they talk, the thicker the line. The size of each node is proportional to the number of scenes in which the character appears. I had to make a lot of difficult decisions: for example, Anakin and Darth Vader, obviously the same character, but they are represented by different nodes in the visualization, since this division of them is important for the plot. And vice versa, I specifically combined Palpatine with Darth Sidious, and Amidala with Padme.

The characters in the original trilogy are located predominantly on the right and are practically separate from the prequel characters, since most characters appear in only one of the trilogies. The main nodes connecting both networks are Obi-Wan Kenobi, R2-D2 and C-3PO. Robots, obviously, are of particular importance for the plot, because they appear most often in films. The structure of both subnets is different. The original trilogy has fewer important nodes (Luke, Khan, Leah, Chewbacca and Darth Vader), and they are tightly interconnected. The prequels show more nodes and more connecting lines.

Character Timelines

Since the same characters are found in different films, I created a comparative timeline, divided into episodes.


It contains all the mentions of the characters, including the mention of their names in the conversations of others. Anakin appears with Darth Vader in the 3rd episode, and then Darth Vader takes up. Anakin reappears at the end of episode 6, in which Darth Vader turns his back on the Dark Side.

The same characters that are constantly involved in all films stand in the center of the social network. These are Obi-Wan, C-3PO and R2-D2. Yoda and the Emperor are also found in all films, but they talk to a small number of people.

Networks for individual episodes

Now consider the episodes separately. Notice how the number of nodes and the complexity of the networks varies from prequels to the original episodes. (clickable). Again, there are more characters in the prequels and more interactions of the various characters with each other. In the original films, there are fewer characters, but they interact more often. George Lucas once said :







In fact, this is the story of the tragedy of Darth Vader, it begins when he is nine years old, and ends with his death.

But was Darth Vader / Anakin really the central character? Let's try to apply the methods of network analysis to identify the central characters and their social structure. I calculated two parameters showing the importance of the character on the network.

  • severity: number of trunks on the node in the network. That is, the total number of scenes in which he talks.
  • intermediateness: the number of shortest paths leading through the node. For example, if you are Leah, and want to send a message to Grido, then the shortest way to it will be the path through Han Solo. And to send a message to Luke, you don’t need to go through Khan, since Leia knows him personally. Thus, Khan’s intermediateness is calculated - through the number of shortest paths between all the other characters passing through it.

The first parameter as a result shows how many characters the character is in contact with, and the second shows how important it is for the story as a whole. Characters with high intermediateness unite different sections of social networks.

The larger the parameter, the more important it is. Below are the Top-5 characters, ranked by parameters, for each movie.


In the first three episodes, Anakin turned out to be the most connected character. Moreover, he practically does not participate in integration - his intermediateness is so small that he did not even make it to the Top-5. It turns out that other characters communicate personally, and not through him. And what will it look like for the original trilogy?


The analysis of centrality in numerical form expresses our impression obtained from the visualization of social networks. In prequels, the social structure is more complex, more characters. And Anakin is not a central figure - some storylines develop in parallel, or relate to it only indirectly. On the other hand, the original trilogy looks more coherent. There are fewer characters connecting the story.

Perhaps because of this, the original trilogy is more popular. The plots are more consistent, and develop thanks to the main characters. The prequels structure is less centralized, there is no central character.

And what will these measurements look like when applied to all films at once? I made two calculation options - with the separation of the characters of Anakin and Darth Vader, and with the combination.

On the left are two separate characters, on the right, the characters are combined:


In the first case, Anakin remains the most connected character, but not the central one. When combined, he becomes the third most important character in the ranking of intermediateness. In any case, it turns out that in reality the films are united by the character of Obi-Wan Kenobi.


How is this done

For the most part, I used F # , combining it with D3.js to visualize the social network, and R to analyze the centrality of the networks. All sources are available on github . Here I will analyze only certain, most interesting parts of the code.


I downloaded all the scripts freely from The Internet Movie Script Database (IMSDb) (example: Episode IV: The New Hope script ). True, there are mainly drafts, which often differ from the final versions.

The first step is to analyze the scripts. It turned out that different files have a slightly different format. They are all represented in HTML, either between tags, or between. I used the Html Parser from the F # Data library, which allows you to access individual tags using queries like:

open FSharp.Data
let url = ""

The code is available in the parseScripts.fs file .

The next step is to extract the necessary information from the scripts. Usually they look like this: Each scene begins with a locale and an INT note. (inside) or EXT. (outside). Explanatory text may also be present. In dialogs, character names are indicated in capital letters and in bold. Therefore, scene separators can be used as INT notes. and EXT. written by bold.


Luke moves along the railing and up to the control room.

He told me enough! It was you
who killed him.

No. I am your father.

Shocked, Luke looks at Vader in utter disbelief.

No. No. That's not true!
That's impossible!

// split the script by scene
// each scene starts with either INT. or EXT. 
let rec splitByScene (script : string[]) scenes =
    let scenePattern = "[ 0-9]*(INT.|EXT.)"
    let idx = 
        |> Seq.tryFindIndex (fun line -> Regex.Match(line, scenePattern).Success)
    match idx with
    | Some i ->
        let remainingScenes = script.[i+1 ..]
        let currentScene = script.[0..i-1]
        splitByScene remainingScenes (currentScene :: scenes)
    | None -> script :: scenes 

A recursive function that takes the entire script and searches for patterns is EXT. or int. bold, in front of which the scene number can go. She breaks the lines into the current scene and the rest of the text, and then recursively repeats the procedure.

Get a list of characters

In some scenes, the names of the characters are indicated in the format that I described earlier. Some use only colon names. And all this can be present on one line. The only common sign was the presence of names written in capital letters.

Had to use regulars.

// Extract names of characters that speak in scenes. 
// A) Extract names of characters in the format "[name]:"
let getFormat1Names text =
    let matches = Regex.Matches(text, "[/A-Z0-9 -]+ *:")
    let names = 
        seq { for m in matches -> m.Value }
        |> (fun name -> name.Trim([|' '; ':'|]))
        |> Array.ofSeq
// B) Extract names of characters in the format " [name] "
let getFormat2Names text =
    let m = Regex.Match(text, "[ ]*[/A-Z0-9 -]+[ ]*")
    if m.Success then
        let name = m.Value.Replace("","").Replace("","").Trim()
        [| name |]
    else [||]

Each regular is looking not only for capital letters, but also for numbers, dashes, spaces and slashes. Since the names of the characters are different: "R2-D2" or even "FODE / BEED".

I also had to consider that some characters have several names. Palpatine - Darth Sidious - Emperor, Amidala - Padme. I made an aliases.csv alias file , where I specified the names to be combined.

let aliasFile = __SOURCE_DIRECTORY__ + "/data/aliases.csv"
// Use csv type provider to access the csv file with aliases
type Aliases = CsvProvider
/// Dictionary for translating character names between aliases
let aliasDict = 
    |> (fun row -> row.Alias, row.Name)
    |> dict
/// Map character names onto unique set of names
let mapName name = if aliasDict.ContainsKey(name) then aliasDict.[name] else name
/// Extract character names from the given scene
let getCharacterNames (scene: string []) =
    let names1 = scene |> Seq.collect getFormat1Names 
    let names2 = scene |> Seq.collect getFormat2Names 
    Seq.append names1 names2
    |> mapName
    |> Seq.distinct

And now, finally, you can extract the names of the characters from the scenes. The following function retrieves all character names from all scenarios for which URLs are specified.

let allNames =
  |> (fun (episode, url) ->
    let script = getScript url
    let scriptParts = script.Elements()
    let mainScript = 
        |> (fun element -> element.ToString())
        |> Seq.toArray
    // Now every element of the list is a single scene
    let scenes = splitByScene mainScript [] 
    // Extract names appearing in each scene
    scenes |> getCharacterNames |> Array.concat )
  |> Array.concat
  |> Seq.countBy id
  |> Seq.filter (snd >> (<) 1)  // filter out characters that speak in only one scene

Another problem remained - some character names were not names. These were names like “Pilot,” “Officer,” or “Captain.” I had to manually filter those names that were real. So the list of characters.csv appeared

Character Interaction

To build networks, I needed to identify all the cases when the characters spoke to each other. They talk in the same scene (cases when people talk to each other on an intercom or walkie-talkie, and therefore are in different scenes, I omitted).

let characters = 
    File.ReadAllLines(__SOURCE_DIRECTORY__ + "/data/characters.csv") 
    |> Array.append (Seq.append aliasDict.Keys aliasDict.Values |> Array.ofSeq)
    |> set

Then I created a set of all character names and their aliases for search and filtering. Then I used it to search for characters in each of the scenes.

let scenes = splitByScene mainScript [] |> List.rev
let namesInScenes = 
    |> getCharacterNames
    |> (fun names -> names |> Array.filter (fun n -> characters.Contains n)) 

Then I used the filtered list of characters to identify the social network.

// Create weighted network
let nodes = 
    |> Seq.collect id
    |> Seq.countBy id        
    // optional threshold on minimum number of mentions
    |> Seq.filter (fun (name, count) -> count >= countThreshold)
let nodeLookup = nodes |> fst |> set
let links = 
    |> List.collect (fun names -> 
        [ for i in 0..names.Length - 1 do 
            for j in i+1..names.Length - 1 do
                let n1 = names.[i]
                let n2 = names.[j]
                if nodeLookup.Contains(n1) && nodeLookup.Contains(n2) then
                    // order nodes alphabetically
                    yield min n1 n2, max n1 n2 ])
    |> Seq.countBy id

So we got a list of nodes, with the number of conversations throughout the script - this count is used to determine the size of the node. Then I created a line between the two characters who speak in the same scene, and calculated their number. Together, nodes and lines define the entire social network.

Finally, I output this data in JSON format. All social networks, global and individual by episode, can be found on my github. The full code for this step is in the getInteractions.fsx file.

Character Mentions

I also decided to find references to all the characters to build a timeline. The code turned out to be similar to the one that extracts character dialogs, only here I was looking for all the references, not only in the dialogs. I also counted scene numbers. The following code returns a list of scene numbers and characters mentioned in them.

let scenes = 
    splitByScene mainScript [] |> List.rev
let totalScenes = scenes.Length
|> List.mapi (fun sceneIdx scene -> 
    // some names contain typos with lower-case characters
    let lscene = scene |> (fun s -> s.ToLower()) 
    |> (fun name -> 
        |> (fun contents -> if containsName contents name then Some name else None )
        |> Array.choose id)
    |> Array.concat
    |> (fun name -> mapName (name.ToUpper()))
    |> Seq.distinct 
    |> (fun name -> sceneIdx, name)
    |> List.ofSeq)
|> List.collect id,

To extract the timelines, I used scene numbering to match the interval of each episode as [episode index − 1, episode index]. This gave me a relative scale for the appearance of the characters in the episodes. Times in intervals [0,1] cells relate to Episode I, in cells [1,2] - episode II, etc.

// extract timelines
[0 .. 5]
|> (fun episodeIdx -> getSceneAppearances episodeIdx)
|> List.mapi (fun episodeIdx (sceneAppearances, total) ->
    |> (fun (scene, name) -> 
        float episodeIdx + float scene / float total, name))      

I saved this in csv , where each line contains the character’s name and the exact times in which he appeared in films, separated by commas. The full code is available in the getMentions.fsx file.

Add characters without words

Looking through the statistics of conversations by characters, I saw that it lacks R2-D2 and Chewbacca. Wookiee not only did not receive the medal, but also disappeared from all the dialogues. Of course, they are mentioned in the script, but only as characters without dialogue.

Of course, it was impossible to ignore them, and I decided to insert them into the social network based on dialogs.

I retrieved the dimensions of the nodes and the connections between the two missing characters from the network, determined by their references. To turn this into a connection within the social network, I decided to scale all the data received in proportion to other similar characters who participate in the script. I chose C-3PO because it is an intermediary for R2-D2, and Han as an intermediary for Chui, suggesting that their interactions will be similar. I applied the following formula to calculate the strength of connections in a conversational social network:



After manually returning Chewbacca and R2-D2, I got a complete set of social networks for both individual films and the entire franchise. For visualization of social networks, I used Strength ... Well, actually, a force-directed network layout from the D3.js library This method uses physical simulation of charged particles. The most important thing in the code is:

d3.json("starwars-episode-1-interactions-allCharacters.json", function(error, graph) {
  /* More code here */
  var link = svg.selectAll(".link")
      .attr("class", "link")
      .style("stroke-width", function(d) { return Math.sqrt(d.value); });
  var node = svg.selectAll(".node")
      .attr("class", "node")
      .attr("r", 5)
      .style("fill", function (d) { return d.colour; })
      .attr("r", function (d) { return 2*Math.sqrt(d.value) + 2; })
  /* More code here */

In the previous steps, I saved all the networks in JSON. Here I download them and define nodes and links. For each node, its own color is added, and a value denoting importance (the number of character phrases). This parameter determines the radius r, as a result, all nodes are scaled in importance. The same is for links - the thickness of each link was stored in JSON, and here it is displayed through the line width.

Centrality analysis

And in the end I did an analysis of the centrality of each character. To do this, I used RProvider along with the R igraph package to analyze networks in F #. First, I downloaded the network from JSON via FSharp.Data:

open RProvider.igraph
let [] linkFile = __SOURCE_DIRECTORY__ + "/networks/starwars-episode-1-interactions.json"
type Network = JsonProvider
let file = __SOURCE_DIRECTORY__ + "/networks/starwars-full-interactions-allCharacters.json"
let nodes = Network.Load(file).Nodes |> (fun node -> node.Name) 
let links = Network.Load(file).Links

The links variable contains all the links in the network, and the nodes are characterized by their indices. To simplify the work, I mapped character names to the indexes:

let nodeLookup = nodes |> Seq.mapi (fun i name -> i, name) |> dict
let edges = 
    |> Array.collect (fun link ->
        let n1 = nodeLookup.[link.Source]
        let n2 = nodeLookup.[link.Target]
        [| n1 ; n2 |] )

Then I created a graph object using the igraph library:

let graph =
    namedParams["edges", box edges; "dir", box "undirected"]
    |> R.graph

Calculation of intermediateness and centrality:

let centrality = R.betweenness(graph)
let degreeCentrality =

The entire code can be found here .


As always with scientific research, the most difficult thing is to bring the data into a readable form. Since the SW scripts had a slightly different format, I spent most of the time defining the general properties of the documents in order to create one function for processing them. After that, I only had to tinker with the problems of the Wookiee and the droid, who did not have replicas. Networks in JSON format can be downloaded on github .


Social networks in JSON format:

Also popular now: