Source: http://jexp.de/blog/2017/03/academy-awards-oscars-from-kaggle-to-neo4j/
This is part 1, in the next part, we’ll look at using import.io to scrape IMDB and the Academy Awards Database. You can query the imported data in this instance (user/pass:oscars) of the brand new Neo4j Sandbox . |
I came across the tweet from @LynnLangit about first step with mxnet, which I really liked.
So I wanted to do the same for Neo4j and was looking for a good dataset.
Then I realized that the 89th Academy Awards (Oscars) ceremony was the next day. I was really looking forward to it, hoping it would come with some strong statements towards the current administration. And then him rage tweeting about it on Monday morning.
But instead we got a fun Jimmy Kimmel performance and the well know Moonlight and La-La-Land mess-up by the (ex)-PWC people.
So I found the data and imported it and had this post ready to go.
But then got distracted trying to scrape IMDB with import.io and missed the date.
But as it is a nice dataset interestingly not as widely available as you’d think, I feel it’s still worth publishing.
So enjoy my struggles with data (quality).
Install Neo4j
Find the data as CSV
It was not that easy to find a full dataset for the Oscar nominations and award winners.
Most CSVs you could find were limited in terms of detail or timespan.
The offical Academy Awards Database had no CSV output format from their search, I’ll skip it for now, it also misses the Nominations for the 89th awards. It would be nice to scrape it with http://import.io, which is probably what others have done.
Kaggle has a CSV with data to 1927 to 2015 but it is a bit messed up, I try to import it in the second part of this post.
I found the data for the Oscars from 1927 to 2010 here at AggData which is a good start.
Especially as it had a directly downloadable URL of the CSV: https://www.aggdata.com/download_sample.php?file=academy_awards.csv
Importing the Data
I wanted to create nodes for Nominations (optionally with an Award label), Nominees, Movies, Categories and Years.
create constraint on (c:Category) assert c.name is unique;
create constraint on (n:Nominee) assert n.name is unique;
create constraint on (m:Movie) assert m.title is unique;
create constraint on (y:Year) assert y.value is unique;
create constraint on (g:Genre) assert g.name is unique;
So the import statement is pretty straightforward.
Only as the AggData CSV had empty (null) columns in the header I couldn’t use LOAD CSV WITH HEADERS
, but had to create the column to name mapping myself.
// load the CVS as stream of row records
LOAD CSV FROM 'https://www.aggdata.com/download_sample.php?file=academy_awards.csv' AS row
// create a map from each row
WITH { year: row[0], category: row[1], nominee: row[2],
info: row[3], won: row[4]} AS data
// skip the broken header
SKIP 1
// get-or-create nodes for Year, Category and Nominee
MERGE (y:Year {value:data.year})
MERGE (c:Category {name: data.category})
MERGE (n:Nominee {name: data.nominee})
// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data
// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:NOMINATED]->(n)
// if there is a YES in the won column, also set the :Award label
WITH * WHERE data.won = "YES" SET a:Award;
Some Fun Queries
Nominees who got the most nominations over the years (note our dataset is limited to 2010).
Some suprises here for me. I was for instance not aware that Mary Poppins or Jack Nicholson got that many Oscars.
MATCH (n:Nominee)<-[:NOMINATED]-(a)-[:IN_YEAR]->(y)
RETURN n.name, count(*), collect(distinct y.value)
ORDER BY count(*) DESC LIMIT 10;
“n.name” | “count(*)” | “collect(distinct y.value)” |
---|---|---|
“Meryl Streep”
|
16
|
["2008 (81st)","2006 (79th)","2002 (75th)","1998 (71st)","1999 (72nd)","2009 (82nd)","1990 (63rd)","
"Titanic"
|
14
|
["1953 (26th)","1997 (70th)"]
“A Star Is Born”
|
13
|
["1937 (10th)","1976 (49th)","1954 (27th)"]
“Cleopatra”
|
13
|
["1934 (7th)","1963 (36th)"]
“Mutiny on the Bounty”
|
12
|
["1935 (8th)","1962 (35th)"]
“Jack Nicholson”
|
12
|
["2002 (75th)","1969 (42nd)","1970 (43rd)","1973 (46th)","1974 (47th)","1975 (48th)","1992 (65th)","
"Moulin Rouge"
|
12
|
["2001 (74th)","1952 (25th)"]
“Katharine Hepburn”
|
12
|
["1932/33 (6th)","1951 (24th)","1935 (8th)","1942 (15th)","1940 (13th)","1967 (40th)","1968 (41st)",
"The Lord of the Rings The Fellowship of the Ring"
|
12
|
["2001 (74th)"]
“Mary Poppins”
|
12
|
["1964 (37th)"]
|
And it seems Meryl Streep is not an failing actress after all :)
Let’s have a look at her achievements.
MATCH path = (n:Nominee {name:"Meryl Streep"})<-[:NOMINATED]-()-->()
RETURN path
Part 2: Full Import from Kaggle
Now as that dataset was a bit limited, let’s import the Kaggle Data instead (with a lot of special cases).
And then add the 2016 nominations and winners, which I manually scraped from oscar.go.com and turned into the Kaggle format.
The CSVs are here:
- https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv
- https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-2017.csv
The Kaggle file stated it has this structure:
- Year
- Ceremony (Number)
- Award (Type of Award)
- Winner (1 or empty)
- Name
- Film
In recent years, the last two columns were only correctly populated for the Actress/Actor awards Otherwise the movie was in the “Name” column and some inconsistent text in the “Film” column
Let’s look at the Winners of 2015.
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS row
WITH * WHERE row.Winner="1" and row.Year = "2015"
RETURN row.Award, row.Name, row.Film
row.Award | row.Name | row.Film |
---|---|---|
Actor in a Leading Role
|
Leonardo DiCaprio
|
The Revenant
Actor in a Supporting Role
|
Mark Rylance
|
Bridge of Spies
Actress in a Leading Role
|
Brie Larson
|
Room
Actress in a Supporting Role
|
Alicia Vikander
|
The Danish Girl
Animated Feature Film
|
Inside Out
|
Pete Docter and Jonas Rivera
Costume Design
|
Mad Max: Fury Road
|
Jenny Beavan
Directing
|
The Revenant
|
Alejandro G. Iñárritu
Documentary (Feature)
|
Amy
|
Asif Kapadia and James Gay-Rees
|
So we saw, that for these Awards, the column content was not correct, and we had take that into account. To simplify our import statement we did a double pass.
We could basically reuse our import statement from the beginning with some minor changes.
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data
// only use the Actress and Actor awards
WITH data WHERE data.Award STARTS WITH "Act"
MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})
MERGE (n:Nominee {name: data.Name})
MERGE (m:Movie {title: data.Film})
// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data
// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:NOMINATED]->(n),(a)-[:FOR_MOVIE]->(m)
MERGE (n)-[:ACTED_IN]->(m)
// if there is a "1" in the Winner column, also set the :Award label
WITH * WHERE data.Winner = "1" SET a:Award;
For the other awards, we had to treat the “Name” column as movie and run some text processing on the “Film” column to extract the clean(ish) names of the relevant people.
For reduced complexity we just replace a bunch of regular expressions with apoc’s regreplace function.
It was quite annoying because the “free-form” text was quite varied but I got most weeded out, using the query below as my check.
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data
// do not use the Actress and Actor awards
WITH data WHERE NOT data.Award STARTS WITH "Act" AND data.Film contains "by"
WITH *, apoc.text.regreplace(apoc.text.regreplace(apoc.text.regreplace(data.Film,"(?i)([;&]|\\b(and|aka)\\b)",","),"[^A-Za-z0-9, ]?",""),
"(?i)\\b(Third|Second|Producers?|Story|Written|Adaptation|Art|Score|Direction|Production|Set|Decoration|Sound|Department|"+
"Co-Producer|Design|Director|(English )?Lyrics?|Music|Special|Audible|Musical|Screenplay|Dialogue|by|Interior|Head of|"+
"Score|Photographic|Effects|in collaboration|with|Visual|Ballet|Made by|with( the)? cooperation of|"+
"Photography|Thematic|Adapted|Song|Orchestral|for the screen|Stories|Jr|Based on a|Original|Screen)\\b","") as cleanFilm
WITH *,[s IN split(cleanFilm,",") WHERE trim(s) <> "" | trim(s)] as nominees
WHERE any(n in nominees WHERE size(split(n," ")) <> 2)
RETURN data.Film, nominees;
Then we could use our cleaned up data to not only create the movie from the “Name” column but also all nominees from that free-form “Film” column. We used the name of the award as relationship-type between nominee and movie.
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data
// do not use the Actress and Actor awards
WITH data WHERE NOT data.Award STARTS WITH "Act"
MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})
MERGE (m:Movie {title: data.Name})
// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data
// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:FOR_MOVIE]->(m)
FOREACH (winner IN CASE data.Winner WHEN "1" THEN [a] ELSE [] END | SET winner:Award)
WITH *, apoc.text.regreplace(apoc.text.regreplace(apoc.text.regreplace(data.Film,"(?i)([;&]|\\b(and|aka)\\b)",","),"[^A-Za-z0-9, ]?",""),
"(?i)\\b(Third|Second|Producers?|Story|Written|Adaptation|Art|Score|Direction|Production|Set|Decoration|Sound|Department|"+
"Co-Producer|Design|Director|(English )?Lyrics?|Music|Special|Audible|Musical|Screenplay|Dialogue|by|Interior|Head of|"+
"Score|Photographic|Effects|in collaboration|with|Visual|Ballet|Made by|with( the)? cooperation of|"+
"Photography|Thematic|Adapted|Song|Orchestral|for the screen|Stories|Jr|Based on a|Original|Screen)\\b","") as cleanFilm
WITH *,[s IN split(cleanFilm,",") WHERE trim(s) <> "" | trim(s)] as nominees
UNWIND nominees as nominee
MERGE (n:Nominee {name: trim(nominee)})
CREATE (a)-[:NOMINATED]->(n)
WITH * WHERE NOT EXISTS ((n)-->(m))
CALL apoc.create.relationship(n,toUpper(data.Award),{},m) YIELD rel
RETURN count(*);
Now the meta-model of our graph looks like this:
As we created the most recent academy data ourselves, it’s less detailed but cleaner, so we can just use one statement:
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-2017.csv" AS data
MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})
// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data
// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
WITH *
UNWIND split(data.Name,";") as name
MERGE (n:Nominee {name: name})
MERGE (a)-[:NOMINATED]->(n)
WITH *
UNWIND split(data.Film,";") as film
MERGE (m:Movie {title: film})
MERGE (a)-[:FOR_MOVIE]->(m)
// if there is a "1" in the Winner column, also set the :Award label
FOREACH (winner IN CASE data.Winner WHEN "1" THEN [a] ELSE [] END | SET winner:Award)
WITH * WHERE NOT EXISTS ((n)-->(m))
call apoc.create.relationship(n,toUpper(data.Award),{},m) yield rel
RETURN count(*);
Now we have the full data in our graph, which you can access on this instance (user/pass:oscars) of the brand new Neo4j Sandbox .
More Queries
- Most different categories
- Which generes win most oscars
- Which countries win most oscars (except the US)
- prediction for 2017
Most Oscars
MATCH (n:Nominee)<-[:NOMINATED]-(a:Award)
RETURN n.name as winner, count(*) as count, collect(distinct a.Award) as awards
ORDER BY count DESC LIMIT 10;
“winner” | “count” | “awards” |
---|---|---|
“Walt Disney”
|
22
|
["Short Subject (Cartoon)","Short Subject (Two Reel)","Documentary (Feature)","Documentary (Short Subject)","Short Subject (Live Action)"]
“Italy”
|
11
|
["Foreign Language Film"]
“Metro-Goldwyn-Mayer”
|
11
|
["Outstanding Production","Short Subject (Two Reel)","Short Subject (One Reel)","Short Subject (Cartoon)","Outstanding Motion Picture","Special Effects"]
“Cedric Gibbons”
|
10
|
["Art Direction","Art Direction (Black and White)","Art Direction (Color)"]
“Alfred Newman”
|
9
|
["Music (Scoring)","Music (Music Score of a Dramatic or Comedy Picture)","Music (Scoring of a Musical Picture)","Music (Scoring of Music, Adaptation or Treatment)"]
“France”
|
9
|
["Foreign Language Film"]
“Dennis Muren”
|
8
|
["Special Achievement Award (Visual Effects)","Visual Effects"]
“Edith Head”
|
8
|
["Costume Design (Black and White)","Costume Design (Color)","Costume Design"]
“Edwin B. Willis”
|
8
|
["Art Direction (Color)","Art Direction (Black and White)"]
“Metro-Goldwyn-Mayer Studio”
|
8
|
["Sound Recording","Sound"]
|
MATCH (n:Nominee)<-[:NOMINATED]-(a:Nomination)-[:IN_YEAR]->(y)
WHERE y.value > "1950"
RETURN n.name as winner, count(*) as count, collect(distinct a.Award) as awards
ORDER BY count DESC LIMIT 10;
Most different Awards
MATCH (n:Nominee)<-[:NOMINATED]-(a:Award)
RETURN n.name as winner, collect(distinct split(a.Award," (")[0]) as awards
ORDER BY size(awards) DESC LIMIT 5;
“winner” | “awards” |
---|---|
“Metro-Goldwyn-Mayer”
|
["Outstanding Production","Short Subject","Outstanding Motion Picture","Special Effects"]
“Billy Wilder”
|
["Directing","Writing","Best Motion Picture"]
“James Cameron”
|
["Directing","Film Editing","Best Picture"]
“Ethan Coen”
|
["Writing","Directing","Best Picture"]
“Robert Wise”
|
["Directing","Best Motion Picture","Best Picture"
|
Which Countries got the most Oscars
MATCH (a:Award)-[:FOR_MOVIE]->(m:Movie)
UNWIND split(m.Country,“, “) as country
RETURN country, count(*) as c
ORDER BY c DESC LIMIT 5;
"country" | "c" |
---|---|
"USA"
|
1673
"UK"
|
366
"France"
|
122
"Germany"
|
72
"Italy"
|
51
|
Which Genres won most
MATCH (a:Award)-[:FOR_MOVIE]->(m:Movie)-[:IN_GENRE]->(g:Genre)
RETURN g.name, count(*) as c
ORDER BY c DESC LIMIT 5;
"g.name" | "c" |
---|---|
"Drama"
|
1316
"Romance"
|
414
"Comedy"
|
376
"Biography"
|
347
"Adventure"
|
279
|
How often did someone win directly or indirectly
E.g. via the movie they participated in.
MATCH (y:Year)<-[:IN_YEAR]-(a:Award)-[*1..2]-(n:Nominee)
WHERE y.value > "1950"
AND none(term IN ["Studio","N/A","Metro-Goldwyn-Mayer"] WHERE n.name contains term)
RETURN n.name, count(distinct a) as c, collect(distinct a.Award)
ORDER BY c DESC LIMIT 10;
"n.name" | "c" | "collect(distinct a.Award)" |
---|---|---|
"John Williams"
|
"54"
|
["Costume Design (Black and White)","Cinematography","Music (Scoring: Adaptation and Original Song Score)","Sound","Film Editing","Music (Original Score)","Special Achievement Award (Visual Effects)","Costume Design","Art Direction","Visual Effects","Special Achievement Award (Sound Effects Editing)","Sound Effects Editing","Directing","Writing (Screenplay Based on Material Previously Produced or Published)","Best Picture","Production Design"]
"Christopher Boyes"
|
"36"
|
["Writing (Story and Screenplay)","Special Achievement Award (Visual Effects)","Art Direction","Cinematography","Costume Design","Visual Effects","Sound","Sound Effects Editing","Best Picture","Directing","Film Editing","Music (Original Dramatic Score)","Sound Editing","Sound Mixing","Makeup","Music (Original Score)","Writing (Adapted Screenplay)"]
"Andy Nelson"
|
"30"
|
["Art Direction (Color)","Costume Design (Color)","Writing (Screenplay Based on Material Previously Produced or Published)","Sound Effects Editing","Sound","Film Editing","Cinematography","Directing","Art Direction","Best Picture","Music (Original Score)","Makeup","Costume Design","Sound Mixing","Production Design","Makeup and Hairstyling","Actor in a Supporting Role","Visual Effects"]
|
…
Directors directly and indirectly
MATCH (y:Year)<-[:IN_YEAR]-(a:Award)-[*1..2]-(n:Nominee)
WHERE y.value > "1950"
AND none(term IN ["Studio","N/A","Metro-Goldwyn-Mayer"] WHERE n.name contains term)
AND exists( (n)-[:DIRECTED]->() )
RETURN n.name, count(distinct a) as c, collect(distinct a.Award)
ORDER BY c DESC LIMIT 10;
"n.name" | "c" |
---|---|
"Steven Spielberg"
|
37
"Robert Redford"
|
30
"Gary Rydstrom"
|
26
"Fred Zinnemann"
|
25
"David Lean"
|
25
"Ralph Fiennes"
|
24
"Paul Newman"
|
24
"Marlon Brando"
|
24
"James Cameron"
|
21
"Peter Jackson"
|
21
|
The Unlucky ones, most often nominated with least wins
MATCH (a:Nomination)-[:NOMINATED]->(n:Nominee)
WITH n.name as name, sum(case when a:Award then 1 else 0 end) as won, count(*) as total
RETURN name, won, total, 100*won/total as percent
ORDER BY percent asc, total desc LIMIT 10;
"name" | "won" | "total" | "percent" |
---|---|---|---|
"Kevin O’Connell"
|
0
|
20
|
0
"Greg P. Russell"
|
0
|
16
|
0
"Alex North"
|
0
|
15
|
0
"Roland Anderson"
|
0
|
15
|
0
"Thomas Newman"
|
0
|
13
|
0
"Republic Studio"
|
0
|
12
|
0
"Roger Deakins"
|
0
|
12
|
0
"George Folsey"
|
0
|
12
|
0
"Rick Kline"
|
0
|
11
|
0
"Walter Lantz"
|
0
|
10
|
0
|
"name" | "won" | "total" | "percent" |
---|---|---|---|
"Metro-Goldwyn-Mayer"
|
11
|
61
|
18
"Walt Disney"
|
22
|
60
|
36
"John Williams"
|
5
|
50
|
10
"Alfred Newman"
|
9
|
46
|
19
"Warner Brothers"
|
7
|
43
|
16
"Cedric Gibbons"
|
10
|
38
|
26
"France"
|
9
|
37
|
24
"Edith Head"
|
8
|
35
|
22
"Edwin B. Willis"
|
8
|
32
|
25
"Max Steiner"
|
4
|
29
|
13
|
Next time we’ll look at using http://import.io to scrape IMDB and the Academy Awards Database.
Other Resources
- https://github.com/bencxs/oscars
- https://projects.fivethirtyeight.com/oscar-predictions-2017/
- https://www.kaggle.com/sebask/d/theacademy/academy-awards/the-oscars-dataset-preparing-for-2017
- https://www.kaggle.com/uvcelokesh/d/deepmatrix/imdb-5000-movie-dataset/and-the-oscars-go-to
- Academy Awards Database
- https://www.kaggle.com/theacademy/academy-awards
- AggData Oscars Data