Academy Awards (Oscars) from Kaggle to Neo4j

Source: http://jexp.de/blog/2017/03/academy-awards-oscars-from-kaggle-to-neo4j/

This is part 1, in the next part, we’ll look at using import.io to scrape IMDB and the Academy Awards Database. You can query the imported data in this instance (user/pass:oscars) of the brand new Neo4j Sandbox .

I came across the tweet from @LynnLangit about first step with mxnet, which I really liked.

lynn langit mxnet.jpg

So I wanted to do the same for Neo4j and was looking for a good dataset.

Then I realized that the 89th Academy Awards (Oscars) ceremony was the next day. I was really looking forward to it, hoping it would come with some strong statements towards the current administration. And then him rage tweeting about it on Monday morning.

But instead we got a fun Jimmy Kimmel performance and the well know Moonlight and La-La-Land mess-up by the (ex)-PWC people.

So I found the data and imported it and had this post ready to go.

But then got distracted trying to scrape IMDB with import.io and missed the date.

But as it is a nice dataset interestingly not as widely available as you’d think, I feel it’s still worth publishing.

So enjoy my struggles with data (quality).

Install Neo4j

brew install neo4j.jpg

Find the data as CSV

It was not that easy to find a full dataset for the Oscar nominations and award winners.

Most CSVs you could find were limited in terms of detail or timespan.

The offical Academy Awards Database had no CSV output format from their search, I’ll skip it for now, it also misses the Nominations for the 89th awards. It would be nice to scrape it with http://import.io, which is probably what others have done.

Kaggle has a CSV with data to 1927 to 2015 but it is a bit messed up, I try to import it in the second part of this post.

I found the data for the Oscars from 1927 to 2010 here at AggData which is a good start.

Especially as it had a directly downloadable URL of the CSV: https://www.aggdata.com/download_sample.php?file=academy_awards.csv

Importing the Data

I wanted to create nodes for Nominations (optionally with an Award label), Nominees, Movies, Categories and Years.

create constraint on (c:Category) assert c.name is unique;
create constraint on (n:Nominee) assert n.name is unique;
create constraint on (m:Movie) assert m.title is unique;
create constraint on (y:Year) assert y.value is unique;
create constraint on (g:Genre) assert g.name is unique;

oscars schema.jpg

So the import statement is pretty straightforward.

Only as the AggData CSV had empty (null) columns in the header I couldn’t use LOAD CSV WITH HEADERS, but had to create the column to name mapping myself.

// load the CVS as stream of row records
LOAD CSV FROM 'https://www.aggdata.com/download_sample.php?file=academy_awards.csv' AS row

// create a map from each row
WITH { year: row[0], category: row[1], nominee: row[2],
       info: row[3], won: row[4]} AS data
// skip the broken header
SKIP 1

// get-or-create nodes for Year, Category and Nominee
MERGE (y:Year {value:data.year})
MERGE (c:Category {name: data.category})
MERGE (n:Nominee {name: data.nominee})

// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data

// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:NOMINATED]->(n)

// if there is a YES in the won column, also set the :Award label
WITH * WHERE data.won = "YES" SET a:Award;

neo4j academy awards.jpg

Some Fun Queries

Nominees who got the most nominations over the years (note our dataset is limited to 2010).

Some suprises here for me. I was for instance not aware that Mary Poppins or Jack Nicholson got that many Oscars.

MATCH (n:Nominee)<-[:NOMINATED]-(a)-[:IN_YEAR]->(y)
RETURN n.name, count(*), collect(distinct y.value)
ORDER BY count(*) DESC LIMIT 10;

“n.name”	“count(*)”	“collect(distinct y.value)”

“Meryl Streep”

["2008 (81st)","2006 (79th)","2002 (75th)","1998 (71st)","1999 (72nd)","2009 (82nd)","1990 (63rd)","

"Titanic"

["1953 (26th)","1997 (70th)"]

“A Star Is Born”

["1937 (10th)","1976 (49th)","1954 (27th)"]

“Cleopatra”

["1934 (7th)","1963 (36th)"]

“Mutiny on the Bounty”

["1935 (8th)","1962 (35th)"]

“Jack Nicholson”

["2002 (75th)","1969 (42nd)","1970 (43rd)","1973 (46th)","1974 (47th)","1975 (48th)","1992 (65th)","

"Moulin Rouge"

["2001 (74th)","1952 (25th)"]

“Katharine Hepburn”

["1932/33 (6th)","1951 (24th)","1935 (8th)","1942 (15th)","1940 (13th)","1967 (40th)","1968 (41st)",

"The Lord of the Rings The Fellowship of the Ring"

["2001 (74th)"]

“Mary Poppins”

["1964 (37th)"]

oscars top.jpg

And it seems Meryl Streep is not an failing actress after all :)

Let’s have a look at her achievements.

MATCH  path = (n:Nominee {name:"Meryl Streep"})<-[:NOMINATED]-()-->()
RETURN path

oscars meryl.jpg

Part 2: Full Import from Kaggle

Now as that dataset was a bit limited, let’s import the Kaggle Data instead (with a lot of special cases).

And then add the 2016 nominations and winners, which I manually scraped from oscar.go.com and turned into the Kaggle format.

The CSVs are here:

The Kaggle file stated it has this structure:

Year
Ceremony (Number)
Award (Type of Award)
Winner (1 or empty)
Name
Film

In recent years, the last two columns were only correctly populated for the Actress/Actor awards Otherwise the movie was in the “Name” column and some inconsistent text in the “Film” column

Let’s look at the Winners of 2015.

LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS row
WITH * WHERE row.Winner="1" and row.Year = "2015"
RETURN row.Award, row.Name, row.Film

row.Award	row.Name	row.Film

Actor in a Leading Role

Leonardo DiCaprio

The Revenant

Actor in a Supporting Role

Mark Rylance

Bridge of Spies

Actress in a Leading Role

Brie Larson

Room

Actress in a Supporting Role

Alicia Vikander

The Danish Girl

Animated Feature Film

Inside Out

Pete Docter and Jonas Rivera

Costume Design

Mad Max: Fury Road

Jenny Beavan

Directing

The Revenant

Alejandro G. Iñárritu

Documentary (Feature)

Amy

Asif Kapadia and James Gay-Rees

So we saw, that for these Awards, the column content was not correct, and we had take that into account. To simplify our import statement we did a double pass.

We could basically reuse our import statement from the beginning with some minor changes.

LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data

// only use the Actress and Actor awards
WITH data WHERE data.Award STARTS WITH "Act"

MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})
MERGE (n:Nominee {name: data.Name})
MERGE (m:Movie {title: data.Film})

// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data

// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:NOMINATED]->(n),(a)-[:FOR_MOVIE]->(m)
MERGE (n)-[:ACTED_IN]->(m)
// if there is a "1" in the Winner column, also set the :Award label
WITH * WHERE data.Winner = "1" SET a:Award;

For the other awards, we had to treat the “Name” column as movie and run some text processing on the “Film” column to extract the clean(ish) names of the relevant people.

For reduced complexity we just replace a bunch of regular expressions with apoc’s regreplace function.

It was quite annoying because the “free-form” text was quite varied but I got most weeded out, using the query below as my check.

LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data

// do not use the Actress and Actor awards
WITH data WHERE NOT data.Award STARTS WITH "Act" AND data.Film contains "by"

WITH *, apoc.text.regreplace(apoc.text.regreplace(apoc.text.regreplace(data.Film,"(?i)([;&]|\\b(and|aka)\\b)",","),"[^A-Za-z0-9, ]?",""),
     "(?i)\\b(Third|Second|Producers?|Story|Written|Adaptation|Art|Score|Direction|Production|Set|Decoration|Sound|Department|"+
     "Co-Producer|Design|Director|(English )?Lyrics?|Music|Special|Audible|Musical|Screenplay|Dialogue|by|Interior|Head of|"+
     "Score|Photographic|Effects|in collaboration|with|Visual|Ballet|Made by|with( the)? cooperation of|"+
     "Photography|Thematic|Adapted|Song|Orchestral|for the screen|Stories|Jr|Based on a|Original|Screen)\\b","") as cleanFilm
WITH *,[s IN split(cleanFilm,",") WHERE trim(s) <> "" | trim(s)] as nominees
WHERE any(n in nominees WHERE size(split(n," ")) <> 2)
RETURN data.Film, nominees;

Then we could use our cleaned up data to not only create the movie from the “Name” column but also all nominees from that free-form “Film” column. We used the name of the award as relationship-type between nominee and movie.

LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-1926-2015.csv" AS data

// do not use the Actress and Actor awards
WITH data WHERE NOT data.Award STARTS WITH "Act"

MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})
MERGE (m:Movie {title: data.Name})

// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data

// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
CREATE (a)-[:FOR_MOVIE]->(m)

FOREACH (winner IN CASE data.Winner WHEN "1" THEN [a] ELSE [] END | SET winner:Award)

WITH *, apoc.text.regreplace(apoc.text.regreplace(apoc.text.regreplace(data.Film,"(?i)([;&]|\\b(and|aka)\\b)",","),"[^A-Za-z0-9, ]?",""),
     "(?i)\\b(Third|Second|Producers?|Story|Written|Adaptation|Art|Score|Direction|Production|Set|Decoration|Sound|Department|"+
     "Co-Producer|Design|Director|(English )?Lyrics?|Music|Special|Audible|Musical|Screenplay|Dialogue|by|Interior|Head of|"+
     "Score|Photographic|Effects|in collaboration|with|Visual|Ballet|Made by|with( the)? cooperation of|"+
     "Photography|Thematic|Adapted|Song|Orchestral|for the screen|Stories|Jr|Based on a|Original|Screen)\\b","") as cleanFilm
WITH *,[s IN split(cleanFilm,",") WHERE trim(s) <> "" | trim(s)] as nominees

UNWIND nominees as nominee

MERGE (n:Nominee {name: trim(nominee)})
CREATE (a)-[:NOMINATED]->(n)
WITH * WHERE NOT EXISTS ((n)-->(m))
CALL apoc.create.relationship(n,toUpper(data.Award),{},m) YIELD rel
RETURN count(*);

Now the meta-model of our graph looks like this:

oscars kaggle meta model.jpg

As we created the most recent academy data ourselves, it’s less detailed but cleaner, so we can just use one statement:

LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/blog/data/oscars-2017.csv" AS data

MERGE (y:Year {value:data.Year})
MERGE (c:Category {name: data.Award})

// create the Nomination node, set the full data as attributes
CREATE (a:Nomination) SET a = data

// create relationships from the nomination to the other nodes
CREATE (c)<-[:IN_CATEGORY]-(a)-[:IN_YEAR]->(y)
WITH *
UNWIND split(data.Name,";") as name
MERGE (n:Nominee {name: name})
MERGE (a)-[:NOMINATED]->(n)
WITH *
UNWIND split(data.Film,";") as film

MERGE (m:Movie {title: film})
MERGE (a)-[:FOR_MOVIE]->(m)

// if there is a "1" in the Winner column, also set the :Award label
FOREACH (winner IN CASE data.Winner WHEN "1" THEN [a] ELSE [] END | SET winner:Award)
WITH * WHERE NOT EXISTS ((n)-->(m))
call apoc.create.relationship(n,toUpper(data.Award),{},m) yield rel
RETURN count(*);

Now we have the full data in our graph, which you can access on this instance (user/pass:oscars) of the brand new Neo4j Sandbox .

More Queries

Most different categories
Which generes win most oscars
Which countries win most oscars (except the US)
prediction for 2017

Most Oscars

MATCH (n:Nominee)<-[:NOMINATED]-(a:Award)
RETURN n.name as winner, count(*) as count, collect(distinct a.Award) as awards
ORDER BY count DESC LIMIT 10;

“winner”	“count”	“awards”

“Walt Disney”

["Short Subject (Cartoon)","Short Subject (Two Reel)","Documentary (Feature)","Documentary (Short Subject)","Short Subject (Live Action)"]

“Italy”

["Foreign Language Film"]

“Metro-Goldwyn-Mayer”

["Outstanding Production","Short Subject (Two Reel)","Short Subject (One Reel)","Short Subject (Cartoon)","Outstanding Motion Picture","Special Effects"]

“Cedric Gibbons”

["Art Direction","Art Direction (Black and White)","Art Direction (Color)"]

“Alfred Newman”

["Music (Scoring)","Music (Music Score of a Dramatic or Comedy Picture)","Music (Scoring of a Musical Picture)","Music (Scoring of Music, Adaptation or Treatment)"]

“France”

["Foreign Language Film"]

“Dennis Muren”

["Special Achievement Award (Visual Effects)","Visual Effects"]

“Edith Head”

["Costume Design (Black and White)","Costume Design (Color)","Costume Design"]

“Edwin B. Willis”

["Art Direction (Color)","Art Direction (Black and White)"]

“Metro-Goldwyn-Mayer Studio”

["Sound Recording","Sound"]

MATCH (n:Nominee)<-[:NOMINATED]-(a:Nomination)-[:IN_YEAR]->(y)
WHERE y.value > "1950"
RETURN n.name as winner, count(*) as count, collect(distinct a.Award) as awards
ORDER BY count DESC LIMIT 10;

Most different Awards

MATCH (n:Nominee)<-[:NOMINATED]-(a:Award)
RETURN n.name as winner, collect(distinct split(a.Award," (")[0]) as awards
ORDER BY size(awards) DESC LIMIT 5;

“winner”	“awards”

“Metro-Goldwyn-Mayer”

["Outstanding Production","Short Subject","Outstanding Motion Picture","Special Effects"]

“Billy Wilder”

["Directing","Writing","Best Motion Picture"]

“James Cameron”

["Directing","Film Editing","Best Picture"]

“Ethan Coen”

["Writing","Directing","Best Picture"]

“Robert Wise”

["Directing","Best Motion Picture","Best Picture"

Which Countries got the most Oscars

MATCH (a:Award)-[:FOR_MOVIE]->(m:Movie)
UNWIND split(m.Country,“, “) as country
RETURN country, count(*) as c
ORDER BY c DESC LIMIT 5;

"country"	"c"

"USA"

1673

"UK"

366

"France"

122

"Germany"

"Italy"

Which Genres won most

MATCH (a:Award)-[:FOR_MOVIE]->(m:Movie)-[:IN_GENRE]->(g:Genre)
RETURN g.name, count(*) as c
ORDER BY c DESC LIMIT 5;

"g.name"	"c"

"Drama"

1316

"Romance"

414

"Comedy"

376

"Biography"

347

"Adventure"

279

oscars genres.jpg

How often did someone win directly or indirectly

E.g. via the movie they participated in.

MATCH (y:Year)<-[:IN_YEAR]-(a:Award)-[*1..2]-(n:Nominee)
WHERE y.value > "1950"
AND none(term IN ["Studio","N/A","Metro-Goldwyn-Mayer"] WHERE n.name contains term)
RETURN n.name, count(distinct a) as c, collect(distinct a.Award)
ORDER BY c DESC LIMIT 10;

"n.name"	"c"	"collect(distinct a.Award)"

"John Williams"

"54"

["Costume Design (Black and White)","Cinematography","Music (Scoring: Adaptation and Original Song Score)","Sound","Film Editing","Music (Original Score)","Special Achievement Award (Visual Effects)","Costume Design","Art Direction","Visual Effects","Special Achievement Award (Sound Effects Editing)","Sound Effects Editing","Directing","Writing (Screenplay Based on Material Previously Produced or Published)","Best Picture","Production Design"]

"Christopher Boyes"

"36"

["Writing (Story and Screenplay)","Special Achievement Award (Visual Effects)","Art Direction","Cinematography","Costume Design","Visual Effects","Sound","Sound Effects Editing","Best Picture","Directing","Film Editing","Music (Original Dramatic Score)","Sound Editing","Sound Mixing","Makeup","Music (Original Score)","Writing (Adapted Screenplay)"]

"Andy Nelson"

"30"

["Art Direction (Color)","Costume Design (Color)","Writing (Screenplay Based on Material Previously Produced or Published)","Sound Effects Editing","Sound","Film Editing","Cinematography","Directing","Art Direction","Best Picture","Music (Original Score)","Makeup","Costume Design","Sound Mixing","Production Design","Makeup and Hairstyling","Actor in a Supporting Role","Visual Effects"]

…

Directors directly and indirectly

MATCH (y:Year)<-[:IN_YEAR]-(a:Award)-[*1..2]-(n:Nominee)
WHERE y.value > "1950"
AND none(term IN ["Studio","N/A","Metro-Goldwyn-Mayer"] WHERE n.name contains term)
AND exists( (n)-[:DIRECTED]->() )
RETURN n.name, count(distinct a) as c, collect(distinct a.Award)
ORDER BY c DESC LIMIT 10;

"n.name"	"c"

"Steven Spielberg"

"Robert Redford"

"Gary Rydstrom"

"Fred Zinnemann"

"David Lean"

"Ralph Fiennes"

"Paul Newman"

"Marlon Brando"

"James Cameron"

"Peter Jackson"

The Unlucky ones, most often nominated with least wins

MATCH (a:Nomination)-[:NOMINATED]->(n:Nominee)
WITH n.name as name, sum(case when a:Award then 1 else 0 end) as won, count(*) as total
RETURN name, won, total, 100*won/total as percent
ORDER BY percent asc, total desc LIMIT 10;

"name"	"won"	"total"	"percent"

"Kevin O’Connell"

"Greg P. Russell"

"Alex North"

"Roland Anderson"

"Thomas Newman"

"Republic Studio"

"Roger Deakins"

"George Folsey"

"Rick Kline"

"Walter Lantz"

"name"	"won"	"total"	"percent"

"Metro-Goldwyn-Mayer"

"Walt Disney"

"John Williams"

"Alfred Newman"

"Warner Brothers"

"Cedric Gibbons"