Scraping movie data from The Movie Database and other sources on the web
Published by Kaustubh Saha on July 17th, 2019
The Movie Database (TMDb) :
The Movie Database (TMDb) is a popular, user editable database for movies and TV shows. They have a REST based discovery API which allows us to pass a bunch of parameters and get a JSON array of movies matching those parameters as the response.
To access the API, we need to register for an API key and pass it as a request param with every request that we send to TMDb
The discovery API currently limits responses to at max 20 responses per request. However, it supports pagination and we can pass a page parameter with every request. If we need more than 20 results, we can simply send another request with the page parameter incremented by 1 and TMDb will return the next 20 responses.
To access TMDb API we can use the tmdbsimple module which is available on PyPI and can be installed with pip. Its not compulsory to use tmdbsimple, but using it makes it easier to write the client side code as the API takes care of parsing the JSON responses for us.
!pip install tmdbsimple
Instead of having to send the API key every-time as request param, we can set it once at the client level if we are using tmdbsimple client. Once a key is associated with a tmdb client, it is automatically sent as a param in every subsequent request sent to TMDb using that param:
import tmdbsimple as tmdbtmdb.API_KEY = 'API KEY HERE'import requestsimport locale
The following request returns us the top 20 movies for 2019 in descending order of revenues :
url = 'https://api.themoviedb.org/3/discover/movie?api_key=8341de77e3a24593226752d83720f88b&primary_release_year=2019&sort_by=revenue.desc&include_all_movies=true'response = requests.get(url)response_json = response.json()
If we look at each movie representation in the response, it looks like this :
{'vote_count': 7720,'id': 299534,'video': False,'vote_average': 8.4,'title': 'Avengers: Endgame','popularity': 102.517,'poster_path': '/or06FN3Dka5tukK1e9sl16pB3iy.jpg','original_language': 'en','original_title': 'Avengers: Endgame','genre_ids': [12, 878, 28],'backdrop_path': '/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg','adult': False,'overview': "After the devastating events of Avengers: Infinity War, the universe is in ruins due to the efforts of the Mad Titan, Thanos. With the help of remaining allies, the Avengers must assemble once more in order to undo Thanos' actions and restore order to the universe once and for all, no matter what consequences may be in store.",'release_date': '2019-04-24'}
This is essentially just very basic information about the movie. To get more information, we need to extract the id from this request and send another request to TMDb using the get API :
results = response_json['results']result = results[0]id = result['id']url = 'https://api.themoviedb.org/3/movie/'+ str(id) + '?api_key=8341de77e3a24593226752d83720f88b'response = requests.get(url)response_json = response.json()
If we look at a sample response, it looks like this :
{'adult': False,'backdrop_path': '/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg','belongs_to_collection': {'id': 86311,'name': 'The Avengers Collection','poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg','backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'},'budget': 356000000,'genres': [{'id': 12,'name': 'Adventure'},{'id': 878,'name': 'Science Fiction'},{'id': 28,'name': 'Action'}],'homepage': 'https://www.marvel.com/movies/avengers-endgame','id': 299534,'imdb_id': 'tt4154796','original_language': 'en','original_title': 'Avengers: Endgame','overview': "After the devastating events of Avengers: Infinity War, the universe is in ruins due to the efforts of the Mad Titan, Thanos. With the help of remaining allies, the Avengers must assemble once more in order to undo Thanos' actions and restore order to the universe once and for all, no matter what consequences may be in store.",'popularity': 102.517,'poster_path': '/or06FN3Dka5tukK1e9sl16pB3iy.jpg','production_companies': [{'id': 420,'logo_path': '/hUzeosd33nzE5MCNsZxCGEKTXaQ.png','name': 'Marvel Studios','origin_country': 'US'}],'production_countries': [{'iso_3166_1': 'US','name': 'United States of America'}],'release_date': '2019-04-24','revenue': 2781545151,'runtime': 181,'spoken_languages': [{'iso_639_1': 'en','name': 'English'},{'iso_639_1': 'ja','name': '日本語'}],'status': 'Released','tagline': 'Part of the journey is the end.','title': 'Avengers: Endgame','video': False,'vote_average': 8.4,'vote_count': 7726}
From a response we can get a lot of important information like whether the movie belongs to a franchisee or not, the budget, the revenue, the runtime, the genres etc. The imdb id field returns the unique id for that movie in IMDb. This allows us to extract data separately from IMDb and TMDb and merge them easily.
This still doesnt tell us anything about the movie director/star castt etc yet,. For that we need to make another call to the TMDB credits API using the movie id :
url = TMDB_MOVIE_URL_BASE + '/'+ str(id) + '/' + CREDITS + '?' + API_KEY_URL_PARAMresponse = requests.get(url)response_json = response.json()response_json['cast']
Here's a sample response for cast :
[ {'cast_id': 55,'character': 'Tony Stark / Iron Man','credit_id': '58700eee9251412ae400238b','gender': 2,'id': 3223,'name': 'Robert Downey Jr.','order': 0,'profile_path': '/1YjdSym1jTG7xjHSI0yGGWEsw5i.jpg'},{'cast_id': 22,'character': 'Steve Rogers / Captain America','credit_id': '585d55afc3a368408600c438','gender': 2,'id': 16828,'name': 'Chris Evans','order': 1,'profile_path': '/7dUkkq1lK593XvOjunlUB11lKm1.jpg'},{'cast_id': 23,'character': 'Thor Odinson','credit_id': '585d55c1c3a368409300cdfd','gender': 2,'id': 74568,'name': 'Chris Hemsworth','order': 2,'profile_path': '/lrhth7yK9p3vy6p7AabDUM1THKl.jpg'},{'cast_id': 24,'character': 'Bruce Banner / The Hulk','credit_id': '585d55ce92514123b300c1f9','gender': 2,'id': 103,'name': 'Mark Ruffalo','order': 3,'profile_path': '/z3dvKqMNDQWk3QLxzumloQVR0pv.jpg'},{'cast_id': 50,'character': 'Natasha Romanoff / Black Widow','credit_id': '58700e6b9251412aef0023df','gender': 1,'id': 1245,'name': 'Scarlett Johansson','order': 4,'profile_path': '/tHMgW7Pg0Fg6HmB8Kh8Ixk6yxZw.jpg'},{'cast_id': 51,'character': 'Clint Barton / Hawkeye','credit_id': '58700e789251412af30023ec','gender': 2,'id': 17604,'name': 'Jeremy Renner','order': 5,'profile_path': '/g8gheNEdPSXWH5SnjfjTYWj5ziU.jpg'},{'cast_id': 71,'character': 'James "Rhodey" Rhodes / War Machine','credit_id': '59ee7ac9925141245f031612','gender': 2,'id': 1896,'name': 'Don Cheadle','order': 6,'profile_path': '/b1EVJWdFn7a75qVYJgwO87W2TJU.jpg'},...{'cast_id': 464,'character': 'Student','credit_id': '5d2c92df3acd207ca3d57fb4','gender': 1,'id': 1653352,'name': 'Faith Logan','order': 93,'profile_path': '/c9sZ4TGDSIRQPEIGqUG23odJriS.jpg'}]
From the same JSON response, we can extract the film crew information as well :
response_json['crew']
Here's a sample response for film crew :
[{'credit_id': '544fe892c3a36802360024e1','department': 'Production','gender': 2,'id': 10850,'job': 'Producer','name': 'Kevin Feige','profile_path': '/kCBqXZ5PT5udYGEj2wfTSFbLMvT.jpg'},{'credit_id': '552523f292514172760015d5','department': 'Directing','gender': 2,'id': 19272,'job': 'Director','name': 'Joe Russo','profile_path': '/679Os4tbY1YsU01KdLhM1NPXNWu.jpg'},{'credit_id': '552523fbc3a3687e080013f1','department': 'Directing','gender': 2,'id': 19271,'job': 'Director','name': 'Anthony Russo','profile_path': '/fIa5wXK7MHAquhefTr3TcnZiYy8.jpg'},{'credit_id': '552f5d719251413f9c0033fb','department': 'Writing','gender': 2,'id': 5551,'job': 'Writer','name': 'Christopher Markus','profile_path': '/tyeKi52yruPdxOkEMKhBKCnkp5V.jpg'},{'credit_id': '552f5d7cc3a3686be20056ec','department': 'Writing','gender': 2,'id': 5552,'job': 'Writer','name': 'Stephen McFeely','profile_path': '/fa8DAGpANcBTTXO4bbNMrCFufmV.jpg'},{'credit_id': '573fc15592514177ef00018b','department': 'Production','gender': 2,'id': 15277,'job': 'Executive Producer','name': 'Jon Favreau','profile_path': '/rOVBKURoR7TrG8MYxTuNUFj3E68.jpg'},...{'credit_id': '5cd59826c3a36869dcfdc04e','department': 'Sound','gender': 0,'id': 2002539,'job': 'Musician','name': 'Victor Pesavento','profile_path': None}]
Currently TMDb imposes a rate limit of 40 requests per 10 seconds per API key beyond which the requests fail with a HTTP 429 status. In order to prevent us from crossing the rate limit, it's probably a good idea to introduce a forced delay after sending every request.
IMDb :
IMDb is probably by miles the world's most popular and authoritative source for movie, TV and celebrity content. Just like tmdbsimple for TMDb, IMDb has a Python module to interact with its REST APIs. The module is available on PyPI and can be installed with a pip install :
!pip install IMDbPY
To access information from IMDb, we need to create a client first:
from imdb import IMDbimdb_client = IMDb()
Every movie listed in IMDb has a unique id beginning with 'tt'. The id also determines the home page url for the movie. For example the id for the movie 'John Wick : Chapter 3' is tt6146586
ImdbPy, however, expects us to strip the 'tt' at the beginning which querying data for a movie based on the movie id :
imdb_id = 6146586imdb_movie = imdb_client.get_movie(imdb_id)
To extract any attribute value, we simply need to pass that attribute name. Like this :
imdb_movie['director']
To find the list of valid attribute names for a movie, we can use :
print(imdb_client.get_movie_infoset())
Just as in the case of TMDB, by default IMDB fetches only a set of basic attributes. If we need extra information, we have to make an additional call to IMDb APIs to fetch it. For example, if we want to fetch the list of plot keywords for the movie 'John Wick : Chapter 3', we will have to tell IMDB to load 'keyword' information : (For multiple attributes use a list of attribute names)
imdb_client.update(imdb_movie, 'keywords')print(imdb_movie['keywords'])
It returns a list of keywords :
['dog', 'sequel', 'third-part', 'chase', 'assassin', 'desert', 'past-coming-back-to-haunt', 'excommunicado', 'new-york-city-new-york', 'kicked-by-a-horse', 'horseback-riding', 'neo-noir', 'bulletproof-clothing', 'honor', 'library', 'betrayal', 'race-against-time', 'organized-crime', 'hotel', 'survival', 'horse', 'ninja', 'sword', 'character-name-in-title', 'muslim-man', 'arab-man', 'head-scarf', 'favor-returned', 'wanted-man', 'north-africa', 'punishment', 'arabian-girl', 'arabian-desert', 'asian-man', 'extreme-violence', 'digit-in-title', 'colon-in-title', 'open-ended', 'action-hero', 'anti-hero', 'one-man-army', 'tough-guy', 'warrior', 'dark-hero', 'photograph', 'improvised-weapon', 'subtitled-scene', 'bilingualism', 'shot-in-the-head', 'shot-in-the-face', 'shot-in-the-eye', 'shot-in-the-forehead', 'shot-in-the-throat', 'shot-in-the-back', 'shot-in-the-chest', 'shot-in-the-leg', 'shot-to-death', 'stabbed-in-the-head', 'stabbed-in-the-eye', 'stabbed-in-the-face', 'stabbed-in-the-throat', 'stabbed-in-the-neck', 'stabbed-in-the-shoulder', 'stabbed-in-the-arm', 'stabbed-in-the-hand', 'stabbed-in-the-chest', 'stabbed-in-the-back', 'stabbed-in-the-leg', 'stabbed-to-death', 'stabbed-through-the-hand', 'stabbed-through-the-head', 'stabbed-through-the-chest', 'murder', 'death', 'violence', 'brutality', 'mercilessness', 'brutal-violence', 'gory-violence', 'bloody-violence', 'ultraviolence', 'hostage', 'held-at-gunpoint', 'rescue', 'escape', 'evacuation', 'deception', 'double-cross', 'blood', 'blood-splatter', 'gore', 'blood-on-camera-lens', 'blood-on-shirt', 'statue-of-liberty-new-york-city', 'flatiron-building-manhattan-new-york-city', 'new-york-city', 'aerial-shot', '2010s', 'no-opening-credits', 'heavy-rain', 'rainstorm', 'lightning', 'taxi', 'taxi-driver', 'coin', 'necklace', 'engagement-ring', 'severed-finger', 'self-mutilation', 'cauterization', 'raised-middle-finger', 'obscene-finger-gesture', 'fire-poker', 'fireplace', 'branding', 'tattoo', 'f-word', 'profanity', 'reference-to-elvis-presley', 'shot-in-the-shoulder', 'surgery', 'doctor', 'russian', 'russian-gangster', 'russian-mafia', 'mafia', 'italian', 'man-with-a-ponytail', 'japanese', 'crime-boss', 'mob-boss', 'mafia-boss', 'gangster', 'mobster', 'secret-society', 'casablanca-morocco', 'umbrella', 'times-square-manhattan-new-york-city', 'manhattan-new-york-city', 'bullet-ballet', 'brooklyn-bridge', 'pigeon', 'scar', 'disfigurement', 'neck-breaking', 'throat-slitting', 'gunfight', 'shootout', 'battle', 'gun-battle', 'grand-central-station-manhattan-new-york-city', 'black-comedy', 'old-flame', 'hotel-manager', 'gun-fu', 'cat', 'animal-attack', 'animal-killing', 'moral-dilemma', 'anarchy', 'hall-of-records', 'telephone', 'vault', 'armory', 'rooftop', 'furnace', 'corpse', 'falling-from-height', 'faked-death', 'homeless-man', 'homeless-person', 'cook', 'stylized-violence', 'slide-locked-back', 'bullet-time', 'slow-motion-scene', 'swimming-pool', 'underwater-scene', 'underwater-fight', 'fistfight', 'fight', 'brawl', 'martial-arts', 'mixed-martial-arts', 'hand-to-hand-combat', 'punched-in-the-face', 'punched-in-the-chest', 'kicked-in-the-face', 'kicked-in-the-stomach', 'fighting', 'beating', 'beaten-to-death', 'hit-in-the-crotch', 'fight-to-the-death', 'head-butt', 'opening-action-scene', 'one-against-many', 'long-take', 'mexican-standoff', 'showdown', 'final-showdown', 'machismo', 'lens-flare', 'theater', 'ballerina', 'henchman', 'bodyguard', 'thug', 'two-against-one', 'on-the-run', 'reference-to-baba-yaga', 'neon', 'fragments-of-glass', 'vinyl', 'army', 'crime-lord', 'camel', "reference-to-dante's-inferno", 'suit-and-tie', 'spiral-staircase', 'mercenary', 'massacre', 'body-count', 'woman-fights-a-man', 'woman-kills-a-man', 'hit-by-a-car', 'smoke-grenade', 'power-outage', 'concierge', 'knife', 'hitman', 'contract-killer', 'hired-killer', 'female-assassin', 'female-killer', 'assassination-attempt', 'attempted-murder', 'near-death-experience', 'teleportation', 'disappearance', 'train-station', 'subway', 'subway-station', 'eccentric', 'gang', 'cell-phone', 'wristwatch', 'broken-arm', 'broken-hand', 'knife-throwing', 'antique-store', 'antique-gun', 'six-shooter', 'revolver', 'pistol', 'machine-gun', 'silencer', 'shotgun', 'flashlight', 'commando-raid', 'assault-rifle', 'pistol-whip', 'knocked-out-with-a-gun-butt', 'knocked-out', 'disarming-someone', 'whisky', 'impalement', 'butterfly-knife', 'dagger', 'threatened-with-a-knife', 'knife-fight', 'axe', 'axe-fight', 'axe-throwing', 'axe-in-the-head', 'sword-fight', 'motorcycle', 'motorcycle-chase', 'foot-chase', 'stable', 'bridge', 'motorcycle-accident', 'product-placement', 'constellation', 'water-bottle', 'spit-take', 'arab', 'revenge', 'redemption', 'professional-hit', 'gang-violence', 'disobeying-orders', 'bus', 'hotel-room', 'ambush', 'statue', 'man-with-glasses', 'shopping-cart', "character-repeating-someone-else's-dialogue", 'hoodie', 'kung-fu', 'karate', 'guard-dog', 'book', 'raining', 'gun', 'throwing-a-knife', 'throwing-knife', 'throwing-a-knife-at-someone', 'horse-motorcycle-chase', 'ballet-dancer', 'tattoo-on-head', 'tattoo-on-back', 'tattooed-man', 'adjudicator', 'dog-shot', 'dog-shot-by-gun', 'dog-attack', 'bottle-of-water', 'severed-ring-finger', 'phone-ringing', 'ringing-phone', 'telephone-call', 'belt', 'male-protagonist', 'gun-violence', 'attacked-by-a-dog', 'finger-cut-off', 'horse-riding', 'travel', 'cutting-off-own-finger', 'bespectacled-male', 'motor-vehicle', 'night-time', 'bladed-weapon', 'actor-reprises-previous-role', 'canine', 'bespectacled-man', 'weapon', 'number-in-title', 'title-spoken-by-character', 'surprise-ending', 'muslim-woman', 'muslim-girl', 'hijab', 'over-the-top', 'american-abroad', 'ninja-magic']
Apart from the information that we get in TMDb, IMDb also stores information about certification/rating for the movie in various countries. Each rating is stored in the following format as a string :
country : rating. (Note that TV rating and movie screen rating appear as two different entries) If we want the rating for a particular country, we can extract it this way
try:for c in imdb_movie['certificates']:if (c.startswith('United States')):cert = c.split(":")[1]if (cert.startswith('TV')):continueelse:print(cert)except:pass
IMDb API also provides us information about how different user groups (based on gender and age range) rated the same movie differently. This is available through the 'demographics' attribute :
imdb_client.update(imdb_movie, 'vote details')imdb_movie['demographics']
(Note that the update key and the attribute name aren't always the same. To know what new attributes were made available through an update key, use imdb_movie.infoset2keys['vote details'] )
Output:
{'imdb users': {'votes': 108786, 'rating': 7.9},'aged under 18': {'votes': 495, 'rating': 8.5},'aged 18 29': {'votes': 31134, 'rating': 8.0},'aged 30 44': {'votes': 29105, 'rating': 7.8},'aged 45 plus': {'votes': 6836, 'rating': 7.6},'males': {'votes': 65052, 'rating': 7.9},'males aged under 18': {'votes': 339, 'rating': 8.5},'males aged 18 29': {'votes': 25926, 'rating': 8.0},'males aged 30 44': {'votes': 24932, 'rating': 7.8},'males aged 45 plus': {'votes': 5690, 'rating': 7.5},'females': {'votes': 8360, 'rating': 7.8},'females aged under 18': {'votes': 28, 'rating': 7.8},'females aged 18 29': {'votes': 3151, 'rating': 7.8},'females aged 30 44': {'votes': 3175, 'rating': 7.8},'females aged 45 plus': {'votes': 903, 'rating': 7.9},'top 1000 voters': {'votes': 200, 'rating': 7.6},'us users': {'votes': 13957, 'rating': 8.0},'non us users': {'votes': 36132, 'rating': 7.7}}
We can also access technical information (like sound mix, camera type, aspect ratio etc) about a movie by using the 'tech' attribute key :
imdb_client.update(imdb_movie, ['technical'])imdb_movie['tech']
Output:
{'runtime': ['2 hr 11 min (131 min)'],'sound mix': ['Dolby Surround 7.1', 'Dolby Atmos'],'color': ['Color'],'aspect ratio': ['2.39 : 1'],'camera': ['Arri Alexa SXT Plus, Zeiss Master Anamorphic and Master Prime Lenses','Arri Alexa Mini, Zeiss Master Anamorphic and Master Prime Lenses'],'laboratory': ['Company 3 (digital intermediate)'],'negative format': ['Codex ARRIRAW (3.2K)'],'cinematographic process': ['Digital Intermediate (2K) (master format)','Dolby Vision','Master Scope (anamorphic) (source format)'],'printed film format': ['D-Cinema']}
Rotten Tomatoes :
Rotten Tomatoes is predominantly a review aggregation site and doesn't provide a lot of information about the movie except reviews by users. However, Rotten Tomatoes provide meterScore which, like IMDB rating, is a widely used indicator for determining a movie's popularity
To access information about a movie from RottenTomatoes website, we need to install the rotten_tomatoes_client module :
!pip install rotten_tomatoes_client
RottenTomatoesClient allows us to search for a movie title and returns us a list of matches. Here's a sample code (The second param lets us specify the max number of matches that will be returned)
from rotten_tomatoes_client import RottenTomatoesClientrt_movies = RottenTomatoesClient.search(term='Inglorious Basterds', limit=5)
Here's the corresponding sample response:
{'actorCount': 0,'actors': [],'criticCount': 0,'critics': [],'franchiseCount': 0,'franchises': [],'movieCount': 1,'movies': [{'name': 'Inglourious Basterds','year': 2009,'url': '/m/inglourious_basterds','image': 'https://resizing.flixster.com/jkLbeolMgD65c28jPnMtOZS-c3I=/fit-in/80x80/v1.bTsxMjk4MjE0OTtqOzE4MTUwOzEyMDA7MTE0NjsxNTI4','meterClass': 'certified_fresh','meterScore': 88,'castItems': [{'name': 'Brad Pitt','url': '/celebrity/brad_pitt'},{'name': 'Mélanie Laurent','url': '/celebrity/melanie-laurent'},{'name': 'Christoph Waltz','url': '/celebrity/christoph_waltz'}],'subline': 'Brad Pitt, Mélanie Laurent, Christoph Waltz, '}],'tvCount': 0,'tvSeries': []}
The only additional information that we have here (and which isn't available in IMDb or TMDb) is the meterScore and the meterClass
The Open Movie Database (OMDb) :
The OMDb API is a RESTful web service to obtain information about a movie. Just like TMDb and RottenTomatoes, the requests and responses are in JSON format. OMDB imposes a rate limit per API key just like TMDb and hence is generally a good idea to add some intentional delay after every request so that we don't cross the threshold. For non paying users, the number of hits is limited to 1000 per day per API key.
OMdb had a lot more useful information than that we couldn’t find in the TMdb dataset. For instance, OMdb contained information on the movie ratings, eg. Rotten Tomatoes ratings, Metacritic score and IMdb ratings. OMdb data also recorded box office performance of the movie.
So OMdb clearly has better quality data. However, the 1000 hits par day limit makes it almost unusable for scraping purpose(s)
To use OMDb we need to install the omdb Python module. Its available on PyPI and can be installed through pip :
!pip install omdb
Here's a sample code for fetching a movie detail from OMDb :
import omdbfrom omdb import OMDBClientomdb_API_KEY = '87c2f7e1'omdb.set_default('apikey', omdb_API_KEY)omdb_movie = omdb.get(title = 'Reservoir Dogs', timeout=1)
Here's the corresponding JSON response from OMDb :
{'title': 'Reservoir Dogs','year': '1992','rated': 'R','released': '02 Sep 1992','runtime': '99 min','genre': 'Crime, Drama, Thriller','director': 'Quentin Tarantino','writer': 'Quentin Tarantino, Quentin Tarantino (background radio dialogue written by), Roger Avary (background radio dialogue written by)','actors': 'Harvey Keitel, Tim Roth, Michael Madsen, Chris Penn','plot': 'When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.','language': 'English','country': 'USA','awards': '12 wins & 22 nominations.','poster': 'https://m.media-amazon.com/images/M/MV5BZmExNmEwYWItYmQzOS00YjA5LTk2MjktZjEyZDE1Y2QxNjA1XkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg','ratings': [{'source': 'Internet Movie Database','value': '8.3/10'},{'source': 'Rotten Tomatoes','value': '91%'},{'source': 'Metacritic','value': '79/100'}],'metascore': '79','imdb_rating': '8.3','imdb_votes': '827,608','imdb_id': 'tt0105236','type': 'movie','dvd': '05 Nov 2002','box_office': 'N/A','production': 'Miramax Films','website': 'N/A','response': 'True'}
One information that's present only in OMDB response is award wins and nominations. Unfortunately its present as a fuzzy text. However, we can look for keywords like win/wins or nomination/nominations and look for the previous word and if the previous word is a numeric value =, then treat it as number of award wins/nominations. Something like this :
def parse_awards(text, suffix):try:words = text.lower().split()if suffix in words[1:]:return int(words[words.index(suffix)-1])except:return Noneaward_wins = parse_awards(awards,'wins')if (total_award_wins is None):award_wins = parse_awards(awards,'win')award_nominations = parse_awards(awards,'nominations')if (award_nominations is None):award_nominations = parse_awards(awards,'nominations.')if (award_nominations is None):award_nominations = parse_awards(awards,'nomination')