pymdb.scraper module

Module containing the PyMDbScraper class.

PyMDbScraper

class pymdb.scraper.PyMDbScraper(rate_limit=1000)

Scrapes various information from IMDb web pages.

Contains functions for various IMDb pages and scrapes information into Python classes.

Rate limit is defaulted to 1000ms.

_get_tree(request)

Get the selectolax HTML tree given a request.

Parameters:request (str) – The HTTP GET request.
Returns:The HTML tree from the GET request.
Return type:HTMLTree
Raises:HTTPError – If a non successful response was returned.
get_company(company_id)

Scrapes all titles a company is credited for on IMDb.

Will scrape all titles listed under a company on IMDb by going through each page in IMDb’s company search. This only gives the year(s) the company was involved with each title and notes for each listed on IMDb.

Parameters:

company_id (str) – The company’s ID used by IMDb prefixed with co.

Yields:

CompanyScrape – An object for each title the company is credited for.

Raises:
  • HTTPError – If a request failed.
  • InvalidCompanyId – If an invalid company ID was given.
get_company_credits(title_id)

Gets all companies credited for a title.

Scrapes a title’s company credits page on IMDb to find information for each company that was credited. Each company creates a new CompanyCreditScrape object.

Parameters:title_id (str) – The title’s ID used by IMDb prefixed with tt.
Yields:CompanyCreditScrape – An object for each company.
Raises:HTTPError – If the request failed.
get_full_cast(title_id, include_episodes=False)

Scrapes the full cast of actors for a specified title.

Will scrape the full cast of actors for a title, each into their own CreditScrape object. An optional argument include_episodes will also scrape each episode an actor is in if the title is a TV series.

Parameters:
  • title_id (str) – The title’s ID used by IMDb prefixed with tt.
  • include_episodes (bool, optional) – Specify if individual episodes of a TV series should also be scraped.
Yields:

CreditScrape – An object for each cast member in the title.

Raises:

HTTPError – If a request failed.

get_full_credits(title_id, include_episodes=False)

Scrapes the full list of credited people for a title.

Will scrape all the cast and crew for a title by returning both get_full_cast and get_full_crew as a single generator. An optional argument include_episodes will also scrape each episode an actor is in if the title is a TV series.

Parameters:
  • title_id (str) – The title’s ID used by IMDb prefixed with tt.
  • include_episodes (bool, optional) – Specify if individual episodes of a TV series should also be scraped.
Yields:

CreditScrape – An object for each credited crew member in the title.

Raises:

HTTPError – If the request failed.

get_full_credits_as_dict(title_id, include_episodes=False)

Scrapes the full list of credited people for a title into a dictionary.

Builds a dictionary with job_title as key of lists of CreditScrape objects. Uses the results of the get_full_credits method to gather the objects. An optional argument include_episodes will also scrape each episode an actor is in if the title is a TV series.

Parameters:
  • title_id (str) – The title’s ID used by IMDb prefixed with tt.
  • include_episodes (bool, optional) – Specify if individual episodes of a TV series should also be scraped.
Returns:

A dictionary where

each key is a str of a job_title and the value is a list of CreditScrape objects who’s job_title value is the same as the key.

Return type:

dict of list of CreditScrape

Raises:

HTTPError – If the request failed.

get_full_crew(title_id)

Scrapes the full list of credited crew people for a title, not including actors.

Will scrape all the credited crew members of a title, without the actors. For example, this will include all directors, writers, producers, cinematographers, etc.

Parameters:title_id (str) – The title’s ID used by IMDb prefixed with tt.
Yields:CreditScrape – An object for each credited crew member in the title.
Raises:HTTPError – If the request failed.
get_name(name_id, include_known_for_titles=False)

Scrapes detailed information from a person’s personal IMDb web page.

Will scrape detailed information on a person’s IMDb bio page into a new NameScrape object.

Parameters:
  • name_id (str) – The person’s ID used by IMDb prefixed with nm.
  • include_known_for_titles (bool, optional) – Determines if an second request should be sent to get the known for titles on a person’s default IMDb page.
Returns:

An object with the person’s information.

Return type:

NameScrape

Raises:

HTTPError – If the request failed.

get_name_credits(name_id, include_episodes=False)

Scrapes all title credits a person is included in.

Scrapes the full filmography from a person’s IMDb page to get each title they are credited in, and what category that credit is under. An optional argument include_episodes will also scrape each episode an actor is in if the title is a TV series. Each credit is created with a new NameCreditScrape object.

Parameters:
  • name_id (str) – The person’s ID used by IMDb prefixed with nm.
  • include_episodes (bool, optional) – Specify if individual episodes of a TV series should also be scraped.
Yields:

NameCreditScrape – An object for each credit in the person’s filmography.

Raises:

HTTPError – If a request failed.

get_search_results(keyword)

Gets search results for a given keyword.

Uses IMDb’s GET requests for searches to retrieve a JSON response containing the search result information. A list of SearchResult objects is created for each result that is either a name or title. If the result is a name, the object is a SearchResultName. If it is a title, the object is a SearchResultTitle.

Parameters:keyword (str) – The keyword to search for. IMDb caps keywords at 20 characters.
Returns:
A list of either
SearchResultName and/or SearchResultTitle objects.
Return type:list of SearchResult
Raises:HTTPError – If the request failed.
get_tech_specs(title_id)

Gets information for all tech specs for a title.

Uses a title’s technical web page on IMDb to scrape all technical specifications listed. A new TitleTechSpecScrape object is created for the tech specs.

Parameters:title_id (str) – The title’s ID used by IMDb prefixed with tt.
Returns:An object containing the information.
Return type:TitleTechSpecScrape
Raises:HTTPError – If the request failed.
get_title(title_id, include_taglines=False)

Scrapes information from the IMDb web page for the specified title.

Uses the given title ID to request the IMDb page for the title and scrapes the page’s information into a new TitleScrape object. An optional argument include_taglines allows an additional request to be made to gather all taglines IMDb has for the title.

Parameters:
  • title_id (str) – The title’s ID used by IMDb prefixed with tt.
  • include_taglines (bool, optional) – Specify if an extra request should be made to get all the taglines for the title
Returns:

An object containing the page’s information.

Return type:

TitleScrape

Raises:

HTTPError – If the request failed.