pymdb.utils module

Module containing various utility functions used within other PyMDb modules.

The functions within here are not intended to be used outside of the PyMDb package.

append_filename_to_path

pymdb.utils.append_filename_to_path(path, filename)

Append a filename to a system file path.

This method correctly appends a filename to a file path with the correct path separators used within the path string.

Parameters:
  • path (str) – The system file path.
  • filename (str) – The filename to append.
Returns:

The filename correctly appended to the file path.

Return type:

str

is_float

pymdb.utils.is_float(f)

Check if a variable is a float type.

Parameters:f – The object to check.
Returns:If the object can be converted to a float.
Return type:bool

is_int

pymdb.utils.is_int(i)

Check if a variable is an int type.

Parameters:i – The object to check.
Returns:If the object can be converted to an int.
Return type:bool

get_company_id

pymdb.utils.get_company_id(node)

Find the IMDb company ID within a selectolax Node.

Expects the ID to be within the Node’s “href” attribute.

Parameters:node (Node) – A Node containing the ID.
Returns:The IMDb company ID.
Return type:str

get_name_id

pymdb.utils.get_name_id(node)

Find the IMDb name ID within a selectolax Node.

Expects the ID to be within the Node’s “href” attribute.

Parameters:node (Node) – A Node containing the ID.
Returns:The IMDb name ID.
Return type:str

get_title_id

pymdb.utils.get_title_id(node)

Find the IMDb title ID within a selectolax Node.

Expects the ID to be within the Node’s “href” attribute.

Parameters:node (Node) – A Node containing the ID.
Returns:The IMDb title ID.
Return type:str

get_category

pymdb.utils.get_category(node)

Gets the category value from a selectolax Node.

Grabs the value from the Node’s “onclick” attribute.

Parameters:node (Node) – A Node containing the “onclick” attribute.
Returns:The category.
Return type:str

get_ref_marker

pymdb.utils.get_ref_marker(node)

Gets the ref marker value from a selectolax Node.

Grabs the value from the Node’s “onclick” attribute.

Parameters:node (Node) – A Node containing the “onclick” attribute.
Returns:The ref marker.
Return type:str

get_episode_info

pymdb.utils.get_episode_info(node)

Gets the episode count, episode year start, and episode year end for an actor.

Gets the episode information for an actor’s credit within an IMDb TV series. The format the information is expected is: “episode count episodes, episode year start-episode year end”. Single episodes/years are also handled. For example:

  • 124 episodes, 1999-2013
  • 2 episodes, 2010
  • 1 episode
Parameters:node (Node) – A Node containing the episode information.
Returns:The episode count, episode start year, and episode end year, or None if a value is not found.
Return type:(int, int, int)

gunzip_file

pymdb.utils.gunzip_file(infile, outfile=None, delete_infile=False)

Unzips a gzip file and returns the unzipped filename.

Unzips the given gzipped file into the specified outfile, or a default outfile name. If the infile’s filename ends with “.gz”, the oufile will be the same filename with the gzip extension removed. The function is also capable of deleteing the gzipped infile afterwards.

Parameters:
  • infile (str) – The gzipped file’s filename.
  • outfile (str, optional) – The filename to unzip the infile to, or None to use the default filename.
  • delete_infile (bool, optional) – Determine if the gzipped infile should be deleted after it is unzipped to the outfile.
Returns:

The outfile’s filename for the case when the default filename was used.

Return type:

str

preprocess_list

pymdb.utils.preprocess_list(lst)

Process a row of data from the IMDb datasets.

Replaces all “\N” characters in the IMDb dataset with None.

Parameters:lst (list of str) – A list of strings to process.
Returns:A list of strings with all “\N” strings being set to None.
Return type:list of str

remove_tags

pymdb.utils.remove_tags(s, tag)

Removes the specified opening and closing tags of the given type.

This method does not remove content between the tags, rather just the tags themselves. For example: “td” to remove all table column tags.

Parameters:
  • s (str) – The HTML to parse.
  • tag (str) – The tag to be removed.
Returns:

A string with all of the given tags removed, but other HTML information intact.

Return type:

str

remove_tags_and_content

pymdb.utils.remove_tags_and_content(s, tag)

Removes all of the specified tags from the string including their children.

Greedily finds an opening and closing of specified tag and removes all content between the two. Note: Not intended to remove multiple sibling nodes with content in between.

Parameters:
  • s (str) – The HTML to parse.
  • tag (str) – The tag to be removed.
Returns:

A string with all of the specified tags and their content removed.

Return type:

str

split_by_br

pymdb.utils.split_by_br(s)

Split a string by <br> tags.

Splits by replacing each <br> tag with a “\t” character and then splitting.

Parameters:s (str) – A string containing <br> tags.
Returns:A list of strings split around the <br> tags.
Return type:list

trim_year

pymdb.utils.trim_year(year)

Used to trim roman numerals from year values.

IMDb differentiates movies of the same title and the same year with the format: YYYY/<Roman numeral>. This function removes the roman numerals and returns just the year value.

Parameters:year (str) – The year and roman numeral combination.
Returns:The year with roman numerals removed, or None if year was None.
Return type:str

is_money_string

pymdb.utils.is_money_string(s)

Determine if a string is in a money format.

Determines if the string represents a monetary value, for example: $123,456,789.

Parameters:s (str) – The monetary amount to check.
Returns:If the string does represent a monetary value for not.
Return type:bool

trim_money_string

pymdb.utils.trim_money_string(s)

Trims excess characters from a monetary value.

Only keeps the digits within a monetary value, such as trimming $123,456 to 123456. Trims dollar signs and commas.

Parameters:s (str) – The monetary amount to trim.
Returns:The same monetary amount with excess characters removed.
Return type:str

get_denomination

pymdb.utils.get_denomination(s)

Returns the monetary denomination for the given monetary value.

Checks if the monetary value has one of the supported denominations. In the case it is a US dollar ($), the dollar sign character is replaced with “USD”. Currently supported denominations:

  • GBP
  • USD ($)
Parameters:s (str) – The monetary amount to retrieve the denomination from.
Returns:The denomination type, or None if not a monetary value or supported denomination.
Return type:str

to_bool

pymdb.utils.to_bool(b)

Convert a variable to a boolean type.

Parameters:b – The object to convert.
Returns:The boolean representation of the object.
Return type:bool

to_datetime

pymdb.utils.to_datetime(d)

Convert a variable to a datetime object.

Checks various formats used in IMDb to convert the variable to a datetime object under those formats. The formats include:

  • %d %B %Y
  • %Y
  • %Y-%m-%d
Parameters:d (str) – A string to convert to a datetime object.
Returns:A datetime object that was represented by the string, or None if d is None.
Return type:datetime
Raises:ValueError – If the string could not be converted.