Generator

In Python, a generator is a function that returns an iterator that produces a sequence of values when iterated over.

Generators are useful when we want to produce a large sequence of values, but we don’t want to store all of them in memory at once.

Understanding Generators

Generator functions look and act just like regular functions, but with one defining characteristic. Generator functions use the Python yield keyword instead of return.

For example: generating infinite sequence:

def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

This looks like a typical function definition, except for the Python yield statement and the code that follows it.

  • yield indicates where a value is sent back to the caller, but unlike return, you don’t exit the function afterward.
  • Instead, the state of the function is remembered. That way, when next() is called on a generator object (either explicitly or implicitly within a for loop), the previously yielded variable num is incremented, and then yielded again.

Building generators

Define as function

Similar to defining a normal function, we can define a generator function using the def keyword, but instead of the return statement we use the yield statement.

def generator_name(arg):
    # statements
    yield something

When the generator function is called, it does not execute the function body immediately. Instead, it returns a generator object that can be iterated over to produce the values.

Define using comprehension

Like list comprehensions, generator expressions allow you to quickly create a generator object in just a few lines of code.

(expression for item in iterable)

The generator expression creates a generator object that produces the values of expression for each item in the iterable, one at a time, when iterated over.

Generator comprehensions are also useful in the same cases where list comprehensions are used, with an added benefit: you can create them without building and holding the entire object in memory before iteration.

Example: squaring numbers

num_squared_list = [num ** 2 for num in range(5)] # list comprehension
num_squared_generator = (num ** 2 for num in range(5)) # generator comprehension
>>> num_squared_list
[0, 1, 4, 9, 16]

>>> num_squared_generator
<generator object <genexpr> at 0x7f9c604c8430>

Profiling generator performance

Memory:

>>> import sys
>>> nums_squared_lc = [i ** 2 for i in range(10000)]
>>> sys.getsizeof(nums_squared_lc) # in bytes
87624
>>> nums_squared_gc = (i ** 2 for i in range(10000))
>>> print(sys.getsizeof(nums_squared_gc)) # in bytes
120

The list is over 700 times larger than the generator object!

Speed:

>>> import cProfile
>>> cProfile.run('sum([i * 2 for i in range(10000)])')
         5 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <string>:1(<listcomp>)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run('sum((i * 2 for i in range(10000)))')
         10005 function calls in 0.003 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10001    0.002    0.000    0.002    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    0.003    0.003 <string>:1(<module>)
        1    0.000    0.000    0.003    0.003 {built-in method builtins.exec}
        1    0.001    0.001    0.003    0.003 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Summing across all values in the list comprehension took about a third of the time as summing across the generator.

If the list is smaller than the running machine’s available memory, then list comprehensions can be faster to evaluate than the equivalent generator expression.

Memory matters → Use generator comprehension.

Runtime matters → Use list comprehension.

Understanding the yield Statement

On the whole, yield is a fairly simple statement. Its primary job is to control the flow of a generator function in a way that’s similar to return statements.

  • When the yield statement is hit/arrived, the program suspends function execution and returns the yielded value to the caller. (In contrast, return stops function execution completely.) When a function is suspended, the state of that function is saved (including variables binding local to the generator, the instruction point, the internal stack, and any exception handling).
  • This allows you to resume function execution whenever you call one of the generator’s methods. In this way, all function evaluation picks back up right after yield.

Example:

def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1
        yield f"Next number to yield: {num}"
>>> inf_seq_generator = infinite_sequence()
>>> inf_seq_generator
<generator object infinite_sequence at 0x7f9c604c8b30>

# The `inifinite_sequence` runs and arrives line 4,
# the execution is suspended by yield statement. 
# The `num` (0 in this case) is yielded.
>>> next(inf_seq_generator)
0

# The funtion execution picks back up right after `yield num`.
# I.e., the program continues from line 5, `num += 1`. `num` is incremented to 1.
# Then the `yield` statement at line 6 is arrived
# --> It suspends the program and yield "Next number to yield: 1"
>>> next(inf_seq_generator)
'Next number to yield: 1'

>>> next(inf_seq_generator)
1

>>> next(inf_seq_generator)
'Next number to yield: 2'

>>> next(inf_seq_generator)
2

Use of Python Generators

There are several reasons that make generators a powerful implementation.

Easy to Implement

Generators can be implemented in a clear and concise way as compared to their iterator class counterpart.

Memory Efficient

A normal function to return a sequence will create the entire sequence in memory before returning the result. This is an overkill, if the number of items in the sequence is very large.

In contrast, Generator implementation of such sequences is memory friendly and is preferred since it only produces one item at a time.

Represent Infinite Stream

Generators are excellent mediums to represent an infinite stream of data. Infinite streams cannot be stored in memory, and since generators produce only one item at a time, they can represent an infinite stream of data.

E.g., the following generator function can generate all the even numbers (at least in theory).

def all_even():
    n = 0
    while True:
        yield n
        n += 2

Pipelining Generators

Multiple generators can be used to pipeline a series of operations.

E.g., suppose we have a generator that produces the numbers in the Fibonacci series. And we have another generator for squaring numbers. If we want to find out the sum of squares of numbers in the Fibonacci series, we can do it in the following way by pipelining the output of generator functions together.

def fibonacci_numbers(nums):
    x, y = 0, 1
    for _ in range(nums):
        x, y = y, x+y
        yield x

def square(nums):
    for num in nums:
        yield num**2

print(sum(square(fibonacci_numbers(10))))

# Output: 4895

Reference

Previous
Next