Arun Tejasvi Chaganty about articles blog thoughts research

Excursions in Programming - Generators

“Lazy computation kicks the llama’s ass” - Somebody

The concept of a generator may be very foreign to those who haven’t really explored Python - though it is a general principle (like iterators). The crux of it is to generate the members of a set/relation one at time (which are subsequently returned). If you are not going to store the enumeration, this methods prevents needless use of memory (that may lead the program to be extremely slow). The trade of is the (minute) expense of storing some “state” in the function, and perhaps some additional function call overhead that is rarely a concern.

An Example - Lexicographic generation of strings

Let us begin by taking this example that is used to enumerate over all strings of length n in lexicographic order:

alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
  'n','o','p','q','r','s','t','u','v','w','x','y','z'] 
def gen_str(n): 
  # Store the index into how "far" each character 
  # in the string of length n has gone 
  idx = [] # Initialise our word array 
  for i in xrange(n):
    idx.append(0) 
  lst = [] 
  while (idx[0] != len(alphabet)-1): 
    str = "" 
    for j in xrange(n): 
      str += alphabet[idx[j]] 
    idx[n-1] += 1 
    # Just like a clock - everytime the idx reaches the end 
    # of the alphabet lst, reset it and increment the 
    # next character 
    for j in xrange(n-1, 0, -1): 
      if (idx[j] == len(alphabet) - 1): 
        idx[j] = 0 
        idx[j-1] +=1 
        else: 
          break 
    lst += str 
  return lst

Now for some back of the envelope calculations: If you ran this code for strings of length 5, you would be generating 265 = 11881376 strings each of length 6 (because of the null char) which amounts to 67Mb. One character more and you’d need 1.8Gb. Considering that you are going to throw this away immediately, clearly this is not the right approach. The “usual” approach to solving this problem is to include your code in the middle of this loop like so:

  ... 
  str = "" 
  for j in xrange(n): 
    str += alphabet[idx[j]] 
    # Computation goes here 
    random_computation(str) 
  # Just like a clock - everytime the idx reaches the end 
      ... 

This is clunky, ugly and non-generic, in other words evil. Another method you might devise is to use a “static” variable - this is of course for those of you who think in C.

Python introduces a beautifully clean solution (as usual) - one word: yield. The yield keyword basically saves the state of this function, and returns to you the computed value. The first time you run the function you get a “generator object”, on which you call the “next()” function to step through (the generator object is also iterable by the way). This modification allows you to compute any length strings in a breeze (much thanks to Kirtika for catching that my original post had “lst += str” and “return lst” in it instead of the yield command. Talk about a screw-up…):

alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m', 
  'n', 'o','p','q','r','s','t','u','v','w','x','y','z'] 
def gen_str(n): 
  # Store the index into how "far" each character 
  # in the string of length n has gone 
  idx = [] 
  # Initialise our word array 
  for i in xrange(n):
    idx.append(0) 
  list = [] 
  while (idx[0] != len(alphabet)-1): 
    str = "" 
    for j in xrange(n): 
      str += alphabet[idx[j]] 
    idx[n-1] += 1 
    # Just like a clock - everytime the idx reaches the end 
    # of the alphabet list, reset it and increment the 
    # next character 
    for j in xrange(n-1, 0, -1): 
      if (idx[j] == len(alphabet) - 1): 
        idx[j] = 0 
        idx[j-1] +=1 
        else: 
          break 
    # The line that does it all 
    yield str 
In [2]: g = gen_str(10) 
In [4]: for i in xrange(3): g.next()
Out[4]: 'aaaaaaaaaa' 
Out[4]: 'aaaaaaaaab' 
Out[4]: 'aaaaaaaaac' 

For comparison, this code was used to generate SQL files for an assignment in our databases course. The earlier version took greater than 10 min. to run (before exhausting all my memory mind you). The newer version took about 10 seconds.

Enumeration of Infinite Sets

A key result of the generator model is that it enables you to enumerate/work on potentially infinite sets! For example take this example to enumerate all even numbers (I agree this is a pointless example, but bear with me):

def EvenGenerator(): 
  x = 0 while(1): 
    yield x 
  x += 2
In [1]: g = EvenGenerator() 
In [2]: for i in xrange(10): g.next() 
Out[2]: 0 
Out[2]: 2 
Out[2]: 4 
Out[2]: 6
Out[2]: 8 
Out[2]: 10 
Out[2]: 12 
Out[2]: 14 
Out[2]: 16 
Out[2]: 18

This is something you just can’t do otherwise.

Generators in C++

For the sake for completeness, I’d like to touch upon how you could implement such a model in C++. Being object-oriented, C++ also lends an elegant, though not as succinct an implementation - Functors (aka function objects). This obscure functionality of C++ is terribly powerful, but I’m quite certain that few C++ coders have ever heard or considered it.

class EvenGenerator { 
  private: int val; 
  public: SinXGenerator() { val = 0; } 
  int operator()() { 
    int val_; 
    val_ = val;
    val += 2; 
    return val_; 
  } 
}

Notice the semantics of the operator() function. If you understand operator overloading, it’s pretty clear what’s happening here.

Now, as I mentioned earlier, the core behind a generator is storing the intermediate state of a computation. Python handles this internally. If you wish to do the same in C++, simply define a function object that modifies it’s private variables every time operator() is called, like so:

Every time you instantiate an EvenGenerator, you’ll have a shiny new object with an initial state, and each object is independent.

Generators in C

C does not lend itself naturally to such a paradigm, and any implementation is intrinsically clunky (as usual). While static functions “work”, they do not lead to re-entrant code. Before I move on to what I’ve thought of, an example of a very clunky implementation a “generator” in readline’s autocomplete example (perhaps more details in a future article):

const char* cmds[] { "SUBLIMINAL", "MESSAGE", "HERE" };
char* cmd_generator (const char* text, int state) {
  static int list_idx, len; 
  const char* name; 
  if(state == 0) { 
    list_idx = 0; 
    len = strlen(text); 
  } 
  while (name = cmds[list_idx]) { 
    list_idx++; 
    if (strncasecmp (name, text, len) == 0) 
      return (strdup(name)); 
  } 
  return ((char\*)NULL); 
}

This is still not re-entrant as only one person can use the generator at any given time. A possible implementation to solve this is to store the intermediate state of a computation in a struct, and pass this to the generator function each time:

typedef struct EvenGeneratorState_ { 
  int val; 
} EvenGeneratorState; 
int EvenGenerator(EvenGeneratorState* state) { 
  int val_; 
  val_ = state->val; 
  state->val += 2; 
  return val_; 
}

Sure it’s clunky, but that’s C for you.