BishopPhillips Consulting - Python Refactoring With Generators

Python Techniques - Refactoring With Generators

Refactoring With Generators.

Author: Jonathan Bishop
AI Revolution

This article is an illustration of techniques in advanced Python coding and a demonstration of a process of refactoring code. It was inspired by a question I was asked about a relatively simple problem and the little journey I embarked upon to answer that question, which I thought might be educational for other Python developers, or at least vaguely amusing. On this journey we collapse 18+ lines of code down to 1 and achieve greater efficiency & robustness in the process.

The problem presented was to write a simple python routine to scan a directory tree and locate every instance of a directory of a specific name within the tree. The directory name - we'll call "Test2" could appear anywhere and any number of times within the tree including nested within itself. The challenge was to make the code as brief as possible. Not an especially difficult task but one that supplied a surprising range of opportunities to demonstrate python techniques in a simple problem scope. You will see that we start with a cumbersome largish solution and end with essentially just a single line of code. Through the journey we will explore python operating system interfacing capabilities, loops, lists, comprehensions and generators and demonstrate the process of code refactoring to achieve a better and better solution.

1. Establishing the Context.

Python provides a number of libraries for interfacing with the operating system. Primarily os, platform and pathlib. In this example we will need os and pathlib. These provide an almost operating system agnostic view of a machine and particularly disk environment. I say "almost" because there are still some differences between say a Windows environment and a Linux environment which may necessitate the use of the platform library instead of the os library for some things ("uname" - I'm looking at you!), but they will not impact us here.

Ok. So, let's begin. For all these code snippets we will need to import from these libraries as follows:

Imports:


import os
from pathlib import Path

That's all for the imports. The rest of what we need is built into Python. The code I am demonstrating was written in Python 3.12.1, although it should work fine in anything over 3.8 (in fact probably earlier, but there is at least one thing used later that was introduced in 3.8 - the "walrus assignment"). The os library provides a bunch of directory and file management tools we need while the pathlib provides us with the Path object which is a much more convenient way to manipulate paths than string mangling.

In all the code snippets we are going to assume that the "find directory" function is being called from a code block as follows:

Standard calling block:


#Print the directory path for each instance of the "findTarget" directory
#found starting from the "findStartPath" directory

findStartPath = "."
findTarget = "Test2"    
for found in findDirXXX(findStartPath, findTarget):
    if found:
       print(found)

This code block essentially expects findDir() to return an iterable object (like a list) that can be stepped through by the for loop one item at a time and each item loaded into "found", which if not empty is printed on the screen.

In the example we will set the findStartPath directory to start at the current working directory "." and the target directory to find as "Test2", but there is nothing special about these decisions. In the test case there are several instances of Test2 both as leaf and branch directories, and embedded within each other. We want the findDirXXX routines to return a list (or other iterable) that contains a series of strings representing the path specification for each instance of "Test2" located in the directory tree below the "findStartPath". In each case below we replace the "findDirXXX() call with the appropriate version of the findDir() routine. The above code block will print an output to the screen similar to the following:


D:\Documents\VSCodeProj\PYTeaching2\Test2
D:\Documents\VSCodeProj\PYTeaching2\Jim\Alice\Test2
D:\Documents\VSCodeProj\PYTeaching2\Test\Test2
D:\Documents\VSCodeProj\PYTeaching2\Test2\Test2

OR

.\Test2
.\Jim\Alice\Test2
.\Test\Test2
.\Test2\Test2

2. Solution 1: Changing the Current Working Directory

So let's assume that you have just learned about the OS library, and you now know how to change the working directory "os.chdir()" and get the current working directory "os.getcwd()". You would also have learned about "os.listdir()" to list the contents of a directory as a list, but we will go a step further and say you have also looked into the pathlib and found "path.iterdir()" which iterates over a directory returning a pathobject for each file of directory found and path.is_dir() to test if a path object is a directory. You can now create a recursive directory tree walking solution thus:

First attempt:


#Recurse through the directory tree changing the CWD as necessary
def isDirInDir(findTarget):
    result = []
    p = Path()
    dirs = [str(x) for x in p.iterdir() if x.is_dir()]
    if not dirs:
        return result
    if findTarget in dirs:
        result.append( os.getcwd() + '\\' + findTarget )
    for dir in dirs:
        os.chdir(dir)
        result += isDirInDir(findTarget)
        os.chdir("..")
    return result

#Find dir by moving CWD
def findDirWDM(findStartPath, findTarget):
    if findStartPath not in ["","."]:
        changeDir(findStartPath)
    return isDirInDir(findTarget)

The starting point is the "findDirWDM() method which takes the starting path, changes the current working directory (CWD) to that place if it is not told to use the existing current directory indicated by setting the findStartPath param to "" or ".". The routine then just calls isDirInDir() with the target to start the recursive walk through the directory tree. Recursion means that the routine (in this case isDirInDir()) calls itself up again at some point in the body of the code of that routine. In this case it calls itself up again when it finds a directory in the list of directories that exist in the current directory being scanned.

isDirInDir() begins by creating a Path object set to the current working directory. Because the current working directory has been moved to the desired starting directory the path object "p" can be initialised with a Path() object initialiser with no arguments. The Path object provides a large number of cabilities for manipulating paths, including a directory iterator which generates a stream of path objects representing each file or directory in the current Path object.

isDirInDir() uses this to scan the current directory using p.iterdir() to build a list of directories contained in the current directory and then drills into each directory in the list by changing the current working directory to that directory and recursively calling itself up again and then changing back to the starting current directory for that instance of the isDirInDir() method when the recursive call finishes. As it travels through the tree it adds entries it finds to the list it is building.

So there are a couple of things to understand here. The first is what we call a "list comprehension". This is a special Python syntactical structure that builds a list [] using an embedded for loop:


dirs = [str(x) for x in p.iterdir() if x.is_dir()]

# Which "translates" to:

dirs = []                   # Make an empty list called "dirs"
for x in p.iterdir():       # for each item in the file/directory list
    if x.is_dir():          # if x is a directory
       dirs.append(str(x))  # add it to the list as a string

A list comprehension is designated by the [] surrounding the expression, and has the form:


list = [ thing to put in the list - this many times from this source - if this condition is true]

In English the list comprehension reads as: "make a list of strings called dirs where each string is a directory name and comes from the list of files & directories held at the path located by p."

As this list of directories is always within the current working directory, the list members will always be the "local" path objects (ie. just one directory in from the current working directory) and thus we can simply check if our target name is anywhere in the list of strings. If it is, we have found an instance of the findTarget directory and can add it to our result list.

We need, however, the full directory path for our list, or the list would just be a list of the directory name for which we were searching. Not a lot of use! We can then assemble a full path/directory string to add to our result list by getting the current working directory string and joining it to the target string. This full path string can then be appended to the result list.

Lastly the routine then just steps through every directory in the directory list, recursing into each directory (ie. calling itself up again with the next directory as the new working directory) and repeating the above process, making in turn each of these directories the current working directory.

The last technique issue to note here is the difference between these two list manipulations:


result.append( os.getcwd() + '\\' + findTarget )

and

result += isDirInDir( ... )

The first adds an item to the list (append) the second joins two lists. As isDirInDir() returns a list, if the result was appended to result we would get a list of lists, not one list.

3. Solution 2. Static Working Directory

Solution 1. is a little childish in that it moves the working directory around to generate a directory list to search, and in the event of a crash code it could leave the current working directory in an unpredictable place. We can do better by using the Path object directly and avoiding changing the current working directory.


#Find dir with absolute path spec, non moving
def findDirANM(startPath, findTarget):
    result = []
    try:
        p = Path(startPath)
        dirs = [x for x in p.iterdir() if x.is_dir()]
    except Exception as e:
        print(f"Error: {e}")
        return []

    for dir in dirs:
        if dir.name == findTarget:
            result.append(str(dir.resolve()))
        result += findDirANM(dir, findTarget)
    return result

This solution has the advantage of being considerably simpler. As we do not need to initialise the current working directory first we are able to use a single recursing routine. Instead of starting the Path object at the current working directory, we pass it a directory path as an argument (which can be just "." if we want the current working directory as the start) and then use a list comprehension as before to create a list of directories. There is one little additional difference here in that we do not convert that list to a list of strings (as in the first solution) but rather leave it as a list of path objects.

This well help avoid one of the potential pitfalls of this approach in that instead of dealing with single file/directory names in our generated directory list we are creating a list of partial or full directory paths and so some string wrangling might be required to split the directory names off the path in order to compare them with our target name.

Having the list of directories as a list of path objects allows us to make use of some extra capabilities, one of which is the path.name attribute which strips the last part of the path off from the path and allows comparison to our target string. This means we don't have to do string manipulation to strip off the last n characters from the string to test if it equals our target. We can also append the result of dir.resolve() which resolves the current partial path to a full path string, instead of manually assembling the path string from components.

So this solution is neater than Solution 1. and makes better use of the capabilities of the pathlib library and the Path object, but the solution is still a recursive walk through the directory tree.

There is another capability of the os library that we haven't yet considered that allows us to simplify the solution even further and eliminate the recursive call entirely.

4. Solution 3: Using the tree walk.

The os library has a method called "walk()" which generates a series of tuples representing each directory below a given starting directory. A tuple in is an ordered, immutable collection of items. It is represented by the structure (a, b, c) in Python, where a, b & c are some arbitrary data objects collected together. Think: "a row of a table". In the case of "os.walk()", the tuple created for each directory is (root, dirs, files) where root is the path spec for the directory being scanned, dirs is a list of strings representing the directories contained in that root directory and files is a list of strings representing the file names contained in that root directory. Note: here we are talking about a root of the tuple, not "THE" root of the disk. "Current directory" would be another way to refer to it.

The other important thing to note here is that os.walk() does not return a list, it is a particular kind of Python object called a "generator". A generator is an object that generates its results one item at a time. In this case that is one tuple at a time. It then yields control back to the calling routine so the calling routine can process that result and then call the generator again when it is ready for the next result. We describe this as "lazy evaluation" in that results are generated only when and if required, ie. on demand.


#Find dir using tree walk, non recursing
def findDirTree1(startPath, findTarget):
    result=[]
    try:
        for root, dirs, files in os.walk(top=startPath):
            for dir in dirs:
                full_path = Path(root) / dir
                if (full_path).name == findTarget:
                    result.append(str(full_path.resolve()))
    except Exception as e:
            print(f"Error: {e}")
            return []
    return result

Looking at the code snippet we note there is no recursive calling up of the findDirTree1() routine as a simple for - loop to process the generated results of the os.walk() routine is all that is needed. It will walk through all the nested directories below the startPath and generate a huge sequence of tuples: (root, dirs, files), returning one tuple at a time with each invocation. For each tuple at each directory we get a list of directories in dirs. We discard the list of files in files (as we are not interested in them in this problem domain). Note that the walk() routine takes a parameter top=startPath to set at which directory it should start.

With each tuple we then step through the members of the dirs list. The dirs list is a list of strings (not path objects), so to make use of the Path object capabilities we need to create a Path() object from the root and dir in the tuple:


full_path = Path(root) / dir

Now that expression is a little odd. The "/" is not a division in this case but an overridden "/" operator changed to work as a path string assembler so it essentially attaches the dir name to the directory root to make a single path "string" with the correct operating system specific path separator. The result is that "full_path" is now a Path object to which we can apply the various Path() methods we have already encountered:


path.name (to check the directory name)

path.resolve (to generate a complete path specification down to the disk root).

The Path object can be turned back into an ordinary string with the str() function.

So this solution is considerably simpler than our previous solutions, but we are not finished yet. Whenever you are dealing with nested "for loops" in Python it is worth considering whether you can turn that into a list comprehension as the language is set up to optimise comprehesions. That will be our next solution.

5. Solution 4: Using a list comprehension

The next solution is essentially the previous solution rewritten as a list comprehension. It is literally the previous "nested for loops" rewritten as :


#Find dir using tree walk, non recurse, list comprehension
def findDirTree2(startPath, findTarget):
    try:
        return [str(p.resolve()) for root, dirs, files in os.walk(top=startPath) for dir in dirs if (p := Path(root) / dir).name == findTarget ]
        
    except Exception as e:
            print(f"Error: {e}")
            return []

Most of this solution is code error trapping and handling as all the real work is now being done in a single line. That line is a list comprehension. We explained list comprehensions in the first section of this article. This one generates a list that is returned in the result.

The first for loop collects the tuples and feeds the directory list into the second for loop which is filtered by the "if ..." expression (called a guard statement) and the individual selected items are then resolved to full paths, converted to strings and added to the list.

The one thing to note here is that we are using a "walrus assignment" ":=" which is Python operator that allows us to assign a value to a variable within an expression (in this case p) and then use that variable's value elsewhere. A normal assignment "=" requires a full statement and can't be used in an exporession (a sub-part of a statement). There are two spots in this comprehension where we could (in theory) do that assignment:

str(p.resolve()) - which would then be written as: str((p := Path(root) / dir).resolve())
(p := Path(root) / dir).name

It might seem logical to do it in step 1, because we literally read that first when reading the comprehension, but Python processes the for loop and then the guard statement first then does the selector part of the comprehension last. The guard statement is the bit that begins "if ...." in a list comprehension. The walrus assignment expression must be executed before the variable assigned is used by the selector, so we need to but the expression in the second place since Python executes that first. If using Python < 3.8, the Walrus assignment can't be used but can be dropped and the "Path(root) / dir" expression repeated in each of step 1 & 2 (without the assignment).

We consider this to be a neat "Pythonic" solution, but it is not yet as efficient as it could be. Here we are generating a list using a generator (os.walk()), yet ultimately we are going to iterate over this list and we could just as easily iterate over a generator. Why convert to a list, and then proceed to generate (iterate) over each list member individually when we already have them individually at this stage? This is kind of stupid as we already have a generator in os.walk() and with one simple change we could preserve the advantages of generators. We will consider that in the next section.

6. Solution 5 & 6: Using Generators

In this section we will look at two ways of doing generators for this problem domain. There are two ways to create generators in Python. One is a generator function the specifically yields with each iteration, the other is a generator expression, which also yields on each iteration but does not have an explicit yield statement. A list comprehension can be converted into a generator expression by changing the [] to ().

You will recall that we noted that generators are "lazy" in that they only generate one iteration's worth of values with each invocation and then wait to be instructed to generate the next. Hence, compared with a list which consumes all the memory and processing needed to make a full list before handing the result back to the calling routine, a generator takes up only as much memory as is required for one list item - not the whole list - and can be abandoned at any point in the generation sequence. Other than that they look to the code as if they were lists (or tuples, or whatever other iterable is being generated) and can be iterated over just like a list.

We use that last point in the first generator solution:


def findDirTree3(startPath, findTarget):
    dirGen =  (str(p.resolve()) for root, dirs, files in os.walk(top=startPath) for dir in dirs if (p := Path(root) / dir).name == findTarget )
    try:
        for found in dirGen:
            yield found
        
    except Exception as e:
            print(f"Error: {e}")
            return

In this solution we assign what used to be the list comprehension to a generator expression and hold it in "dirGen". At this point all that is happening is that a generator object has been created.

We then use that object in a for loop to iterate over it and explicitly yield with each directory found. This makes the findTree3() function a generator function itself which can be used in a for-loop as an iterable with each yielded value being printed or otherwise processed.

The advantage of this approach over the next approach is error handling. In this solution we can trap errors at each iteration step inside the findDirTree3 generator. Note that the generator object creation step is NOT included in the try - except error handler.

The "for found in dirGen" loop will return to the calling routine with each "yield found" statement execution - so every time around the loop - returning the value of found as the iteration instance. When there are no more values the iteration-end event is automatically generated and the receiving routine's iteration loop ends.

The next solution is even simpler:


def findDirTree4(startPath, findTarget):
    yield from  (str(p.resolve()) for root, dirs, files in os.walk(top=startPath) for dir in dirs if (p := Path(root) / dir).name == findTarget )

Here the solution has literally been collapsed into one line of code. The same single generator expression as in the previous solution but now attached directly to a "yield from" expression. It works the same way as the previous solution, except that all error handling code must be in the calling routine because a "try - except" clause around the generator expression here would literally be about the formation of the generator object not its execution.

As a parting note, if you really wanted to have a list of the directories instead of the generator at the end of the day, you would simply cast the findTree4() result to a list as follows:


my_dir_list = list(findDirTree4(".", "Test2"))

7. Conclusion

We have travelled a little refactoring journey here from 18 lines of code down to 1 while solving the same basic problem with each version. The last solution is far better than the first:

It has no impact on the working directory
It is more memory efficient - using only as much as is required for each step of the process
It makes good use of the facilities in the standard libraries
Fewer lines of code means less opportunities for bugs.

This is a relatively simple problem that demonstrates both the power of Python as a language and the process and advantages of refactoring your code. In this journey we have covered importing and using libraries, for-loops, nested for-loops, function calls, parameters, lists, tuples, generators, generator expressions & yield statements, list comprehensions, in-expression assignment (using the walrus operator), guard statements, os & pathlib disk and path methods, lazy evaluation, error handling, the Path() "/" operator, the "in" membership operator and iteration.

Programming in Python

References

Programming in Python
Python Method & Function Overloading
Python Type Checking
Python List Comprehension
Python Dictionary Comprehension
Python Iterators
Python Generators
Python Closures
Python Maps and Filters
Python Refactoring With Generators
Python Decorators