Coming in Go 1.16: ReadDir and DirEntry

January 2021

As the primary author of Python’s os.scandir function and PEP 471 (the original proposal for scandir), I was very happy to see that Go is adding something similar in Go 1.16, which is coming out in late February 2021.

In Go it will be called os.ReadDir, and was proposed last September. After more than 100 comments and several tweaks to the design, it was committed by Russ Cox in October. A file system-agnostic version is also included in the new io/fs package as fs.ReadDir.

Why is ReadDir needed?

The short answer is: performance.

When you call the system functions to read directory entries, the OS typically returns the file name and its type (and on Windows, stat information such as file size and last modified time). However, the original Go and Python interfaces threw away this extra information, requiring you to make an additional stat call per entry. System calls aren’t cheap to begin with, and stat may read from disk, or at least the disk cache.

When recursively walking a directory tree, you need to know whether an entry is a file or directory so you know whether to recurse in. So even a simple directory tree traversal required reading the directory entries and stat-ing each entry. But if you use the file type information the OS provides, you can avoid those stat calls and traverse a directory several times as fast (even dozens of times as fast on network file systems). See some benchmarks for the Python version.

Both languages, unfortunately, started with a non-optimal design for reading directories that didn’t allow you to access the type information without extra calls to stat: os.listdir in Python, and ioutil.ReadDir in Go.

I first came up with the idea behind Python’s scandir in 2012, and implemented it for Python 3.5, which came out in 2015 (read more about that process). It’s been improved and added to since: for example, with statement handling and file descriptor support.

For Go, I didn’t have anything to do with the proposal or implementation, apart from a couple of comments suggesting improvements based on my experience with the Python version.

Python vs Go

Let’s have a look at the new “read directory” interfaces, particularly how similar they are in Python and Go.

In Python you call os.scandir(path), and it returns an iterator of os.DirEntry objects, which are as follows:

class DirEntry:
    # This entry's filename.
    name: str

    # This entry's full path: os.path.join(scandir_path, entry.name).
    path: str

    # Return inode or file ID for this entry.
    def inode(self) -> int: ...

    # Return True if this entry is a directory.
    def is_dir(self, follow_symlinks=True) -> bool: ...

    # Return True if this entry is a regular file.
    def is_file(self, follow_symlinks=True) -> bool: ...

    # Return True if this entry is a symbolic link.
    def is_symlink(self) -> bool: ...

    # Return stat information for this entry.
    def stat(self, follow_symlinks=True) -> stat_result: ...

Accessing the name and path attributes will never raise exceptions, but the method calls may raise OSError, depending on operating system and file system, and whether the entry is a symbolic link or not. For example, on Linux, stat always performs a system call, and hence may raise an exception, but the is_X methods usually do not.

In Go you call os.ReadDir(path), and it returns a slice of os.DirEntry objects, which look like this:

type DirEntry interface {
    // Returns the name of this entry's file (or subdirectory).
    Name() string

    // Reports whether the entry describes a directory.
    IsDir() bool

    // Returns the type bits for the entry (a subset of FileMode).
    Type() FileMode

    // Returns the FileInfo (stat information) for this entry.
    Info() (FileInfo, error)
}

You can see the similarities right away, though in true Go fashion, the Go version is somewhat simpler. In fact, if I were doing Python’s scandir again, I’d probably push for a slightly simpler interface – in particular, getting rid of the follow_symlinks parameter and making it not follow symbolic links by default.

Here’s an example that uses os.scandir – a function that calculates the total size of the files in a directory and its subdirectories, recursively:

def get_tree_size(path):
    total = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_dir(follow_symlinks=False):
                total += get_tree_size(entry.path)
            else:
                total += entry.stat(follow_symlinks=False).st_size
    return total

In Go (once 1.16 comes out) it would look like this:

func GetTreeSize(path string) (int64, error) {
    entries, err := os.ReadDir(path)
    if err != nil {
        return 0, err
    }
    var total int64
    for _, entry := range entries {
        if entry.IsDir() {
            size, err := GetTreeSize(filepath.Join(path, entry.Name()))
            if err != nil {
                return 0, err
            }
            total += size
        } else {
            info, err := entry.Info()
            if err != nil {
                return 0, err
            }
            total += info.Size()
        }
    }
    return total, nil
}

A similar high-level structure, though of course someone’s going to say, “see, look how much boilerplate Go’s error handling introduces!” And that’s true – the Python code is very neat. In a little script that would be fine, and that’s where Python excels.

However, in production code, or in a hardened command-line utility, you’d want to catch errors around the stat call, and perhaps ignore permission errors, or log them. The Go code makes explicit the fact that errors can occur, and would easily allow you to add logging or nicer error messages.

Higher-level tree walking

In addition, both languages have higher-level functions for recursively walking a directory tree. In Python, that’s os.walk. The beauty of scandir in Python is that the signature of os.walk didn’t need to change, so all existing users of os.walk (of which there are many) got the speed-up automatically.

For example, to print all the non-dot file paths in a directory tree using os.walk:

def list_non_dot(path):
    paths = []
    for root, dirs, files in os.walk(path):
        # Modify dirs to skip directories starting with '.'
        dirs[:] = [d for d in dirs if not d.startswith('.')]
        for f in files:
            if f.startswith('.'):
                continue
            paths.append(os.path.join(root, f))
    return sorted(paths)

As of Python 3.5, where os.walk uses scandir instead of listdir under the hood, this will magically be from 1.5 to 20 times as fast, depending on operating system and file system.

Go (pre-1.16) has a similar function, filepath.Walk, but unfortunately the FileInfo interface wasn’t designed to allow errors to be reported from its various method calls. As we’ve seen, these can sometimes perform system calls – for example, the stat information like Size will always require a system call on Linux. So in Go, the methods need to return an error (in Python they raise an exception).

Is was tempting to wave error handling away to try to reuse the FileInfo interface, so that existing code would get a magical speed-up. In fact, issue 41188 is a proposal from Russ Cox suggesting just that (with some data to show that it’s not as terrible an idea as it sounds). However, stat can and does return errors, so there was potential for things like a file size being returned as 0 on error. As a result, there was significant push-back against trying to wedge it into the existing API, and Russ eventually acknowledged the lack of consensus and proposed the DirEntry interface instead.

What this means is that, to get the performance gain, filepath.Walk calls need to be changed to filepath.WalkDir – very similar, but the walk function receives a DirEntry instead of a FileInfo.

Here’s what a Go version of list_non_dot would look like with the existing filepath.Walk function:

func ListNonDot(path string) ([]string, error) {
    var paths []string
    err := filepath.Walk(path, func(p string, info os.FileInfo,
                                    err error) error {
        if strings.HasPrefix(info.Name(), ".") {
            if info.IsDir() {
                return filepath.SkipDir
            }
            return err
        }
        if !info.IsDir() {
            paths = append(paths, p)
        }
        return err
    })
    return paths, err
}

This will keep working in Go 1.16, of course, but if you want the performance benefits you’ll have to make some very small changes – in this case just changing Walk to WalkDir, and changing os.FileInfo to os.DirEntry:

    err := filepath.WalkDir(path, func(p string, info os.DirEntry,

For what it’s worth, running the first function on my home directory on Linux, once cached, takes about 580ms. The new version using Go 1.16 takes about 370ms – roughly 1.5x as fast. Not a huge difference, but worth it – and you get much larger speed-ups on networked file systems and on Windows.

Summary

The new ReadDir API is easy to use, and integrates nicely with the new file system interface via fs.ReadDir. And to speed up your existing Walk calls, the tweaks you’ll need to make to switch to WalkDir are trivial.

API design is hard. Cross-platform, OS-related API design is even harder. Be sure to get this right when designing your next programming language’s standard library! :-)

In any case, I’m glad that Go’s support for reading directories will no longer be lagging behind – or walking behind – Python.