Readers

Background¶

Before we proceed, we need to zoom out and look at the moving parts involved in serving a page of HTML or a blob of JSON data. Roughly speaking, in a standard Django application, there are three parts to this story:

The query. This is built using a queryset, which is usually defined in the view. It controls which rows and columns are requested from the database. It applies filters, as well as building joins (either in the database with select_related or in the application with prefetch_related) and any other database-level operations that are needed, such as annotations, aggregations and optimisations. Some of the business logic for building some parts of the query may be defined in a custom queryset, but that business logic will be assembled and called from the view.
The values. After a query is executed, code needs to run to produce the values that are going to be sent to the client. Django's built-in field descriptors produce the values for each column via attribute access of the model fields themselves. Often, some more complex transformation is required: for example, if you use Django's get_absolute_url mechanism, you're defining a method which is used to produce a value (the canonical URL for an object) by combining some data from the database (usually the object's ID or slug) with the result of some Python business logic (string interpolation or URL reversing). Most business logic in complex Django applications tends to live here, and as we've already discussed, this code usually lives in model methods.
The projection. I'm using the word "projection" to mean the process of deciding exactly what subset of data to send to the client as part of a particular request. In a JSON API endpoint, this is roughly the process that DRF (incorrectly) refers to as serialization: converting a model instance (or iterable of model instances) into a possibly-nested dictionary mapping names to primitive values (or a list of such dictionaries). In an HTML-on-the-server application, it's the process of combining an HTML template with a set of values via Django's template system. In both cases, the values are obtained by calling into the business logic that is responsible for producing those values, as described in point 2 above.

Given these three definitions, I'm going to make two assertions:

Almost all structural problems in Django applications are caused by the fact that these three types of activity (query-building, value-producing, data-projecting) are defined in different places in the codebase, and interact with each other implicitly rather than explicitly.
Almost all performance problems in Django applications are caused by a very specific type of leakage between these layers: as part of assembling a projection, code that produces values is able to execute queries. Furthermore, Django's ORM (by design) hides this leakage by allowing queries to be executed from anywhere, and by automatically following relationships by lazy-loading data on demand.

In a typical Django application, the three kinds of logic described above usually happen in separate places in the codebase. Manipulating the queryset usually happens in a view. Deriving values from instances often happens in model methods. Describing how those values are assembled in the final response usually happens in a template or serializer.

This introduces invisible coupling between seemingly-unrelated parts of a codebase. Changes to any one of the three items can introduce unexpected results in any of the other two.

This violates the principle of Locality of Behaviour:

The behaviour of a unit of code should be as obvious as possible by looking only at that unit of code

By having the code that uses the results of database queries in a totally different place from where the database query is constructed, the behaviour of any individual unit is often very much not obvious by looking at it in isolation. This pattern, repeated across a codebase, creates an environment in which overall codebase comprehension and maintenance can be difficult.

There's also an extremely common practical problem: performance. Due largely to the pervasive practice of executing database queries inside model methods (or as a result of accessing attributes to traverse relationships), and then adding the optimisations to enable those methods or attributes to be called efficiently in a view, and then actually calling the methods or attributes from a serializer or template, the overall performance profile of the application becomes almost impossible to reason about. A tweak to a queryset method to optimise one endpoint might unexpectedly blow up the performance of another endpoint. A seemingly-innocuous change to how a value is calculated in a model method may impact multiple downstream consumers of that method. In extreme cases, just accessing an attribute in a template could cause a page to go from executing one query to executing thousands.

It's worth reiterating that the above problems are by no means terminal. There are many successful and mature Django apps that persist despite them. But in my experience, most codebases suffer from some or all of the problems described above, and they are significantly more impactful as the product increases in complexity and the codebase becomes larger. By changing the way that we structure this code to express dependencies between values and queries, and by controlling where queries are executed, it's quite possible to avoid this mess entirely. There's a learning curve, of course, but the benefits are worth the investment.

Reader functions¶

Let's start with some terminology. Reader functions, or just "readers", are functions that encapsulate any aspect of a Django application related to reading data from the database via the Django ORM. They replace bespoke query-building code (that would usually go inside a custom QuerySet), as well as data transformation or derivation code (that would usually go inside a model method).

The HackSoft Django Styleguide leans some way towards this idea via the concept of "selectors". This is very much a "...maybe just stuff that code in a function somewhere" approach. The guide under-defines exactly what the responsibilities of a selector are. It's really only talking about the query-building part: model methods and serializers don't change. It suggests strict type annotations but doesn't define a framework protocol: in other words, selectors can have any type signature as long as it's annotated. Encapsulating this logic in a function is certainly an improvement on the standard approach, but I think we can do better.

A note on django-readers

For the most part, the Django RAPID architecture is a collection of patterns, suggested structures and best practices, intended to be used to organise the code you write and the way you think about it. However, the section on reader functions is the only part which also suggests the use of an open-source library: django-readers, created by the same authors as this document.

It's not strictly necessary to use django-readers. But if you do choose to adopt this pattern, I'd strongly suggest considering the library. It's a relatively small set of building blocks that implement a lot of the patterns you'd probably end up building yourself anyway.

Following the three stages of serving a read request discussed above, there are conceptually three types of reader function:

Queryset functions, which encapsulate queryset construction, filtering, annotation, prefetching, etc.
Producer functions, which are responsible for "producing" values from model instances. These could be direct values (the values of fields or annotations) or derived values (those for which some Python business logic has been applied).
Projector functions, which are responsible for assembling a set of values, giving them names, and building a possibly-nested dictionary to be used to transfer those values to another part of the system (via a template, or directly sent out as a JSON response).

The huge benefit of encapsulating each type of process in a function is that they can be composed together like building blocks to build complete end-to-end flows. But first, let's explain each part.

Queryset functions¶

Queryset functions are intended to encapsulate code that would traditionally go in custom queryset methods. We're going to start with the following example from the Django documentation to illustrate:

class PersonQuerySet(models.QuerySet):
    def authors(self):
        return self.filter(role="A")

    def editors(self):
        return self.filter(role="E")

As discussed previously, querysets present a fluent interface, meaning method calls can be chained. This is implemented by having each method return a copy of self. The methods in the example above already take self as an argument (as do, by convention, all instance methods on Python objects).

We can now "detach" this concept from the queryset class, and apply it to a standalone function, which we can define as: a function that takes a queryset and returns a queryset.

This is the framework protocol for a queryset function. As soon as anyone says the words "queryset function", you know they're talking about a function that takes a queryset and returns a queryset (just like whenever anyone says "view" you know they're talking about a function that takes an HttpRequest and returns an HttpResponse).

def authors(queryset):
    return queryset.filter(role="A")

def editors(queryset):
    return queryset.filter(role="E")

To use these, we simply call them, passing in a new queryset:

all_authors = authors(Person.objects.all())

Now, not all queryset methods take self as their only argument. Consider this example:

class PersonQuerySet(models.QuerySet):
    ...

    def last_name_starts_with(self, prefix):
        return self.filter(last_name__startswith=prefix)

So how do we parameterise queryset functions like this? Remember, we need all queryset functions to conform to our framework protocol: they should take a queryset and return a queryset.

Higher-order functions¶

To achieve this, we can use a concept from functional programming: higher-order functions, or functions that return other functions:

def last_name_starts_with(prefix):
    def queryset_function(queryset):
        return queryset.filter(last_name__startswith=prefix)
    return queryset_function

To use:

m_people = last_name_starts_with("M")(Person.objects.all())

This approach allows us to create queryset functions that conform to our protocol, by closing over other parameters.

Composition¶

Now, how do we compose these functions? If they were implemented as custom queryset methods, we could chain them:

m_authors = Person.objects.authors().last_name_starts_with("M")

This works because authors returns a copy of the queryset (with the role filter applied) which then becomes the object upon which last_name_starts_with is called.

To achieve this with our function-oriented approach, we need to call one function, then call the other function with the result of the first:

m_authors = last_name_starts_with("M")(authors(Person.objects.all()))

You might notice two problems with this:

It's quite verbose
It's "backwards" - the first function to be applied is the "inner" one, and the last function to be applied is the "outer" one.

As you can imagine, with a complex set of filters or transformations, this could get very difficult to read.

We can easily improve this, though, by using Python's ability to treat functions as first-class objects and put them inside other data structures, such as lists:

queryset = Person.objects.all()
functions_to_apply = [authors, last_name_starts_with("M")]
for function in functions_to_apply:
    queryset = function(queryset)

If you're familiar with functional programming, you might recognise this: it's a pipe. A value is passed ("piped") through a list of functions, with the return value of each used as the argument passed to the next. This smells a lot like framework code! You could write an implementation of this in your own codebase, or you could use the version that comes with django-readers.

The django-readers pipe implementation is subtly different to the code shown above: it's actually another higher-order function, so rather than piping the value through each function itself, it creates a function that when called will pipe the value through each function. The neat thing about this is that the function it creates is itself also a queryset function!

from django_readers import qs

m_authors_filter = qs.pipe(authors, last_name_starts_with("M"))

# m_authors_filter is now a queryset function!

m_authors = m_authors_filter(Person.objects.all())

So we now have a way of defining our own queryset functions to implement our bespoke queryset filtering logic. And we have a way of composing them to form arbitrarily complex filters. But this approach is pretty verbose: to encapsulate even basic filtering, we need to define a function. Can we make this neater?

What we need is a generic higher-order function to create filter functions. Consider:

def filter(*args, **kwargs):
    def queryset_function(queryset):
        return queryset.filter(*args, **kwargs)
    return queryset_function

This short piece of code makes all of our previous implementations much neater:

authors = filter(role="A")
editors = filter(role="E")

def last_name_starts_with(prefix):
    return filter(last_name__startswith=prefix)

Function-oriented queryset operations¶

Hopefully now you can see the direction in which this is going. This doesn't just apply to filter: all of the built-in queryset methods could be wrapped in higher-order functions like this. And that's exactly what the qs module in django-readers provides. Arbitrarily complex queryset transformations can be assembled by using these lower-level primitives and combining them with pipe.

What's the point?¶

Now, this might strike you as a fun and interesting way to manipulate querysets, but what do we gain by representing filters and other operations as functions like this? The answer is that these functions can be put in data structures and "connected" to other functions to express dependencies between them. But before we get to that, we need to move on to the next kind of reader function: producers.

Producer functions¶

We've covered functions that encapsulate queryset-level business logic. What about instance-level business logic?

Considering only reads (in other words, data retrieved from the database), model instance methods in a "traditional" Django app usually perform some sort of transformation or derivation from one or more fields on the model (or related models):

class Person(models.Model):
    first_name = models.CharField(max_length=50)
    last_name = models.CharField(max_length=50)
    # ... other fields

    def full_name(self):
        return f"{self.first_name} {self.last_name}"

To convert the full_name method to a function, we need to define a framework protocol for this kind of function, which is simply: a function that takes a model instance and returns any value. In the method above, self is already a model instance, so we just need to "slide the method to the left" to convert it to a standalone function:

def produce_full_name(instance):
    return f"{instance.first_name} {instance.last_name}"

We call this kind of function a producer function. In simple terms, a producer function is a function that produces a value from a model instance.

The very simplest type of producer function contains no bespoke business logic and just produces the value of a model attribute:

def produce_first_name(instance):
    return instance.first_name

When we have simple, generic functions like this, we can again create higher-order functions to abstract over them.

def attr(name):
    def producer(instance):
        return getattr(instance, name)
    return producer


produce_first_name = attr("first_name")

This is a simplified version of django-readers' producers.attr function.

Projector functions¶

Projectors build on top of producers. The task of a projector is to return a dictionary that maps one or more names on to one or more values. I think of these dictionaries as lightweight "data transfer objects": simple bags of data that can cross abstraction boundaries, either from view code into a template, or from our backend code to the frontend via JSON. The framework protocol for a projector function is: a function that takes a model instance and returns a dictionary. Again, by using functions here, we can compose simple building blocks to create complex behaviour.

The simplest example of a projector just returns a single name -> value mapping:

def project_full_name(instance):
    return {"full_name": f"{instance.first_name} {instance.last_name}"}

The first thing you'll spot is that the actual business logic here is duplicated from the producer function above. That's because, in almost all circumstances, projector functions should depend on producer functions to do the actual work:

def project_full_name(instance):
    return {"full_name": produce_full_name(instance)}

Notice there's no actual business logic left in this function. This smells a lot like framework code:

def producer_to_projector(key, producer):
    def projector(instance):
        return {key: producer(instance)}
    return projector

project_full_name = producer_to_projector("full_name", produce_full_name)

This implementation is lifted directly from django-readers.

Mapping a single key to a single value is the building block, but in the real world we need to create mappings to represent multiple values. Conceptually we need to do this:

def project_first_and_last_name(instance):
    return {
        "first_name": instance.first_name,
        "last_name": instance.last_name,
    }

For reasons that will become clear soon, it's better to use framework code to assemble these sorts of projections from lower-level building blocks. We can use projectors.combine to do this:

from django_readers import producers, projectors

project_first_and_last_name = projectors.combine(
    producer_to_projector("first_name", producers.attr("first_name")),
    producer_to_projector("last_name", producers.attr("last_name")),
)

Pairs¶

As discussed above, one of the biggest problems in a traditional Django application is the inability to specify dependencies between code that retrieves data from the database, and code that consumes that data. The key to untangling this mess is to be able to decompose our data access patterns into small, atomic pieces - and then express the dependencies between them. We now have all the building blocks to allow us to do this.

To illustrate, let's use the example of an annotation to count the number of authors associated with a book (again, this example is taken from the Django docs).

The annotation is applied to the queryset as follows:

from django.db.models import Count

q = Book.objects.annotate(num_authors=Count("authors"))

Following standard Django conventions, we might choose to make this reusable by defining a custom QuerySet and adding a method with_author_count to encapsulate the call to .annotate. We'd then apply this annotation to the queryset in a view, by adding the call to that method to the queryset we use: Book.objects.with_author_count(). Finally, in a template, we'd display the count: {{ book.num_authors }}.

Note what happens here:

If we change or remove the annotation code in with_author_count, we'd likely break both the view and the template.
If we change or remove the call to with_author_count in the view, we'd likely break the template.
If we change or remove the call to the num_authors attribute in the template, we'd be left with a possibly-expensive query annotation to calculate a value that isn't actually being used (this is called "overfetching").

And this is just for one view. Imagine if the with_author_count functionality was shared and used in multiple places!

The main problem that we want to solve here is to conceptually define a dependency between the num_authors call and the Count("authors") annotation.

Let's start by reimplementing this functionality using the building blocks above. We'll use django-readers for this, but again this code could live in your codebase.

First, we need a queryset function to add the annotation:

from django_readers import qs

with_author_count = qs.annotate(num_authors=Count("authors"))

Now we need a producer function to retrieve the value:

from django_readers import producers

author_count_producer = producers.attr("num_authors")

We want to say the following: in order to call the producer function on a model instance, you must first have called the queryset function on the queryset that the instance came from.

The simplest way of expressing the dependency relationship between these two functions is to put them together in a data structure. In Python, the data structure that fits best is a simple two-tuple:

author_count_pair = (with_author_count, author_count_producer)

We call a two-tuple consisting of a queryset function in the first position and a producer function in the second position a reader pair (or just a "pair").

Technically there are two types of reader pairs: producer pairs where the second item in the tuple is a producer, as shown above, and projector pairs, where the second item is a projector. In practice, django-readers wraps up the process of converting a producer pair to a projector pair, so it's uncommon to need to think about projectors at all. Because we haven't got to the higher levels of abstraction in django-readers yet, I'll demonstrate how that works:

from django_readers import pairs

author_count_projector_pair = pairs.producer_to_projector(
    "author_count", author_count_pair
)

So now we have a two-tuple that can be used as follows:

# unpack the tuple
prepare, project = author_count_projector_pair

qs = prepare(models.Book.objects.all())
for instance in qs:
    print(project(instance))  # prints {"author_count": 2}

It's worth noting that we can also apply this concept to optimize simple attribute access as well as more complex annotations. By default, when you evaluate SomeModel.objects.all(), Django runs the equivalent of SELECT * on the table to return values for all the columns (it actually enumerates all of the columns defined on the model rather than using a wildcard, but the effect is the same in most cases).

Django does provide the .only(*fields) and .defer(*fields) methods on the queryset to control which columns are returned, but these can be quite dangerous: if an attribute is accessed which has been deferred, Django transparently selects it from the database! The pairs approach allows us to surgically extract precisely the fields we need for a particular use case, by using a special queryset function include_fields to manipulate the list of deferred fields:

from django_readers import qs, producers

book_title_pair = qs.include_fields("title"), producers.attr("title")

And again, we can use higher-order functions that create pairs of reader functions! The above is equivalent to:

from django_readers import pairs

book_title_pair = pairs.field("title")

And in fact django-readers provides pair functions to create common annotations, too:

from django_readers import pairs

author_count_pair = pairs.count("authors")

So we can represent the dependencies between values and querysets by putting queryset functions and producer functions into simple two-tuples called pairs. The next piece of the puzzle is this: these pairs can be combined to allow us to construct arbitrarily complex projections.

from django_readers import pairs

prepare, project = pairs.combine(
    pairs.producer_to_projector("title", pairs.field("title")),
    pairs.producer_to_projector("author_count", pairs.count("authors")),
)

qs = prepare(models.Book.objects.all())
for instance in qs:
    print(project(instance))
    # prints {"title": "The Definitive Guide to Django", "author_count": 2}

Specs¶

So far, we've defined a set of framework protocols (signatures) for functions to encapsulate business logic that manipulates querysets, produces values, and assembles projections. We have a library of higher-order functions to allow us to quickly implement common functionality (attribute access, annotations, filtering). But what we have so far is quite verbose: assembling these structures of pairs, converting between producers and projectors, and combining them together requires quite a lot of boilerplate. In a real-world codebase, we need a quicker and more lightweight way to interact with our data. There are several ways you might choose to implement this layer, but in django-readers the concept is called a spec.

A spec (short for "specification") is nothing more than straightforward layer of syntactic sugar (or a domain-specific language) for querying and projecting data. A spec is a list, and each item in the list relates to one or more key/value pairs in the resulting projection. The items in the spec are shortcuts to encapsulate the most common patterns in the examples seen above. There are only a few kinds of things that can be in a spec. The first couple are relevant to the discussion above:

A string, which is interpreted as the name of a model field. When the spec is processed by the library, a string like "title" is replaced directly with the line in the example above: pairs.producer_to_projector("title", pairs.field("title")),.
A dictionary mapping a string key to a producer pair, like this: {"author_count": pairs.count("authors")}. This is replaced with a call to pairs.producer_to_projector, wrapping the producer pair in the given name.

The full details of specs are in the django-readers docs so I won't go into any more detail here.

Specs are the standard day-to-day high-level developer interface for building endpoints with django-readers. The layered design explained above is well worth understanding for those relatively rare occasions when you need to do clever things, but almost all work in almost all endpoints involves simply listing out the data you need the endpoint to return in a spec, maybe writing a few custom producer pairs for any bespoke business logic, and you're done.

With a little practice, this approach makes it shockingly fast to build endpoints, while maintaining easy readability. The resulting code is minimalist (in terms of volume) while maintaining full and straightforward access to lower-level primitives that allow you to do literally anything that's possible with the Django ORM. And unlike alternative approaches which try to provide this extremely high-level interface to endpoint design and query optimisation, it remains entirely "magic-free".

It's also worth noting that I've not even attempted to describe another key feature of django-readers, which is its ability to automatically prefetch relationships. Specs can contain other specs: listing field names of related objects results in a nested dictionary containing data from those related objects, and those relationships are prefetched for efficiency, eliminating N+1 query problems.

If django-readers appeals to you, please do read the documentation carefully and give it a try in your project.

Code organisation¶

By identifying and describing the three kinds of business logic (queryset-building, value-producing and data-projecting) we're able to organise our codebase in a way that makes most sense for the project, rather than being forced to put everything in one place.

Straightforward logic that's only relevant to a particular endpoint can be assembled in the spec, without any attempt to make it reusable. Retrieving the values of model fields or relationships is just a question of putting the field names in a list. For API endpoints, this is much less work than defining a serializer.
When more complex logic is needed, we can start with a simple readers.py for smaller projects.
When there's too much code for a single file, it's easy to break out into domain-specific modules. For larger core models, it might make sense to put all the readers associated with that model together in their own file.
For complex apps with multiple user roles, it's possible to add another layer of hierarchy for each role (readers/guest/*.py and readers/admin/*.py, for example). This makes it easy to spot errors, and enables the use of static analysis tooling to catch potential security issues by (say) disallowing any imports from readers/admin/ in the area of the code dedicated to serving guests.

The point is, you can organise these functions in any way that makes sense for your project. You can start extremely simple and smoothly scale up to a more complex arrangement as needed. There's almost no cost to moving these functions around until you find a structure that fits.

Remember, if you do use django-readers, you can freely mix and match between custom business logic and library code. In my experience, thinking in a more higher-order function-oriented way enables a significant reduction in the overall amount of code that needs to be written, and drastically improves readability and codebase comprehension.

Summary

This is probably the longest section in the entire RAPID documentation. That's to be expected: in the real world, most of the business logic in most applications involves database reads, and so this is where most of the problems tend to lie. I've argued that by adding a lightweight taxonomy for the three different kinds of business logic involved in most endpoints, we can define a set of straightforward patterns that unlock a far more comprehensible and flexible codebase structure. Although not essential, I'd highly recommend the use of the django-readers library to facilitate this pattern.