Architecture
ArchitectureSuppressing the "n+1 problem"

Suppressing the "n+1 problem"

Let's learn how Gato GraphQL completely avoids the "n+1 problem" already by architectural design.

What is the "n+1 problem"

The "n+1 problem" basically means that the amount of queries executed against the database can be as large as the amount of nodes in the graph.

What does it mean? Let's check it out with an example: let's say we want to retrieve a list of directors, and for each of them his/her films, through the following query:

{
  query {
    directors(first: 10) {
      name
      films(first: 10) {
        title
      }
    }
  }
}

To be efficient, we would expect to execute only 2 queries to retrieve the data from the database: 1 to fetch the directors' data, and 1 to retrieve the data for all films by all directors.

However, in order to satisfy this query, GraphQL will need to execute "n+1" queries against the database: 1 first to retrieve the list of the N directors (10 in this case) and then, for each of the N directors, 1 query to retrieve his/her list of films. In our case, we must execute 1+10=11 queries.

This problem arises because GraphQL resolvers only handle 1 object at a time, and not all the objects of the same kind at the same time. In our case, the resolver handling objects of the Query type (which is the root type) will be called once the first time to get the list of all the Director objects and then, the resolver handling the Director type will be called once for each Director object, to retrieve his/her list of films.

In other words: GraphQL resolvers see the tree, not the forest.

This problem is actually worse than it initially appears, because the number of nodes in a graph grows exponentially on the number of levels of the graph. Then, the name "n+1" is valid only for a graph 2-levels deep. For a graph 3-levels deep, it should be called the "N2+n+1" problem! And so on...

For instance, following our example above, let's also add each film's list of actors/actresses to the query, like this:

{
  query {
    directors(first: 10) {
      name
      films(first: 10) {
        title
        actors(first: 10) {
          name
        }
      }
    }
  }
}

Then, the queries executed against the database are: 1 first to retrieve the list of the 10 directors, then 1 query to retrieve each director's list of films for each of the 10 directors, and finally 1 query to retrieve each list of actors/actresses for each of the 10 films for each of the 10 directors. This gives a total of 1+10+100=111 queries.

After noticing this behaviour, the "n+1 problem" can easily be considered GraphQL's biggest performance hurdle: if left unchecked, querying graphs a few levels deep may become so slow, as to effectively render GraphQL pretty much useless.

General solution to the "n+1 problem"

The standard solution to the "n+1 problem" was first provided by the utility DataLoader. Its strategy is very simple: defer resolving segments of the query until a later stage, in which all of the objects of the same kind can be resolved all together, in a single query. This strategy, called "batching", effectively solves the "n+1" problem.

In addition, DataLoader caches objects after retrieving them, so that if a subsequent query needs to load an already-loaded object, it can skip execution and retrieve the object from the cache instead. This strategy, which is called "caching", is mostly an optimization on top of "batching".

Problems with the "batching/deferred" solution

Technically speaking, there is no problem whatsoever with the "batching" or "deferred" strategy: it just works.

(From now on, let's refer to the strategy as "deferred" only.)

The problem, though, is that this strategy is an afterthought: the developer may first implement the server and then, noticing how slow it is to resolve the queries, will decide to introduce the deferring mechanism. Hence, implementing the resolvers may involve some faux steps, adding friction to the development process. In addition, since the developer must understand how the "deferred" mechanism works, it makes its implementation more complex than it could otherwise be.

This problem doesn't lie in the strategy itself, but in having the GraphQL server offering this functionality as an add-on, even though, without it, querying may be so slow as to render GraphQL pretty much useless.

The solution to this problem is, then, straightforward: the "deferred" strategy should not be an add-on but baked-in the GraphQL server itself. Instead of having 2 query execution strategies, "normal" and "deferred", there should only be only 1, "deferred". And the GraphQL server must execute the "deferred" mechanism even though the developer implements the resolver the "normal" way (in other words, the GraphQL server takes care of the extra complexity, not the developer).

And that is exactly what Gato GraphQL does.

Making "deferred" the only strategy executed by the GraphQL server

The problem with most GraphQL servers is that the responsibility of resolving the object types (object, union and interface) as objects is done by the resolvers themselves when processing the parent node (eg: films => directors), instead of delegating this task to the dataloading engine.

Gato GraphQL transfers this responsibility away from the resolver and into the server's data-loading engine, like this:

  1. The resolvers return IDs, and not objects, when resolving a relationship between the parent and child nodes
  2. Given a list of IDs of a certain type, a DataLoader entity obtains the corresponding objects from that type
  3. The server's data-loading engine is the glue between these 2 parts: it first obtains the object IDs from the resolvers and, just before executing the nested query for the relationship (by which time it will have accumulated all the IDs to be resolved for the specific type), it retrieves the objects for those IDs through the DataLoader (which can efficiently include all the IDs into a single query).

This approach can be summarized as: "Deal with IDs, not with objects".

Let's use the same example from earlier on to visualize this new approach. The query below retrieves a list of directors and their films:

{
  query {
    directors(first: 10) {
      name
      films(first: 10) {
        title
      }
    }
  }
}

Pay attention to the 2 fields to retrieve from each director, name and films, and how they are currently different:

Field name is of scalar type. It is immediately resolvable, since we can expect the object of type Director to contain a property of type string called name, containing the director's name. Hence, once we have the Director object, there is no need to execute an extra query to resolve this property.

Field films, though, is a list of object type. It is normally not immediately resolvable, since it references a list of objects, of type Film, which must still be retrieved from the database through 1 or more extra queries. Hence, the developer would need to implement the "deferred" mechanism for it.

Now, let's consider the different behaviour, and have field films be resolved as a list of IDs (instead of a list of objects). Because we can expect the Director object to contain a property called filmIDs containing the IDs of all its films, of type array of string (assuming that the ID is represented as a string), then this field can also be resolved immediately, without having to implement the "deferred" mechanism.

Finally, in addition to the ID, the resolver must give an extra piece of information: the type of the expected object (in our example, it could be [(Film, 2), (Film, 5), (Film, 9)]). This information is internal though, passed over to the engine, and does not need be output in the response to the query.

Implementing the adapted approach in code

Let's see how Gato GraphQL implements this approach in PHP code. The code below demonstrates the different resolvers (for the purpose of clarity, all code below has been edited).

FieldResolvers

FieldResolvers receive an object of a specific type, and resolve its fields. For relationships, it must also indicate the type of the object it resolves to. This is their contract:

interface FieldResolverInterface
{
  public function resolveValue($object, string $field, array $args = []);
  public function resolveFieldTypeResolverClass(string $field, array $args = []): ?string;
}

Its implementation looks like this:

class PostFieldResolver implements FieldResolverInterface
{
  public function resolveValue($object, string $field, array $args = [])
  {
    $post = $object;
    switch ($field) {
      case 'title':
        return $post->title;
      case 'author':
        return $post->authorID; // This is an ID, not an object!
    }
 
    return null;
  }
 
  public function resolveFieldTypeResolverClass(string $field, array $args = []): ?string
  {
    switch ($field) {
      case 'author':
        return UserTypeResolver::class;
    }
 
    return null;
  }
}

Please notice how, by removing the logic dealing with promises/deferred objects, the code resolving field author has become very simple and concise.

TypeResolvers

TypeResolvers are objects which deal a specific type: they know the type's name and which TypeDataLoader loads objects of its type, among others.

The data-loading engine, when resolving fields, will be given IDs from a certain TypeResolver class. Then, when retrieving the objects for those IDs, the data-loading engine will ask the TypeResolver which TypeDataLoader object to use to load those objects.

Their contract is defined like this:

interface TypeResolverInterface
{
  public function getTypeName(): string;
  public function getTypeDataLoaderClass(): string;
}

In our example, class UserTypeResolver defines that type User must have its data loaded through class UserTypeDataLoader:

class UserTypeResolver implements TypeResolverInterface
{
  public function getTypeName(): string
  {
    return 'User';
  }
 
  public function getTypeDataLoaderClass(): string
  {
    return UserTypeDataLoader::class;
  }
}

TypeDataLoaders

TypeDataLoaders receive a list of IDs of a specific type, and return the corresponding objects of that type. This is their contract:

interface TypeDataLoaderInterface
{
  public function getObjects(array $ids): array;
}

Retrieving users is done like this:

class UserTypeDataLoader implements TypeDataLoaderInterface
{
  public function getObjects(array $ids): array
  {
    $userAPI = UserAPIFacade::getInstance();
    return $userAPI->getUsers($ids);
  }
}

Executing a (really) big query

Let's test that this strategy works. Go to the GraphiQL client in Gato GraphQL, and execute the query below, which involves a graph 10-levels deep (posts => author => posts => tags => posts => comments => author => posts => comments => author) and could not be resolved in a decent amount of time if the "n+1 problem" were taking place.

query {
  posts(pagination:{ limit:10 }) {
    excerpt
    title
    url
    author {
      name
      url
      posts(pagination:{ limit:10 }) {
        title
        tags(pagination:{ limit:10 }) {
          slug
          url
          posts(pagination:{ limit:10 }) {
            title
            comments(pagination:{ limit:10 }) {
              content
              date
              author {
                name
                posts(pagination:{ limit:10 }) {
                  title
                  url
                  comments(pagination:{ limit:10 }) {
                    content
                    date
                    author {
                      name
                      username
                      url
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Scrolling down on the results we will see how big the response is, how many entities it involves, and how many levels it retrieved, and yet it was executed promptly, without any difficulty.