AVIO Consulting

What To Know About Pagination and MuleSoft

Jul 13, 2021 | Blogs, MuleSoft

If you have a large amount of data that needs to be processed, it might be time to consider pagination. Pagination is the process of presenting data from the data source via APIs in sizable chunks. Implementing paging can be good if you have a vast amount of data and want to process it for data unlocking purposes and/or syncing it across several systems. Because pagination breaks down data into multiple pages, it is easier to process data on both the server and client side without overwhelming the system with high memory utilization.

Before you go and implement pagination on your dataset, have you thought of streaming the data out as an alternative solution for paging? Below are some of the factors that are worth giving a thought to.

  1. Assess whether your consumers able to consume a stream and process
    • You can set up to stream data to your consumer when they are capable of receiving/accepting a stream. If your primary intention is to sync data between two systems that are in your control, you may get away with streaming the data between those systems. However, make sure you properly process the stream, especially with large datasets. For example, positional access to the objects inside a collection being streamed would cause the whole dataset to load into the memory which overwhelms the server memory.
  2. Consumer demands the paginated API to show in their UI
    • If you need to unlock your data for a UI client to display results page by page, streaming the data out may NOT work.
  3. Backend is unable to handle multiple calls to send out paginated data or have rate limiting and thresholds
    • Compared to streaming, pagination through the data could be slightly slower and can cause more API calls to the backend server. On the other hand, pagination is more robust and easy to consume without worrying about memory limits.

The below sections will discuss the paging API specification design process, its implementation, and how to consume pagination API.

Designing Paging API Specification

To design a paging resourceType in RAML to include all paging-related entities for your paging resource, you’ll want to publish this resourceType as a RAML fragment in your organization Anypoint Exchange to use it across all the API specs that need paging capability.

#%RAML 1.0

title: Folks API

- application/json

 # Below resourceType contains all the things needed for a paging resource
   description: Retrieves a summary of <<resourcePathName>> headers:
   description: Retrieves paged <<resourcePathName>>
     description: Number of records to fetch at a time
     type: number
     default: 100
     minimum: 1
     maximum: 1000
     description: Record ID to start to return the results from
     type: number
     default: 0
          example: "/api/v1/folks?page_size=100&start=600"
          example: "/api/v1/folks?page_size=100&start=400"
          example: "/api/v1/folks?page_size=100&start=500"
       results: <<actual-response>>

/folks: type: { pageableResource: { actual-response: !include get-resource-datatype.raml }}

When you use the above resource type for a REST resource, the query params, HEAD (to get row count), and GET verbs will be inherited to the resource without replicating across all the resources that need paging.

The above API spec is an example that may slightly change based on your backend system and its data, but the fundamental idea of paging remains the same. Justification for each paging-related item in the above API spec like a HEAD verb and page links is provided below.

Keep in mind — you may need to come up with some sensible defaults to the page_size query parameter based on the record size in your dataset. If your record size is small, you can comfortably return around 1000 rows in a single API call. Also, you can easily enforce the lower and upper limits by declaring minimum and maximum numbers (as shown above) to send in for the page_size query param. APIKit will throw an error if it receives a page size limit number outside of that bracket.

Based on your consumer requirements, you may set up additional query params like sortBy to accommodate data sorting requirements that influence the page results. Make sure you create an index for the sort_by columns for performance.

Paging API Implementation

Pagination Using Database Systems

Implementing a pagination API needs a sortable dataset in the database to retrieve pages. Below are a couple different approaches.

1. Paging Using Offset/Skip

This method is the simplest and most popular form of paging the data. Here, the server API will expose a couple of query parameters named limit and offset (A.K.A skip). The server API would need to build an SQL query like below when it receives a URL like GET /folks?limit=1000&offset=30000

select *
from folks
order by id
limit 1000
offset 30000;

The above SQL query returns the 30000th page with 1000 folks in the response.

You may also expose a query parameter named page instead of offset to accept a page number and calculate the offset value based on the given page number to build the SQL query.

Something to consider is that if you are implementing a paging API on an unordered dataset or don’t have an implicit order, there is a good chance you may receive duplicate records between page retrievals. This is because new records were inserted in the already retrieved page.

The consumer will be able to jump to any page within the dataset by providing the proper page/offset number in the URL. Therefore, the consumer can issue parallel requests to this paging API. This method is beneficial in the use-cases where you want to fetch and sync data to the target system page by page. It can also be used in other cases where you do not care about retrieving all the pages synchronously.

If you have an extensive data set (tens of millions of records or more) this approach is not preferred due to how the offset works in the database. When you specify a value for offset in your SQL query, the database loads all the previous records before that offset and starts returning the results after the offset. For example, if you provide an offset value of 900, the database would need to fetch the first 1000 records only to reach the offset until 900, ‘skip’ them and send the next 100 records starting from 900 to the consumer. Even though we only need 100 records, the database is examining 1000 records to reach that offset. Imagine if you issue an offset value of 1,000,000, the database needs to examine over a million records which is a performance hit due to numerous IO operations.

2. Paging Using a Unique Field

The idea here is to provide a unique field to the paging API and build an SQL query to return the rows that are greater than that unique field value. The database can directly fetch the records from the disk by looking up the address in the database index for the given unique identifier value. Compared to the offset implementation above, this type of underlying paging implementation would be performed with a large set of data given a proper index is maintained in the database for that unique identifier field. This eliminates the process of examining the previous records until the offset.

To fetch the results from a row, a server-side API needs to expose a query param named start to indicate the starting row. You may name this query parameter to represent the primary key field in your database and it’s purely your choice.

When a consumer hits a URL like /folks?limit=1000&start=30000, it needs to build the SQL query like below.

Select *
from folks
where folk_id >= 30000
order by folk_id
limit 1001

The above SQL query fetches 1001 folks from the database after #30000. The API would need to return 1000 folks in the response body and the 1001st record ID and the response to indicate it as a starting point to fetch the next page.

Every time a consumer fetches a page, the server needs to send this unique ID information and the dataset to use the unique ID field from the response to fetch subsequent pages. You can send this info and the response body under page links sections as per the RAML spec above.

You can keep this unique ID field opaque to the consumers by encoding it. As a result, you can easily change the underlying implementation of the unique field to use a database unique record ID from an already existing primary key-based one without affecting the consumers.

There is no way to jump to a specific page in this approach unless you know the unique field value to send in the URL to retrieve the page results starting from it. As we cannot jump to a specific page, a consumer is limited to fetch pages one by one in a synchronous manner only and there may not be a way to issue parallel requests to the database.

You may also use time-based page retrieval where the consumer can send the timestamp in the query params to expect the rows after that timestamp. With this approach, the timestamp may not be a unique value as some back-end jobs might update database records in a batch manner.

Pagination Using Non-Database Systems

Implementing a pagination API on a system that is not a database needs some construct at a minimum to paginate through its dataset. If they provide that paging interface, we can easily simplify the paging mechanism by implementing a paging API that follows the above API spec. You may also implement a custom connector using Mule SDK pagingProvider, which is discussed in other sections.

Suppose your consumers demand a pageable API on top of a system that is not a database and it doesn’t expose any interface to page through. In that case, we may need to consider other options such as replicating the data to another system like a database and implement paging on top of that.

Consuming a Paginated API in MuleSoft

You can easily retrieve pages from the pageable API implemented above by maintaining either offset/page/start(unique ID) values on the client-side. The client only needs to issue subsequent query parameter values(offset/page/start) to fetch the remaining pages.

Based on the above API spec, if you implement your paging API to return page links as part of the response, you may use the next page link from the API response to retrieve the subsequent pages. On the client side, this makes it easier to retrieve pages.

Looping through pages

As there is no straightforward while loop in Mule, you may either use an until-successful scope or the recursive flow call approach to loop through the pages.

  • When using an until successful scope, you may need to raise a dummy error to loop through the next pages.

When using a recursive flow-based approach, there is a good chance that you will get a “Too many child contexts“ error as Mule limits the number of nested flow spans. Thus, this approach is not recommended. Refer to this Mule support article to learn why.

As per the API spec in the first section, The primary reason for exposing a HEAD endpoint and the pageable GET resource is to simplify the process of looping through the pages. You may use the HEAD endpoint to retrieve the record count and, based on that, you can use MuleSoft for each scope to loop through the pages. This HEAD endpoint also serves for the needs where you want to send summary information to the consumer about your dataset.

If you have a use case where you need to retrieve and update the target system page by page without needing to retrieve the whole dataset, you may consider using the offset-based approach to boost overall application performance by leveraging parallel processing. As you can fetch any particular page with offset-based implementation, which enables us to issue parallel requests to the DB, you may use MuleSoft batch processor or a multi worker VM queue implementation to retrieve asynchronously and process pages to the target system one by one.

Mule SDK pagingProvider interface

You can easily implement a custom connector using Mule SDK pagingProvider interface to simplify the page looping mechanism. Therefore, you can use this connector across multiple Mule apps easily. This approach is also helpful when you have complex interfaces to communicate with the system and fetch pages from that system.

Streaming supported modules in Mule use this pagingProvider interface under the hood to move objects between the modules. This is how object streaming works in Mule. If you implement this pagingProvider interface in a custom connector, you are streaming pages out of the connector. If the object is big enough, the next component after this connector in your Mule flow will retrieve the results page by page. In other words, your paged data is being streamed out of the connector.

Ready for more - explore webinars