Feature stores are an essential part of the modern stack around machine learning (ML); in particular, the effort aimed at rationalizing the access patterns to the features associated with ML models by the various functions revolving around it (from data engineers to data scientists). At its core, a feature store provides a layer atop a persistent data store (a database) that facilitates shared access to the features associated with the entities belonging to a business domain, making it easier to retrieve them consistently for both training and prediction.
Out of several well-established feature stores available today, the most popular open-source solution is arguably Feast. With its active base of contributors and support for a growing list of backends to choose from, ML practitioners don’t have to worry about the boilerplate setup of their data system and can focus on delivering their product—all while retaining the freedom to choose the backend that best suits their needs.
Last year, the Feast team published extensive benchmarks comparing the performance of the feature store when using different storage layers for retrieval of “online” features (that is, up-to-date reads to calculate inferences, as opposed to batch or historical “offline” reads). The storage backends used in the test, each powered by its own Feast plugin, were: Redis (running locally), Google Datastore, and Amazon DynamoDB—the latter on the same cloud region as the testing client. The main takeaways were:
- Redis yields the lowest response times (but at a cost; see below)
- Among the cloud DB vendors, DynamoDB is noticeably faster than Datastore
- Latencies increase with the number of features needed and, albeit less so, with the number of rows (“entities”)
Moreover, Feast offers an SDK for both Java and Python. Although choosing the Java stack for the feature server results in faster responses, the vast majority of Feast users work with a Python-centered stack. So in our tests, we’ll focus on Python start-to-end setups.
Surveys done by Feast also showed that more than 60% of the interviewees required that P99 latency stay below 50 ms. These ultra-low-latency ML use cases often fall in the fraud detection and recommender system categories.
Feature stores & Cassandra / Astra DB
The need for persistent data stores is ubiquitous in any ML application—and it comes in all sizes and shapes, of which the “feature store” pattern is but a certain, albeit very common, instance.
As is discussed at length in the Feast blog post, many factors influence the architectural choices for ML-based systems. Besides serving latency, there are considerations about fault tolerance and data redundancy, ease of use, pricing, ease of integration with the rest of the stack, and so forth. For example, in some cases, it may be convenient to employ an in-memory store such as Redis, trading data durability and ease of scaling for reduced response times.
In this recently published guide, the author highlights the fact that a feature store lies at the core of most ML-centered architectures lies, possibly (and, looking forward, more and more so) augmented with real-time capabilities owing to a combination of CDC (Change Data Capture), event-streaming technologies, and sometimes in-memory cache layers. The guide makes the case that Cassandra and DataStax’s cloud-based DBaaS Astra DB (which is built on Cassandra) are great databases to build a feature store on top of, owing to the world-class fault tolerance, 100% uptime, and extremely low latencies it can offer out of the box. As a side note, Astra DB offers a generous free tier that could well sustain even a moderate production workload (try it out!).
In this spirit, last year, we developed a plugin for Feast that enables usage of Cassandra and Astra DB as the database backend for the online feature store. Shortly thereafter, the plugin was accepted and included in the Feast codebase.
We then set out to extend the performance measurements to Astra DB, with the intent to provide hard data corroborating our claim that Cassandra and Astra DB are performant first-class choices for an online feature store. In other words, once the plugin made its way to Feast, we took the next logical step: running the very same testing already done for the other DBaaS choice, but this time on Astra DB. The next section reports on our findings.
Performance benchmarks for Feast on Astra DB
The Feast team published a Github repository with the code used for the benchmarks. We added coverage for Astra DB (plus a one-node Cassandra cluster running locally, serving the purpose of a functional test) and upgraded the Feast version used in all benchmarks to use v0.26 consistently.
Note: The original tests used v0.20 for DynamoDB, v0.17 for Datastore and v0.21 for Redis. Because we reproduced all pre-existing benchmarks, finding most values to be in acceptable agreement (see below for more remarks on this point), we are confident that upgrading the Feast version does not significantly alter the performance.
The tests have been run on comparable AWS and GCP machines (respectively c5.4xlarge and c2-standard-16 instances) running in the same region as the cloud database (thereby mimicking the desired architecture for a production system). We did not change any benchmark parameter in order to keep the comparison meaningful, even with prior results. As stated earlier, we focused on the Python feature server, which has a wider adoption among the Feast community and supports a broader ecosystem of plugins.
Here’s how we conducted the benchmarking. First, a moderate amount of synthetic “feature data” (10k entities with 250 integer features each, for a total of about 11 MB) was materialized to the online store. Then various one-minute test runs were performed, each with a certain choice of feature-retrieval parameters, all while collecting statistics (in particular, high percentiles) on the response time of these retrieval operations. The parameters that varied between runs were:
- batch size (1 to 100 entities per request)
- number of requested features (50 to 250 features per entity)
- requests per second (“RPS”, 10 to 100)
To better support concurrent requests, in all tests, 16 instances of the feature server were spawned (in as many containers running on the benchmark machine) and were queried in a round-robin fashion. For the results, the main focus has been on the P99 latency percentiles (other percentiles show qualitatively similar behavior). Also, as the findings do not exhibit significant dependency on the RPS, we will present results just for fixed RPS=10, in line with the methodology of the first Feast performance article.
A remark is in order: one might be tempted to interpret “entities” and “features” as “database rows and columns,” respectively. However, this isn’t exactly right because each online store has a different way to map these abstractions to its actual table structure. On Cassandra (and Astra DB), for instance, each entity maps to a whole table partition, with one feature per database row—a choice that makes it very fast to retrieve many features for a given entity with a single query, thanks to the physical proximity of all data belonging to a single Cassandra partition.
As mentioned above, we reproduced the findings of the first Feast article, and then we added the data series pertaining to Astra DB. Let’s first look at the P99 response times, in milliseconds, as a function of the number of requested features, with a fixed batch size of 1 and RPS=10.
|Number of features|
|Astra DB||152 ms||134 ms||204 ms||201 ms||227 ms|
|Datastore||158 ms||398 ms||503 ms||358 ms||728 ms|
|DynamoDB||133 ms||167 ms||202 ms||241 ms||316 ms|
|Redis||62 ms||104 ms||80 ms||95 ms||147 ms|
Now let’s examine the behavior of the P99 response times (milliseconds) as a function of the batch size, keeping 50 features per entity and RPS=10 fixed:
The main takeaways from the above results are as follows:
- As already observed in the analysis by the Feast team, Datastore has noticeably higher response times in essentially all conditions.
- The novel finding is that, within the fluctuations observed, Astra DB and DynamoDB perform in a comparable way.
- Redis (the only examined backend that, far from being a cloud DBaaS, is an in-memory store running on the same client machine) is definitely faster across the whole spectrum of conditions. As argued elsewhere, choosing Redis is a matter of trading resilience and durability for speed (with all the associated pros and cons), plus, of course, there’s only so much data that can fit into memory.
- Still, compared to the Redis data appearing in the original article, what we find here seems to match just some of the raw benchmark logs found in the repo rather than the (noticeably lower) response times given in the text. The reason for that is unclear.
- Closer inspection reveals that Astra DB fares marginally better when many features for few entities are involved and slightly worse in the opposite case of moderate number of features for many entities. Based on knowledge of how the features are mapped to the partitioned-table physical storage model employed by Cassandra (an entity is a partition and each feature is a row in it), we can translate this finding as follows: Cassandra is optimized to offer very fast retrieval of even many rows in the same partition (a single query for the DB layer), more so than for a scattered read from several partitions at once (which, in the plugin, is achieved with as many read requests, albeit with massive parallelization).
- The results for Datastore show a noticeable non-monotonicity of results as the independent variable grows. This might signal either that one-minute runs are barely sufficient for a statistically robust determination of P99 percentiles or that Datastore is inherently more prone to random fluctuations in its performance due, for example, to underlying intensive maintenance background tasks triggering now and then. In the former case, with most tests being run at RPS=10 and for the duration of one minute, we are effectively probing the P99 from 600 collected values, a methodology likely to introduce rather substantial fluctuations for purely statistical reasons.
When designing a feature store, such as Feast, there is an inherent tradeoff between flexibility and performance, and a choice must be made whether to favor interoperability (all backends implementing the same API in a clean, maintainable way) or performance (tailoring the feature store to exploit patterns very specific to a particular DB engine, for example). Usually, these two choices are the opposite ends of a spectrum.
In most cases, there are very valid reasons to prefer interoperability/modularity, including:
- the ability to switch between different backend databases for Feast
- the fact that one does not need to learn a new API or DSL to start working with the store
- a cleaner separation of concerns
- the additional abstractions making it possible to add and remove features and define “views” on these
- the ease of starting right away with a ready-to-use feature store (not a small advantage in itself)
While this is often a reasonable choice that already meets the application’s latency requirements, there are cases where performance has to be pushed to its limits at the cost of tailoring the whole of the feature store to a specific database engine.
In this spirit, probing the other end of the spectrum (i.e., opposite to the “Feast+Astra DB” solution), we investigated an equivalent formulation for a feature store tailored to how Cassandra works. One possible implementation has each entity corresponding to a single row, with the features being as many columns of that row (remember that Cassandra employs a row-oriented storage model, not a column-oriented one). We’ll summarize those findings in an upcoming blog post. As detailed in the table below, looking at latencies for comparable setups in the two feature store implementations, the gain in performance is massive.
|N. features||RPS||Batch size||Response time P99 [ms]|
|Direct to Cassandra (Astra DB)||70||5000||1||10.1|
We reiterate that this choice comes at a cost: this kind of latency reduction would require committing to a certain feature set at table-design-time to avoid later schema changes, and requires development of a custom feature store on top of the DB access, but for a certain subset of use cases this is arguably the most reasonable thing to do.
Plugin implementation choices
Let’s go back to the Cassandra plugin for Feast and examine some properties of how it was structured.
First, one might notice that, regardless of which features are requested at runtime, the whole partition (i.e., all features for a given entity) is read. This was chosen to avoid using IN clauses when querying Cassandra; these are indeed discouraged unless the number of values is very, very small (as a rule of thumb, less than half a dozen). Moreover, since one does not know at write-time which features will be read together, there is no preferred way to arrange the clustering column(s) to have these features grouped together in the partition (as done, for example, with Facebook’s “feature re-ordering” which purportedly results in a 30%-70% latency reduction). A reasonable compromise was then to always read the whole partition and apply client-side post-query filtering to avoid burdening the query coordinators with additional work—at the cost, of course, of increased network throughput.
Second, when features from multiple entities are needed, the plugin makes good use of the execute_concurrently_with_args primitive offered by the Cassandra Python driver, thereby spawning one thread per partition and firing all requests at once (up to a maximum concurrency threshold, which can be configured). This leverages the excellent support for concurrency by the Cassandra architecture, which accounts for the observed moderate dependency of latencies on the batch size. (Side note: Changing your code to using this multiple-read paradigm is often the quick one-line change that can reduce the time needed for long Cassandra I/O operations by even a couple of orders of magnitude!)
We put the Cassandra plugin for Feast to test in the same way as other DBaaS plugins were tested; that is, using the Astra DB cloud database built on Cassandra, and we ran the same benchmarks that were applied to Redis, Datastore, and DynamoDB.
Besides broadly confirming the previous results published by the Feast team, our main finding is that the performance with Astra DB is on par with that of AWS DynamoDB and noticeably better than that of Google Datastore.
All these tests target the Python implementation. As mentioned in the Feast article, switching to a Java feature server greatly improves the performance, but requires a more convoluted setup and architecture and overall more expertise both for setup and maintenance.
Other evidence points to the fact that, if one is mainly concerned about performance, replacing any feature store with a direct-to-DB implementation may be the best choice. In this regard, our extensive investigations clearly make the case that Cassandra is a good fit for ML applications, regardless of whether a feature store is involved or not.
Some results might be made statistically stronger by more extensive tests, which could be a task for a future iteration of these performance benchmarks. It is possible that longer runs and/or much larger amounts of stored data would better highlight the underlying patterns in how the response times behave as a function of batch size and/or number of requested features.
The author would like to thank Alan Ho, Scott Regan, and Jonathan Shook for a critical reading of this manuscript, and the Feast team for a pleasant and fruitful collaboration around the development (first) and the benchmarking (afterwards) of the Cassandra / Astra DB plugin for the namesake feature store.