Architecturally this is great. Introducing more infra in an ML stack is a huge pain in the short and long run and I really love that this doesn't do that (or at least, the new infra are things I understand).
In the past I've always opted for a feature store as a library that operates over an existing database/data warehouse/data lake in the offline case, and computing features on the fly in the online case. The internal feature cache for scaling an online service is nicely implemented here using Redis. Bravo, that's probably how I would do it too.
My one bit of feedback is the API. The code just doesn't look nice, out the gate there's a bunch of objects and methods I don't immediately understand the need for. I'm sure they're useful, but for starting out I'd expect a lot more from that interface. I'd suggest something higher level that looks pretty and is easy to understand. That would be my one hesitation.
I'm Simba Khadder, Co-Founder & CEO of Featureform. I'm super stoked to be sharing our open-source feature store with you all. At my last company, we were building models that served to <100M MAU. Most of our time was spent feature engineering and using off-the-shelf model architectures. I remember having google docs that got shared around with useful SQL snippet, and digging in my file system to find untitled_128.ipynb which had a super useful transformation. We built Featureform so no one would ever have to deal with that again.
Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featureform sits atop your existing infrastructure and orchestrates it to work like a traditional feature store.
By using Featureform, a data science team can solve the organizational problems:
- Enhance Collaboration Featureform ensures that transformations, features, labels, and training sets are defined in a standardized form, so they can easily be shared, re-used, and understood across the team.
- Organize Experimentation The days of untitled_128.ipynb are over. Transformations, features, and training sets can be pushed from notebooks to a centralized feature repository with metadata like name, variant, lineage, and owner.
- Facilitate Deployment - Once a feature is ready to be deployed, Featureform will orchestrate your data infrastructure to make it ready in production. Using the Featureform API, you won't have to worry about the idiosyncrasies of your heterogeneous infrastructure (beyond their transformation language).
- Increase Reliability Featureform enforces that all features, labels, and training sets are immutable. This allows them to safely be re-used among data scientists without worrying about logic changing. Furthermore, Featureform's orchestrator will handle retry logic and attempt to resolve other common distributed system problems automatically. Finally, Featureform will monitor and notify you of infrastructure problems and data drift.
- Preserve Compliance With built-in role-based access control, audit logs, and dynamic serving rules, your compliance logic can be enforced directly by Featureform.
Seems like this can help with ensuring features used in training are exactly identical to features used in serving?
Recently I wanted to use “target encoding” of features with TF, but saw that TF does not have a “target encoding” layer that can be used in serving. Would this help with this scenario?
A feature is an input to a machine learning model. You can think of a model as a black-box function that takes features and outputs a prediction:
prediction = model(features)
For example, If you're building a recommendation model at Spotify, you'll transform a stream of user listens into features like: user's top genre in last 30 days.
Featureform orchestrates the transformations on your infrastructure, manages the metadata like versioning, and allows you to serve them for training and inference.
A feature is a data point to your model.
it can be as simple as the amount of a transaction (for a fraud detection model) or as complex as the avg_number_of_transactions_in_the_past_7_days_with_over_1k_in_amount_that_were_pending_review.
Since you have many features, and they constantly change and evolve and being consumed by many models you need a way to store them - thats how feature stores came to be.
I personalty never used.
Imagine you're Spotify and you have a stream of user-song-timestamp triplets per listen. You'll likely want to transform it into features such as: top genre per user in last 30 days. As a data scientist, you'll write your transformations to do so and run it yourself on something like Spark and store it on Redis for inference and S3 for training. You have to keep track of your versioning, jobs, and transformations. You also can't easily share them across data scientists.
Featureform's library allows you to define your transformations, feature, and training sets. It will interface with Spark, Redis, etc. on your behalf to achieve your desired state. It'll also keep track of all the metadata for you and easily make it share-able and re-usable.
Looks like a nice lighter alternative to Feast. Feature stores still add a lot of complexity though, if you don’t have real time transformations they probably aren’t needed
Our also has a local mode, the goal is to solve organisational problems as well such as versioning. It solves the problem of having a Google doc full of SQL snippets that people pass around, or digging for a notebook called untitled_110.ipynb to copy some code from,
Yes, we essentially have three components: the serving layer, the orchestrator, and the metadata layer.
All three of these work across different infrastructure by design. We already have users who use Google Cloud services like BigQuery for experimentation and DynamoDB on AWS for serving in productions.
The data mesh analogy is interesting. In a way, we're an applied form of data mesh, as opposed to a theoretical argument for it. By separating the abstraction/workflow layer from the data infra layer, you can put your data where it makes sense and access it all through featureform's feature store abstraction.
The client libraries are all in Python. Much like Tensorflow, even though most of the heavy-lifting is done in a different language, the client feels like native Python. The internals of Featureform's deployed solution benefit from Go's native and lightweight networking and multithreading libraries.
Feast is a literal feature store. it exclusively stores features, it does not manage the transformations used to compute them. The pros and cons of Feast are more obvious when examining the process to change a feature. It happens in three steps:
1. Write and run your new data transformation in your existing transformation pipeline. Note that this happens outside of Feast.
2. A new feature table must be created in Feast, since the old one cannot be directly overwritten. Once the new feature is created the transformation pipeline should be re-run and write all the features to the new table.
3. All the models that use this new feature should be updated to point at the new feature.
Feast also has other problems, for example, it can’t copy your features from the offline to the online store, you have to download the features and upload them to the online store yourself using their CLI tool. You also have to manage retries and failure yourself.
Featureform treats the transformation lineage as part of the feature and orchestrates your infrastructure to create and change your features.
Hey! One of the maintainers of Feast here, mainly to make some friendly corrections on Feast’s functionality (e.g. Feast does in fact manage transformations) :)
Feast currently supports a few kinds of transformations: on demand transformations and streaming transformations. We’re adding batch transformations soon though! (and have an RFC out already).
I like to think that Feast’s goal is to be more of a pluggable framework for platform teams to be able to build towards a platform like Tecton’s (which fully orchestrates both batch and streaming transformations while abstracting complexity away from data scientists). We’re being mindful of trying to keep things as simple as possible though because our users have told us repeatedly they don’t want to forced to manage a complex system.
With moving features from the offline to online store (we call this materialization), users today can (and often quite successfully) use Feast’s CLI or SDK to trigger in memory materialization to the online store. We do have ongoing work to enable out of process materialization (e.g. using Ray, Spark, Bytewax, etc) that should be ready soon. Being able to manually trigger materialization via Airflow though has proven to be very useful for users in integrating with their existing workflows (such as triggering this when they detect changes to their raw data sources).
Simba’s correct though in calling out that a lot of the orchestration in Feast is left to the user. It hasn’t fully emerged as a key painpoint users want addressed, but if that changes… well we’re working to make our community happy :)
In the past I've always opted for a feature store as a library that operates over an existing database/data warehouse/data lake in the offline case, and computing features on the fly in the online case. The internal feature cache for scaling an online service is nicely implemented here using Redis. Bravo, that's probably how I would do it too.
My one bit of feedback is the API. The code just doesn't look nice, out the gate there's a bunch of objects and methods I don't immediately understand the need for. I'm sure they're useful, but for starting out I'd expect a lot more from that interface. I'd suggest something higher level that looks pretty and is easy to understand. That would be my one hesitation.