Show HN: Featureform – An open-source Feature Store for ML

hadsed · on July 1, 2022

Architecturally this is great. Introducing more infra in an ML stack is a huge pain in the short and long run and I really love that this doesn't do that (or at least, the new infra are things I understand).

In the past I've always opted for a feature store as a library that operates over an existing database/data warehouse/data lake in the offline case, and computing features on the fly in the online case. The internal feature cache for scaling an online service is nicely implemented here using Redis. Bravo, that's probably how I would do it too.

My one bit of feedback is the API. The code just doesn't look nice, out the gate there's a bunch of objects and methods I don't immediately understand the need for. I'm sure they're useful, but for starting out I'd expect a lot more from that interface. I'd suggest something higher level that looks pretty and is easy to understand. That would be my one hesitation.

simba-k · on July 1, 2022

Up for a call? Would love to get your Api feedback, email me at simba@featureform.com , I’ll buy you lunch :)

spartee · on June 30, 2022

FeatureForm is a great tool with a solid team backing it. I recommend checking it out.

Also check out these articles if you are interested about learning more about feature stores in general: - https://www.featureform.com/post/feature-stores-explained-th... - https://redis.com/blog/building-feature-stores-with-redis-in... - https://feast.dev/blog/feast-benchmarks/

simba-k · on June 30, 2022

Hey everyone,

I'm Simba Khadder, Co-Founder & CEO of Featureform. I'm super stoked to be sharing our open-source feature store with you all. At my last company, we were building models that served to <100M MAU. Most of our time was spent feature engineering and using off-the-shelf model architectures. I remember having google docs that got shared around with useful SQL snippet, and digging in my file system to find untitled_128.ipynb which had a super useful transformation. We built Featureform so no one would ever have to deal with that again.

Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featureform sits atop your existing infrastructure and orchestrates it to work like a traditional feature store.

By using Featureform, a data science team can solve the organizational problems:

- Enhance Collaboration Featureform ensures that transformations, features, labels, and training sets are defined in a standardized form, so they can easily be shared, re-used, and understood across the team.

- Organize Experimentation The days of untitled_128.ipynb are over. Transformations, features, and training sets can be pushed from notebooks to a centralized feature repository with metadata like name, variant, lineage, and owner.

- Facilitate Deployment - Once a feature is ready to be deployed, Featureform will orchestrate your data infrastructure to make it ready in production. Using the Featureform API, you won't have to worry about the idiosyncrasies of your heterogeneous infrastructure (beyond their transformation language).

- Increase Reliability Featureform enforces that all features, labels, and training sets are immutable. This allows them to safely be re-used among data scientists without worrying about logic changing. Furthermore, Featureform's orchestrator will handle retry logic and attempt to resolve other common distributed system problems automatically. Finally, Featureform will monitor and notify you of infrastructure problems and data drift.

- Preserve Compliance With built-in role-based access control, audit logs, and dynamic serving rules, your compliance logic can be enforced directly by Featureform.

You can check out our repo: https://github.com/featureform/featureform

Our docs: https://docs.featureform.com

Our quickstart guide: https://docs.featureform.com/quickstart-local

Read more about feature stores: https://featureform.com/post/feature-stores-explained-the-th...

d4rkp4ttern · on July 1, 2022

Seems like this can help with ensuring features used in training are exactly identical to features used in serving? Recently I wanted to use “target encoding” of features with TF, but saw that TF does not have a “target encoding” layer that can be used in serving. Would this help with this scenario?

whoevercares · on July 1, 2022

How do you solve the issue with applying columnar transform from feature engineering step as row major transformer at inference time?

cal85 · on June 30, 2022

As someone outside the ML world, can anyone tell me what a “feature” is?

simba-k · on June 30, 2022

A feature is an input to a machine learning model. You can think of a model as a black-box function that takes features and outputs a prediction: prediction = model(features)

For example, If you're building a recommendation model at Spotify, you'll transform a stream of user listens into features like: user's top genre in last 30 days.

Featureform orchestrates the transformations on your infrastructure, manages the metadata like versioning, and allows you to serve them for training and inference.

Asafp · on June 30, 2022

A feature is a data point to your model. it can be as simple as the amount of a transaction (for a fraud detection model) or as complex as the avg_number_of_transactions_in_the_past_7_days_with_over_1k_in_amount_that_were_pending_review. Since you have many features, and they constantly change and evolve and being consumed by many models you need a way to store them - thats how feature stores came to be. I personalty never used.

uptownfunk · on July 1, 2022

a column in a table basically

otsaloma · on June 30, 2022

Same as "explanatory variable" in the old (statistical) lingo.

lysecret · on June 30, 2022

Oh I rely like the idea and was very close to building something like that myself. I will definitely take a close look :)

mrfusion · on June 30, 2022

I’m not quite getting it. Would you be willing to explain it using a toy example?

simba-k · on June 30, 2022

Imagine you're Spotify and you have a stream of user-song-timestamp triplets per listen. You'll likely want to transform it into features such as: top genre per user in last 30 days. As a data scientist, you'll write your transformations to do so and run it yourself on something like Spark and store it on Redis for inference and S3 for training. You have to keep track of your versioning, jobs, and transformations. You also can't easily share them across data scientists.

Featureform's library allows you to define your transformations, feature, and training sets. It will interface with Spark, Redis, etc. on your behalf to achieve your desired state. It'll also keep track of all the metadata for you and easily make it share-able and re-usable.

mountainriver · on July 1, 2022

Looks like a nice lighter alternative to Feast. Feature stores still add a lot of complexity though, if you don’t have real time transformations they probably aren’t needed

simba-k · on July 1, 2022

Our also has a local mode, the goal is to solve organisational problems as well such as versioning. It solves the problem of having a Google doc full of SQL snippets that people pass around, or digging for a notebook called untitled_110.ipynb to copy some code from,

rydog_2099 · on June 30, 2022

We have one team that uses Redis and another that uses Cassandra, would we be able to use Featureform across both of them? Kind of like a data mesh?

simba-k · on June 30, 2022

Yes, we essentially have three components: the serving layer, the orchestrator, and the metadata layer.

All three of these work across different infrastructure by design. We already have users who use Google Cloud services like BigQuery for experimentation and DynamoDB on AWS for serving in productions.

The data mesh analogy is interesting. In a way, we're an applied form of data mesh, as opposed to a theoretical argument for it. By separating the abstraction/workflow layer from the data infra layer, you can put your data where it makes sense and access it all through featureform's feature store abstraction.

you can read more about the virtual feature store architecture here: https://www.featureform.com/post/feature-stores-explained-th...

lysecret · on June 30, 2022

Quick question. I see it is mostly written in go. Can you talk about why that in contrast to python?

simba-k · on June 30, 2022

The client libraries are all in Python. Much like Tensorflow, even though most of the heavy-lifting is done in a different language, the client feels like native Python. The internals of Featureform's deployed solution benefit from Go's native and lightweight networking and multithreading libraries.

More here: https://docs.featureform.com/system-architecture

planetsprite · on June 30, 2022

How does this compare to Feast?

simba-k · on June 30, 2022

Great question! We wrote about this in our blog post: https://www.featureform.com/post/feature-stores-explained-th...

Feast is a literal feature store. it exclusively stores features, it does not manage the transformations used to compute them. The pros and cons of Feast are more obvious when examining the process to change a feature. It happens in three steps:

1. Write and run your new data transformation in your existing transformation pipeline. Note that this happens outside of Feast.

2. A new feature table must be created in Feast, since the old one cannot be directly overwritten. Once the new feature is created the transformation pipeline should be re-run and write all the features to the new table.

3. All the models that use this new feature should be updated to point at the new feature.

Feast also has other problems, for example, it can’t copy your features from the offline to the online store, you have to download the features and upload them to the online store yourself using their CLI tool. You also have to manage retries and failure yourself.

Featureform treats the transformation lineage as part of the feature and orchestrates your infrastructure to create and change your features.

adchia · on July 1, 2022

Hey! One of the maintainers of Feast here, mainly to make some friendly corrections on Feast’s functionality (e.g. Feast does in fact manage transformations) :)

Feast currently supports a few kinds of transformations: on demand transformations and streaming transformations. We’re adding batch transformations soon though! (and have an RFC out already).

I like to think that Feast’s goal is to be more of a pluggable framework for platform teams to be able to build towards a platform like Tecton’s (which fully orchestrates both batch and streaming transformations while abstracting complexity away from data scientists). We’re being mindful of trying to keep things as simple as possible though because our users have told us repeatedly they don’t want to forced to manage a complex system.

With moving features from the offline to online store (we call this materialization), users today can (and often quite successfully) use Feast’s CLI or SDK to trigger in memory materialization to the online store. We do have ongoing work to enable out of process materialization (e.g. using Ray, Spark, Bytewax, etc) that should be ready soon. Being able to manually trigger materialization via Airflow though has proven to be very useful for users in integrating with their existing workflows (such as triggering this when they detect changes to their raw data sources).

Simba’s correct though in calling out that a lot of the orchestration in Feast is left to the user. It hasn’t fully emerged as a key painpoint users want addressed, but if that changes… well we’re working to make our community happy :)

Cheers, Danny

rydog_209 · on June 30, 2022

Love this!!!

misbahkhan · on June 30, 2022

Great seeing you here!

suyash · on June 30, 2022

Project seems to be using a weird license, therefore I'm out.

simba-k · on June 30, 2022

Mozilla Public License 2.0 is standard, well-known, and OSI approved: https://opensource.org/licenses