These questions are meant to be constructively critical, but not hyper-critical: I'm genuinely interested and a big fan of open-source projects in this space:
* In terms of a high-performance AI-focused S3 competitor, how does this compare to NVIDIA's AIstore? https://aistore.nvidia.com/
* What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?
* You spend a lot of time talking about performance; do you have any benchmarks?
* Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?
* How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?
* Are there any front-end or admin tools (and screenshots)?
* Can a cluster scale horizontally or only vertically (ie Minio)
* Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?
* Is there any telemetry?
* Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?
* Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?
* Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?
I wonder in that's why it's all over the place. Meta engine written in Zig, okay, do I need to care? Gateway in Rust... probably a smart choice, but why do I need to be able to pick between web frameworks?
> Most object stores use LSM-trees (good for writes, variable read latency) or B+ trees (predictable reads, write amplification). We chose a radix tree because it naturally mirrors a filesystem hierarchy
Okay, so are radix tree good for write, and reads, bad for both, somewhere in between?
A hybrid of physical logging, which is logging page-by-page changes, and logical logging, which is recording the activity performed at an intent level. If you do both of these, it's apparently "physiological", which I imagine was first conceived of as "physio-logical".
I could only find references to this in database systems course notes, which may indicate something.
Yes, since zig is moving to new async IO and we'd like to wait for it to be more stable before open sourcing our metadata engine. You can check the underneath runtime which we are using for now: https://codeberg.org/thomas-fractalbits/iofthetiger
Thanks for the thoughtful and thorough questions! Really appreciate the constructive engagement. Let me address each one:
> NVIDIA's AIstore
AFAIK, AIstore is designed as a caching/proxy layer, but FractalBits is a ground-up object storage implementation with a custom metadata engine (Fractal ART - an on-disk adaptive radix tree). AIstore also seems to focus more on the training data pipeline (prefetch, shuffle, ETL) while we focus on the storage layer performance itself. One thing I am aware of is the (folder) rename operation, which would be a little tricky for caching/proxy layer. I'd like to do more research on the detailed comparison and update in our website. Thanks for mention this.
> What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?
You can check our arch doc [1] for the clustering. Right now we are focusing on the cloud (aws) environment, and you can simply type `just deploy create-vpc` for deployment, which is calling cdk underneath to set up all the clustering stuff. You can also run low level cdk commands for customization. We are also working on the k8s deployment for on-prem environment.
> You spend a lot of time talking about performance; do you have any benchmarks?
Yes, published in the README[2] with reproducible setup:
- GET: 982K IOPS, 3.8 GiB/s throughput, 2.9ms avg latency, 5.3ms p99
- PUT: 248K IOPS, 970 MiB/s throughput, 6.6ms avg latency
Configuration: 4KB objects, 14x c8g.xlarge (API), 6x i8g.2xlarge (BSS), 1x m7gd.4xlarge (NSS), 3-way data replication. Cost: ~$8/hour on-demand. You can reproduce this yourself with `just deploy create-vpc --template perf_demo --with-bench` and run the included benchmark scripts.
> Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?
Yeah, some blog content was generated by Gemini. For code, I haven't used AI until this September and you can check from the git commit history. The core Fractal ART engine and the io_uring integration are hand-crafted for performance.
I am also learning to work with AI (spec driven). One thing I notice is that AI really made the performance related experiments more efficient than before (framework axum->actix-web, io_uring based rpc, rust arena allocator etc.). I use my nvim editor with customized setup to review every line of code written by AI.
> How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?
Architecture[1]: Not DHT-based. We use a centralized metadata service (NSS) with a custom on-disk Adaptive Radix Tree called Fractal ART. Key design choices:
- Full-path naming: Avoids distributed transactions for directory operations
- Quorum replication: N/R/W configurable (default 3-way data, 6-way metadata)
- Two-tier storage: <1MB objects on local NVMe, larger objects tier to S3 backend
CAP tradeoffs: We prioritize CP (consistency + partition tolerance). Strong consistency via quorum writes. HA for NSS is under testing.
> Are there any front-end or admin tools (and screenshots)?
Admin UI: Yes, we have a web-based UI (React/Ant Design) for bucket browsing, object management, and access key configuration. Currently basic but functional. Screenshots aren't published yet, but it's bundled with deployments. CLI tooling via standard AWS CLI works out of the box since we're S3-compatible.
> Can a cluster scale horizontally or only vertically (ie Minio)
Scaling: Horizontal for data nodes (BSS) - add more instances, data distributes across them with quorum with data-volume and metadata-volume. API servers are stateless and horizontally scalable. NSS (metadata) is currently single-node but designed for future horizontal sharding(split/fractal) (on roadmap: "support hundreds of NSS and BSS instances"). Different from MinIO's per-server erasure sets - we have proper quorum-based distribution.
> Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?
Two reasons:
1. License: MinIO switched to AGPL, then to a more restrictive license. Apache 2.0 was important to us.
2. Architecture: We have a fundamentally different architecture to solve the minio's inherent scalability issue [3]. Bolting these onto MinIO would be a rewrite anyway.
> Is there any telemetry?
Yes, but not fully polished (cloudwatch setup has been commented out). We are using rust crates for tracing & metrics. Currently working on distributed tracing capabilities, and we have embedded trace_id in all our rpc headers. Once it is stabilized, we'd like to update it in the docs.
> Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?
The company (FractalBits Labs) is incorporated in the United States. The BYOC model means your data stays in your cloud account in your chosen AWS region - we never see your data.
> Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?
CLA: Currently no CLA - contributions are under Apache 2.0 via the standard "inbound = outbound" model. We don't require copyright assignment. This makes a rug-pull harder since we can't relicense without contributor consent. Happy to discuss a more formal CLA if the community prefers one.
> Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?
Foundation/CNCF: Not yet, but it's something we're open to as the project matures (Apache/CNCF/LF AI & Data Foundation). For now, the Apache 2.0 license provides baseline protection - any prior version remains open source regardless of future decisions. We'd welcome community input on governance structure as adoption grows. On the other hand, I (with a partner) have been full-time working on this for a whole year, and would also welcome conversations with anyone interested in supporting the project's growth.
Appreciate the mention. Just to clarify: AIStore is a reliable storage cluster that can natively operate on both in-cluster and remote data, without treating either as a cache.
Yes, it _can_ be configured to act as a cache in front of any of the four supported Clouds but (a) we've never done so at NVIDIA, and (b) its primary function always was - and remains - reliable, resilient storage for AI/ML workloads.
(By the way: NVIDIA AIstore is NOT a proxying/caching engine, although it can, which is somewhat unique among these types of stores. AIstore is actually a full S3 engine in its own right, and it's actually extremely capable, although live cluster resizing and ETL requires k8s :( )
No, the AI is not a script kiddie. The AI is a tool which anyone from script kiddies to professionals and nation state actors can use as leverage to increase their ability to do damage and the number of systems they can do damage to.
> Expectations for pay seem to be very high even for people only just out of college.
Darn kids get off my lawn!
But yes, you're right. Salary expectations are surprisingly high considering how little fresh grads bring to the table!
We're talking huge amounts of training just to get to a basic level of competency, especially since most CS degrees are focused on things that were cool five years ago (or much, much longer).
This is the age-old tale, though, but now time is compressed and we need people to hit the ground running much, much faster.
CA's have a lot of management and logistical issues and potential for misuse. The simplicity and TOFU design of the SSH key system (which obv can bring along some issues of its own) can bring a lot of benefits, especially for people who don't want to introduce a CA or PKI.
(obligatory disclaimer, I work at Userify and we have a server-side product that automates SSH key management and distribution. For example, the CA design doesn't kick someone out once their access is removed, but Userify's shim actually terminates all of sessions instantly, like screen or tmux, across all of the servers they're logged into and removes (but retains for historical record) their home directory.)
> the CA design doesn't kick someone out once their access is removed, but Userify's shim actually terminates all of sessions instantly
I was a bit confused at first, I thought you were saying ssh certificates couldn't be revoked - but I see you're talking about signing the user out from existing sessions.
That is a fair point. I guess removing/locking a local user (in /etc/passwd, /etc/shadow) would typically leave any console logins alone too - unless other action is taken.
We've thought about porting Userify to work with CA's too but haven't had many requests for that for some reason, even though I'm sure many companies do have CA's set up alongside their other PKI for SSH.
People should generate at least a single ssh key per client device. (On Userify, rotating your key is just a matter of pasting the new public key into your keybox in your dashboard.) One per client device will let you revoke/rotate only that key when it's compromised. This also helps keep you from copying the private key somewhere else (which you should never do).
It does look like this wants to be a replacement for ssh-agent/ssh-add; also check out GNU keychain by Daniel Robbins, which is in most distro repos.
(blatant plug - we actually developed Userify for these three use cases, especially on cloud instances with constantly changing IP's)
Right. We usually recommend a single key per client device (laptop, desktop, etc), because that way you can rotate that key if it gets lost/stolen without changing your other devices as well. This way, those private keys stay totally local to the device and never actually need to move, which is much safer. (I work at Userify.)
Very nice for a first project! (the clipboard integration is a nice touch)
Seeing this passion is great! TBH, some of the negative comments here might be warranted: SSH is an especially important and tricky area to start in as your first Github project or for those inexperienced in security.
However, we're always looking for people at Userify to help us build the next wave of SSH UX and who aren't afraid to put something out there. It's a lot of work to get things built and it's very exciting to see some new ideas. Hit me up if you want to talk!
I replied below about our homegrown solution, but this is very cool! I wish it'd existed when we wrote Userify, but we'll definitely plan on it for our next app! Kudos for solving this so elegantly!
* In terms of a high-performance AI-focused S3 competitor, how does this compare to NVIDIA's AIstore? https://aistore.nvidia.com/
* What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?
* You spend a lot of time talking about performance; do you have any benchmarks?
* Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?
* How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?
* Are there any front-end or admin tools (and screenshots)?
* Can a cluster scale horizontally or only vertically (ie Minio)
* Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?
* Is there any telemetry?
* Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?
* Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?
* Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?
Thanks!
reply