This showdown benchmark was and still is great, but an enormous grain of salt should be added to any model that was released after the showdown benchmark itself.
Maybe everyone has a different dose of skepticism. Personally I'm not even looking at results for models that were released after the benchmark, for all this tells us, they might as well be one-trick ponies that only do well in the benchmark.
It might be too much work, but one possible "correct" approach for this kind of benchmark would to periodically release new benchmarks with new tests (that are broadly in the same categories) and only include models that predate each benchmark.
Yeah that’s a classic problem, and it's why good tests are such closely guarded secrets: to keep them from becoming training fodder for the next generation of models. Regarding the "model date" vs "benchmark date" - that's an interesting point... I'll definitely look into it!
I don't have any captcha systems in place, but I wonder if it might be worth putting up at least a few nominal roadblocks (such as Anubis [1]) to at least slow down the scrapers.
A few weeks ago I actually added some new, more challenging tests to the GenAI Text-to-Image section of the site (the “angelic forge” and “overcrowded flat earth”) just to keep pace with the latest SOTA models.
In the next few weeks, I’ll be adding some new benchmarks to the Image Editing section as well~~
The Blender previz reskin task [1] could be automated! New test cases could be randomly and procedurally generated (without AI).
Generate a novel previz scene programatically in Blender or some 3D engine, then task the image model with rendering it in a style (or to style transfer to a given image, eg. something novel and unseen from Midjourney). Another test would be to replace stand in mannequins with identities of characters in reference images and make sure the poses and set blocking match.
Throw in a 250 object asset pack and some skeletal meshes that can conform to novel poses, and you've got a fairly robust test framework.
Furthermore, anything that succeeds from the previz rendering task can then be fed into another company's model and given a normal editing task, making it doubly useful for two entirely separate benchmarks. That is, successful previz generations can be reused as image edit test cases - and you a priori know the subject matter without needing to label a bunch of images or run a VLM, so you can create a large set of unseen tests.
You don't need skepticism, because even if you're acting in 100% good faith and building a new model, what's the first thing you're going to do? You're going to go look up as many benchmarks as you can find and see how it does on them. It gives you some easy feedback relative to your peers. The fact that your own model may end up being put up against these exact tests is just icing.
So I don't think there's even a question of whether or not newer models are going to be maximizing for benchmarks - they 100% are. The skepticism would be in how it's done. If something's not being run locally, then there's an endless array of ways to cheat - like dynamically loading certain LoRAs in response to certain queries, with some LoRAs trained precisely to maximize benchmark performance. Basically taking a page out of the car company playbook in response to emissions testing.
But I think maximizing the general model itself to perform well on benchmarks isn't really unethical or cheating at all. All you're really doing there is 'outsourcing' part of your quality control tests. But it simultaneously greatly devalues any benchmark, because that benchmark is now the goal.
I think training image models to pass these very specific tests correctly will be very difficult for any of these companies. How would they even do that?
Hire a professional Photoshop artist to manually create the "correct" images and then put the before and after photos into the training data. Or however they've been training these models thus far, i don't know.
And if that still doesn't get you there, hash the image inputs to detect if its one of these test photos and then run your special test-passer algo.
I'm sure there's a way for them to give enough weight if they really cared enough. I don't think they should or would, but they could stuff the training data with thousands of slight variations if they wanted to or manually give them more importance. This might adversely affect everything else, but that's another story.
Well, that was way cooler than the title makes it sound. I would rephrase it to "abstract algorithmic art generator" or something, otherwise it sounds like yet another prompt-to-slop app. Could really use an accompanying readme/blog post/description of the algorithms as well, I'm curious about how it works!
Personally, I would change the article to anonymize the actual plugin that was cracked. The plugin author seems to be a solo dev/musician, actually more a musician than a developer, which might explain the poorly implemented copy protection*. But they're good at crafting sounds, and that's what they're selling. Or trying to sell. Or taking donations for, by the way: https://ko-fi.com/bassbullyvst
* I highly doubt it was deliberate as some others are suggesting.
The language certainly looks nice! Is it open source? I think it makes sense for this kind of tool, since it's inherently "hackery". I mean people who want to write music with code also probably want the ability to understand and modify any part of the stack, it's the nature of the audience.
Thanks! I tried to make it as familiar as possible, inspired by JS. It's not yet open-source, mainly because the source is a bit of a mess, but it will be once I tidy things up. Follow me on GitHub[0] for updates. Also that sounds to me like Tech-House/Electro-House :D Very nice!
Both statements are true. We have a strong tendency for integer ratios in harmony, and just intonation often sounds out of tune.
Integer ratios are the base upon which harmony is built. Temperament is a subtle modification that sounds very close to integer ratios, but allows more complex harmonic structures where dissonance is evenly spread out across all the relationships between tones.
Way off what? Complex ratios are likely to be heard as out-of-tune simple ratios, that's why they sound off. A concept sometimes called tolerance in music cognition. Note that by "complexity" and "simplicity" I'm referring to harmonic distance here.
7/4 ratio should be simple, but it'll sound out of tune (over 30 cents) in a normal context. Many BP intervals are just as simple and they'll sound very out of tune to people unused to them.
7/4 happens to be approximately 30 cents away from 16/9. It's hard to tell what's "simple" when looking at fractions, but 16/9 is indeed simple: divide by 3 twice and adjust the octave. If we assume octave equivalence, that means one step in the "7" direction is perceived as more complex than 2 steps in the "3" direction, so the second interpretation wins, but is perceived as out-of-tune.
That said, we're trying to isolate things that are typically not isolated. If you get to 7/4 by following the harmonic series, it will sound in-tune. If you get to 16/9 by playing and applying 4/3 twice, then that will sound in tune. Unsurprisingly the second option is more common in music.
Before 7th harmonic all you have is octaves, fifths and major thirds. If you want to stick to making other pitches out of stacked fifths and major thirds you'll end up with other compromises.
It would be unergonomic, if not painful, to use a western classical approach to rhythm in a programming environment. Alex McLean, the main author of Tidal/Strudel, is very much into Indian classical, and this is reflected in the approach to rhythm. IMO this is an good choice, and people who know music theory and composition should feel right at home, assuming we're talking about the right theory.
When it comes to pitch (and I guess we agree on this) Strudel is firmly on the western traditional side. It generally assumes 12-tone equal temperament, uses ABC notation, has built-in facilities to express chords using their classical names...
Meanwhile I'm over here programming music where I express all frequencies as fractions or monzos. I find this better suited to a music programming environment, but this might be more personal.
Having done a fair amount of audio physical modeling, I'll just say a synthesized version that's both fast and realistic would be possible but difficult. The difficulty is at least "it would make an impressive presentation at DAFx [1]", though I might be underestimating it, and it's more "you could make it your master's thesis at CCRMA [2]"
Ideal springs are a common, simple element in this field, but this kind of spring is very much not that.
You're probably better off improving the sample-based version by fading out the audio when necessary and using different samples based on the way it's triggered. If you have "ultra-dry" samples (maybe taken with a contact mic), you can add a convolution effect with a well-chosen impulse response, this will allow you to sharply cut off or adjust the audio and still have a natural-sounding tail.
I'm extremely grateful for this. My most deeply held secret is that I wish I could do this for a living - digitally modeling weird/beautiful objects/instruments and work on that forever haha. (And maybe make pedals out of them, I don't know)
If you don't mind humoring me (I'm quite the novice in this field), if I automated the recording of "all" possible positions for a spring (say I had a motor positioned in a way that would let me pull the spring in any polar direction), would that make modeling potentially easier?
There might be a "train an AI, here's 1000 recordings" angle, but I'm not necessarily interested in/asking about that.
Just strictly for modeling, would it help the R&D phase to have a lot of high sample rate recordings? Thanks a lot!
P.S. Also, if you have a good intro to DSP class/book, I'd love to hear it. I know about a few, but a recc is always appreciated
That's funny, I was trying to do other stuff after posting my comment, but my brain kept working in the background, against my will, looking for the best approach to actually model this. Honestly, I was probably being pessimistic about the difficulty of a synthesized version, but I still think your current approach (don't synthesize, use samples) is more reasonable and can be made more responsive.
I don't think that recording a large number of starting positions would help that much with creating a (non-ML) model, and I doubt a high sample rate would provide much useful information either. A more common approach would be to try getting separate sounds for the impulse and the resonant body, though they may be impossible to really separate, and the actual model may end up more complex than that.
You probably have a good starting point already with your code for the animated model. I think the sound mostly comes from the collision between coils (collisions not visible in your animated model), and almost entirely from the lowest couple of windings that are against the wall. This is your impulse. The resonant body might be in 2 parts: the wall and the long end of the spring. Your existing model can tell you when to trigger the impulses, and how much force to put into them.
Julius O. Smith has an encyclopedic amount of content on the topic, though it's often condensed into math that can be hard to apply: https://ccrma.stanford.edu/~jos/
The death of VR has been greatly exaggerated. The only thing that died is the hype, and the hype will not be missed. It's just nice to have so much less bullshit.
Maybe everyone has a different dose of skepticism. Personally I'm not even looking at results for models that were released after the benchmark, for all this tells us, they might as well be one-trick ponies that only do well in the benchmark.
It might be too much work, but one possible "correct" approach for this kind of benchmark would to periodically release new benchmarks with new tests (that are broadly in the same categories) and only include models that predate each benchmark.
reply