One quick way to estimate a lower bound is to take the number of parameters and multiply it with the bits per parameter. So a model with 7 billion parameters running with float8 types would be ~7 GB to load at a minimum. The attention mechanism would require more on top of that, and depends on the size of the context window.
You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.
If a person tries to communicate, but his stylistic choice of laziness (his own admission!) gets in the way of delivering his message, it is very tangibly useful information to tell, so that the writing effort could be better optimized for effect.
I wasn't even demanding/telling him what to do. I simply shared my observation, but it's up to him to decide if he wants to communicate better. Information and understanding is power.
Your choice. The worst thing is not knowing ("Why are not posts with reasonable opinions are being downvoted and not engaged with?"). Now you know (you are welcome) and it's your choice what to do with that information.
Telling of where the boundary of competence is for these models. And to show that these models aren't doing what most expect them to be doing, i.e. not counting legs, and maybe instead inferring information based on the overall image (dogs usually have 4 legs) to the detriment of find grained or out-of-distribution tasks.
Cosine similarity is the dot product of vectors that have been normalized to lie on the unit sphere. Normalization doesn't alter orthogonality, nor does it change the fact that most high‑dimensional vectors are (nearly) orthogonal.
They do.
It's just that for two vectors to be orthogonal it's the case as soon as they're orthogonal when projected down to any subspace; the latter means that if for example one coordinate is all they differ on, and it's inverse in that value between the two vectors, then these two vectors _are already orthogonal._
In d dimensions you can have d vectors that are mutually orthogonal.
Interestingly this means that for sequence lengths up to d, you can have precise positional targeting attention. As soon as you go to longer sequences that's no longer universally possible.
Do you mean the text input box only allows one line? In that case, try pressing Alt+Enter for a new line. It's a little unintuitive; I would expect Shift+Enter for a new line.
woah its alt-enter, I always did shift+enter and thought these tools were terrible. They still are terrible if they use alt-enter, what a waste of time lol/
> It is completely typical, but at the same time abnormal to have tools with such poor usability.
The main difference I see is that LLMs are flaky, getting better over time, but still more so than traditional tooling like debuggers.
> Programming languages are notoriously full of unnecessary complexity. Personal pet peeve: Rust lifetime management. If this is what it takes, just use GC (and I am - golang).
Lifetime management is an inherently hard problem, especially if you need to be able to reason about it at compile time. I think there are some arguments to be made about tooling or syntax making reasoning about lifetimes easier, but not trivial. And in certain contexts (e.g., microcontrollers) garbage collectors are out of the question.