Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Running Stable Diffusion in 260MB of RAM (github.com/vitoplantamura)
293 points by Robin89 on July 20, 2023 | hide | past | favorite | 64 comments


I like the use of a tiny device to generate the images. I was wondering whether the energy consumption per image would be lower, but I did the simple maths and it's not the case.

A raspberry pi zero 2W seems to use about 6W under load (source: https://www.cnx-software.com/2021/12/09/raspberry-pi-zero-2-... )

So if it takes 3 hours to generate one picture, that's about 18Wh per image.

A Nvidia Tesla or RTX GPU can generate a similar picture very quickly. Assuming one second per image and 350W under load for the whole system it's in the magnitude of 0.1Wh per image.

Of course we could consider that a raspberry pi zero uses a lot less ressources and energy to be manufactured and transported.


Would an accelerator such as the Intel Neural Compute Stick 2 work with this? It can be plugged into a Pi, however I'm not clear on how VRAM works on the compute stick or if it's shared with the host?


For on prem use, the up front cost is a lot lower. The A100 that most serious outfits are using runs in the thousands to tens of thousands of dollars per unit with very limited availability. The pi is typically under $75 usd for any variant.


A RTX 4090 has a much better value for stable diffusion but yes if you start to think about cost the pi wins. If you think about availability, I’m not sure.


The big immediate plus here, is if you live somewhere with limited access to the internet, you can still generate imagery offline on a low end laptop, like a protest group in far eastern europe or other areas. My personal travel laptop only has 8GB memory so it's exciting to be able to try out an idea even if I don't have high end hardware.


An RTX 3090 hits the current sweetspot of price/performance for me. Half the throughput of the 4090, but at 1/3rd the cost. (I needed the 24GB VRAM for other LLM projects).


Is this brand new or used?


Used is the only way to get a 3090 for ~$650-$750 (they're not hard to find on eBay in that general price area).


Used from ebay in my case.


Incredible! If only there was some cheap hackable eink frame, you could make a fully self contained artwork from eink panel + rpi that's (slowly) continuously updating itself..!


There definitely are some: https://shop.pimoroni.com/search?q=e-ink

And now I think I know what my next project is going to be, I am sure I can find some desk space


Yessss! I looked into building some self contained "slow tech" generative art using eink a couple of years ago but it was just impossible for my tiny budget. This is great, thanks!!

Edit..: I'm so hyped about this; the example image on TFA takes +2 hours to generate, but who cares?! I'd love to have a little display that churns around in the background and creates a new variation on my prompt every whatever hours, displaying the results on an unobtrusive eink screen.


Is it possible to incorporate a personalized "context" into the generator? Weather, market/news sentiments, calendar events, etc... to style the end result.


I love the idea.


Make sure you build in a capacity to save all the previous iterations in case you see something you really like.


Haha I like the idea of walking past, glancing now and then to see if there's something you really love...

but on the other hand I would also love the statement behind something unconnected to the internet that's slowly churning out unique, ephemeral pictures. Yours to enjoy, then gone forever.


You can make a digital sand mandala [1]

[1] https://en.m.wikipedia.org/wiki/Sand_mandala


I made one before (https://dheera.net/projects/einkframe/) that used ShanShui (https://github.com/LingDong-/shan-shui-inf)

I'm thinking of making a Stable Diffusion version of this, and preferably with a larger eInk screen.


You can use mine:

https://www.stavros.io/posts/making-the-timeframe/

You just put an image on some HTTP server and it shows it.


Listen to the speech in the room, based on hot topics generate a set of pictures for tomorrow.


Like a continuously updating wall-mounted newspaper[1]?

[1] https://imgur.io/a/NoTr8XX (no, I don’t know why anyone would use Imgur to write up a hack either)


Waveshare and Pimoroni have some that work well with Raspberry Pi, if they're in your budget. I built a Waveshare epaper display + Pi Zero into a photo frame for a totally different project. Your idea tempts me.


2 years ago people were already hacking updateable photo frames out of them

https://www.youtube.com/watch?v=YawP9RjPcJA


Does this mean you could fit its whole working set in the cache hierarchy of a modern high end GPU getting near 100% ALU utilisation?


It streams the weights. This is going to be what limits performance, not alu utilization.


In 260 megs of RAM?!? I'm going to try this on my Amiga!

Check back in a few months for my results...


Look at moneybags over here with his "megs" of RAM. I think mine only had 256K available after the kickstart disk was loaded.



I remember rendering few ray traced balls over night on Amiga, good times.


Incredible! The march continues to get more models to run on the edge, much faster than I anticipated. The static quantization and slicing techniques here are pretty cool


I’ve been amazed at how quickly the open source community has iterated on LLMs and Diffusion models. Goes to show how well open source can work.


Innovation in the tech world is spurred by open access.


Support the open companies. Avoid the closed ones, even if they are fantastic at marketing. ;)


Wait are these inference times real? 1 second on a Raspi? Do I get this right? This is faster than on my GPU. What's going on here?


Pretty sure that is just the text encoding step. Generating a complete image took 3h if I read correctly.

update: "Tests were run on my development machine: Windows Server 2019, 16GB RAM, 8750H cpu (AVX2), 970 EVO Plus SSD, 8 virtual cores on VMWare."


I think it's the inference time per iteration.


ahh thanks


That's really cool! I always thought you needed a good amount of GPU VRAM to generate images using SD.

I wonder how fast would a consumer PC, with no GPU, generate an image with say 16gb of RAM?


On an Apple M1 with 16gig RAM, without using Pytorch compiled to take advantage of Metal, it could take 12mins to generate an image with a tweet-length prompt. With Metal, it takes less than 60 seconds.


Prompt length shouldn't influence creation time, at least it didn't in any of the implementations I used.

What is the resolution of your images and number of steps?


Defaults from the Huggingface repo, just copy-pasted. So, iirc 50 steps and the image is 512x512.

Edit: confirmed.

> Prompt length shouldn't influence creation time...

Yeah, checks out with my experience too. Longer prompts were truncated.


Some tools (e.g. Automatic1111) are able to feed in longer prompts, but then the prompt length does affect the speed of inference.

Albeit in 77 token increments.


And PyTorch on the M1 (without Metal) uses the fast AMX matrix multiplication units (through the Accelerate Framework). The matrix multiplication on the M1 is on par with ~10 threads/cores of Ryzen 5900X.

[1] https://github.com/danieldk/gemm-benchmark#example-results


Wtf, my 4 year old, $400 crappy low wattage computer can generate a picture in a minute or two.

DDIM, 12 steps.


Metal is such an advantage, had no idea


I was using a 6ish year old amd cpu with 16gigs of ram and generating a prompt would take about a half hour. Which is still massively impressive for what it is.


Use a free GPU from google colab and you can do the same in about 15 seconds...


yes, and if he does it on a paid machine with a better GPU it'll be even faster!

While true, neither your statement or mine above is germane to the discussion. It wasn't about how long it takes. It's a discussion of how cool it is that it can be done on that machine at all.


Do you have a google colab link?


On 21 April 2023 Google blocked usage of Stable Diffusion with a free account on colab. You need a paid plan to use it.

Apparently there are ways around it, but I just switched to runpod.io. It's very cheap (around $0.80/h for a 4090 including storage) and having a real terminal is worth it.


There is no shortage of google collab stable diffusion tutorials on the web


Which is why asking for one high quality starting point is such a useful question.


"It runs Stable Diffusion" is the new "It runs Doom".


Now I'm wondering: could a monkey hitting random keys on a keyboard for an infinite amount of time eventually come up with the right prompts to get GPT-4 to produce code that compiles to a faithful reproduction of Doom?


Probably more easily than you'd think. DOOM is open source[1], and as GP alludes, is probably the most frequently ported game in existence, so its source code almost certainly appears multiple times in GPT-4's training set, likely alongside multiple annotated explanations.

[1] https://github.com/id-Software/DOOM


Well, not the most ported, the Z-Machine with tons of games (even ones legally available from IF archive with great programming, such as Curses!, Jigsaw, Anchorhead) might be. It runs even on the Game Boy, up to v3 games. Z5 and Z8 games will run fine from a 68020 and beyond.


Now I'm wondering: if there were two monkeys hitting random keys on a keyboard for an infinite amount of time, one in the gpt-4 prompt and another straight typing 0s and 1s who would produce Doom code faster?


Yes. Infinity is weird.


No, because GPT-4 has finite memory, its context length, and its random number generator for output selection is probably pseudo-random with finite memory.

If the random number generator is pseudo-random, this makes GPT-4 a deterministic finite-state machine, and the output sequence does not necessarily contain all possible subsequences no matter how many times the monkey types a new random key. Put differently, some output subsequences may be inaccessible no matter which keys are input. Same if the random number generator is truly random but its value cannot select among all possible output tokens, only a subset provided by the GPT at each step.


That's a good point, I hadn't considered the limits of GPT's memory.


3 hours for 8 bit. I wonder what it would be if going further. Greyscale? Black & white?


This is really neat! Always cool to see what people can do with less.


Interesting. Which platform/PC config did you use?


Amazing work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: