On an Apple M1 with 16gig RAM, without using Pytorch compiled to take advantage of Metal, it could take 12mins to generate an image with a tweet-length prompt. With Metal, it takes less than 60 seconds.
And PyTorch on the M1 (without Metal) uses the fast AMX matrix multiplication units (through the Accelerate Framework). The matrix multiplication on the M1 is on par with ~10 threads/cores of Ryzen 5900X.