Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article sums up some valid arguments against UUIDv4 as PKs but the solution the author provides on how to obfuscate integers is probably not something I'd use in production. UUIDv7 still seems like a reasonable compromise for small-to-medium databases.




I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).


I'm also not a huge fan of leaking server-side information; I suspect UUIDv7 could still be used in statistical analysis of the keyspace (in a similar fashion to the german tank problem for integer IDs). Also, leaking data about user activity times (from your other comment) is a *really* good point that I hadn't considered.

I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.

My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.

To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.

Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.

I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.


I think a bigger benefit from doing that would be that inserts would be cheaper. Instead of an expensive insert into the middle of an index for every table that needs an index on that key, you can do a cheaper insert at the end of the index for all of them except for the one that uses uuid4.

But if you are doing that, why not just use an incrementing integer instead of a uuidv7?


Certainly for many applications, the autoint approach would be fine.

The benefit of uuid in this case is that it allows horizontally scalable app servers to construct PKs on their own without risk of collisions. In addition to just reducing database load by doing the ID generation on the app server (admittedly usually a minor benefit), this can be useful either to simplify insert queries that span multiple tables with FK relationships (potentially saving some round trips in the process) or in very niche situations where you have circular dependencies in non-nullable FKs (with the constraint deferred until the end of the transaction).


> I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.

See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":

* https://github.com/stateless-me/uuidv47

* Sept 2025: https://news.ycombinator.com/item?id=45275973


If that kind of stuff is on the able you can also use boring 64bit integer keys and encrypt those (e.g. [1]). Which in the end is just a better thought out version of what the article author did.

UUIDv47 might have a space if you need keys generated on multiple backend servers without synchronization. But it feels very niche to me.

1: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)


Out of curiosity, why is it an issue if you leak creation time?

The issue will be very context specific. In other words to (reasonably) answer the question, we'd have to judge each application individually.

For one example, say you were making voting-booth software. You really don't want a (hidden) timestamp attached to each vote (much less an incrementing id) because that would break voter confidentiality.

More generally, it's more a underlying principle of data management. Not leaking ancillary data is easier to justify than "sure we leak the date and time of the record creation, but we can't think of a reason why that matters."

Personally I think the biggest issue are "clever" programmers who treat the uuid as data and start displaying the date and time. This leads to complications ("that which is displayed, the customer wants to change"). It's only a matter of time before someone declares the date "wrong" and it must be "fixed". Not to mention time zone or daylight savings conversions.


Well you're leaking user data. I'm sure you can imagine situations where "the defendant created an account on this site on this date" could come up. And the user could have created that account not knowing that the creation date is public, because it's not listed anywhere in the publicly viewable part of the profile other than the UUID in the URL.

Discord is doing fine.

Hacker news is also doing fine, even though I can just click your profile and see you joined in october 2024. It doesn't matter for every use case.

But there are cases where it matters. Using UUIDv7 for identifiers means you need to carefully consider the security and privacy implications every time you create a new table identified by a UUID, and you'll possibly end up with some tables where you use v4 and some where you use v7. Worst case, you'll end up with painful migrations from v7 to v4 as security review identifies timestamped identifiers as a security concern.


Pretty much every social media app has a "Member since X" visible on public profiles. I don't think it's an issue.

Who said I was talking about social media?

Well where else do users have public profiles?

The whole point though is that the ID itself leaks info, even if the profile is not public. There are many cases where you reference an object as a foreign key, even if you can't see the entire record of that foreign key.

I can't think of any.

Sending a friend request is an obvious example.

Ok but in that case, why not use the user ID as the primary key?

If your system (pseudo-) random number generator (RNG) is compromised to derive a portion of its entropy from things that are knowable by knowing the time when the function ran, then the search space for cracking keys created around the same time can be shrunken considerably.

This doesn’t even rely on your system’s built-in RNG being low quality. It could be audited and known to avoid such issues but you could have a compromised compiler or OS that injects a doctored RNG.


E.g, if your service users have timestamp as part of the key and this data is visible to other users, you would know when that account was created. This could be an issue.

There was a HN comment about competitors tracking how many new signups are happening and increasing the discounts/sales push based on that. Something like this.

In a business I once worked for, one of the users of the online ordering system represented over 50% of the business' income, something you wouldn't necessarily want them to know.

However, because the online ordering system assigned order numbers sequentially, it would have been trivial for that company to determine how important their business was.

For example, over the course of a month, they could order something at the start of the month and something at the end of the month. That would give them the total number of orders in that period. They already know how many orders they have placed during the month, so company_orders / total_orders = percentage_of_business

It doesn't even have to be accurate, just an approximation. I don't know if they figured out that they could do that but it wouldn't surprise me if they had.


This is also something that depends heavily on regulations. In my home country, invoice numbers have to be sequential by law, although you can restart the numbering every year.

Yes, even if it's not a legal requirement it's definitely best practice to have sequential invoice numbers. I thought about this at the time but these numbers aren't invoice numbers, only order numbers.

A global sequence, or sequence-per-account? I suspect the latter?

A sequence per "series", where a series can be a fiscal year, a department or category, etc. But I am not sure if you can have one series per customer, I only find conflicting information.

You can have more details here, in the section "Complete invoice":

https://sede.agenciatributaria.gob.es/Sede/en_gb/iva/factura...

https://www.boe.es/buscar/act.php?id=BOE-A-2012-14696#a6 (Spanish only)


That's happening everywhere. You can order industrial parts from a Fortune 500 and check some of the numbers on it too, if they're not careful about it.

Apart from all the other answers here: an external entity knowing the relative creation time for two different accounts, or just that the two accounts were created close in time to each other can represent a meaningful information leak.

Depends on the data. If you use a primary key in data about a person that shouldn't include their age (e.g. to remove age-based discrimination) then you are leaking an imperfect proxy to their age.

So the UUID could be used as an imperfect indicator of a records created time?

UUIDv7 but not UUIDv4.

I suppose timing attacks become an issue too.

UUIDv7 still have a lot of random bits. Most attacks around creating lots of ids are foiled by that

Admins, early users, founders, CEOs etc etc would have althe lowest creation time...

You shouldn't generally use PKs as public identifiers, least of all UUIDs, which are pretty user hostile.

I really don't see the issue with having a UUID in a URL.

If all you want is to obfuscate the fact that your social media site only has 200 users and 80 posts, simply use a permutation over the autoincrement primary key. E.g. IDEA or CAST-128, then encode in base64. If someone steps on your toes because somewhere in your codebase you're using a forbidden legacy cipher, just use AES-128. (This is sort of the degenerate/tautological base case of format-preserving encryption)

(What do you think Youtube video IDs are?)


The problem with this approach is that you now have to manage a secret key/secret for a (maybe) a very long time.

I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/

I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?


The article is self-contradictory in that it acts like that key is super-important ("Operations becomes a nightmare. You now have a cryptographic secret to manage. Where does this key live? Protected by a wrapping key living in a KMS or HSM? Do you use the same key across prod, staging, and dev? If dev needs to test with prod data, does it need access to prod encryption keys? What about CI pipelines? Local developer machines?") but then also acknowledges that we're talking about an obfuscation layer of stuff which is not actually sensitive ("to hide timestamps that aren't sensitive"). Don't get me wrong, it's a definitive drawback for scaling the approach, but most applications have to manage various secrets, most of which are actually important. E.g. session signing keys, API keys etc. It's still common for applications to use signed session with RCE data formats. The language from that article, while not wrong, is much more apt for those keys.

That being said, while fine for obfuscation, it should not be used for security for this purpose, e.g. hidden/unlisted links, confirmation links and so on. Those should use actual, long-ish random keys for access, because the inability to enumerate them is a security feature.


I always thought they are used and stored as they are because the kind of transformation you mention seems terribly expensive given the YT's scale, and I don't see a clear benefit of adding any kind of obfuscation here.

> What do you think Youtube video IDs are?

I actually haven no idea. What are they?

(Also what is the format of their `si=...` thing?)


Can’t recall where I heard this, but I’m pretty sure the si=… is tracking information that associates the link with the user who shared it.

Oh absolutely, I am just wondering _what_ does it contain.

YouTube video ids are just integers in a base-64 encoding, modified to be URL safe.

Interesting. Any examples? I mean, I can probably reverse-engineer something myself but just curious.

I am much more interested in the `si` parameter.. but I am fairly sure nobody outside of Google knows what it is exactly.


Why not use AES-128 by default? Your CPU has instructions to accelerate AES-128.

Can't you just change the starting value of your sequence?

In Postgres I often like to use a single sequence for everything. It leaks some information yes but in a busy system it tends to be "obscure enough".

It's not leaking that's the concern. It's that not having the names of objects be easily enumerable is a strongly security-enhancing feature of a system.

Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.


With a single sequence and a busy system, the ids for most high-level tables/collection are extremely sparse. This doesn't mean they can't be enumerated, but you will probably notice if you suddenly start getting hammered with 404s or 410s or whatever your system generates on "not found".

Also, if most of your endpoints require auth, this is not typically a problem.

It really depends on your application. But yes, that's something to be aware of. If you need some ids to be unguessable, make sure they are not predictable :-)


If you have a busy system, a single sequence is going to be a pretty big performance bottleneck, since every resource creation will need to acquire a lock on that sequence.

> Also, if most of your endpoints require auth, this is not typically a problem.

Many systems are not sparse, and separately, that's simply wrong. Unguessable names is not a primary security measure, but a passive remediation for bugs or bad code. Broken access control remains an owasp top 10, and idor is a piece of that. Companies still get popped for this.

See, eg, google having a bug in 2019, made significantly less impactful by unguessable names https://infosecwriteups.com/google-did-an-oopsie-a-simple-id...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: