Hacker Newsnew | past | comments | ask | show | jobs | submit | bonaldi's commentslogin

This doesn't feel like good-faith. There are leagues of difference between "what you typed out" when that's in a highly structured compiler-specific codified syntax *expressly designed* as the input to a compiler that produces computer programs, and "what you typed out" when that's an English-language prompt, sometimes vague and extremely high-level

That difference - and the assumed delta in difficulty, training and therefore cost involved - is why the latter case is newsworthy.


> This doesn't feel like good-faith.

When has a semantic "argument" ever felt like good faith? All it can ever be is someone choosing what a term means to them and try to beat down others until they adopt the same meaning. Which will never happen because nobody really cares.

They are hilarious, but pointless. You know that going into it.


Fastmail is the way. These are people for whom email is their job and focus and you get everything that comes with that, including good and responsive customer service.


But their servers are in the US.


So are the email servers used by the recipients of your emails, no? Almost everybody uses gmail, so even of you don't most of your email correspondence is going to end up, or originate from, on gmail servers anyway.


Personally I do not know the last time I wrote to a Gmail-adress, so depending on location and evironment avoiding US-mailservers may be possible.


GDPR applies if you're in the EU regardless, but it would be nice to have it split like bitwarden[.eu].


Because we exist within a market, where the choices of others end up affecting us - if the market "votes" for a competing thing, that might affect the market for the things you care about.

Your car analogy isn't great, but we see a similar dynamic playing out with EV vs combustion, and we did with film-vs-digital cameras. "Don't buy a digital camera if you like film" sure didn't help the film photographers.


This is like "HTML isn't code" again. For non-technical readers, there is their own language, and there is "code" - a bespoke language used solely to instruct machines. If you can't type to the machine in your own language (eg like you can to a chatbot) then you're using code. "The machine" is the device on the desk.

"ls" is code. You type it into the machine's keyboard, and it understands your code and performs that instruction. The statement is not "radically" wrong, it's an oversimplification that both communicates correctly to the lay reader, and to the proficient reader who understands the nuances and why they're irrelevant here.


> Tesonet initially assisted Proton with HR, payroll, and local regulation

Entirely normal behaviour for a competitor to provide “HR assistance”.


I've been part of a European startup that added offices in Asia and the US, and we initially always partnered with local companies to do this. It's mutually beneficial. It allowed us to grow more quickly, and it allowed them to make relatively easy money (and, in our case, to dump some of their shittier employees on us without us knowing).

In Proton's case, they already knew each other because Tesonet had previously offered to provide infrastructure during a DDoS attack against Proton.

So maybe it's a conspiracy, or maybe it's just how things go. You can make up your own mind, but you should provide the facts when you make sinister insinuations.


You know an awful lot of detail about the inner workings of two separate private companies though.


Is it really that shocking that someone on HN would have worked at as many as 2 private companies?


Nor is it shocking that a company with a PR issue would be astroturfing our forum.

The point is: we don't know.


I would assume that if they were astroturfing, they would be smart enough to use more than one account. Given that, I'm inclined to believe that you are part of an astroturfing campaign.


The summary is: if you use someone’s VPN, Tor, etc. you’re just setting yourself up. There is no privacy, and if you act like you want privacy, they’re going to pay more attention to you.


That's what they want you to think.


LOL, now I'm part of the conspiracy. This is all public knowledge.


Then you could provide sources, please?


Here you go: https://www.reddit.com/r/ProtonVPN/comments/8ww4h2/protonvpn...

Here's the Handelsregisterauszug for Proton, which shows ownership: https://www.zefix.admin.ch/en/search/entity/list/firm/118926...

Proton's peering relationships: https://bgp.tools/as/62371#asinfo

I'm not sure what exactly you're looking for.


> Here's the Handelsregisterauszug for Proton, which shows ownership

It doesn‘t. It’s a joint-stock corporation and while the shareholders are registered, the register is not public.


Proton discloses shareholder information here: https://proton.me/support/who-owns-protonmail

But I guess they could be lying.


Them providing information isn't the same as publicly verifiable information.


> Mails are superior in announcing to multiple people

People who are known at time of sending. A slack message can be searched by those joining the team much (much) later, those who move teams, in-house search bots, etc. Mailing lists bridge this gap to some extent, but then you're really not just using email, you're using some kind of external collaboration service. Which undermines the point of "just email".


> > Mails are superior in announcing to multiple people > > People who are known at time of sending. A slack message can be searched by those joining the team much (much) later, those who move teams, in-house search bots, etc.

People use slack search successfully? It's search has to be one of the worst search implementations I have come across. Unless you know the exact wording in the slack message, it is almost always easier to scroll back and find the relevant conversation just from memory. And that says something because the slack engineers in their infinite wisdom (incompetence) decided that messages don't get stored on the client, but get reloaded from the server (wt*!!), so scrolling back to a conversation that happened some days ago becomes an excercise of repeated scroll and wait. Slack is good for instant messaging type conversations (and even for those it quickly becomes annoying because their threads are so crappy), not much else. I wish we would use something else.


How would you search from mail threads you weren't CC'd on?


MS Exchange had sort-of solved that problem with Public Folders. Basically shared email folders across an organization.

The older solution is NNTP/Usenet. I wish we had a modern system like that.


> Mailing lists bridge this gap to some extent, but then you're really not just using email, you're using some kind of external collaboration service. Which undermines the point of "just email".

Mailing lists are just email. They simply add a group archiving system.


thats why online private archives like https://mailarchive.ietf.org/arch/browse/ exist. for a free version, use groups.google.com


you just use a shared inbox for the team


This is being blocked by my corp on the grounds of "newly seen domains". What a world.


Not sure the emotive language is warranted. Message appears to be “if you use robots.txt AND archive sites honor it AND you are dumb enough to delete your data without a backup THEN you won’t have a way to recover and you’ll be sorry”.

It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.


I just plain don't understand what they mean by "suicide note" in this case, and it doesn't seem to be explained in the text.

A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".


The meaning is reasonably clear to me: Robots.txt says "Don't archive this data. When the website dies, all the information dies with it." It's a kind of death pact.


That's not a suicide note, though, in any way I understand it.


It's the inevitable suicide of the data.

Language gets weird when you anthropomorphize abstract things like "data", but I thought it was clever enough. YMMV.


The suicide of the data listed in robots.txt? How? The whole point of the article is they ignore what you have written in your robots.txt, so they'll archive it regardless of what you say.


Correct, they are challenging your written wish for data-suicide.


I also cannot figure out from context what part of this is "suicide".

I don't even think it's a note saying your back door is unlocked? As myself and others shared in a sibling comment thread, we have worked at places that implemented robots.txt in order to prevent bots from getting into nearly-infinite tarpits of links that lead to nearly-identical pages.


> volumes of LLM scraping

FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.

(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)


It's hard because of attribution, but it absolutely is happening at very high volume. I actually got an alert this morning when I woke up from our monitoring tools that some external sites were being scraped. Happens multiple times a day.

A lot of it is coming through compromised residential endpoint botnets.


Even without attribution…seeing bot traffic or general traffic increase


Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...


> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.


> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.


Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.


If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2. https://apify.com/apify/puppeteer-scraper


Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.


Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.


From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.


I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.


By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

1. https://news.ycombinator.com/item?id=44944761

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886


Or major web properties for that matter.


I could really really use something that would OCR and classify all the screenshots I take of stuff to remember. Have an enormous folder of the damn things.


Have you tried “Keep It”?

Keep It is a notebook and document organizer for Mac, and is also available as a separate app for iPhone and iPad. Keep It can create and edit notes, rich text, plain text and Markdown files, scan documents, edit PDFs, archive emails, save web links in a variety of formats, preview and search just about any kind of file, and organize these in a variety of ways. All the files, folders and tags you store in Keep It are available in the Finder, and can be shared with other Keep It users via iCloud.

https://reinventedsoftware.com/keepit/


> I could really really use something that would OCR and classify all the screenshots I take of stuff to remember.

IMHO, a Spotlight Importer[0] would be the way to go. A quick search found the MacOS Vision OCR[1] project, which might be able to be incorporated as an importer.

In any event, whatever OCR approach you prefer, leveraging Spotlight would obviate the need for a service to index and then find screenshots.

0 - https://developer.apple.com/library/archive/documentation/Ca...

1 - https://github.com/bytefer/macos-vision-ocr


…Or scans that come off a page scanner. For years, I’ve done it deterministically by looking for key data in the OCR; but the process is fragile.


I think that it’s probably doable in DEVONthink. There are flows to automatically OCR and to organize files into folders based on content.


> Instead of tapping buttons to bold text or create headers, users could type *bold* or # Header directly into their notes.

Which will be more keystrokes, not fewer – it's faster to get to the formatting buttons than it is the punctuation keyboard on iOS, and even on Mac the shortcut commands are often faster too.

Notes was a fanastic example of a rich-text environment, but if Markdown input helps the die-hards that is great, so long as I don't have to ever see, use or be aware of it.


I tend to copy Markdown content from other sources into Apple Notes. Being able to paste into Notes and have it format in the view is a big win.

It is possible reporting is getting this wrong and the Markdown feature and it is just to serve use case above. As an example, Google Docs recently enabled "Paste from Markdown" that also is a huge convenience.


> I tend to copy Markdown content from other sources into Apple Notes. Being able to paste into Notes and have it format in the view is a big win.

AIUI it's only Markdown export support for now


Yeah, this would be huge for me; I often toss a bunch of notes into an Apple Notepad note just to have it in my pocket, and everything I write with a keyboard is markdown.

This just makes it so I don't have to stare at a bunch of random characters and can have actual formatting. A win in my book!


Apple Notes is used not only with onscreen keyboards in iOS, but also with physical keyboards in iPadOS and macOS, where familiarity with Markdown input could make it faster than shortcut commands.


I think it’s user preference. For me it’s easier to get to the punctuation asterisk (basically muscle memory and <0.5s) than to tap the formatting menu, wait for the keyboard sheet to go down and the formatting sheet to go up, tap Bold, then tap the X and wait for animations to complete again (which I have no muscle memory for and need to look at every UI element that I’m tapping).


But formatting already-typed text on iOS is incredibly fiddly (as you have to select a text span first, and iOS fights you every step of the way when you do this — especially if the span starts or stops at something iOS doesn't consider a "whole token.")

Meanwhile, inserting punctuation representing formatting into already-typed text, merely requires placing the insertion caret, which is much less fiddly.


I was just typing a reply to gush about how much I agree with you and how awful iOS text editing is (I’m on my iPhone at the moment) and decided to play with editing text to get some example complaints, when I had an epiphany:

iOS lets you double tap to start a text selection now. I don’t know when this started. I’m 99% sure I used to long-press to start a text selection, and that it would start highlighting the word under the little preview bubble. My muscle memory is still to do this when I want to highlight text; it just never works and I always get frustrated.

Maybe if I start remembering to double tap to highlight text, the text editing experience might actually start to be passable? :shrug:

(Yes, I know about long pressing the keyboard to use it as a trackpad. I do that most of the time, but it’s still fiddly, it very very often misinterprets a tap and starts text selection wildly off from where I wanted it to, and the only fix is to tap around in the text area.)


How can you be 99% sure when long press in *edited text* like when writing notes doesn't select, but instead moves the caret with a magnifying glass?


I said I was 99% sure it used to work that way.

But yes, you’re right about editable text being the difference: my memory of long pressing to highlight/select is exactly how text selection works for noneditable text, like in regular web sites in safari.

That’s the big inconsistency, and why I’m always frustrated by iOS text editing. Long pressing normal text highlights it, but long pressing editable text does not.

So it’s not that they changed something, it’s that the behavior is different for editable vs noneditable text, and my brain keeps doing the wrong one. Maybe now that I know about double tapping my brain can finally have a complete picture of the behavior split and I can stop fucking it up each time.

(Although I’m still pretty certain that doing a brief long press, but not long enough for the magnifying glass to show up, used to select a word of text. I can’t prove this though. Maybe I’m remembering the Force Touch days when you used to be able to do a Force Touch while long pressing to expand selection. That would make sense with the timeline.)


hm, here is the original iphone user guide for ios 3.1

> When you’re typing, you canalso double-tap to select a word. In read-only documents, such as webpages, or email or text messages you’ve received, touch and hold to select a word.

Also, double tapping selects by words in editable notes vs by letter in read-only, so the OS will continue to fight you feeble attempts at trying to have a consistent experience!


I also remember the same thing as GP, and I believe I got it from the very first iPad.


I don’t know if it’s just me, but it feels like it’s gotten more fiddly with time. Either that or I notice it more, because I’m using the thing more as time goes on.

I can’t understand people who use an iPad full time. My dad does this and I don’t know how he does drive himself mad with all the taps required to do basic things.


> iOS fights you every step of the way when you do this

oh, indeed, that's true even for simple movements: you tap somewhere, the cursors jumps there momentarily and then jumps back. You tap again, same thing. So the system knows what you want, but just "competently" engineered in a way to ignore you...


Current

   - Tap Aa
 - Tap Italic
 - Tap close
 - Type
 - Tap Aa
 - Tap Italic (to reset)
 - Tap close
Future

    - Hold a key to insert asterisk
    - Type
    - Hold a key to insert asterisk


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: