Astro - Hacker News

29 comments

Frieren 2 hours ago ago

> The LLM companies are not picking on me in particular, they are pounding every site on the net.
Why is not this a criminal offense? They are hurting business for profit (or for higher valuation as they probably have no profit at all).
Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
The five-year and ten-year penalties kick in only when the government can show the offense caused at least $5,000 in losses across all victims during a one-year period. https://legalclarity.org/what-are-the-punishments-for-a-ddos...
[-]
- maplethorpe an hour ago ago
  
  > Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
  Those laws are intended to protect corporations. If corporations are the ones doing the scraping, it doesn't make sense for the same laws to affect them.
- budududuroiu 2 hours ago ago
  
  Normative vs prerogative state [1]. See US v. Swartz compared to Meta use of LibGen for Llama
  [1] https://en.wikipedia.org/wiki/Dual_state_(model)
  [-]
  - dannyobrien an hour ago ago
    
    So, I knew Aaron and I definitely would not presume to predict what he would have thought, but I’d point out there is a sizeable state space where he should never have been prosecuted, and scraping by others including large commercial companies should not prosecutable on the same grounds.
    I repeat what Aaron’s friends and lawyers said at the time: we were going to fight that case, and we were going to win.
- devmor 18 minutes ago ago
  
  Because they have more money.
  I've had to deploy a combination of Cloudflare's bot protection and anubis on over 200 domains across 8 different hosting environments in the last 2 months. I have small business clients that couldn't access their sales and support platforms because their websites that normally see tens of thousands of unique sessions per day are suddenly seeing over a million in an hour.
  Anthropic and OpenAI were responsible for over 70% of that traffic.
- tempest_ 2 hours ago ago
  
  Because might makes right and any entity with the power to legally put up a fight is in on the game (or wants to be)
- heavyset_go 2 hours ago ago
  
  We've already established that computer crime and IP laws apply to normies and not tech companies
- spiderfarmer an hour ago ago
  
  I have added a DB replica server just to keep my website from succumbing to AI bot traffic.
- legohead 2 hours ago ago
  
  adapt or die
  waiting on the govt to do something is a path of failure
- will4274 2 hours ago ago
  
  It's a bit more like a physical business with a "public welcome" policy like a coffee shop going viral and then having tens of thousands of people walking in and taking pictures but not buying coffee. It's disruptive, but not illegal.
  Acme.com is welcome to require authentication for all pages but their home page, which would quickly cause the traffic to drop. They don't want to do this - like the coffee shop, they want to be open to public, and for good reasons.
  Sometimes the use profile changes dramatically in a short time. 15 years ago, Netflix created the video streaming market and shared bandwidth capacity that had been excessive before wasn't enough. 15 years before that, Google did the same thing when they created search and started driving tremendous traffic to text based websites which had spread through word of mouth before.
  Turns out the micro transaction people probably had the right idea.
- avazhi 2 hours ago ago
  
  Is what an offence lol? Bot scraper traffic?
  How do you think search engines work?
  [-]
  - spiderfarmer an hour ago ago
    
    They work because they offer ways to opt out, they honor crawl delay, setting ideal scraping times, IndexNow, etc.
    And they give you real, valuable traffic in return.
- reddozen an hour ago ago
  
  Because the law deals with intent. The intent for a 12 year old skiddie with a ddos box is to harm someone else's internet. the intent of big scrapers is to collect data. if you want to make the latter illegal then vote for that instead of loading it with the normative baggage of the former.
  It's the same problem as why Occupy Wallstreet fell apart: bunch of losers who don't understand the system screech about the system. because they don't understand it, they can't offer any meaningful dialogue about how to fix it beyond screeching.
davidsojevic 3 hours ago ago

I suspect part of the issue is that people are still using things like `acme.com` and `demo.com` as an example domain in their documentation and tests instead of relying on `example.com` which is reserved exactly for this purpose [0]
[0]: https://www.iana.org/domains/reserved
[-]
- spiderfarmer an hour ago ago
  
  A small part. On my server AI bots outnumber real visitors 300 to one.
  [-]
  - davidsojevic 30 minutes ago ago
    
    I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation.
    If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.
    This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.
  - dylan604 26 minutes ago ago
    
    That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain.
  - Lerc 17 minutes ago ago
    
    Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping,
chupchap 2 hours ago ago

Bot traffic is crazy even for smaller sites, but still manageable. I was getting 2,000 visitors a day on my infrequently updated website, but after I blocked all the bots via Cloudflare it went back to the normal double digit visitor count.
[-]
- spiderfarmer an hour ago ago
  
  I have 6M pages across 8 domains. I have 10 unique IP residential bots per second working hard to scrape every single page.
  [-]
  - dylan604 24 minutes ago ago
    
    and how often are those 6M pages changing? how often are those bots finding anything new? why are the bot makers not noticing no difference and just slowing the request down for essentially stale content to them
kristianp 2 hours ago ago

> Nearly all of them were for non-existent pages.
Do any webservers have a feature where they keep a list in memory of files/paths that exist?
[-]
- kristianp 21 minutes ago ago
  
  Also why are most requests for non existent pages?
  [-]
  - herecomesthepre 7 minutes ago ago
    
    Because they are hunting for vulnerable devices and the requests' existence are unique to an application. Like a VoIP appliance for example.
    They usually request something deep like /foo/bar/login.html as part of their reconnaissance.
    I'm up to 4 pages of filter rules after the massive IP blacklist.
arjie an hour ago ago

The only real solution is to put Anubis in front. For me, I just use Cloudflare in front and that suffices. But it's only a few thousand per hour by default. My homeserver can handle that quite well on its own.
JohnTHaller 2 hours ago ago

Series of Chinese LLM scrapers kept PortableApps.com running slow and occasionally unresponsive for 2 weeks.
superkuh 2 hours ago ago

There are plenty of local LLMs out there run by humans that play nice. It's not the LLMs that are the problem. It's the corporations. That's the commonality. Human people aren't doing this. These corporate legal persons are a much more dangerous and capable form of non-human intelligence with non-human motives than LLMs (which are not doing the scraping or even calling the tools which are sending the HTTP requests). And they have lobbied their way to legal immunity to most of their crimes.
[-]
- happyopossum 2 hours ago ago
  
  > Human people aren't doing this
  Who do you think writes these scrapers? Well, I mean aside from the vibe coded ones.
avazhi 2 hours ago ago

> Someone really ought to do something about it.
What is bro proposing here?