Astro - Hacker News

13 comments

bstsb an hour ago ago

> Get Outlook for Mac
this bit made me laugh. was the email drafted in Outlook? was it sent to some sort of forwarding mailbox, or did they just BCC every customer in?
jacobn an hour ago ago

I just complained to them the other day! They were scraping our weather website to no end, very much including the disallowed path prefixes.
Did end up just adding them to our WAF blocklist, which is weirdly ironic - hosting on their infra & using their services to block their AI scraper...
[-]
- BLKNSLVR 33 minutes ago ago
  
  I hope you leave it on the WAF. If they're only just deciding to respect robots.txt, which has been internet infrastructure forever, then it's probably still incredibly amateur software with 'Amazon-priorities' rather than 'responsible internet traffic' priorities.
TurdF3rguson 40 minutes ago ago

Why does Amazonbot even exist, can someone explain? I don't understand why an ecommerce play would be crawling other websites.
[-]
- input_sh 34 minutes ago ago
  
  To train AI. Not even a hyperbole, that is the only concrete example they list in their explanation: https://developer.amazon.com/amazonbot
  > Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.
- tintor 34 minutes ago ago
  
  To ensure Amazon marketplace sellers aren't offering lower prices on other ecommerce websites. Also AI.
- embedding-shape 35 minutes ago ago
  
  Amazonbot is specifically the user agent they use for crawling for "provide more accurate information to customers" (whatever that means, could be anything it sounds like) and also when they scrape for data used in AI training, according to https://developer.amazon.com/amazonbot
- reaperducer 35 minutes ago ago
  
  AI. Gotta slurp the world.
arjie an hour ago ago

Huh, I get a lot of traffic from Amazonbot (relative to humans) and try as I might, it would get stuck in a tarpit of no creation because it would sit there and keep blasting every variation of my recent pages because Mediawiki lists many links. I have them appropriately nofollow and warning the bot not to waste its time with robots.txt but it just goes and sticks itself on nonsense internal pages.
The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723
an hour ago ago

[deleted]
namegulf an hour ago ago

Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care.
Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.
[-]
- input_sh 27 minutes ago ago
  
  Yes, we know, its purpose is to guide the bots, not forcibly block them.
  That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.
  [-]
  - namegulf 14 minutes ago ago
    
    Why down vote a comment?
    You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.