The move comes at a time when artificial intelligence firms have
been accused of plagiarizing content from publishers to create
AI-generated summaries without giving credit or asking for
permission.
Reddit said that it would update the Robots Exclusion Protocol,
or "robots.txt," a widely accepted standard meant to determine
which parts of a site are allowed to be crawled.
The company also said it will maintain rate-limiting, a
technique used to control the number of requests from one
particular entity, and will block unknown bots and crawlers from
data scraping - collecting and saving raw information - on its
website.
More recently, robots.txt has become a key tool that publishers
employ to prevent tech companies from using their content
free-of-charge to train AI algorithms and create summaries in
response to some search queries.
Last week, a letter to publishers by the content licensing
startup TollBit said that several AI firms were circumventing
the web standard to scrape publisher sites.
This follows a Wired investigation which found that AI search
startup Perplexity likely bypassed efforts to block its web
crawler via robots.txt.
Earlier in June, business media publisher Forbes accused
Perplexity of plagiarizing its investigative stories for use in
generative AI systems without giving credit.
Reddit said on Tuesday that researchers and organizations such
as the Internet Archive will continue to have access to its
content for non-commercial use.
(Reporting by Harshita Mary Varghese; Editing by Alan Barona)
[© 2024 Thomson Reuters. All rights
reserved.]
Copyright 2022 Reuters. All rights reserved. This material may
not be published, broadcast, rewritten or redistributed.
Thompson Reuters is solely responsible for this content.
|
|