We recently noticed that someone crawling from idris.fr was impersonating CCBot with the user-agent string. We contacted IDRIS and they said they would stop doing it. It would be good for the documentation for img2dataset should say: 1) use your own useragent, don't impersonate anyone else 2) respect robots.txt and X-robots 3) Respect ccbot robots.txt rules and X-robots rules in addition to the rules for your bot