Need way to securely verify Internet Archive crawler #507

mlissner · 2017-11-20T22:37:55Z

mlissner
Nov 20, 2017

I could be mistaken, but I haven't been able to find the documentation on how to verify the IA crawler. The closest thing I've found is that you can check the User-Agent string of the crawler, but that's easily faked. My issue is that I want to invite the IA crawler to crawl my content, but I want to detect things like spammers and block them.

Google and Bing both handle this by using a reverse DNS request of the IP address of the crawler, followed by a regular DNS request checking the host returned by the reverse DNS.

Put another way:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

So, since the second host command returns the same IP that we started with, and since the domain ends with googlebot.com, we're in business.

Here's google's docs: https://support.google.com/webmasters/answer/80553?hl=en
And Bings: https://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

Could IA add this feature too? I think it would only require that you do some work with your DNS whenever you have a new IP address.

nlevitt · 2018-01-25T01:37:23Z

nlevitt
Jan 25, 2018

Sorry, seeing this for the first time.

I think your DNS technique may already work for our crawlers?

1 reply

ansuz Nov 23, 2025

I've received requests from the following IP addresses:

152.53.36.192
152.53.39.118
2607:5300:60:6dd4::
2a03:4000:5c:152::
94.16.31.119

identified by this user-agent string:

ArchiveTeam ArchiveBot/20250806.050c783 (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36

A reverse DNS lookup yields the following results:

$ for ip in 152.53.36.192 152.53.39.118 2607:5300:60:6dd4:: 2a03:4000:5c:152:: 94.16.31.119;do echo $ip;host $ip;echo;done;
152.53.36.192
192.36.53.152.in-addr.arpa domain name pointer v2202411233004298790.luckysrv.de.

152.53.39.118
118.39.53.152.in-addr.arpa domain name pointer v2202411243627299646.bestsrv.de.

2607:5300:60:6dd4::
Host 0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.4.d.d.6.0.6.0.0.0.0.3.5.7.0.6.2.ip6.arpa not found: 3(NXDOMAIN)

2a03:4000:5c:152::
Host 0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.5.1.0.c.5.0.0.0.0.0.4.3.0.a.2.ip6.arpa not found: 3(NXDOMAIN)

94.16.31.119
Host 119.31.16.94.in-addr.arpa. not found: 3(NXDOMAIN)

Can you confirm whether this is evidence of others spoofing their agent string to pass as ArchiveTeam bots?

If so, could you point to an IP address of a valid ArchiveBot crawler so I could validate the method against a legitimate example?

mlissner · 2018-01-25T01:40:18Z

mlissner
Jan 25, 2018
Author

Would it be possible to get this documented somewhere? I'd be afraid to start using this technique without some level of commitment from IA, for fear that it'd break your crawler.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Need way to securely verify Internet Archive crawler #507

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Need way to securely verify Internet Archive crawler #507

Uh oh!

mlissner Nov 20, 2017

Replies: 2 comments · 1 reply

Uh oh!

nlevitt Jan 25, 2018

Uh oh!

Uh oh!

ansuz Nov 23, 2025

Uh oh!

mlissner Jan 25, 2018 Author

mlissner
Nov 20, 2017

Replies: 2 comments 1 reply

nlevitt
Jan 25, 2018

mlissner
Jan 25, 2018
Author