Wayback Machine whois Scanner
The post from archive.org retroactive robots.txt and .gov .mil usage got me thinking about how data and sites, specifically URLs change hands for any reason possible. What constitutes a hostile takeover? What constitutes an acceptable change? Whilst reading the article, this quote struck me as something that could possibly be solved by internet magic.
Another problem is knowing when a domain name changes hands, so a current robots.txt file is not relevant to a different era.
By the end of the article my mind had thought of a possible solution. Whois information. I may move DNS hosts and I may move where I host my website but my whois information will stay the same until a user buys a new one or lets it expire. I’ve personally put too much work in it to my domain to let it expire now, but what about others.
whois data as a … dictionary object?
So my current thinking is to use sha512 hash of the whois data in a dict, object?, json? {url, date-of-registration, hash}
to generate timezones of ownership.
To do this we need a way to get enough entropy from the whois data to make sure we don’t generate any hash collisions. We also are only allowed to use whois data that will stay static for each url. With both of these two constraints we should in theory be able to avoid hash collisions and be able to reliably determine when a owner changes as to not retroactively remove content from the same url when a new owner is detected.
The following is the whois information for my domain: slowb.ro
Domain Name: slowb.ro
Registered On: 2013-06-04
Registrar: EPAG Domainservices GmbH
Referral URL: http://www.epag.de
DNSSEC: Inactive
Nameserver: ns1.he.net
Nameserver: ns2.he.net
Nameserver: ns3.he.net
Nameserver: ns5.he.net
Nameserver: ns4.he.net
Domain Status: OK
As you can see there is a serious lack of information to work with.
Note: .ro domains are one of the very few TLDs which have still not have finally implemented DNSSEC! .ro TLD DNSSEC News update
There is literally only two points of data that we can use, no wonder people love to use .ro domains for suspect activities
Domain Name: slowb.ro
Registered On: 2013-06-04
These two bits of information is the smallest amount of data points that we have to work with. But what about our maximum. Lets have a look at a .org which is required to not have a private whois service:
Domain Name:ARCHIVE.ORG
Domain ID: D2445039-LROR
Creation Date: 1995-12-14T05:00:00Z
Updated Date: 2013-03-04T00:20:16Z
Registry Expiry Date: 2022-12-13T05:00:00Z
Sponsoring Registrar:easyDNS Technologies Inc. (R1247-LROR)
Sponsoring Registrar IANA ID: 469
WHOIS Server:
Referral URL:
Domain Status: clientTransferProhibited
Domain Status: clientUpdateProhibited
------ snip ------
There is a serious amount of information compared to little old slowb.ro. The choice now is to decide what information we need, want, or require to fulfill our previous constraints.
My original choices for variable data was to try and include as much as possible to reduce changes of hash collisions. Any contact information such as: Tech, Accounts or Registrant can change and the domain could be with the same company. It could also update to a new department or change hands to the new IT guy because he’s straight out of college and updated all details into his name by accident.
Working out our data points
Domain Name:ARCHIVE.ORG
Domain ID: D2445039-LROR
Creation Date: 1995-12-14T05:00:00Z
Updated Date: 2013-03-04T00:20:16Z
Registry Expiry Date: 2022-12-13T05:00:00Z
Registrant Name:Internet Archive
I believe we can remove Updated Date and Expiry Date as both of those can be changed with no indication of But will possibly have to add them back in again as my plan is starting to unravel. I started asking all the questions that would haunt me for the next days:
- Does the creation date update after it expires? (Possibly not?)
- What happens when people rebuy it, during the hold period?
- Does each TLD have its own implementation of information given to the public registry?
Closing Thoughts
Whilst this was a fun mental exercise, without serious investigation of the whois format, and having a way to basically scrape whois data, which has always been known to be a hotspot of personal information, I believe it to just be a pipe dream. Further investigation into Archive.org, and since a certain president was elected. The Internet Archive recently changed their stance on retroactively removing content based on robots.txt. Please read more here. If you also feel like it there is a lot of information on this hot topic (mostly old, before the 2017/04 post) see more here
I could not imagine the amount of data the Internet Archive actually saves for the future generations, and I commend them on their efforts. In a day and age where: Wikipedia is a source of truth that anyone can edit. Democratically elected officials promote an Oligarchy society, hopefully saving this content can save human kind from making the same mistakes.
Disclaimer: All whois is information is provided in a fair use manner. I stripped emails because no-one likes spam. If you have issues with any other data, please let me know and I’ll redact it.