Leap second affected Cloudflare DNS
Cloudflare · Cloudflare DNS
On January 1, 2017, at 00:00 UTC, Cloudflare’s custom RRDNS software began experiencing issues due to a leap second event. This led to some DNS resolutions failing, specifically impacting customers using CNAME DNS records. The problem was quickly identified, with engineers escalating the issue by 00:10 UTC.
The root cause was an incorrect assumption in the Go-based RRDNS code that time would always move forward monotonically. The time.Now() function, used to calculate Round Trip Times (RTT) for upstream DNS resolvers, returned a time earlier than a previously recorded start time immediately after the leap second. This resulted in negative RTT values.
These negative RTT values were smoothed over time and eventually fed into Go’s rand.Int63n() function, which is designed to panic if its argument is negative. This panic in the weighted selection algorithm for upstream resolvers caused the RRDNS failures and subsequent DNS resolution issues.
The customer impact was limited, primarily affecting customers who use CNAME DNS records. At its peak, approximately 0.2% of DNS queries to Cloudflare were affected, and less than 1% of all HTTP requests to Cloudflare encountered an error. The issue was confined to a small number of machines across Cloudflare’s data centers.
A fix was developed and deployed within 90 minutes to the most affected machines. The remediation involved a code change that prevented the recording of negative RTT values, allowing RRDNS to normalize its performance metrics if time skipped backward. The fix was rolled out worldwide by 06:45 UTC on January 1, 2017, at which point the impact ended.