GitHub explains its recent outages


Microsoft-owned code-sharing service GitHub is making improvements to its “MySQL1” database cluster after repeated outages last week affected many of its 73 million users.

GitHub outages are a problem for developers because software is hosted on the service. GitHub is also important for the operation of enterprise applications.

Problems in the database

GitHub admitted that its service has been down for the past week due to “database health” issues, which has resulted in a degraded experience for its users.

“We know this impacts the productivity of many of our customers and we take it very seriously,” Keith Ballinger, senior vice president of engineering at GitHub, said in a blog post.

“The origin of our issues over the past few weeks has to do with resource contention in our mysql1 cluster, which has impacted the performance of many of our services and features during periods of peak load,” he explains.

Repeated GitHub outages over the past week have resulted in numerous social media complaints. Incident reports on downdetector.com peaked on March 23, with most of them relating to failed push and pull requests for projects.

Repeated incidents

Keith Ballinger highlighted four incidents, which lasted between two and five hours each, on March 16, 17, 22 and 23.

The March 16 outage lasted 5 hours and 36 minutes, it says. GitHub’s MySQL1 database was overloaded, causing outages affecting git operations, webhooks, pull requests, API requests, issues, GitHub Packages, GitHub Codespaces, GitHub Actions, and GitHub Services Pages. “The incident appears to be related to a spike in load combined with poor query performance for specific sets of circumstances,” he notes.

GitHub has failover mechanisms, but those have also failed. On March 17, a new outage occurs, which will last two hours and 28 minutes. “We were unable to identify and resolve query performance issues prior to this spike, and decided to proactively failover before the issue escalated. Unfortunately, this caused a new load pattern which introduced connectivity issues on the new downed primary server, and applications were once again unable to connect to mysql1 while we worked to reset those connections,” says the manager.

Further blackouts occurred on March 22 and 23, and both lasted just under three hours. “In this third incident, we enabled memory profiling on our database proxy to take a closer look at performance characteristics during peak loads. At the same time, client connections to mysql1 started failing, and we had to do a primary failover again in order to recover,” he says of the March 22 incident.

Then, on March 23, the company reduced webhook traffic and used this control to mitigate future issues when its database can’t handle peak loads.

Fixes for the future

The company, which belongs to Microsoft, announces that it has taken measures to prevent its database cluster from being overwhelmed by traffic from its services. It audits load patterns, deploys several performance fixes for the affected database, shifts traffic to other databases, and attempts to reduce failover times.

“We sincerely apologize for the negative impacts these disruptions have caused. We understand the impact these types of outages have on the customers who rely on us to get their jobs done every day and are committed to making efforts to ensure we can manage disruptions and minimize downtime.” , said Keith Ballinger.

GitHub will disclose more details in its March Availability Report, which will be released in a few weeks.

Source: ZDNet.com





Source link -97