This content is viewable by Everyone
Brief CLE Outage on March 1, 2024
What Happened
The cloud service provider (Amazon Web Services) used for the UCSF CLE's database engine performed a minor, automatic version upgrade on Friday, March 1, 2024, at 8:09 p.m.
This upgrade to the database engine caused a two-minute outage to the CLE. In these two minutes, when an end-user tried logging into the CLE, they received an error message—"database connection failed." The Education IT was made aware of the outage when it happened via its built-in alarm notification system.
When Did it Happen
Friday, March 1, 2024, at 8:09 p.m. PT.
Duration of Outage
Friday, March 1, 2024, for two minutes, from 8:09 p.m. to 8:11 p.m. PT.
Response to Outage
The outage resolved automatically when the version upgrade was completed—in two minutes. Therefore, the Education IT team did not have to take further action to resolve the outage. The outage did not affect/alter any existing data and/or content in the UCSF CLE.
Since Friday, March 1, 2024, the Education IT team has been consulting with Amazon Web Services (AWS) to determine why the team’s built-in safeguards for minor upgrades failed and did not switch to the backup database as intended.
Technical Notes
- All database engine upgrades require downtime, both major and minor version upgrades. Even for Amazon’s Relational Database Service database instances with multi-AZ deployment, both the primary and standby database instances upgrade at the same time. Downtime duration varies based on the size of the database instance and engine.
- We can minimize downtown with blue/green deployments. The switchover results in downtown, usually under one minute, but it can be longer depending on your workload.
- We can also use read replicas to minimize downtime, but the process is very involved. Read replica gets promoted to a standalone database instance. Requires a new database endpoint and locking original database in read-only to ensure no write operations are lost when switching over.
- Failover times can further be reduced with Amazon RDS Proxy. This requires more research and testing before possible implementation.
- Reference: Amazon RDS Proxy now supports Amazon RDS
- Reference: Amazon RDS Proxy
- Note: A failover is a method of automatic switching from a primary system to a backup system in the event of a failure/disruption—to ensure the service of an operation continues as intended and there is minimum downtime for end-users. (In our case, the UCSF CLE, during the database engine upgrade should have switched to a backup database.)
Next Action Items
The Education IT team strives to ensure that upgrades on the back end never cause unexpected downtime or outages in the UCSF CLE. Therefore, the team will:
- Turn off all automatic, minor upgrades for the database engine; instead, set up manual upgrades only.
- Move the maintenance window for database engine-related items to Wednesdays, from 5 a.m. to 6 a.m. PT.
- Conduct manual database failover tests and dry-runs in the UCSF CLE’s test environment.