AO3 News

Post Header

Since the end of December 2024, AO3 has had numerous periods of slowness, downtime, and related issues such as missing kudos emails and delayed invitations. We've been taking some steps to improve the situation, but we are also working on some highly time-sensitive updates to our infrastructure, so we can't spend as much time as we'd like on performance improvements. We expect some slowness and downtime to continue until our new servers are delivered and installed in a few months.

We first noticed some strain on the servers we use for Elasticsearch (which powers searching and filtering) in the middle of last year. The new servers we wanted weren't available yet, so we repurposed some of our other servers to help with the load on Elasticsearch until we could get the hardware.

Unfortunately, the hardware wasn't available on its October release date, and our temporary fix couldn't hold up to the traffic increase we experience at the end of every year. This has led to periods of noticeable slowness over the last several weeks.

The servers we wanted finally became available in early January, and we completed the process of getting quotes and requisitioning them by January 15. Our purchase was confirmed on January 28, but it will take a few months for the servers to be delivered and installed.

We estimate the new Elasticsearch servers will be in place by early April. Until then, you might run into the following issues, especially during busy periods:

  • all pages loading more slowly
  • Elasticsearch-powered pages like search results and work and bookmark listings taking longer to update
  • error pages
  • automated checks from Cloudflare's Under Attack mode
  • stricter rate limiting
  • issues with services like the Wayback Machine or Tumblr RSS accounts that rely on bots, scrapers, or other automated tools, which we have deprioritized in favor of traffic from users

In addition to new Elasticsearch servers, we'll be purchasing five database servers to improve the capacity and resilience of our database cluster. We don't currently have enough database power to handle increased traffic and do certain types of maintenance at the same time. This means we sometimes have to take AO3 offline to resolve database issues, as we did for our February 7 maintenance. Additional hardware should help us avoid this situation in the future, but it will take some time for the purchase to be completed and the servers to be installed. We do not anticipate any database issues while we wait and there is no risk of data loss.

We're very sorry for the disruptions, and we appreciate your patience and your generous donations, which fund purchases like these.

For updates on slowness, downtime, or other issues, please follow @AO3_Status on Twitter/X or ao3org on Tumblr. We're also in the process of setting up a status account on Bluesky and a status page, but they're still works in progress and might not receive all updates just yet, so please make sure to check Twitter/X or Tumblr for a fully accurate list of updates.

Comment

Post Header

Published:
2019-11-15 23:54:27 UTC
Tags:

Over the last few weeks, you may have noticed a few brief periods when the Archive has been slow to load (or refusing to load at all). This is because our Elasticsearch servers are under some strain. We've made some adjustments to our server setup, but because this is our busiest time of year, we expect the problems to continue until we're able to have new servers delivered and installed in a few months.

We've been planning this server purchase for a while now, but the machines we wanted -- AMD's Epyc Rome CPUs, which have an increased core count and are cheaper than the Intel equivalent -- didn't come on the market until August. Now that they've been released, we're working on finding the best price to help us make the most of your generous donations. We expect to order them very soon.

While we're waiting for our new servers, we plan to upgrade the Elasticsearch software to see if the newer version offers any performance improvements. We hope this upgrade and the changes to our server setup will keep things from getting much worse during our end-of-year traffic influx.

Thank you for your patience, and for all the donations that allow us to buy new hardware when these situations arise!

Update 27 November: The servers have been ordered, but it will still be a few months before they are delivered and installed.

Comment

Post Header

Published:
2018-04-27 19:29:58 UTC
Tags:

For a while now, our Support team has been receiving reports from users who have been logged out of their accounts and are unable to log back in. While our coders have been unable to determine the exact cause of this issue, Support has found a workaround that should allow you to log in.

If you have been redirected to the Forced Logout page -- also known as the Lost Cookie page -- and are unable to log in using the Log In option at the top of the page, please go directly to the Log In page at archiveofourown.org/login. From there, you should be able to access your account.

We're very sorry if you've run into this issue! We have added this information to our Known Issues page and will be adding it to the Forced Logout page while our coders continue to look for a fix.

Comment

Post Header

Published:
2017-09-18 16:47:51 UTC
Tags:

Shortly after we upgraded the Archive to Rails 4.2, users began reporting they were being redirected to the login page when submitting forms (e.g. bookmarking a work, or posting a comment). Our coders were unable to find the cause of this problem and hoped it would resolve itself when we upgraded to Rails 5.1.

Unfortunately, the upgrade did not fix the issue, and further research has revealed this is a bug within Rails itself. The bug mainly -- but not only -- affects iPhone Safari users, and is most likely to happen when submitting a form after closing and re-opening your browser, or after leaving a page open for a number of days.

There's currently no official fix for this issue, but you may be able to work around it by using your browser's "Back" button and submitting the form again. We'll also be implementing a temporary workaround on our end by making session cookies last two weeks. This means it is very important to log out of your account if you are using a public computer. If you simply close the browser and leave, you will still be logged in and the next person to use the computer will be able to access your account.

Once an official fix becomes available, we will apply it as soon as possible. There's no word on when this will be, but in the meantime, we'll keep looking for workarounds.

Update, 23 September 2017: If you have JavaScript disabled in your browser and were getting Session Expired errors when trying to log in, the problem should now be fixed!

Comment

Post Header

Published:
2017-03-21 19:17:26 UTC
Tags:

Update, April 4: We successfully deployed an improved version of the code referenced in this post on March 29. It now takes considerably less time to add a work to the database.

-

You may have noticed the Archive has been slow or giving 502 errors when posting or editing works, particularly on weekends and during other popular posting times. Our development and Systems teams have been working to address this issue, but our March 17 attempt failed, leading to several hours of downtime and site-wide slowness.

Overview

Whenever a user posts or edits a work, the Archive updates how many times each tag on the work has been used across the site. During this time, the record is locked and the database cannot process other changes to those tags. This can result in slowness or even 502 errors when multiple people are trying to post using the same tag. Because all works are required to use rating and warning tags, works' tags frequently overlap during busy posting times.

Unfortunately, the only workaround currently available is to avoid posting, editing, or adding chapters to works at peak times, particularly Saturdays and Sundays (UTC). We strongly recommend saving your work elsewhere so changes won’t be lost if you receive a 502.

For several weeks, we’ve had temporary measures in place to decrease the number of 502 errors. However, posting is still slow and errors are still occurring, so we’ve been looking for more ways to use hardware and software to speed up the posting process.

Our Friday, March 17, downtime was scheduled so we could deploy a code change we hoped would help. The change would have allowed us to cache tag counts for large tags (e.g. ratings, common genres, and popular fandoms), updating them only periodically rather than every time a work was posted or edited. (We chose to cache only large tags because the difference between 1,456 and 1,464 is less significant than the difference between one and nine.) However, the change led to roughly nine hours of instability and slowness and had to be rolled back.

Fixing this is our top priority, and we are continuing to look for solutions. Meanwhile, we’re updating our version of the Rails framework, which is responsible for the slow counting process. While we don’t believe this upgrade will be a solution by itself, we are optimistic it will give us a slight performance boost.

March 17 incident report

The code deployed on March 17 allowed us to set a caching period for a tag’s use count based on the size of the tag. While the caching period and tag sizes were adjusted throughout the day, the code used the following settings when it was deployed:

  • Small tags with less than 1,000 uses would not be cached.
  • Medium tags with 1,000-39,999 uses would be cached for 3-40 minutes, depending on the tag’s size.
  • Large tags with at least 40,000 uses would be cached for 40-60 minutes, but the cache would be refreshed every 30 minutes. Unlike small and medium tags, the counts for large tags would not update when a work was posted -- they would only update during browsing. Refreshing the cache every 30 minutes would prevent pages from loading slowly.

We chose to deploy at a time of light system load so we would be able to fine tune these settings before the heaviest weekend load. The deploy process itself went smoothly, beginning at 12:00 UTC and ending at 12:14 -- well within the 30 minutes we allotted for downtime.

By 12:40, we were under heavy load and had to restart one of our databases. We also updated the settings for the new code so tags with 250 or more uses would fall into the “medium” range and be cached. We increased the minimum caching period for medium tags from three minutes to 10.

At 12:50, we could see we had too many writes going to the database. To stabilize the site, we made it so only two out of seven servers were writing cache counts to the database.

However, at 13:15, the number of writes overwhelmed MySQL. It was constantly writing, making the service unavailable and eventually crashing. We put the Archive into maintenance mode and began a full MySQL cluster restart. Because the writes had exceeded the databases' capabilities, the databases had become out of sync with each other. Resynchronizing the first two servers by the built-in method took about 65 minutes, starting at 13:25 and completing at 14:30. Using a different method to bring the third recalcitrant server into line allowed us to return the system to use sooner.

By 14:57, we had a working set of two out of three MySQL servers in a cluster and were able to bring the Archive back online. Before bringing the site back, we also updated the code for the tag autocomplete, replacing a call that could write to the database with a simple read instead.

At 17:48, we were able to bring the last MySQL server back and rebalance the load across all three servers. However, the database dealing with writes was sitting at 91% load rather than the more normal 4-6%.

At 18:07, we made it so only one app server wrote tags’ cache values to the database. This dropped the load on the write database to about 50%.

At 19:40, we began implementing a hotfix that significantly reduced writes to the database server, but having all seven systems writing to the database once more put the load up to about 89%.

At 20:30, approximately half an hour after the hotfix was finished, we removed the writes from three of the seven machines. While this reduced the load, the reduction was not significant enough to resolve the issues the Archive was experiencing. Nevertheless, we let the system run for 30 minutes so we could monitor its performance.

Finally, at 21:07, we decided to take the Archive offline and revert the release. The Archive was back up and running the old code by 21:25.

We believe the issues with this caching change were caused by underestimating the number of small tags on the Archive and overestimating the accuracy of their existing counts. With the new code in place, the Archive began correcting the inaccurate counts for small tags, leading to many more writes than we anticipated. If we're able to get these writes under control, we believe this code might still be a viable solution. Unfortunately, this is made difficult by the fact we can’t simulate production-level load on our testing environment.

Going forward

We are currently considering five possible ways to improve posting speed going forward, although other options might present themselves as we continue to study the situation.

  1. Continue with the caching approach from our March 17 deploy. Although we chose to revert the code due to the downtime it had already caused, we believe we were close to resolving the issue with database writes. We discovered that the writes overwhelming our database were largely secondary writes caused by our tag sweeper. These secondary writes could likely be reduced by putting checks in the sweeper to prevent unnecessary updates to tag counts.
  2. Use the rollout gem to alternate between the current code and the code from our March 17 deploy. This would allow us to deploy and troubleshoot the new caching code with minimal interruption to normal Archive function. We would be able to study the load caused by the new code while being able to switch back to the old code before problems arose. However, it would also make the new code much more complex. This means the code would not only be more error-prone, but would also take a while to write, and users would have to put up with the 502 errors longer.
  3. Monkey patch the Rails code that updates tag counts. We could modify the default Rails code so it would still update the count for small tags, but not even try to update the count on large tags. We could then add a task that would periodically update the count on larger tags.
  4. Break work posting into smaller transactions. The current slowness comes from large transactions that are live for too long. Breaking the posting process into smaller parts would resolve that, but we would then run the risk of creating inconsistencies in the database. In other words, if something went wrong while a user was updating their work, only some of their changes might be saved.
  5. Completely redesign work posting. We currently have about 19,000 drafts and 95,000 works created in a month, and moving drafts to a separate table would allow us to only update the tag counts when a work was finally posted. We could then make posting from a draft the only option. Pressing the "Post" button on a draft would set a flag on the entry in the draft table and add a Resque job to post the work, allowing us to serialize updates to tag counts. Because the user would only be making a minor change in the database, the web page would return instantly. However, there would be a wait before the work was actually posted.
  6. The unexpected downtime that occurred around noon UTC on Tuesday, March 21, was caused by an unusually high number of requests to Elasticsearch and is unrelated to the issues discussed in this post. A temporary fix is in currently in place and we are looking for long term solutions.

Comment

Post Header

To combat an influx of spam works, we are temporarily suspending the issuing of invitations from our automated queue. This will prevent spammers from getting invitations to create new accounts and give our all-volunteer teams time to clean up existing spam accounts and works. We will keep you updated about further developments on our Twitter account. Please read on for details.

The problem

We have been dealing with two issues affecting the Archive, both in terms of server health and user experience.

  • Spammers who sign up for accounts only to post thousands of fake "works" (various kinds of advertisements) with the help of automated scripts.
  • People who use bots to download works in bulk, to the point where it affects site speed and server uptime for everyone else.

Measures we've taken so far

We have been trying several things to keep both problems in check:

  • The Abuse team has been manually banning accounts that post spam.
  • We are also keeping an eye on the invitation queue for email addresses that follow discernible patterns and removing them from the queue. This is getting trickier as the spammers adjust.
  • We delete the bulk of spam works from the database directly, as individual work deletion would clearly be an overwhelming task for the Abuse team; however, this requires people with the necessary skills and access to be available.
  • Our volunteer sysadmin has been setting up various server scripts and settings aimed at catching spammers and download bots before they can do too much damage. This requires a lot of tweaking to adjust to new bots and prevent real users from being banned.

Much of this has cut into our volunteers' holiday time, and we extend heartfelt thanks to everyone who's been chipping in to keep the Archive going through our busiest days.

What we're doing now

Our Abuse team needs a chance to catch up on all reported spamming accounts and make sure that all spam works are deleted. Currently the spammers are creating new accounts faster than we can ban them. Our sysadmins and coders need some time to come up with a sustainable solution to prevent further bot attacks.

To that end, we're temporarily suspending issuing invites from our automated queue. Existing account holders can still request invite codes and share them with friends. You can use existing invites to sign up for an account; account creation itself will not be affected. (Please note: Requests for invite codes have to be manually approved by a site admin, so there might be a delay of two to three days before you receive them; challenge moderators can contact Support for invites if their project is about to open.)

We are working hard to get these problems under control, so the invite queue should be back in business soon! Thank you for your patience as we work through the issues.

What you can do

There are some things you can do to help:

  • When downloading multiple works, wait a few moments between each download. If you're downloading too many works at once, you will be taken to an error page warning you to slow down or risk being blocked from accessing the Archive for 24 hours.
  • Please don't report spam works. While we appreciate all the reports we've received so far, we now have a system in place that allows us to find spam quickly. Responding to reports of spam takes time away from dealing with it.
  • Keep an eye on our Twitter account, @AO3_Status, for updates!

Known problems with the automated download limit

We have been getting reports of users who run into a message about excessive downloads even if they were downloading only a few works, or none at all. This may happen for several reasons that are unfortunately beyond our control:

  • They pressed the download button once, but their device went on a rampage trying to download the file many times. A possible cause for this might be a download accelerator, so try disabling any relevant browser extensions or software, or try downloading works in another browser or device.
  • They share an IP address with a group of people, one of whom hit the current download limit and got everyone else with the same IP address banned as well. This can be caused by VPNs, Tor software, or an ISP who assigns the same IP address to a group of customers (more likely to happen on phones). Please try using a different device, if you can.

We apologize if you have to deal with any of these and we'll do our best to restore proper access for all users as soon as possible!

Comment

Post Header

Published:
2014-09-09 20:43:05 UTC
Tags:

Credits

  • Coder: Elz
  • Code reviewers: Enigel, james_
  • Testers: Ariana, Lady Oscar, mumble, Ridicully, sarken

Overview

With today's deploy we're making some changes to our search index code, which we hope will solve some ongoing problems with suddenly "missing" works or bookmarks and inaccurate work counts.

In order to improve consistency and reduce the load on our search engine, we'll be sending updates to it on a more controlled schedule. The trade-off is that it may take a couple of minutes for new works, chapters, and bookmarks to appear on listing pages (e.g. for a fandom tag or in a collection), but those pages will ultimately be more consistent and our systems should function more reliably.

You can read on for technical details!

The Problem

We use a software package called Elasticsearch for most of our search and filtering needs. It's a powerful system for organizing and presenting all the information in our database and allows for all sorts of custom searches and tag combinations. To keep our search results up to date for everyone using the Archive, we need to ensure that freshly-posted works, new comments and kudos, edited bookmarks, new tags, etc. all make it into our search index practically in real time.

As the volume of updates has grown considerably over the last couple of years, however, that's increased the time it takes to process those updates and slowed down the general functioning of the underlying system. That slowness has interacted badly with the way we cache data in our current code: works and bookmarks seem to occasionally appear and disappear from site listings and the counts you see on different pages and sidebars may be significantly different from one another.

That's understandably alarming to anyone who encounters it, and fixing it has been our top priority.

The First Step

We are making some major changes to our various "re-indexing" processes, which take every relevant change that happens to works/bookmarks/tags and update our massive search index accordingly:

  • Instead of going directly into Elasticsearch, all indexing tasks will now be added to a queue that can be processed in a more orderly fashion. (We were queueing some updates before, but not all of them.)
  • The queued updates will then be sent to the search engine in batches to reduce the number of requests, which should help with performance.
  • Cached pages get expired (i.e., updated to reflect new data) not when the database says so, but when Elasticsearch is ready.
  • Updates concerning hit counts, kudos, comments, and bookmarks on a work (i.e. "stats" data) will be processed more efficiently but less frequently.

As a result, work updates will take a minute to affect search results and work listings, and background changes to tags (e.g. two tags being linked together) will take a few minutes longer to be reflected in listings. Stats data (hits, kudos, etc.) will be added to the search index only once an hour. The upside of this is that listings should be more consistent across the site!

(Please note that this affects only searching, sorting, and filtering! The kudos count in a work blurb, for example, is based on the database total, so you may notice slight inconsistencies between those numbers and the order you see when sorting by kudos.)

The Next Step

We're hoping that these changes will help to solve the immediate problems that we're facing, but we're also continuing to work on long-term plans and improvements. We're currently preparing to upgrade our Elasticsearch cluster from version 0.90 to 1.3 (which has better performance and backup tools), switch our code to a better client, and make some changes to the way we index data to continue to make the system more efficient.

One big improvement will be in the way we index bookmarks. When we set up our current system, we had a much smaller number of bookmarks relative to other content on the site. The old Elasticsearch client we were using also had some limitations on its functionality, so we ended up indexing the data for bookmarked works together with each of their individual bookmarks, which meant that updates to the work meant updates to dozens or hundreds of bookmark records. That's been a serious problem when changes are made to tags, in particular, where a small change can potentially kick off a large cascade of re-indexes. It's also made it more difficult to keep up with regular changes to works, which led to problems with bookmark sorting by date. We're reorganizing that, using Elasticsearch's parent-child index structure, and we hope that this will also have positive long-term effects on performance.

Overall, we're continuing to learn and look for better solutions as the Archive grows. We apologize for the bumpy ride lately, and we hope that the latest set of changes will make things run more smoothly. We should have more improvements for you in the coming months, and in the meantime, we thank you for your patience!

Comment

Post Header

Published:
2014-01-23 21:26:51 UTC
Tags:

If you're a regular Archive visitor or if you follow our AO3_Status Twitter account, you may have noticed that we've experienced a number of short downtime incidents over the last few weeks. Here's a brief explanation of what's happening and what we're doing to fix things.

The issue

Every now and then, the volume of traffic we get and the amount of data we're hosting starts to hit the ceiling of what our existing infrastructure can support. We try to plan ahead and start making improvements in advance, but sometimes things simply catch up to us a little too quickly, which is what's happening now.

The good news is that we do have fixes in the works: we've ordered some new servers, and we hope to have them up and running soon. We're making plans to upgrade our database system to a cluster setup that will handle failures better and support more traffic; however, this will take a little longer. And we're working on a number of significant code fixes to improve bottlenecks and reduce server load - we hope to have the first of those out within the next two weeks.

One area that's affected are the number of hits, kudos, comments, and bookmarks on works, so you may see delays in those updating, which will also result in slightly inaccurate search and sort results. Issues with the "Date Updated" sorting on bookmark pages will persist until a larger code rewrite has been deployed.

Behind the scenes

We apologize to everyone who's been affected by these sudden outages, and we'll do our best to minimize the disruption as we work on making things better! We do have an all-volunteer staff, so while we try to respond to server problems quickly, sometimes they happen when we're all either at work or asleep, so we can't always fix things as soon as we'd like to.

While we appreciate how patient and supportive most Archive users are, please keep in mind that tweets and support requests go to real people who may find threats of violence or repeated expletives aimed at them upsetting. Definitely let us know about problems, but try to keep it to language you wouldn't mind seeing in your own inbox, and please understand if we can't predict immediately how long a sudden downtime might take.

The future

Ultimately, we need to keep growing and making things work better because more and more people are using AO3 each year, and that's something to be excited about. December and January tend to bring a lot of activity to the site - holiday gift exchanges are posted or revealed, people are on vacation, and a number of fandoms have new source material.

We're looking forward to seeing all the new fanworks that people create this year, and we'll do our best to keep up with you! And if you're able to donate or volunteer your time, that's a huge help, and we're always thrilled to hear from you.

Comment


Pages Navigation