What’s the deal with AWS Elasticsearch Service timeouts

in #art7 years ago

I write this post after a pretty frustrating day of knob turning, querying logs, and google-searching just so you don’t have to go down the same rabbit hole I did. In essence — I had an index hosted in Elasticsearch Service via AWS, and I could not reindex an index in my staging environment, and therefore production because, of constraints imposed on Elasticsearch by AWS.

tl;dr

Have you ever tried running an operation that takes ≥1 min in AWS Elasticsearch Service, and ran into:

gems/elasticsearch-transport-5.0.4/lib/elasticsearch/transport/transport/base.rb:202:in `__raise_transport_error': [504] (Elasticsearch::Transport::Transport::Errors::GatewayTimeout)

This is due in part to — AWS Elasticsearch Service imposes a non-configurable ELB timeout of 60s. With that in mind, you should consider using the /_tasks API (currently in beta) to poll long-running operations on your cluster if the time of the operation can not be estimated.

The Problem

I had an Elasticsearch cluster hosted via AWS Elasticsearch Service with an index that I wanted to modify the mapping for, and the /_reindex API kept timing out in my team’s staging environment.

In my circumstance, I had a few fields that I wanted to change the property types of, and the only way to achieve that is by reindexing a whole collection into an intermediary index.

The following script is a snippet of what I used to perform the reindexing:

straight-forward, right?

Also, notice the wait_to_completion parameter — I essentially wanted to block until the operation completed, before performing the next step in the script.

In my local dev environment I had no issues with the /_reindex API, especially when the operation took longer than 1 minute to perform.

However, when executing this in a staging environment with AWS ES service I produce an exception:

gems/elasticsearch-transport-5.0.4/lib/elasticsearch/transport/transport/base.rb:202:in `__raise_transport_error': [504] (Elasticsearch::Transport::Transport::Errors::GatewayTimeout)

At first glance, it would seem that the server might be timing out the connection, hence the status 504.

Digging deeper, I uncovered that the reindex operation is still occurring on the cluster because, the total number of documents in the temp index is incrementally approaching the number of documents in the original.

The next thing I try is adjusting the timeouts I’m passing to /_reindex

No dice. I produced the same exception 😔

After hours of digging on end (A.K.A google searching)— I discovered that the Gateway timeout was the result of AWS enforcing a non-configurable timeout of 60s on Elasticsearch service

OK — that makes sense. So what do you do next?

Solution

So obviously the problem was not my ES cluster — the problem lies in the hands of the folks at AWS and whatever is load balancing requests to my ES cluster.

Fortunately, Elasticsearch ships with a task management feature that proves to be useful in these types of scenarios — where you have a long-running operation on your cluster and the time for it to complete can not be estimated.

Using the task management API, given the task-id of the reindex operation you can poll elasticsearch for when the task has completed:

and piecing it all together

Also notice that the wait_for_completion parameter is now false —by setting it to false, we instruct elasticsearch to launch the request, and return a response immediately with the task id of the operation. More on that here

Reflections

If your infrastructure is pretty locked into AWS, your budget is tight, and you expect to have a relatively small sized index, hosted Elasticsearch via AWS is right for you. However, be warned that AWS does impose certain restrictions on ES that would otherwise ship with ES out of the box (timeouts being an example).

Thanks for reading! If you found this post to be useful please hold down clap so others can discover it! Also — you can follow on my twitter.



Posted from my blog with SteemPress : http://selfscroll.com/whats-the-deal-with-aws-elasticsearch-service-timeouts/
Sort:  

Warning! This user is on my black list, likely as a known plagiarist, spammer or ID thief. Please be cautious with this post!
If you believe this is an error, please chat with us in the #cheetah-appeals channel in our discord.