How to configure elasticsearch
Table of contents
Even though Elasticsearch is ready for the usage right after you unpack it, there are some configuration parameters and system requirements you really should know more about. Otherwise, you can get in a situation where you have to confront with a bunch of problems elastic can cause on so many levels. Of course, a good understanding of Elasticsearch internals is extremely useful in this entire story, so I would encourage you to read more about it on Elasticsearch official site. Here, I have listed a couple of topics I see relevant both as prerequisites and as mandatory settings.
There are a couple of system requirements for a healthy and stable Elasticsearch cluster. Those include efficient hard drives with enough space for storing Elasticsearch data and a good network infrastructure.
- Always use SSD drives for elasticsearch. Usage of any other kind of hard drive is not recommended at all! Elasticsearch (and Lucene library below) performs heavy IO operations, therefore hard drive speed is extremely relevant! Other drives would give you degraded performances.
- Make sure you have good network infrastructure. If multiple Elasticsearch nodes are present in a production, which is mostly the case, then your network infrastructure becomes a highly important prerequisite for Elasticsearch performances. Nodes in Elasticsearch cluster have various roles, and there is often synchronization between them because of many reasons. Slow network environment often leads to a bad performance and cluster instability.
- Ensure you have enough hard drive space for storing Elasticsearch data. Elasticsearch has some default limitations how much of disk space will be used by the node. Lower watermark value is 85% of hard drive space, higher watermark value is 90%. This means if the hard drive has 20Gb, and 18.5GBb is already used, elastic will refuse to add additional data to shards on the node, and most probably will try to reallocate shards in order to arrange data more equally over the cluster. Therefore, ensure that you never hit 85%-90% limitations of consumed space on hard drives hosting data nodes.
As I said before, elastic comes absolutely ready for the usage without any additional configuration by the end user. But still, I wouldn’t recommend using elastic in that form. The default configuration of elastic is good for development purposes, but even in that case, it can cause some issues, like joining undesired nodes to the cluster, unnecessary shard reallocation and similar. Therefore, the following settings listed below should always be set and their values entirely depend on the production requirements and desired Elasticsearch cluster topology.
- Set meaningful elasticsearch cluster name and node names. By default, Elasticsearch will put some random node names, which in production don't make any sense, and could just cause problems while reading log files. Therefore, set explicitly meaningful names for elasticsearch nodes. In the elasticsearch.yml configuration file, detect section "Node" and change the following line:
Except for node names, put a useful name for the entire cluster. In the elasticsearch.yml file detect section "Cluster" and change the line:
- Set elasticsearch heap size. Elasticsearch requires lots of RAM for its internal structures and to perform well, but it’s quite important to leave enough memory for the rest of the system (OS and Lucene library itself). It is usually recommended to set Elasticsearch heap size to the half of the available RAM, and never more than 32 Gb! So, if the machine has 24 Gb, 12 Gb should be allocated for elasticsearch heap at most, and if the machine has 96Gb, 32 Gb should be the top amount of memory.
Setting the correct value of heap size can be tricky, and it could require multiple tries before you find the correct value. Mostly, this depends on other services hosted on the same machine, how much memory they require, and it depends on the general pressure you have upon Elasticsearch node itself. Ideally, the machine hosting elastic node should be dedicated just to elastic itself, it shouldn’t be used for anything else unless it is a master dedicated node. How to set the heap size? Go to the environment variables window and add ES_HEAP_SIZE variable with a correct value. For example:
- For production use at least 3 elasticsearch nodes. Unless you are using Elasticsearch for a testing purpose or to hold a really small amount of data (in which case one node is sufficient), then a minimum of 3 elasticsearch nodes is recommended. This is a good starting point since it provides fail-over support and helps you avoid split-brain scenario in the cluster. As your data volume grows, it is most likely that additional nodes will be required at some point.
- Set master eligible nodes number. Once you have more than one node in the cluster, you will need to set the number of master eligible nodes. The formula to calculate a correct value is: N/2+1, where N is the number of nodes. To set this, go to the elasticsearch.yml configuration file, detect section "Discovery" and change the following line:
- Set the correct number of primary shards and replica shards. This highly depends on the number of data nodes you have, the amount of data you store in elastic itself and failover strategy you want to apply. Remember, a number of replicas aren’t used as a total number of replicas, but it is a number of replicas per primary shard! In other words, if we have 3 elastic nodes, and we set 3 primary shards, then setting a number of a replica to 1 will produce one replica per each primary shard! Therefore, 3 replicas. So, for 3 nodes, one configuration could be:
This will result in 6 shards over 3 nodes, so 2 shards per node. Setting number of primary shards and replicas isn’t easy as it seems. It will also require a couple of trials and testing until you come to the right number. While having many shards could help you improve your search performances, it can be extremely inefficient in terms of indexing performances. Since all update operations will require more time, you will need to find a good balance, so choose these values carefully.
- Disable zen multicast discovery and explicitly set a list of nodes for the zen unicast discovery. Zen multicast is useful for development purposes, but it should never be used in production. Inside of elasticsearch.yml file, the following setting should be applied:
Instead of zen multicast, we will use zen unicast discovery, which instructs elastic to include only specified nodes in the elastic cluster. To do this, the following setting should be applied in the elasticsearch.yml file:
Put relevant machine names or IP addresses in the array. Remember, here we need to set at least couple of nodes to which current node can talk, we don’t have to list all the nodes.
- Disable swapping (Unix/Linux platforms only) In order to avoid data swapping from memory to hard drive (which is performance killer), put the following setting in the elasticsearch.yml file:
Here, I would like to list a couple of more useful recommendations, for the advanced setup of elasticsearch cluster.
- Set explicitly which nodes are master, data or client nodes. For larger clusters, it’s recommended to explicitly set master, data and client nodes. This could optimize elastic cluster inner communication, cluster recovery, master election and, in general, client's operations performed on the elastic cluster (search, add, delete doc, etc). Master only node can be specified by the following setup in the elasticsearch.yml file:
Data node can be specified by:
Client node can be specified by:
- Set zen discovery timeout to a higher value. If you are stuck with the slow network environment, some default timeouts aren’t good enough. In the elasticsearch.yml file increase zen discovery ping timeout with the following setting:
Latest blog posts
Intent classification: understanding text with the powe...
In today’s world, with the expansion of data generated from various sources, analyzing it has become a critical challenge for businesses. Read more about how intent classification of textual data works and how it can lead t...
What Is Stable Diffusion and How Does It Work?
For the past few years, revolutionary models in the field of AI image generators have appeared. Stable diffusion is a text-to-image model of Deep Learning published in 2022. Find out the reasons why Stable diffusion gained ...