Benchmarking OpenSearch and Elasticsearch

Page content

This post concludes a four-month performance study of OpenSearch and
Elasticsearch search engines across realistic scenarios using OpenSearch
Benchmark (OSB). Our
full report includes the detailed findings
and comparison results of several versions of these two applications. We have
not modified either codebase.

Organizations running search-driven applications—from product searches on
e-commerce sites to real-time market analysis at financial
institutions—depend heavily on search engine
performance. OpenSearch and
Elasticsearch enable fast, scalable, and efficient
data retrieval, making them essential for applications like website search,
time-series log analysis, business intelligence, and cybersecurity monitoring.
Both are increasingly used in generative AI, machine learning, and vector
applications as well.

When milliseconds of latency can impact user experience or business operations,
even small performance differences can have significant costs.
Amazon Web Services (AWS) requested that we conduct an independent
benchmark assessment comparing these two prominent search-and-analysis software
suites.

As a result of our independent assessment, we observed that OpenSearch v2.17.1
is 1.6x faster on the Big5 workload and
11% faster on the Vectorsearch workload than Elasticsearch v8.15.4, when
aggregating the geometric mean of their queries. However, benchmarking both
applications is a moving target because both OpenSearch and Elasticsearch have
frequent product release cycles.

Over the course of our testing, Elasticsearch
updated its product to version 8.17.0, and OpenSearch released version 2.18.0.
We developed reusable code
that automates repeatable testing and analysis for both
platforms on AWS cloud infrastructure.

This review compares the query latencies of OpenSearch and Elasticsearch on
OpenSearch Benchmark workloads.
OSB uses the client-side service
time

metric (measured in milliseconds) for this purpose; it represents how long a
request (i.e., query) takes to receive a response. This includes overhead
(network latency, load balancer overhead, serialization/deserialization, etc.).
OSB records the 90th percentile (p90) of service times for each operation.

The scope was limited to Apache v2 (OpenSearch) and Elastic 2.0 (Elasticsearch)
licensed versions of the two engines and did not include proprietary systems.
The results can be used to direct future development of individual components
for each engine.

While six OSB workloads were evaluated in our full report, this blog post
highlights results from
Big5
(a comprehensive workload that performs text querying, sorting, date histogram,
range queries, and term aggregation) and
Vectorsearch
(a workload querying a dataset of 10 million 768-dimension vectors using
approximate K-Nearest Neighbor (KNN) search). We compare
recent versions of OpenSearch and Elasticsearch—v2.17.1
(released October 16, 2024) and v8.15.4
(released November 12, 2024), respectively.

Figure 1 illustrates our results comparing OpenSearch to Elasticsearch on the Big5 and
Vectorsearch workloads:

OpenSearch Benchmark Workloads
Figure 1: Ratio of the geometric mean of the median of p90 service times (in milliseconds)

OpenSearch Benchmark Vectorsearch
Figure 2: Ratio of the geometric mean of the median of p90 service times (in milliseconds). Note that since Elasticsearch only supports Lucene, the other engines (NMSLIB and FAISS) executed with OpenSearch are compared to Lucene executed with Elasticsearch

Observations and Impact

Methodology

We executed OpenSearch and Elasticsearch on workloads once per day, every day
for 15 days—except Vectorsearch, which we ran for 11 days. We collected
between 11 and 15 tests per workload per engine. We set up brand-new AWS
instances each time and executed workloads five times in a row. We discarded the
first run to ensure that the hardware, operating system, and application caches
were warmed up for both engines. Each operation in the workload was executed
hundreds to thousands of times. This resulted in thousands to tens of thousands
of sample measurements per operation. We believe this is a large enough sample
size to draw reliable conclusions.

Results processing and statistical analysis

We observed that some operations (mainly those whose service time values were
less than one millisecond) had a statistical power lower than our chosen
threshold of 90% despite having a low p-value: this signifies that a
statistically significant difference could be detected, but its magnitude could
not be observed reliably enough. Therefore, we executed additional runs of those
operations to increase their statistical power. After doing so, we confirmed
that any statistically significant difference between OpenSearch and
Elasticsearch performance characteristics in the tasks noted above was
inconsequential. However, users are likely more concerned about the performance
of longer-running queries than those that complete within a few milliseconds.

Outliers

We calculated the median of each workload operation’s p90 service time. We did
this because we saw non-trivial variations in performance in several runs in
both OpenSearch and Elasticsearch. These outliers can impact the arithmetic
average. We chose not to statistically exclude these outliers (e.g., using
standard deviation or quartiles as the exclusion criteria) because the results
do not necessarily follow a Gaussian (normal) distribution. Therefore, we
believe the median across this large number of independent data points is most
representative of the summary statistics we calculated for OpenSearch and
Elasticsearch. For completeness, the full report includes sparklines that
visually indicate the degree of variance across queries.

With our methodology established, let’s examine what our extensive testing
revealed about the performance characteristics of these search engines.

Big5 workload overview

The Big5 workload comprises a set of carefully crafted queries that exercise all
the major capabilities available in OpenSearch and Elasticsearch. They fall into
the following categories. For this performance comparison, each query is
weighted the same.

Text Queries: Searching for text is fundamental to any search engine or
database. Entering natural language queries is intuitive for users and does not
require knowledge of the underlying schema, making it ideal for easily searching
through unstructured data.

Term Aggregation: This query type groups documents into buckets based on
specified aggregation values, which is essential for data analytics use cases.

Sorting: Evaluates arranging data alphabetically, numerically,
chronologically, etc. This capability is useful for organizing search results
based on specific criteria, ensuring that the most relevant results are
presented to users.

Date Histogram: This is useful for aggregating and analyzing time-based data
by dividing it into intervals. It allows users to visualize and better
understand trends, patterns, and anomalies over time.

Range Queries: This is useful for filtering search results based on a
specific range of values in a given field. This capability lets users quickly
narrow their search results and find more relevant information.

Big5 workload category results

Using the geometric mean of the median values of all query operations in the
Big5 workload, we observe that OpenSearch v2.17.1 is 1.6x faster than
Elasticsearch v8.15.4. Below, we provide more details about how we arrived at
this estimate.

First, to ensure that our testing methodology was accurate, we referenced a
recent blog post
from the OpenSearch project on their performance measurements of v2.17. We
compare their reported performance (see the Results section of their post) to
ours in the table below. This includes all operations in Big5, while skipping
the match_all query (named default in the workload) and the scroll query.
This is the same protocol followed in the post.

Geomean of Operation Category – Median of p90 Service Time (ms)Our Results (v2.17.1)OpenSearch Results (v2.17)
Text queries16.0921.88
Sorting5.827.49
Term aggregations104.90114.08
Range queries1.473.30
Date histograms124.79164.03

Table 1: Establishing a baseline for OpenSearch

Note that the OpenSearch project publishes its performance numbers nightly at
https://opensearch.org/benchmarks

Since these values are reasonably close (albeit slightly different, most likely
due to different version numbers), we compare our results of running OpenSearch
v2.17.1 to Elasticsearch v8.15.4.

The original blog post above does not include two Big5 operations:
default and scroll. Both use the match_all query, which returns all
documents. Below, we add and categorize them as Text queries.

Geomean of Operation Category – Median of p90 Service Time (ms)OpenSearch v2.17.1Elasticsearch v8.15.4OpenSearch is slower/faster than Elasticsearch
Text queries18.117.472.42x slower
Sorting5.826.141.05x faster
Term aggregations104.90354.523.38x faster
Range queries1.471.491.02x faster
Date histograms124.792,064.6116.55x faster
All Operations12.118.81.56x faster

Table 2: Big5 comparison between OpenSearch and Elasticsearch

Next, we assess comprehensive Big5 workload testing results to understand
these observations, which exercise core search engine functionality.

Big5 workload operation results

The following graphs show the differences in median p90 service times across the
individual query operations in the Big5 workload for each category. The y-axis
represents an operation, and the x-axis represents how many times faster an
engine (OpenSearch or Elasticsearch) is over the other.

Text Queries
Figure 3: Text query operations

Sorting
Figure 4: Sorting operations

Term Aggregations
Figure 5: Term aggregation operations

Range Queries
Figure 6: Range query operations

Date Histograms
Figure 7: Date histogram operations

Vectorsearch Workload Results

Having examined traditional search operations, we now turn to vector search
capabilities, an increasingly important feature for modern applications using
AI/ML techniques. Here, we discuss the Vectorsearch workload results.
Force-merge
is enabled by default.

OpenSearch supports three vector search
engines:
NMSLIB, FAISS, and Lucene. These engines cater to various algorithms (HNSW,
HNSW+PQ, IVF, and IVF+PQ) and quantization techniques (fp16, 2x compression to
binary, 32x compression) based on different user workloads. The default vector
engine for OpenSearch 2.17.1 is NMSLIB. Newer releases after 2.17 have switched
to FAISS.

On the other hand, Elasticsearch supports only Lucene. Any reported values in
the charts below for Elasticsearch indicate test runs using the Lucene engine.
For brevity, we specify a search engine and the vector search engine used in
this format: search engine (vector engine).

The Vectorsearch workload consists of one primary query: prod-queries, a
vector search of the ingested data with a recall computation for the ANN search.

Similar to Big5, we compare the median p90 service time values. Focusing on an
out-of-the-box experience with the respective default configured engines (NMSLIB
for OpenSearch and Lucene for Elasticsearch), OpenSearch is 11% faster than
Elasticsearch for this metric, with similar recall and the same hyper-parameter
values.

Vectorsearch Performance Details

Comparing each OpenSearch v2.17.1 vector engine against Elasticsearch (Lucene)
v8.15.4 yielded the following findings:

  • OpenSearch (NMSLIB) was 11.3% faster.
  • OpenSearch (FAISS) was 13.8% faster.
  • OpenSearch (Lucene) was 258.2% slower.

The median values are as follows:

Vectorsearch Engine Comparison
Figure 8: Vectorsearch engine comparison

Below are sparklines comparing OpenSearch and Elasticsearch on the Vectorsearch
workload. The x-axis represents time, and the y-axis represents the p90 service
time (in milliseconds). The min and max values represent the minimum and maximum
values of the y-axis for each sparkline, respectively. Each pair of sparklines
in a row is plotted with the same y-axis. All Elasticsearch sparklines plot the
same data, but they appear different from each other due to different y-axis
minimum and maximum values.

Vectorsearch Sparklines
Figure 9: Sparkline comparison of OpenSearch and Elasticsearch on the Vectorsearch workload

As shown above, OpenSearch (Lucene) varies in its performance. While this paints a
clear picture of relative performance, our testing also revealed some important
caveats about consistency that users should consider.

Performance Inconsistencies

We observed slow outlier runs for p90 service times for OpenSearch and
Elasticsearch. We investigated these scenarios but could not identify the root
cause. For example, note the random spikes in Figure 9 above for OpenSearch (Lucene). While
these anomalies did not affect our overall conclusions, they warrant further
investigation. We still included outliers in the datasets when calculating
results, as there was no systematic way to remove them.

We can quantify how extreme outliers are by the ratio of the maximum service
time over the median service time. Using this ratio, we found that OpenSearch has outliers
that are more extreme than those of Elasticsearch.
The tasks with the most extreme
ratios for OpenSearch and Elasticsearch were:

  • OpenSearch 2.17.1: 1412x for composite-date_histogram-daily
  • Elasticsearch 8.15.4: 43x for query-string-on-message

We counted how many tasks have outlier runs using the criterion of a run with a
value that is more than twice as slow as the median. We found that
Elasticsearch has more outliers than OpenSearch
:

  • OpenSearch 2.17.1: 11 outlier tasks out of 98
  • Elasticsearch 8.15.4: 19 outlier tasks out of 98

Repeatable, Open-Sourced Benchmarking

Based on these findings about both performance and consistency, we’ve developed
several key recommendations for conducting reliable search engine benchmarks:

  • Always run workloads on newly created instances. If not, variations in
    workload performance may not be observed, which would skew a user’s
    expectations.

  • After collecting data, measure both the p-value and the statistical power to ensure
    statistical reliability. Measuring p-values across runs with the same
    configuration helps detect anomalies where you expect high p-values (> 0.05)
    when comparing similar runs. Measure against different configurations (like
    different setups or engines) to confirm that changes produce statistically
    different results.

  • Benchmarking should use configurations that closely match the out-of-the-box
    experience. Sometimes, changes are needed for a fair benchmark. In these
    cases, document the changes and explain why they aren’t suitable for the
    default configuration.

  • A snapshot approach that may create more consistent results is to flush the
    index, refresh the index, and then wait for merges to complete before taking a
    snapshot. We found promising initial results in testing this approach with the
    Vectorsearch workload, but have not extensively tested this strategy.

Looking beyond our specific findings, we wanted to ensure that our work could serve
as a foundation for future benchmarking efforts. We focused on creating
repeatable and objective performance comparisons between OpenSearch and
Elasticsearch and used GitHub Actions to make our experiments easy to reproduce.
This enables ongoing performance comparisons in the future.

If you’re interested in how we can support your project, please
contact us.