Migrate from Apache Solr to OpenSearch

OpenSearch is an open supply, distributed search engine appropriate for a wide selection of use-cases similar to ecommerce search, enterprise search (content material administration search, doc search, information administration search, and so forth), web site search, software search, and semantic search. It’s additionally an analytics suite that you need to use to carry out interactive log analytics, real-time software monitoring, safety analytics and extra. Like Apache Solr, OpenSearch supplies search throughout doc units. OpenSearch additionally consists of capabilities to ingest and analyze knowledge. Amazon OpenSearch Service is a totally managed service that you need to use to deploy, scale, and monitor OpenSearch within the AWS Cloud.

Many organizations are migrating their Apache Solr primarily based search options to OpenSearch. The principle driving elements embrace decrease complete value of possession, scalability, stability, improved ingestion connectors (similar to Information Prepper, Fluent Bit, and OpenSearch Ingestion), elimination of exterior cluster managers like Zookeeper, enhanced reporting, and wealthy visualizations with OpenSearch Dashboards.

We advocate approaching a Solr to OpenSearch migration with a full refactor of your search answer to optimize it for OpenSearch. Whereas each Solr and OpenSearch use Apache Lucene for core indexing and question processing, the programs exhibit completely different traits. By planning and working a proof-of-concept, you possibly can guarantee one of the best outcomes from OpenSearch. This weblog put up dives into the strategic issues and steps concerned in migrating from Solr to OpenSearch.

Key variations

Solr and OpenSearch Service share elementary capabilities delivered by way of Apache Lucene. Nonetheless, there are some key variations in terminology and performance between the 2:

Assortment and index: In OpenSearch, a group is known as an index.
Shard and duplicate: Each Solr and OpenSearch use the phrases shard and duplicate.
API-driven Interactions: All interactions in OpenSearch are API-driven, eliminating the necessity for guide file modifications or Zookeeper configurations. When creating an OpenSearch index, you outline the mapping (equal to the schema) and the settings (equal to solrconfig) as a part of the index creation API name.

Having set the stage with the fundamentals, let’s dive into the 4 key elements and the way every of them will be migrated from Solr to OpenSearch.

Assortment to index

A set in Solr is known as an index in OpenSearch. Like a Solr assortment, an index in OpenSearch additionally has shards and replicas.

Though the shard and duplicate idea is comparable in each the major search engines, you need to use this migration as a window to undertake a greater sharding technique. Dimension your OpenSearch shards, replicas, and index by following the shard technique finest practices.

As a part of the migration, rethink your knowledge mannequin. In inspecting your knowledge mannequin, you’ll find efficiencies that dramatically enhance your search latencies and throughput. Poor knowledge modeling doesn’t solely end in search efficiency issues however extends to different areas. For instance, you may discover it difficult to assemble an efficient question to implement a selected function. In such circumstances, the answer typically entails modifying the information mannequin.

Variations: Solr permits major shard and duplicate shard collocation on the identical node. OpenSearch doesn’t place the first and duplicate on the identical node. OpenSearch Service zone consciousness can mechanically be sure that shards are distributed to completely different Availability Zones (knowledge facilities) to additional enhance resiliency.

The OpenSearch and Solr notions of duplicate are completely different. In OpenSearch, you outline a major shard depend utilizing number_of_primaries that determines the partitioning of your knowledge. You then set a duplicate depend utilizing number_of_replicas. Every duplicate is a duplicate of all the first shards. So, for those who set number_of_primaries to five, and number_of_replicas to 1, you should have 10 shards (5 major shards, and 5 duplicate shards). Setting replicationFactor=1 in Solr yields one copy of the information (the first).

For instance, the next creates a group known as check with one shard and no replicas.

http://localhost:8983/solr/admin/collections?
  _=motion=CREATE
  &maxShardsPerNode=2
  &title=check
  &numShards=1
  &replicationFactor=1
  &wt=json

In OpenSearch, the next creates an index known as check with 5 shards and one duplicate

PUT check
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Schema to mapping

In Solr schema.xml OR managed-schema has all the sector definitions, dynamic fields, and replica fields together with subject sort (textual content analyzers, tokenizers, or filters). You employ the schema API to handle schema. Or you possibly can run in schema-less mode.

OpenSearch has dynamic mapping, which behaves like Solr in schema-less mode. It’s not essential to create an index beforehand to ingest knowledge. By indexing knowledge with a brand new index title, you create the index with OpenSearch managed service default settings (for instance: "number_of_shards": 5, "number_of_replicas": 1) and the mapping primarily based on the information that’s listed (dynamic mapping).

We strongly advocate you go for a pre-defined strict mapping. OpenSearch units the schema primarily based on the primary worth it sees in a subject. If a stray numeric worth is the primary worth for what is mostly a string subject, OpenSearch will incorrectly map the sector as numeric (integer, for instance). Subsequent indexing requests with string values for that subject will fail with an incorrect mapping exception. You realize your knowledge, you realize your subject varieties, you’ll profit from setting the mapping immediately.

Tip: Think about performing a pattern indexing to generate the preliminary mapping after which refine and tidy up the mapping to precisely outline the precise index. This method helps you keep away from manually establishing the mapping from scratch.

For Observability workloads, you need to think about using Easy Schema for Observability. Easy Schema for Observability (often known as ss4o) is a customary for conforming to a typical and unified observability schema. With the schema in place, Observability instruments can ingest, mechanically extract, and mixture knowledge and create customized dashboards, making it simpler to grasp the system at a better stage.

Most of the subject varieties (knowledge varieties), tokenizers, and filters are the identical in each Solr and OpenSearch. In spite of everything, each use Lucene’s Java search library at their core.

Let’s take a look at an instance:

<!-- Solr schema.xml snippets -->
<subject title="id" sort="string" listed="true" saved="true" required="true" multiValued="false" /> 
<subject title="title" sort="string" listed="true" saved="true" multiValued="true"/>
<subject title="handle" sort="text_general" listed="true" saved="true"/>
<subject title="user_token" sort="string" listed="false" saved="true"/>
<subject title="age" sort="pint" listed="true" saved="true"/>
<subject title="last_modified" sort="pdate" listed="true" saved="true"/>
<subject title="metropolis" sort="text_general" listed="true" saved="true"/>

<uniqueKey>id</uniqueKey>

<copyField supply="title" dest="textual content"/>
<copyField supply="handle" dest="textual content"/>

<fieldType title="string" class="solr.StrField" sortMissingLast="true" />
<fieldType title="pint" class="solr.IntPointField" docValues="true"/>
<fieldType title="pdate" class="solr.DatePointField" docValues="true"/>

<fieldType title="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer sort="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer sort="question">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

PUT index_from_solr
{
  "settings": {
    "evaluation": {
      "analyzer": {
        "text_general": {
          "sort": "customized",
          "tokenizer": "customary",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "sort": "key phrase",
        "copy_to": "textual content"
      },
      "handle": {
        "sort": "textual content",
        "analyzer": "text_general"
      },
      "user_token": {
        "sort": "key phrase",
        "index": false
      },
      "age": {
        "sort": "integer"
      },
      "last_modified": {
        "sort": "date"
      },
      "metropolis": {
        "sort": "textual content",
        "analyzer": "text_general"
      },
      "textual content": {
        "sort": "textual content",
        "analyzer": "text_general"
      }
    }
  }
}

Notable issues in OpenSearch in comparison with Solr:

_id is at all times the uniqueKey and can’t be outlined explicitly, as a result of it’s at all times current.
Explicitly enabling multivalued isn’t obligatory as a result of any OpenSearch subject can comprise zero or extra values.
The mapping and the analyzers are outlined throughout index creation. New fields will be added and sure mapping parameters will be up to date later. Nonetheless, deleting a subject isn’t potential. A useful ReIndex API can overcome this drawback. You should utilize the Reindex API to index knowledge from one index to a different.
By default, analyzers are for each index and question time. For some less-common situations, you possibly can change the question analyzer at search time (within the question itself), which can override the analyzer outlined within the index mapping and settings.
Index templates are additionally an effective way to initialize new indexes with predefined mappings and settings. For instance, for those who constantly index log knowledge (or any time-series knowledge), you possibly can outline an index template so that each one the indices have the identical variety of shards and replicas. It will also be used for dynamic mapping management and element templates

Search for alternatives to optimize the search answer. As an illustration, if the evaluation reveals that town subject is solely used for filtering quite than looking out, think about altering its subject sort to key phrase as an alternative of textual content to eradicate pointless textual content processing. One other optimization may contain disabling doc_values for the user_token subject if it’s solely supposed for show functions. doc_values are disabled by default for the textual content datatype.

SolrConfig to settings

In Solr, solrconfig.xml carries the gathering configuration. All types of configurations pertaining to the whole lot from index location and formatting, caching, codec manufacturing facility, circuit breaks, commits and tlogs all the best way as much as sluggish question config, request handlers, and replace processing chain, and so forth.

Let’s take a look at an instance:

<codecFactory class="solr.SchemaCodecFactory">
<str title="compressionMode">`BEST_COMPRESSION`</str>
</codecFactory>

<autoCommit>
    <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
    <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
    <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
    </autoSoftCommit>

<slowQueryThresholdMillis>1000</slowQueryThresholdMillis>

<maxBooleanClauses>${solr.max.booleanClauses:2048}</maxBooleanClauses>

<requestHandler title="/question" class="solr.SearchHandler">
    <lst title="defaults">
    <str title="echoParams">express</str>
    <str title="wt">json</str>
    <str title="indent">true</str>
    <str title="df">textual content</str>
    </lst>
</requestHandler>

<searchComponent title="spellcheck" class="solr.SpellCheckComponent"/>
<searchComponent title="counsel" class="solr.SuggestComponent"/>
<searchComponent title="elevator" class="solr.QueryElevationComponent"/>
<searchComponent class="solr.HighlightComponent" title="spotlight"/>

<queryResponseWriter title="json" class="solr.JSONResponseWriter"/>
<queryResponseWriter title="velocity" class="solr.VelocityResponseWriter" startup="lazy"/>
<queryResponseWriter title="xslt" class="solr.XSLTResponseWriter"/>

<updateRequestProcessorChain title="script"/>

Notable issues in OpenSearch in comparison with Solr:

Each OpenSearch and Solr have BEST_SPEED codec as default (LZ4 compression algorithm). Each provide BEST_COMPRESSION as a substitute. Moreover OpenSearch gives zstd and zstd_no_dict. Benchmarking for various compression codecs can be accessible.
For close to real-time search, refresh_interval must be set. The default is 1 second which is nice sufficient for many use circumstances. We advocate growing refresh_interval to 30 or 60 seconds to enhance indexing velocity and throughput, particularly for batch indexing.
Max boolean clause is a static setting, set at node stage utilizing the indices.question.bool.max_clause_count setting.
You don’t want an express requestHandler. All searches use the _search or _msearch endpoint. In the event you’re used to utilizing the requestHandler with default values then you need to use search templates.
In the event you’re used to utilizing /sql requestHandler, OpenSearch additionally permits you to use SQL syntax for querying and has a Piped Processing Language.
Spellcheck, often known as Did-you-mean, QueryElevation (often called pinned_query in OpenSearch), and highlighting are all supported throughout question time. You don’t have to explicitly outline search elements.
Most API responses are restricted to JSON format, with CAT APIs as the one exception. In circumstances the place Velocity or XSLT is utilized in Solr, it should be managed on the applying layer. CAT APIs reply in JSON, YAML, or CBOR codecs.
For the updateRequestProcessorChain, OpenSearch supplies the ingest pipeline, permitting the enrichment or transformation of knowledge earlier than indexing. A number of processor levels will be chained to type a pipeline for knowledge transformation. Processors embrace GrokProcessor, CSVParser, JSONProcessor, KeyValue, Rename, Break up, HTMLStrip, Drop, ScriptProcessor, and extra. Nonetheless, it’s strongly advisable to do the information transformation exterior OpenSearch. The perfect place to try this could be at OpenSearch Ingestion, which supplies a correct framework and varied out-of-the-box filters for knowledge transformation. OpenSearch Ingestion is constructed on Information Prepper, which is a server-side knowledge collector able to filtering, enriching, remodeling, normalizing, and aggregating knowledge for downstream analytics and visualization.
OpenSearch additionally launched search pipelines, much like ingest pipelines however tailor-made for search time operations. Search pipelines make it simpler so that you can course of search queries and search outcomes inside OpenSearch. At the moment accessible search processors embrace filter question, neural question enricher, normalization, rename subject, scriptProcessor, and personalize search rating, with extra to return.
The next picture reveals the best way to set refresh_interval and slowlog. It additionally reveals you the opposite potential settings.
Sluggish logs will be set like the next picture however with way more precision with separate thresholds for the question and fetch phases.

Earlier than migrating each configuration setting, assess if the setting will be adjusted primarily based in your present search system expertise and finest practices. As an illustration, within the previous instance, the sluggish logs threshold of 1 second could be intensive for logging, so that may be revisited. In the identical instance, max.booleanClauses could be one other factor to have a look at and scale back.

Variations: Some settings are finished on the cluster stage or node stage and never on the index stage. Together with settings similar to max boolean clause, circuit breaker settings, cache settings, and so forth.

Rewriting queries

Rewriting queries deserves its personal weblog put up; nevertheless we need to at the very least showcase the autocomplete function accessible in OpenSearch Dashboards, which helps ease question writing.

Just like the Solr Admin UI, OpenSearch additionally incorporates a UI known as OpenSearch Dashboards. You should utilize OpenSearch Dashboards to handle and scale your OpenSearch clusters. Moreover, it supplies capabilities for visualizing your OpenSearch knowledge, exploring knowledge, monitoring observability, working queries, and so forth. The equal for the question tab on the Solr UI in OpenSearch Dashboard is Dev Instruments. Dev Instruments is a improvement setting that permits you to arrange your OpenSearch Dashboards setting, run queries, discover knowledge, and debug issues.

Now, let’s assemble a question to perform the next:

Seek for shirt OR shoe in an index.
Create a side question to seek out the variety of distinctive clients. Side queries are known as aggregation queries in OpenSearch. Often known as aggs question.

The Solr question would appear to be this:

http://localhost:8983/solr/solr_sample_data_ecommerce/choose?q=shirt OR shoe
  &side=true
  &side.subject=customer_id
  &side.restrict=-1
  &side.mincount=1
  &json.side={
   unique_customer_count:"distinctive(customer_id)"
  }

The picture beneath demonstrates the best way to re-write the above Solr question into an OpenSearch question DSL:

Conclusion

OpenSearch covers all kinds of makes use of circumstances, together with enterprise search, web site search, software search, ecommerce search, semantic search, observability (log observability, safety analytics (SIEM), anomaly detection, hint analytics), and analytics. Migration from Solr to OpenSearch is turning into a typical sample. This weblog put up is designed to be a place to begin for groups looking for steerage on such migrations.

You may check out OpenSearch with the OpenSearch Playground. You may get began with Amazon OpenSearch Service, a managed implementation of OpenSearch within the AWS Cloud.

In regards to the Authors

Aswath Srinivasan is a Senior Search Engine Architect at Amazon Internet Companies at present primarily based in Munich, Germany. With over 17 years of expertise in varied search applied sciences, Aswath at present focuses on OpenSearch. He’s a search and open-source fanatic and helps clients and the search group with their search issues.

Jon Handler is a Senior Principal Options Architect at Amazon Internet Companies primarily based in Palo Alto, CA. Jon works carefully with OpenSearch and Amazon OpenSearch Service, offering assist and steerage to a broad vary of consumers who’ve search and log analytics workloads that they need to transfer to the AWS Cloud. Previous to becoming a member of AWS, Jon’s profession as a software program developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the College of Pennsylvania, and a Grasp of Science and a PhD in Pc Science and Synthetic Intelligence from Northwestern College.

Migrate from Apache Solr to OpenSearch

Key variations

Assortment to index

Schema to mapping

SolrConfig to settings

Rewriting queries

Conclusion

In regards to the Authors

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US