Amazon OpenSearch Ingestion is a totally managed, serverless knowledge pipeline that simplifies the method of ingesting knowledge into Amazon OpenSearch Service and OpenSearch Serverless collections. Some key ideas embody:
- Supply – Enter element that specifies how the pipeline ingests the information. Every pipeline has a single supply which will be both push-based and pull-based.
- Processors – Intermediate processing models that may filter, remodel, and enrich data earlier than supply.
- Sink – Output element that specifies the vacation spot(s) to which the pipeline publishes knowledge. It might publish data to a number of locations.
- Buffer – It’s the layer between the supply and the sink. It serves as non permanent storage for occasions, decoupling the supply from the downstream processors and sinks. Amazon OpenSearch Ingestion additionally affords a persistent buffer possibility for push-based sources
- Useless-letter queues (DLQs) – Configures Amazon Easy Storage Service (Amazon S3) to seize data that fail to put in writing to the sink, enabling error dealing with and troubleshooting.
This end-to-end knowledge ingestion service can assist you gather, course of, and ship knowledge to your OpenSearch environments with out the necessity to handle underlying infrastructure.
This put up supplies an in-depth have a look at establishing Amazon CloudWatch alarms for OpenSearch Ingestion pipelines. It goes past our really useful alarms to assist determine bottlenecks within the pipeline, whether or not that’s within the sink, the OpenSearch clusters knowledge is being despatched to, the processors, or the pipeline not pulling or accepting sufficient from the supply. This put up will make it easier to proactively monitor and troubleshoot your OpenSearch Ingestion pipelines.
Overview
Monitoring your OpenSearch Ingestion pipelines is essential for catching and addressing points early. By understanding the important thing metrics and establishing the proper alarms, you’ll be able to proactively handle the well being and efficiency of your knowledge ingestion workflows. Within the following sections, we offer particulars about alarm metrics for various sources, displays, and sinks. The precise values for the edge, interval, and datapoints to alarm used for alarms can fluctuate primarily based on the person use case and necessities.
Conditions
To create an OpenSearch Ingestion pipeline, consult with Creating Amazon OpenSearch Ingestion pipelines. For creating CloudWatch alarms, consult with Create a CloudWatch alarm primarily based on a static threshold.
You possibly can allow logging for OpenSearch Ingestion Pipeline, which captures varied log messages throughout pipeline operations and ingestion exercise, together with errors, warnings, and informational messages. For particulars on enabling and monitoring pipeline logs, consult with Monitoring pipeline logs
Sources
The entry level of your pipeline is commonly the place monitoring ought to start. By setting acceptable alarms for supply parts, you’ll be able to shortly determine ingestion bottlenecks or connection points. The next desk summarizes key alarm metrics for various sources.
| Supply | Alarm | Description | Really useful Motion |
| HTTP/ OpenTelemetry | requestsTooLarge.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The request payload dimension of the consumer (knowledge producer) is bigger than the utmost request payload dimension, ensuing within the standing code HTTP 413. The default most request payload dimension is 10 MB for HTTP sources and 4 MB for OpenTelemetry sources. The restrict for the HTTP sources will be elevated for the pipelines with persistent buffer enabled. | The chunk dimension for the consumer will be diminished in order that the request payload doesn’t exceed the utmost dimension. You possibly can look at the distribution of payload sizes of incoming requests utilizing the payloadSize.sum metric. |
| HTTP | requestsRejected.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The request was despatched to the HTTP endpoint of the OpenSearch Ingestion pipeline by the consumer (knowledge producer), however the request wasn’t accepted by the pipeline, and it rejected the request with the standing code 429 within the response. | For persistent points, take into account growing the minimal OCUs for the pipeline to allocate extra sources for request processing. |
| Amazon S3 | s3ObjectsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The pipeline is unable to learn some objects from the Amazon S3 supply. | Check with REF-003 in Reference Information under. |
| Amazon DynamoDB | Distinction for totalOpenShards.max - activeShardsInProcessing.worthThreshold: >0 Statistic: Most (totalOpenShards.max) and Sum (activeShardsInProcessing.worth) Datapoints to Alarm: 3 out of three.Further Word: refer REF-004 for extra particulars on configuring this particular alarm. | It displays alignment between whole open shards that ought to be processed by the pipeline and lively shards at the moment in processing. The activeShardsInProcessing.worth will go down periodically as shards shut however ought to by no means misalign from ‘totalOpenShards.max’ for longer than a few minutes. | If the alarm is triggered, you’ll be able to take into account stopping and beginning the pipeline, this feature resets the pipeline’s state, and the pipeline will restart with a brand new full export. It’s non-destructive, so it does not delete your index or any knowledge in DynamoDB. In the event you don’t create a contemporary index earlier than you do that, you would possibly see a excessive variety of errors from model conflicts as a result of the export tries to insert older paperwork than the present _version within the index. You possibly can safely ignore these errors. For root trigger evaluation on the misalignment, you’ll be able to attain out to AWS Assist |
| Amazon DynamoDB | dynamodb.changeEventsProcessingErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of processing errors for change occasions for a pipeline with stream processing for DynamoDB. | If the metrics report growing values, consult with REF-002 in Reference Information under |
| Amazon DocumentDB | documentdb.exportJobFailure.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The try to set off an export to Amazon S3 failed. | Evaluate ERROR-level logs within the pipeline logs for entries starting with “Obtained an exception throughout export from DocumentDB, backing off and retrying.” These logs include the whole exception particulars indicating the basis reason for the failure. |
| Amazon DocumentDB | documentdb.changeEventsProcessingErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of processing errors for change occasions for a pipeline with stream processing for Amazon DocumentDB. | Check with REF-002 in Reference Information under |
| Kafka | kafka.numberOfDeserializationErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The OpenSearch Ingestion pipeline encountered deserialization errors whereas consuming a document from Kafka. | Evaluate WARN-level logs within the pipeline logs and confirm serde_format is configured appropriately within the pipeline configuration and the pipeline function has entry to the AWS Glue Schema Registry (if used). |
| OpenSearch | opensearch.processingErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | Processing errors had been encountered whereas studying from the index. Ideally, the OpenSearch Ingestion pipeline would retry mechanically, however for unknown exceptions, it would skip processing. | Check with REF-001 or REF-002 in Reference Information under, to get the exception particulars that resulted in processing errors. |
| Amazon Kinesis Knowledge Streams | kinesis_data_streams.recordProcessingErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The OpenSearch Ingestion pipeline encountered an error whereas processing the data. | If the metrics report growing values, consult with REF-002 in Reference Information under, which can assist in figuring out the trigger. |
| Amazon Kinesis Knowledge Streams | kinesis_data_streams.acknowledgementSetFailures.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The pipeline encountered a unfavourable acknowledgment whereas processing the streams, inflicting it to reprocess the stream. | Check with REF-001 or REF-002 in Reference Information under. |
| Confluence | confluence.searchRequestsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | Whereas attempting to fetch the content material, the pipeline encountered the exception. | Evaluate ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching content material.” These logs include the whole exception particulars indicating the basis reason for the failure. |
| Confluence | confluence.authFailures.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of UNAUTHORIZED exceptions obtained whereas establishing the connection | Though the service ought to mechanically renew tokens, if the metrics present an growing worth, overview ERROR-level logs within the pipeline logs to determine why the token refresh is failing. |
| Jira | jira.ticketRequestsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | Whereas attempting to fetch the problem, the pipeline encountered an exception. | Evaluate ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching problem.” These logs include the whole exception particulars indicating the basis reason for the failure. |
| Jira | jira.authFailures.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of UNAUTHORIZED exceptions obtained whereas establishing the connection. | Though the service ought to mechanically renew tokens, if the metrics present an growing worth, overview ERROR-level logs within the pipeline logs to determine why the token refresh is failing. |
Processors
The next desk supplies particulars about alarm metrics for various processors.
| Processor | Alarm | Description | Really useful Motion |
| AWS Lambda | aws_lambda_processor.recordsFailedToSentLambda.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | Among the data couldn’t be despatched to Lambda. | Within the case of excessive values for this metric, consult with REF-002 in Reference Information under. |
| AWS Lambda | aws_lambda_processor.numberOfRequestsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The pipeline was unable to invoke the Lambda operate. | Though this case mustn’t happen underneath regular situations, if it does, overview Lambda logs and consult with REF-002 in Reference Information under. |
| AWS Lambda | aws_lambda_processor.requestPayloadSize.maxThreshold: >= 6292536 Statistic: MAXIMUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The payload dimension is exceeding the 6 MB restrict, so the Lambda operate can’t be invoked. | Think about revisiting the batching thresholds within the pipeline configuration for the aws_lambda processor. |
| Grok | grok.grokProcessingMismatch.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The incoming knowledge doesn’t match the Grok sample outlined within the pipeline configuration. | Within the case of excessive values for this metric, overview the Grok processor configurations and ensure the outlined sample matches in response to the incoming knowledge. |
| Grok | grok.grokProcessingErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The pipeline encountered an exception when extracting the knowledge from the incoming knowledge in response to the outlined Grok sample. | Within the case of excessive values for this metric, consult with REF-002 in Reference Information under. |
| Grok | grok.grokProcessingTime.maxThreshold: >= 1000 Statistic: MAXIMUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The utmost period of time that every particular person document takes to match in opposition to patterns from the match configuration possibility. | If the time taken is the same as or greater than 1 second, examine the incoming knowledge and the Grok sample. The utmost period of time throughout which matching happens is 30,000 milliseconds, which is managed by the timeout_millis parameter. |
Sinks and DLQs
The next desk incorporates particulars about alarm metrics for various sinks and DLQs.
| Sink | Alarm | Description | Really useful Motion |
| OpenSearch | opensearch.bulkRequestErrors.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of errors encountered whereas sending a bulk request. | Check with REF-002 in Reference Information under which can assist to determine the exception particulars. |
| OpenSearch | opensearch.bulkRequestFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The variety of errors obtained after sending the majority request to the OpenSearch area. | Check with REF-001 in Reference Information under which can assist to determine the exception particulars. |
| Amazon S3 | s3.s3SinkObjectsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | The OpenSearch Ingestion pipeline encountered a failure whereas writing the item to Amazon S3. | Confirm that the pipeline function has the mandatory permissions to put in writing objects to the required S3 key. Evaluate the pipeline logs to determine the precise keys the place failures occurred. Monitor the s3.s3SinkObjectsEventsFailed.rely metric for granular particulars on the variety of failed write operations. |
| Amazon S3 DLQ | s3.dlqS3RecordsFailed.relyThreshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1 | For a pipeline with DLQ enabled, the data are both despatched to the sink or to the DLQ (if they’re unable to ship to the sink). This alarm signifies the pipeline was unable to ship the data to the DLQ because of some error. | Check with REF-002 in Reference Information under which can assist to determine the exception particulars. |
Buffer
The next desk incorporates particulars about alarm metrics for buffers.
| Buffer | Alarm | Description | Really useful Motion |
| BlockingBuffer | BlockingBuffer.bufferUsage.worthThreshold: >80 Statistic: AVERAGE Interval: 5 minutes Datapoints to alarm: 1 out 1 | The p.c utilization, primarily based on the variety of data within the buffer. | To research additional, examine if the Pipeline is bottlenecked because of processors or sink by evaluating timeElapsed.max metrics and analyzing bulkRequestLatency.max |
| Persistent | persistentBufferRead.recordsLagMax.worthThreshold: > 5000 Statistic: AVERAGE Interval: 5 minutes Datapoints to alarm: 1 out 1 | The utmost lag by way of variety of data saved within the persistent buffer. | If the worth for bufferUsage is low, enhance the utmost OCUs. If bufferUsage can be excessive [>80], examine if pipeline is bottlenecked by processors or sink. |
Reference Information
The next present steerage for resolving widespread pipeline points together with common reference.
REF-001: WARN-level Log Evaluate
Evaluate WARN-level logs within the pipeline logs to determine the exception particulars.
REF-002: ERROR-level Log Evaluate
Evaluate ERROR-level logs within the pipeline logs to determine the exception particulars.
REF-003: S3 Objects Failed
When troubleshooting growing s3ObjectsFailed.rely values, monitor these particular metrics to slim down the basis trigger:
s3ObjectsAccessDenied.rely– This metric increments when the pipeline encounters Entry Denied or Forbidden errors whereas studying S3 objects. Widespread causes embody:- Inadequate permissions within the pipeline function.
- Restrictive S3 bucket coverage not permitting the pipeline function entry.
- For cross-account S3 buckets, incorrectly configured bucket_owners mapping.
s3ObjectsNotFound.rely– This metric increments when the pipeline receives Not Discovered errors whereas making an attempt to learn S3 objects.
For additional help with the really useful actions, contact AWS assist.
REF-004: Configuring Alarm for distinction in totalOpenShards.max and activeShardsInProcessing.worth for Amazon DynamoDB supply.
- Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
- Within the navigation pane, select Alarms, All alarms.
- Select Create alarm.
- Select Choose Metric.
- Choose Supply.
- In supply, following JSON can be utilized after updating the <sub-pipeline-name>, <pipeline-name> and <area>.
Let’s overview couple of situations primarily based on the above metrics.
State of affairs 1 – Perceive and Decrease Pipeline Latency
Latency inside a pipeline is constructed up of three essential parts:
- The time it takes to ship paperwork by way of bulk requests to OpenSearch,
- the time it takes for knowledge to undergo the pipeline processors, and
- the time that knowledge sits within the pipeline buffer
Bulk requests and processors (final two objects within the earlier listing) are the basis causes for why the buffer builds up and results in latency.
To observe how a lot knowledge is being saved within the buffer, monitor the bufferUsage.worth metric. The one technique to decrease latency throughout the buffer is to optimize the pipeline processors and sink bulk request latency, relying on which of these is the bottleneck.
The bulkRequestLatency metric measures the time taken to execute bulk requests, together with retries, and can be utilized to observe write efficiency to the OpenSearch sink. If this metric experiences an unusually excessive worth, it signifies that the OpenSearch sink could also be overloaded, inflicting elevated processing time. To troubleshoot additional, overview the bulkRequestNumberOfRetries.rely metric to substantiate whether or not the excessive latency is because of rejections from OpenSearch which are resulting in retries, reminiscent of throttling (429 errors) or different causes. If doc errors are current, look at the configured DLQ to determine the failed doc particulars. Moreover, the max_retries parameter will be configured within the pipeline configuration to restrict the variety of retries. Nonetheless, if the documentErrors metric experiences zero, the bulkRequestNumberOfRetries.rely can be zero, and the bulkRequestLatency stays excessive, it’s probably an indicator that the OpenSearch sink is overloaded. On this case, overview the vacation spot metrics for extra particulars.
If the bulkRequestLatency metric is low (for instance, lower than 1.5 seconds) and the bulkRequestNumberOfRetries metric is reported as 0, then the bottleneck is probably going throughout the pipeline processors. To observe the efficiency of the processors, overview the <processorName>.timeElapsed.avg metric. This metric experiences the time taken for the processor to finish processing of a batch of data. For instance, if a grok processor is reporting a a lot greater worth than different processors for timeElapsed, it could be because of a gradual grok sample that may be optimized and even changed with a extra performant processor, relying on the use case.
State of affairs 2 – Understanding and Resolving Doc Errors to OpenSearch
The documentErrors.rely metric tracks the variety of paperwork that did not be despatched by bulk requests. The failure can occur because of varied causes reminiscent of mapping conflicts, invalid knowledge codecs, or schema mismatches. When this metric experiences a non-zero worth, it signifies that some paperwork are being rejected by OpenSearch. To determine the basis trigger, look at the configured Useless Letter Queue (DLQ), which captures the failed paperwork together with error particulars. The DLQ supplies details about why particular paperwork failed, enabling you to determine patterns reminiscent of incorrect area varieties, lacking required fields, or knowledge that exceeds dimension limits. For instance, discover the pattern DLQ objects for widespread points under:
Mapper parsing exception:
Right here, OpenSearch can not retailer the textual content string “N/A” in a area that’s just for numbers, so it rejects the doc and shops it within the DLQ.
Restrict of whole fields exceeded:
The index.mapping.total_fields.restrict setting is the parameter that controls the utmost variety of fields allowed in an index mapping, and exceeding this restrict will trigger indexing operations to fail. You possibly can examine if all these fields are required or leverage varied processors offered by OpenSearch Ingestion to rework the information.
As soon as these points are recognized, you’ll be able to both right the supply knowledge, modify the pipeline configuration to rework the information appropriately, or modify the OpenSearch index mapping to accommodate the incoming knowledge format.
Clear up
When establishing alarms for monitoring your OpenSearch Ingestion pipelines, it’s vital to be aware of the potential prices concerned. Every alarm you configure will incur expenses primarily based on the CloudWatch pricing mannequin.
To keep away from pointless bills, we advocate rigorously evaluating your alarm necessities and configuring them accordingly. Solely arrange the alarms which are important in your use case, and recurrently overview your alarm configurations to determine and take away unused or redundant alarms.
Conclusion
On this put up, we explored the great monitoring capabilities for OpenSearch Ingestion pipelines via CloudWatch alarms, protecting key metrics throughout varied sources, processors, and sinks. Though this put up highlights probably the most important metrics, there’s extra to find. For a deeper dive, consult with the next sources:
Efficient monitoring via CloudWatch alarms is essential for sustaining wholesome ingestion pipelines and sustaining optimum knowledge move.
