Construct a knowledge pipeline from Google Search Console to Amazon Redshift utilizing AWS Glue

Google Search Console (GSC) is a service provided by Google that helps you monitor, preserve, and troubleshoot your website’s presence in Google Search outcomes. It supplies you distinctive insights straight from Google about how the search engine sees your website, serving to you enhance your efficiency in Search Engine Outcomes Pages (SERPs).

When there’s a must merge Google Search Console information with a number of information sources or conduct complicated efficiency evaluation, conventional strategies can change into time-consuming and error-prone. That is the place Amazon Redshift and AWS Glue provide a complete information integration resolution.

On this publish, we discover how AWS Glue extract, rework, and cargo (ETL) capabilities join Google functions and Amazon Redshift, serving to you unlock deeper insights and drive data-informed choices by automated information pipeline administration. We stroll you thru the method of utilizing AWS Glue to combine information from Google Search Console and write it to Amazon Redshift.

Answer overview

AWS Glue is a serverless information integration service that helps uncover, put together, and mix information for analytics, machine studying (ML), and software improvement. You need to use AWS Glue to create, run, and monitor information integration and ETL pipelines and catalog your property throughout a number of information shops.

Amazon Redshift is a quick, scalable, and absolutely managed cloud information warehouse that allows you to to course of and run complicated SQL analytics workloads on structured and semi-structured information. It additionally helps you securely entry your information in operational databases, information lakes, or third-party datasets with minimal motion or copying of information. Tens of hundreds of shoppers use Amazon Redshift to course of massive quantities of information, modernize their information analytics workloads, and supply insights for his or her enterprise customers.

The next diagram illustrates the structure that we implement on this publish.

The workflow consists of an AWS Glue job studying information from Google Search Console for the three entities that Google Search Console helps (Search Analytics, Websites, and Sitemaps), and writing the info in a Redshift provisioned cluster. AWS Glue helps Google Search Console API v3.

Within the following sections, we stroll by the next steps to configure AWS Glue to arrange a connection between Google Search Console and Amazon Redshift for information migration:

Create an OAuth consumer.
Create an IAM function for AWS Glue integration with Google Search Console, AWS Secrets and techniques Supervisor, and Amazon Redshift.
Create a secret in Secrets and techniques Supervisor to retailer the consumer secret created within the earlier step.
Create a connection to Google Search Console in AWS Glue.
Create a connection to Amazon Redshift in AWS Glue.
Arrange a desk and permissions in Amazon Redshift.
Create an ETL job in AWS Glue.

Conditions

Earlier than beginning this walkthrough, you could have the next conditions in place:

An AWS account.
A Google Cloud account and a Google Cloud mission.
In your Google Cloud mission, you could allow the Google Search Console API.

For directions, see Allow and disable APIs on the API Console Assist for Google Cloud Platform.
A provisioned cluster or Amazon Redshift Serverless .

On this publish, we use a single-node ra3.massive Redshift provisioned cluster deployed in a single Availability Zone. This configuration is used for demonstration functions solely. For manufacturing environments, we suggest utilizing multi-node clusters with a minimal of two nodes deployed throughout a number of Availability Zones for prime availability and higher efficiency.
An Amazon Easy Service Storage (Amazon S3) bucket.
An AWS Id and Entry Administration (IAM) function that grants AWS Glue and Amazon Redshift read-only entry to Amazon S3. This function can be hooked up to the Redshift cluster or Redshift Serverless namespace throughout creation, and also will be used when operating the AWS Glue job together with permissions to learn and write secrets and techniques to Secrets and techniques Supervisor. Confer with the Amazon Redshift Database Developer Information for extra particulars.

Create OAuth consumer

To hook up with Google Search Console, AWS Glue requires OAuth 2.0 for authentication. It’s essential to create an OAuth 2.0 consumer ID, which AWS Glue makes use of when requesting an OAuth 2.0 entry token. To create an OAuth 2.0 consumer ID within the Google Cloud Platform console, observe these steps:

On the Google Cloud Platform console, from the tasks checklist, select a mission or create a brand new one.
If the APIs & Providers web page isn’t already open, select the menu icon on the higher left and select APIs & Providers.
Within the navigation pane, select Credentials.
Select Create Credentials, then select OAuth consumer ID.
Choose Net software as the appliance kind, enter NewClient because the identify, and supply https://console.aws.amazon.com for Approved JavaScript origins.
For Approved redirect URIs, add https://us-east-1.console.aws.amazon.com/gluestudio/oauth. This instance makes use of us-east-1 for establishing AWS Glue jobs; change the redirect URIs in accordance with your AWS Area. A number of redirect URIs may also be specified.
Select Create.
Open the main points web page on your new consumer.
Beneath Further data, be aware down the consumer ID and consumer secret. You will want these particulars when configuring the key in Secrets and techniques Supervisor.

Create IAM function for AWS Glue integration with Google Search Console, Secrets and techniques Supervisor, and Amazon Redshift

You need to use AWS Glue to switch information from supported sources into your Redshift databases. You want an IAM function as a result of AWS Glue wants authorization to write down into Redshift databases. To create a task, full the next steps:

Check in to the IAM console with adequate entry to create insurance policies.
Select Insurance policies within the navigation pane.
Select Create coverage.

On the JSON tab, enter the next coverage. AWS Glue wants the next permissions to entry and run SQL statements within the Redshift database and create and retrieve secrets and techniques with Secrets and techniques Supervisor:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:DescribeSecret",
                "secretsmanager:GetSecretValue",
                "secretsmanager:PutSecretValue",
                "ec2:CreateNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DeleteNetworkInterface"
            ],
            "Useful resource": "*"
        },
        {
            "Impact": "Enable",
            "Motion": "s3:GetObject",
            "Useful resource": "arn:aws:s3:::aws-glue-studio-transforms-510798373988-prod-us-east-1/*"
        },
        {
            "Impact": "Enable",
            "Motion": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Useful resource": [
                "arn:aws:s3:::aws-glue-assets-testbucket/*"
            ]
        },
        {
            "Sid": "DataAPIPermissions",
            "Impact": "Enable",
            "Motion": [
                "redshift-data:ExecuteStatement",
                "redshift-data:GetStatementResult",
                "redshift-data:DescribeStatement"
            ],
            "Useful resource": "*"
        },
        {
            "Sid": "GetCredentialsForAPIUser",
            "Impact": "Enable",
            "Motion": "redshift:GetClusterCredentials",
            "Useful resource": [
                "arn:aws:redshift:*:*:dbname:*/*",
                "arn:aws:redshift:*:*:dbuser:*/*"
            ]
        },
        {
            "Sid": "GetCredentialsForServerless",
            "Impact": "Enable",
            "Motion": "redshift-serverless:GetCredentials",
            "Useful resource": "*"
        },
        {
            "Sid": "DenyCreateAPIUser",
            "Impact": "Deny",
            "Motion": "redshift:CreateClusterUser",
            "Useful resource": [
                "arn:aws:redshift:*:*:dbuser:*/*"
            ]
        },
        {
            "Sid": "ServiceLinkedRole",
            "Impact": "Enable",
            "Motion": "iam:CreateServiceLinkedRole",
            "Useful resource": "arn:aws:iam::*:function/aws-service-role/redshift-data.amazonaws.com/AWSServiceRoleForRedshift",
            "Situation": {
                "StringLike": {
                    "iam:AWSServiceName": "redshift-data.amazonaws.com"
                }
            }
        }
    ]
}

Modify the S3 bucket identify that you’re utilizing because the staging bucket. Moreover, AWS Glue should have entry to particular AWS owned S3 buckets for internet hosting AWS Glue transforms. On this instance, the IAM coverage makes use of aws-glue-studio-transforms-510798373988-prod-us-east-1, which is the AWS owned bucket within the us-east-1 Area. Confer with Overview IAM permissions wanted for ETL jobs for the suitable bucket identify on your Area.

Select Subsequent.
For Coverage identify, enter a reputation (for this publish, we use glue-redshift-gsc-policy).
Enter an outline, then select Create coverage.
Within the navigation pane, select Roles and Create function.

Select Customized belief coverage and enter the next, then select Subsequent.

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "glue.amazonaws.com"
                ]
            },
            "Motion": "sts:AssumeRole"
        }
    ]
}

Seek for and choose the coverage glue-redshift-gsc-policy, then select Subsequent.
Present the function identify GlueIAMRoleRedshiftNew or one other identify and related Description, then select Create function.
After the function is created, select Add permissions and Connect insurance policies.
Seek for AWSGlueServiceRole and select Add Permissions. This coverage is usually hooked up to roles specified when defining crawlers, jobs, and improvement endpoints.

Create secret in Secrets and techniques Supervisor

Full the next steps to create a Secrets and techniques Supervisor secret:

On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
Choose Different kind of secret.
For the customer-managed related software, the key ought to include the related software’s shopper secret with USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET as the important thing and the consumer secret worth as created within the earlier step.
Select Subsequent.
Enter a secret identify and select Subsequent.
Select Retailer.

Create connection to Google Search Console in AWS Glue

To create a connection to Google Search Console in AWS Glue, observe these steps:

Check in to the AWS Glue console with a licensed electronic mail ID with permissions already supplied in Google Search Console.
Within the navigation pane, select Knowledge connections.
Beneath Connections, select Create connection.
In Knowledge sources, seek for Google Search Console and select Subsequent.
For IAM Function ARN, select the function created earlier.
For Token URL, use https://oauth2.googleapis.com/token, which is the default worth.
For Consumer Managed Consumer Software ClientId, enter the consumer ID created earlier whereas creating the OAuth consumer.
For AWS Secret, select the key created earlier.
In case your AWS Glue jobs must run in an Amazon digital non-public cloud (VPC), present acceptable particulars. For extra data, discuss with Configure a VPC on your ETL job.
Select Take a look at connection, select your Google ID, and select Proceed.
Select Proceed to belief the connection.

If the person has licensed entry, the connection check can be profitable.
Select Subsequent.
Present a connection identify and select Create connection.

Create connection to Amazon Redshift in AWS Glue

Full the next steps to arrange an AWS Glue connection for Amazon Redshift. Confer with Redshift connections for extra data.

On the AWS Glue console, within the navigation pane, select Knowledge connections.
Beneath Connections, select Create connection.
In Knowledge sources, seek for JDBC and select Subsequent. For Amazon Redshift, you can even use Redshift connections. On this publish, we use JDBC. On this instance, we’re utilizing a Redshift provisioned cluster.
Present the Amazon Redshift JDBC URL and both use a Secrets and techniques Supervisor secret for storing credentials or present the person identify and password straight. As a greatest apply, it is strongly recommended to make use of Secrets and techniques Supervisor.
Configure community choices with Amazon VPC settings for operating the AWS Glue job in a VPC. On this instance, we use the identical VPC, subnet, and safety group the place the Redshift cluster is provisioned. All JDBC information shops have to be accessible from the VPC subnet. A VPC endpoint is required to entry Amazon S3 from inside your VPC. In case your job must entry each VPC assets and the general public web, configure a NAT gateway within the VPC.

Arrange desk and permissions in Amazon Redshift

To arrange desk and permissions in Amazon Redshift, observe these steps:

On the Amazon Redshift console, select Question editor v2.
Hook up with your present Redshift cluster.

Create a desk with the next DDL. For this publish, we create a brand new database named check and create the next tables within the public schema of check database:

#Create Database command
CREATE DATABASE check; 

#Sitemap desk creation
CREATE TABLE public.sitemap(
    path VARCHAR(4096) ENCODE lzo,
    kind VARCHAR(255) ENCODE lzo,
    lastSubmitted TIMESTAMP ENCODE delta,
    isPending BOOLEAN NULL ENCODE uncooked,
    isSitemapsIndex BOOLEAN NULL ENCODE uncooked,
    lastDownloaded TIMESTAMP NULL ENCODE delta,
    warnings BIGINT NULL ENCODE delta,
    errors BIGINT NULL ENCODE delta,
    contents VARCHAR(65535) NULL ENCODE lzo) DISTSTYLE AUTO;
    
#Search Analytics desk creation
CREATE TABLE public.search_analytics (
    keys character various(2048) ENCODE lzo,
    clicks double precision ENCODE uncooked,
    impressions double precision ENCODE uncooked,
    ctr numeric(38, 18) ENCODE az64,
    place double precision ENCODE uncooked
) DISTSTYLE AUTO;

#Websites desk creation
 CREATE TABLE public.websites (
    siteurl character various(2048) ENCODE lzo,
    permissionLevel character various(50) ENCODE lzo
) DISTSTYLE AUTO;

Create ETL job in AWS Glue

To create a knowledge circulate in AWS Glue, observe these steps:

On the AWS Glue console, select ETL jobs within the navigation pane.
Select Visible ETL underneath Create job.

Every ETL job in AWS Glue is priced primarily based on its period.
For the supply, select Google Search Console, and for the goal, select Amazon Redshift.
Select Supply (Google Search Console) to configure the properties, which opens in the best window pane.
Select the Google Search Console connection created within the earlier sections, and supply the entity identify. On the time of writing, there are three supported entities: Search Analytics, Websites, and Sitemaps, with a number of supported fields and operators for every entity. Select the entity identify and the corresponding fields; by default, the connector selects all fields. The instance exhibits choosing the entity Website and corresponding fields siteUrl and permissionLevel.
Select Target (Amazon Redshift) to configure the properties, which opens in the best pane.
Select the Amazon Redshift connection, schema, and desk identify that have been created within the earlier steps. On this instance, we use Append to focus on desk as the tactic for dealing with the info. An S3 listing is supplied for staging short-term information.
Navigate to Job particulars and supply a job identify and IAM function (which the job will assume whereas operating). This is identical function created earlier.
Select Save and Run. For this instance, we use AWS Glue model 5.0, retaining all different configuration values underneath Job particulars at their defaults. For this instance, we now have not applied any schema mapping, so the columns in Amazon Redshift have been created to match the output response for the Search entity.
After the job has accomplished efficiently, navigate to Question Editor v2 in Amazon Redshift and question the Websites desk to preview the info.
Within the case of job failures, validate the connections by doing a knowledge preview, and discuss with Troubleshooting AWS Glue.
Just like the Website entity, you possibly can load Sitemap entity information by altering the supply properties and vacation spot desk within the goal Redshift cluster, then selecting Run.
Navigate to Question Editor v2 in Amazon Redshift and question the sitemap desk to preview the info.
Just like Sitemap, you possibly can load Search Analytics entity information by altering the supply properties and vacation spot desk within the goal Redshift cluster, then selecting Run.
Navigate to Question Editor v2 in Amazon Redshift and question the search_analytics desk and preview the info.

Filter predicates with Search Analytics

The Search Analytics entity supplies help for a number of filters that can be utilized to view the site visitors information for the websites. The next examples present use of some filter predicates you should utilize that Google Search Console connections help.

start_end_date – The default worth for start_end_date is between <30 days in the past from the present date> AND <yesterday>. To make use of a special date vary, use the between The next instance shows search information from January by September 2025:
```
start_end_date between '2025-01-01' AND '2025-09-30'
```
machine – The machine filters end result in opposition to specified machine kind like DESKTOP, MOBILE, and TABLET:
nation – You possibly can filter in opposition to the required nation, as specified by three-letter nation code (ISO 3166-1 alpha-3):
dimensions: Dimensions assist group zero or extra outcomes for filtering search information by nation or machine. The next instance shows search information grouped by nation, and likewise grouping by nation and filtering for cellular units:
```
dimensions="nation" AND nation='ind' AND machine="MOBILE"
```

Run analytical queries on Amazon Redshift

On this part, we run analytical queries utilizing aggregated information throughout totally different search entities.

Listing all international locations the place website place is lower than 10 and machine kind is MOBILE:

SELECT * from search_analytics_device_country the place place < 10 AND keys LIKE '%MOBILE%'

Listing all international locations the place impressions are better than 1 and place is lower than 10:

SELECT * FROM "check"."public"."search_analytics_country" the place impressions > 1 and place < 10;

Clear up

To keep away from incurring expenses, clear up the assets in your AWS account by finishing the next steps:

On the AWS Glue console, within the navigation pane, select Job monitoring.
Cease any operating jobs created for Google Search Console connections.
From the checklist of connections, choose the connection identify created and delete it.
Delete the Redshift provisioned cluster or the Redshift Serverless workspace and namespace. Amazon Redshift pricing is utilized throughout the cluster’s runtime primarily based on cluster configuration.
Clear up assets in your Google account by deleting the mission that comprises the Google Venture assets. For directions, discuss with Delete your mission.

Conclusion

On this publish, we walked you thru the method of utilizing AWS Glue to combine information from Google Search Console and write it to Amazon Redshift, a petabyte-scale information warehouse. Whether or not you’re archiving historic information, performing complicated analytics, or making ready information for machine studying, this connector streamlines the method and helps create an built-in information pipeline.

For extra data, discuss with AWS Glue help for Google Search Console.

Construct a knowledge pipeline from Google Search Console to Amazon Redshift utilizing AWS Glue

Answer overview

Conditions

Create OAuth consumer

Create IAM function for AWS Glue integration with Google Search Console, Secrets and techniques Supervisor, and Amazon Redshift

Create secret in Secrets and techniques Supervisor

Create connection to Google Search Console in AWS Glue

Create connection to Amazon Redshift in AWS Glue

Arrange desk and permissions in Amazon Redshift

Create ETL job in AWS Glue

Filter predicates with Search Analytics

Run analytical queries on Amazon Redshift

Clear up

Conclusion

Concerning the authors

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US