Quantcast
Channel: DDL – Cloud Data Architect
Viewing all 275 articles
Browse latest View live

Diagnostic Data Processing on Cloudera Altus

$
0
0

Feed: Cloud – Cloudera Engineering Blog.
Author: Shelby Khan.

Cloudera Altus architecture

Fig 1 – Architecture

Introduction

Many of Cloudera’s customers set up Cloudera Manager to collect their clusters’ diagnostic data on a regular schedule and automatically send that data to Cloudera. Cloudera analyzes this data, identifies potential problems in the customer’s environment, and alerts customers, requiring fewer back-and-forths with our customers when they file a support case and provides Cloudera with critical information to improve future versions of all of Cloudera’s software. If Cloudera discovers a serious issue, Cloudera searches this diagnostic data and proactively notifies Cloudera customers who might encounter problems due to the issue. This blog post explains how Cloudera internally uses the Altus as a Service platform in the cloud to perform these analyses. Offloading processing and ad-hoc visualization to the cloud reduces costs since compute resources are used only when needed. Transient ETL workloads process incoming data and stateless data warehouse clusters are used for ad-hoc visualizations. The clusters share metadata via Altus SDX.

Overview

In Cloudera EDH, diagnostic bundles are ingested from customer environments and normalized using Apache NiFi. The bundles are stored in an AWS S3 bucket with date-based partitions (Step 1 in Fig 1). AWS Lambda is scheduled daily (Step 2 in Fig 1) to process the previous day’s data (Step 3 in Fig 1) by spinning up Altus Data Engineering clusters to execute ETL processing and terminating the clusters on job completion. The jobs on these DE clusters produce a fact table and three dimension tables (Star schema). This extracted data is stored in a different S3 bucket (Step 4 in Fig 1) and the metadata produced from these processes, such as schema and partitions, are stored in Altus SDX (Step 5 in Fig 1). Whenever data needs to be visualized, a stateless Altus Data Warehouse cluster is created, which provides easy access to the data via the built-in SQL Editor or JDBC/ODBC Impala connector (Step 6,7 in Fig 1).

Altus SDX

We create configured SDX namespaces for sharing metadata and fine-grained authorization between workloads that run on clusters that we create in Altus Data Engineering and Data Warehouse. Configured SDX namespaces are built on Hive metastore and Sentry databases. These databases are set up and managed by Altus customers.

First, we will create two databases in an external database.

Fig 2 – Hive and Sentry external database creation

We will then create a configured namespace using those databases.

configured namespace creation cloudera altus

Fig 3 – Configured namespace creation

Altus will initialize the schemas of Hive metastore and Sentry databases when a cluster using a configured namespace is started. We chose to grant namespace admin group all privileges so that when the internal user ‘altus’, who executes the Spark job in Data Engineering clusters, has necessary DDL permissions.

Altus Data Engineering

We need to process the bundles that were uploaded the previous day. To do this, we need a mechanism to start an Altus Data Engineering cluster periodically, which executes the job and terminates after the processing is done. We execute an AWS Lambda function that is periodically triggered by an AWS Cloudwatch rule. This AWS Lambda uses Altus Java SDK to kick off an ephemeral Data Engineering cluster with a Spark job.

This will process the diagnostic data that was stored in a S3 bucket. It will create a fact table and three dimension tables. The data for those tables will be stored in a different S3 bucket. The metadata will be stored in Altus SDX. The cluster is deleted at the end as the job only runs for 30 minutes.

Altus Data Warehouse

Whenever we want to analyze the data, we spin up an Altus Data Warehouse cluster in the same environment, using the same namespace as the Altus Data Engineering cluster. Doing so allows the Altus Data Warehouse to have direct access to the data and metadata created by the Altus Data Engineering clusters.

data warehouse cluster creation

Fig 4 – Data Warehouse cluster creation

Once the cluster is created we can use the built-in Query Editor to quickly analyze the data.

Query Data Warehouse cluster using SQL Editor

Fig 5 – Query Data Warehouse cluster using SQL Editor

If you need to generate custom dashboards, connect  to the Altus Data Warehouse clusters using Cloudera Impala JDBC/ODBC connector. Recently, we released a new Impala JDBC driver which can connect to an Altus Data Warehouse cluster by simply specifying the cluster’s name.

Data Warehouse JDBC connection using cluster name

Fig 6 – Data Warehouse JDBC connection using cluster name

In the image above, we show how easy it is to connect Tableau to the tables in Altus Data Warehouse.

Visualizing distribution of services in the Cloudera clusters

Fig 7 – Visualizing distribution of services in the Cloudera clusters

In the above visualization, you can see the distribution of services running in Cloudera clusters. Note that this is based on a random dataset.

What’s Next?

To get started with a 30-day free Altus trial, visit us at https://www.cloudera.com/products/altus.html. Send questions or feedback to the Altus community forum.


Separating queries and managing costs using Amazon Athena workgroups

$
0
0

Feed: AWS Big Data Blog.

Amazon Athena is a serverless query engine for data on Amazon S3. Many customers use Athena to query application and service logs, schedule automated reports, and integrate with their applications, enabling new analytics-based capabilities.

Different types of users rely on Athena, including business analysts, data scientists, security, and operations engineers. But how do you separate and manage these workloads so that users get the best experience while minimizing costs?

In this post, I show you how to use workgroups to do the following:

  • Separate workloads.
  • Control user access.
  • Manage query usage and costs.

Separate workloads

By default, all Athena queries execute in the primary workgroup.  As an administrator, you can create new workgroups to separate different types of workloads.  Administrators commonly turn to workgroups to separate analysts running ad hoc queries from automated reports.  Here’s how to build out that separation.

First create two workgroups, one for ad hoc users (ad-hoc-users) and another for automated reports (reporting).

Next, select a specific output location. All queries executed inside this workgroup save their results to this output location. Routing results to a single secure location helps make sure users only access data they are permitted to see. You can also enforce encryption of query results in S3 by selecting the appropriate encryption configuration.

Workgroups also help you simplify the onboarding of new users to Athena. By selecting override client-side settings, you enforce a predefined configuration on all queries within a workgroup. Users no longer have to configure a query results output location or S3 encryption keys. These settings default to the parameters defined for the workgroup where those queries execute. Additionally, each workgroup maintains a unique query history and saved query inventory, making queries easier for you to track down.

Finally, when creating a workgroup, you can add up to 50 key-value pair tags to help identify your workgroup resources. Tags are also useful when attempting to allocate Athena costs between groups of users. Create Name and Dept tags for the ad-hoc-users and reporting workgroups with their name and department association.

Control user access to workgroups

Now that you have two workgroups defined, ad-hoc-users and reporting, you must control who can use and update them.  Remember that workgroups are IAM resources and therefore have an ARN. You can use this ARN in the IAM policy that you associate with your users.  In this example, create a single IAM user representing the team of ad hoc users and add the individual to an IAM group. The group contains a policy that enforces what actions these users can perform.

Start by reviewing IAM Policies for Accessing Workgroups and Workgroup Example Policies to familiarize yourself with policy options. Use the following IAM policy to set up permissions for your analyst user. Grant this user only the permissions required for working in the ad-hoc-users workgroup. Make sure that you tweak this policy to match your exact needs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:ListWorkGroups"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:DeleteNamedQuery",
                "athena:GetNamedQuery",
                "athena:ListQueryExecutions",
                "athena:StopQueryExecution",
                "athena:GetQueryResultsStream",
                "athena:GetQueryExecutions",
                "athena:ListNamedQueries",
                "athena:CreateNamedQuery",
                "athena:GetQueryExecution",
                "athena:BatchGetNamedQuery",
                "athena:BatchGetQueryExecution",
                "athena:GetWorkGroup",
                "athena:ListTagsForResource"
            ],
            "Resource": "arn:aws:athena:us-east-1:112233445566:workgroup/ad-hoc-users"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObjectAcl",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::demo/workgroups/adhocusers/*"
        },
{
            "Effect": "Allow",
            "Action": [
                "glue:Get*"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:112233445566:catalog",
                "arn:aws:glue:us-east-1:112233445566:database/amazon",
                "arn:aws:glue:us-east-1:112233445566:table/amazon/*"
            ]
        }
    ]
}

Now your analyst user can execute queries only in the ad-hoc-users workgroup. The analyst user can switch to other workgroups, but they lose access when they try to perform any action. They are further restricted to list and query only those tables that belong to the Amazon database. For more information about controlling access to AWS Glue resources such as databases and tables, see AWS Glue Resource Policies for Access Control.

The following screenshot shows what the analyst user sees in the Athena console:

I’ve created a simple Node.js tool that executes SQL queries stored as files in a given directory. You can find my Athena test runner code in the athena_test_runner GitHub repo. You can use this code to simulate a reporting tool, after configuring it to use a workgroup. To do that, create an IAM role with permissions like those previously defined for the analyst user. This time, restrict access to the reporting workgroup.

The following JavaScript code example shows how to select a workgroup programmatically when executing queries:

function executeQueries(files) {
    params = 
    {
      "QueryString": "", 
      "ResultConfiguration": { 
        "OutputLocation": ""
      },
      "QueryExecutionContext": {
        "Database": "default"
      },
      "WorkGroup":"reporting"
    }
 
    params.QueryString = "SELECT * FROM amazon.final_parquet LIMIT 10"
    return new Promise((resolve, reject) => {
        athena.startQueryExecution(params, (err, results) => {
            if (err) {
                reject(err.message)
            } else {
                resolve(results)
            }
        })
    })
}

Run sample automated reports under the reporting workgroup, with the following command:

node index.js testsuite/

Query histories remain isolated between workgroups. A user logging into the Athena console as an analyst using the ad-hoc-users workgroup doesn’t see any automated reports that you ran under the reporting workgroup.

Managing query usage and cost

You have two workgroups configured: one for ad hoc users and another for automated reports. Now, you must safeguard against bad queries. In this use case, two potential situations for query usage should be monitored and controlled:

  • Make sure that users don’t run queries that scan more data than allowed by their budget.
  • Safeguard against automated script bugs that could cause indefinite query retirement.

First, configure data usage controls for your ad-hoc-users workgroup. There are two types of data usage controls: per-query and per-workgroup.

Set the per-query control for analysts to be 1 GB. This control cancels any query run in the ad-hoc-users workgroup that tries to scan more than 1 GB.

To observe this limit in action, choose Update, return to the query editor, and run a query that would scan more than 1 GB. This query triggers the error message, “Query cancelled! : Bytes scanned limit was exceeded”. Remember that you incur charges for data the query scanned up to the point of cancellation. In this case, you incur charges for 1 GB of data.

Now, switch to your reporting workgroup. For this workload, you’re not worried about individual queries scanning too much data. However, you want to control the aggregate amount of data scanned of all queries in this workgroup.

Create a per-workload data usage control for the reporting workgroup. You can configure the maximum amount of data scanned by all queries in the workgroup during a specific period.

For the automated reporting workload, you probably have a good idea of how long the process should take and the total amount of data that queries scan during this time. You only have a few reports to run, so you can expect them to run in a few minutes, only scanning a few megabytes of data. Begin by setting up a low watermark alarm to notify you when your queries have scanned more data than you would expect in five minutes. The following example is for demo purposes only. In most cases, this period would be longer. I configured the alarm to send a notification to an Amazon SNS topic that I created.

To validate the alarm, I made a minor change to my test queries, causing them to scan more data. This change triggered the SNS alarm, shown in the following Amazon CloudWatch dashboard:

Next, create a high watermark alarm that is triggered when the queries in your reporting workgroup exceed 1 GB of data over 15 minutes. In this case, the alarm triggers an AWS Lambda function that disables the workgroup, making sure that no additional queries execute in it. This alarm protects you from incurring faulty automation code or runaway query costs.

Before creating the data usage control, create a Node.js Lambda function to disable the workgroup. Paste in the following code:

exports.handler = async (event) => {
    const AWS = require('aws-sdk')
    let athena = new AWS.Athena({region: 'us-east-1'})
 
    let msg = JSON.parse(event.Records[0].Sns.Message)
    let wgname = msg.Trigger.Dimensions.filter((i)=>i.name=='WorkGroup')[0].value
    
    athena.updateWorkGroup({WorkGroup: wgname, State: 'DISABLED'})
 
    const response = {
        statusCode: 200,
        body: JSON.stringify(`Workgroup ${wgname} has been disabled`),
    };
    return response;
}

This code grabs the workgroup name from the SNS message body and calls the UpdateWorkGroup API action with the name and the state of DISABLED. The Athena API requires the most recent version of the AWS SDK. When you create the Lambda bundle, include the latest AWS SDK version in that bundle.

Next, create a new SNS topic and a subscription. For Protocol, select AWS Lambda. Then, select the Lambda function that you created in the previous step.

In the Athena console, create the second alarm, 1 GB for 15 min., and point it to the SNS topic that you created earlier. When triggered, this SNS topic calls the Lambda function that disables the reporting workgroup. No more queries can execute in this workgroup. You see this error message in the console when a workgroup is disabled:

Athena exposes other aggregated metrics per workgroup under the AWS/Athena namespace in CloudWatch, such as the query status and the query type (DDL or DML) per workgroup. To learn more, see Monitoring Athena Queries with CloudWatch Metrics.

Cost allocation tags

When you created your ad-hoc-users and reporting workgroups, you added Name and Dept tags. These tags can be used in your Billing and Cost Management console to determine the usage per workgroup.

Summary

In this post, you learned how to use workgroups in Athena to isolate different query workloads, manage access, and define data usage controls to protect yourself from runaway queries. Metrics exposed to CloudWatch help you monitor query performance and make sure that your users are getting the best experience possible. For more details, see Using Workgroups to Control Query Access.

About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

Konstantin Evteev: Standby in production: scaling application in second largest classified site in the world.

$
0
0

Feed: Planet PostgreSQL.

Konstantin Evteev

Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of . Avito is the biggest Russian classified site, and the second largest classified site in the world (after Craigslist of USA). Items offered for sale on Avito can be brand new or used. The website also publishes job vacancies and CVs.

Via its web and mobile apps, the platform monthly serves more than 35 million users. They add approximately a million new ads a day and close over 100,000 transactions per day. The back office has accumulated more than a billion ads. According to Yandex, in some Russian cities (for example, in Moscow), Avito is considered a high load project in terms of page views. Some figures can give a better idea of the project’s scale:

  • 600+ servers;
  • 4.5 Gbit/sec TX, 2 Gbit/sec RX without static;
  • about a million queries per minute to the backend;
  • 270TB of images;
  • >20 TB in Postgres on 100 nodes:
  • 7–8K TPS on most nodes;
  • the largest — 20k TPS, 5 TB.

At the same time, these volumes of data need not only to be accumulated and stored but also processed, filtered, classified and made searchable. Therefore, expertise in data processing is critical for our business processes.

The picture below shows the dynamic of pageviews growth.

Our decision to store ads in PostgreSQL helps us to meet the following scaling challenges: the growth of data volume and growth of number of requests to it, the scaling and distribution of the load, the delivery of data to the DWH and the search subsystems, inter-base and internetwork data synchronization, etc. PostgreSQL is the core component of our architecture. Reach set of features, legendary durability, built-in replication, archive, reserve tools are found a use in our infrastructure. And professional community helps us to effectively use all these features.

In this report, I would like to share Avito’s experience in different cases of standby usage in the following order:

– a few words about standby and its history in general;

– problems and solutions in replication based horizontal scale-out;

– Avito’s implementation for solution to avoid stale reads from replica;

– possible pitfalls while using standby with high request rate, applying DDL, receiving WAL files from the archive;

– handling issues with technique of using several standbys in production and routing queries between them;

– logical replication based scaling example to compare;

– conclusions and standby major upgrade features.

Standby can be used for the following purposes.

  • High availability: if primary crashes you need to have hot reserve to make fast failover
  • Scaling — you can switch part or all your read queries between one or more standbys

In early 2000s, PostgreSQL community didn’t think that replication should be a built-in feature. Vadim Mikheev after implementation of MVCC, developed a replication solution based on MVCC internal mechanics. The following logic was supposed to be used to replicate data: taking changes that were not visible in a previous snapshot and became visible in the current snapshot. Then Jan Wieck developed an enterprise logical replication solution based on Vadim Mikheev’s idea and named it Slony (Russian for “elephant”) as a reverence for its author. Then the team from Skype created SkyTools with PgQ (the transactional queue) and Londiste (the logical replication solution based on PgQ). PostgreSQL 8.3 аdd several txid_*() functions to query active transaction IDs. This is useful for various replication solutions. (Jan). And finally Logical replication became a built-in feature in PostgreSQL 10.

Meanwhile, built-in binary replication was evolving.

  • 2001 PostgreSQL 7.1: write-ahead log is created to improve write performance.
  • 2005 PostgreSQL 8.0: point in time recovery, makes it possible to create handmade replication scripts to have warm reserve.
  • 2008 PostgreSQL 8.3: administration of previously released warm standby feature (2006) becomes easier. That’s why someone still uses PostgreSQL 8.3 in production.
  • 2010 PostgreSQL 9.0: since this version PostgreSQL standby has been able to serve read queries (), moreover release of streaming replication feature has simplified setting up replication and decreased possible replication lag.
  • 2011 PostgreSQL 9.1: synchronous replication makes it easier to support SLA for applications working with critical data (for example information about financial transactions). Previously to avoid data loss you had to wait until your data was successfully replicated to standby or had to save data in two databases manually.
  • 2013 PostgreSQL 9.3: standby can follow timeline switch which makes failover procedures easier.
  • 2014 PostgreSQL 9.4: replication slots make it safer to setup standby without archive, but as a result standby fail can lead to master fail. Logical decoding becomes the fundament for built-in logical replication.
  • 2016 PostgreSQL 9.6: multiple synchronous standbys give an opportunity to create more reliable clusters. Remote_apply makes it possible to serve read queries from standby without stale reads.

So we have discussed few classifications of standby: logical and physical, built-in and stand alone solutions, read only and not only read only, sync and async.

Now let’s look closer at physical built-in read only async standbу.

In Streaming Replication, three kinds of processes work together. A process on the primary server sends WAL data to а standby server; and then, a and processes on standby server receives and replays these data. A walsender and a walreceiver communicate using a single TCP connection.

When you use asynchronous standby, it can fall behind the primary. That’s why different rules for routing read queries to the primary or to the standby match different types of applications.

  1. First approach — don’t use any routing technique, just route any read query to standby. For some applications it will be good due to many reasons such as:
  • low replication lag;
  • specific profile of queries with a lack of or a small number of errors connected with stale reads;
  • and so on.

But I don’t like this approach, let’s look at better techniques for routing read queries to standby.

2. Sometimes there is a specific business logic that can allow exploiting stale reads. For example, few years ago on Avito there was a 30-minute interval before our users could see new ads (time for anti-fraud check and other restrictions). So we successfully use this specific business rule for routing your read queries for other people’s ads. At the same time we route read queries to your own ads to the primary. Thus, you may edit your own ad and work with the actual ad’s state. To make this routing approach work in a correct way we need to support a replication lag lower than the 30-minute interval.

3. There can be situations when stale reads are not acceptable, even more interesting when you need to use few standbys to achieve a greater level of scale. To deal with them you need to use special techniques. Let’s look closer at our example and the technique to avoid stale reads

We successfully used the logical routing technique (number 2: based on business specific logic) when we faced the following spikes. In the picture you can see a diagram with the Load Average on a 28-physical-core machine (2X Intel(R) Xeon(R) CPU E5–2697 v3 @ 2.60GHz). When we started using Hyper-threading, we benchmarked and found out that most of our queries were successfully scaled with enabling Hyper-threading. But then the distribution of queries changed and the result of this can be seen on the diagram. When the level of load average got closer to the number of physical cores, the spikes happened.

The same spikes were observed on the CPU diagram below. From my side, I saw that all queries started being executed dozens of times longer.

So I experienced the capacity problem of the primary server. Actually, the capacity of one server is bounded by the most powerful market offering. If it does not fit you, it is a capacity problem. Normally, changing the architecture of your application is not so that fast and moreover it might be complicated and take a lot of time. Being under such circumstances I had to find a quick solution. As you can notice that the spikes are gone, we coped with the capacity problem by switching more read queries to standby (you can see the number of queries on standby on the following TPS diagram).

We distributed load among more nodes in the system by re-routing queries that don’t need to run on the primary. The main challenges in spreading read queries to replicas are stale reads and race conditions.

On the links you can see detailed description of routing reads based on replica WAL position technique. The main idea is to store the last committed LSN for the entity on which mutating request was executed. Then, when we subsequently want to fulfill a read operation for the same entity, we’ll check which replicas have consumed to that point or beyond it, and randomly select one from the pool. If no replicas are sufficiently advanced (i.e. say a read operation is being run very closely after the initial write), we’ll fall back to the master. Stale reads become impossible regardless of the state of any given replica.

We made some simplification in the technique above. We didn’t use exactly the same one because we didn’t have enough time and resources to rewrite application by implementing sticky connections based on LSN tracking.

At that time we were facing the capacity problem of the primary server, we had a huge monolithic application (it was in progress of splitting to microservice architecture). So it had a lot of complicated logic and deep nested calls — which forced us to make migration to sticky LSN sessions unreachable in short term period. But on the other hand, we had some historical background: we had a timestamp column in the user and the item tables (our main entities), which we filled with the help of PostgreSQL function now(): current date and time — start of current transaction. So we decided to use timestamp instead of LSN. Probably you know that timestamp can’t be used for the task of serialization data operations (or tracking position of replica in relation to primary) in PostgreSQL. To make this assumption we used the following arguments:

  • we didn’t have long transactions (all of them lasted several milliseconds);
  • we routed all read queries for the entity on which mutating request was executed within 7.5 minutes to primary server;
  • we could route read queries to replicas, which must not fall behind in relation to primary greater than 5-minute interval.

Eventually, we came up with the following solution (we called it Avito Smart Cache). We replaced our cache based on Redis with the one implemented in Tarantool. ( is a powerful fast data platform that comes with an in-memory database and an integrated application server).

We made two levels of cache. The first level — the hot cache — stores sticky sessions. Sticky connections are supported by storing the timestamp of last user’s data mutation. Each user is given its own key so that the actions of one user would not lead to all other users being affected.

Let’s look closer at the approach we used to deliver user’s timestamp information to Tarantool. We have made some changes in our framework for working with database: we have added «on commit hook». In this «hook» we invoke sending data mutation’s timestamp to the 1st level of cache. You may notice, that there is a possibility when we successfully have made changes in PostgreSQL database but couldn’t deliver this info to cache (for example there was a network problem or application error or smth else). To deal with such cases we have made the Listener. All mutations on the database side are sent to the Listener using PostgreSQL LISTEN and NOTIFY features. Also, the Listener tracks last_xact_replay_timestamp of standby and has HTTP API from which Tarantool takes this information.

The second level is the cache for our read data to minimize utilization of PostgreSQL resources (replacing the old cache in Redis).

Some special features:

  • On the schema there are 16 Tarantool shards to help us to be resilient. When 1 shard crashes, PostgreSQL can continue processing 1/16 queries without help of the cache.
  • Decreasing the value of time interval used for routing queries causes the growth in number of queries to standby. Meanwhile, this leads to increasing probability that standbys lag is greater than the routing interval. If this happens, all queries will be sent to the primary server and it’s highly likely that the primary won’t be able to process all these requests. And this elicits one more fact: when you use 2 nodes (primary and standby in production), you need at least the 3rd node (2nd standby) for the reserve purpose, and there should be a disaster recovery plan for all parts of this complex infrastructure.

The main logic of working with smart cache is as follows.

Trying to find data on the main level of the cache, if we find it, it is success. If we do not then we try to get data on the 1st level hot cache. If we find data in the hot cache, it means that there were recent changes in our data and we should get data from primary.

Or if the standbys lag is greater than routing interval — we also should get data from the primary.

Otherwise, data hasn’t been changed for a long time and we can route read request to standby.

You may notice different ttl values — they are different because there is a probability of races and In cases where the probability is greater I use smaller ttl to minimize losses.

Thus we successfully implemented one of the techniques for routing queries and made eventually synchronized cache. But on the graph below you may notice the spikes. These spikes appeared when standby started falling behind due to some reasons such as different locks or specific profile of utilizing hardware resources. In the following part of my report, I would like to share Avito’s experience in solving those issues on the standbys.

Step 1:

The Infrastructure for this case is as follows:

  • primary (master);
  • standby;
  • two tables: the items and the options.

Step 2

Open transaction on the primary and alter the options table.

Step 3

On the standby execute the query to get data from the items table.

Step 4

On the primary side alter the items table.

Step 5

Execute the query to get data from the items table.

In PostgreSQL versions lower than 10 we have a deadlock that is not detected. Such deadlocks have successfully been detected by deadlock detector since PostgreSQL 10.

The infrastructure for this case is the same.

How to apply DDL changes to the table, receiving thousands read requests per second. Of course with the help ofsetting (Abort any statement that takes more than the specified number of milliseconds, starting from the time the command arrives at the server from the client. ).

Another setting you should use to apply DDL is This is the amount of time, in milliseconds, to wait on a lock before checking to see if there is a deadlock condition. The check for deadlock is relatively expensive, so the server doesn’t run it every time it waits for a lock. We optimistically assume that deadlocks are not common in production applications and just wait on the lock for a while before checking for a deadlock. Increasing this value reduces the amount of time wasted in needless deadlock checks, but slows down reporting of real deadlock errors. The default is one second (1s), which is probably about the smallest value you would want in practice. On a heavily loaded server you might want to raise it. Ideally the setting should exceed your typical transaction time, so as to improve the odds that a lock will be released before the waiter decides to check for deadlock.

setting controls the time when conflicting autovacuum is canceled, because the default 1-second value is also the value of duration of the lock while applying DDL and it can lead to a trouble in systems with thousands requests per second.

There is some specifics in the scope of statement_timeout setting and the changes take effect ( for example changing statement_timeout inside the stored procedure will take no effect on this procedure call).

With the help of the settings above we repeatedly try to apply changes to the structure of our tables with a short lock (approximately dozens of ms). Sometimes we can’t do this (take a lock within dozens of ms) during the daytime, and then we try to do it at night, when the traffic rate is lower. Usually we manage to do it. Otherwise we increase the values of those settings(«time window» for getting the lock).

After execution of the DDL statement with timeouts on the primary, these changes are replicated to standby. Timeouts aren’t replicated to the standby that’s why there may be locks between the WAL replay process and read queries.

2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host=127.0.0.1:55763 LOG: process 20949 still waiting for AccessShareLock on relation 10000 of database 9000 after 5000.055 ms2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host=127.0.0.1:55763 DETAIL: Process holding the lock: 46091. Wait queue: 18639, 20949, 53445, 20770, 10799, 47217, 37659, 6727, 37662, 25742, 20771,

We have implemented the workaround to deal with the issue above. We use the proxy to route queries to one of two standbys. And when we want to apply the DDL command on the primary, we stop replaying of the replication on the standby where the queries are executed.

  • Wait till the ALTER command is replayed on the second standby.
  • Then switch traffic to the second standby (which has successfully replayed the WAL with DDL statement).
  • Start the replication on the first standby and wait till the ALTER command has been replayed on it.

Return the first standby to the pool of active standbys.

Vacuum can truncate the end of data file — the exclusive lock is needed for this action. At this moment on the standby long locks between read only queries and recovery process may occur. It happens because some unlock actions are not written to WAL. The example below shows a few AccessExclusive locks in one xid 920764961, and not a single unlock… Unlock happens much later. When the standby replays the .

The solution for that issue can be like:

In our case (when we faced problem above) we have a kind of table with logs. Almost append only — once a week we executed a delete query with timestamp condition, we cleaned records older than 2 weeks. This specific workload makes it possible to truncate the end of the data files with the autovacuum process. At the same time we actively use the standby for reading queries. The consequences can be seen above. Eventually we successfully solved that issue by executing cleaning queries more frequently: every hour.

On Avito we actively use standbys with the replication being set up with the help of an archive. This choice was made to have an opportunity to recreate any crashed node ( primary or standby) from the archive and then set it in a consistent state in relation to the current primary.

In 2015, we faced the following situation: some of our services started generating too many WAL files. As a result our archive command reached up the performance limits, you may notice the spikes with pending WAL files for sending to the archive on the graph below.

You can notice the green arrow pointing to the area, where we solved this issue. We implemented multithreaded archive command.

You can make a deep dive into the implementation following . In short, it has the following logic: if the number of ready WAL files is lower than the threshold value, then the archive is to be done with one thread, else turn on a parallel archive technique.

But then we were faced the problem of CPU utilization on the standby. It was not obvious at first view that we were monitoring CPU utilization for our profile of load in a wrong way. Getting /proc/stat values we can see just a snapshot of the current CPU utilization. Archiving one WAL file in our case takes 60 ms and on the snapshots we don’t see the whole picture. The details could be seen with the help of the counter measurements.

Due to the above reasons there was a challenge for us to find the true answer why the standby was falling behind in relation to the primary (almost all standby CPU resources were utilized by archiving and unarchiving WAL files). The green arrow on the schema above shows the result of the optimization in the standby CPU usage. This optimization was made in a new archive schema, by delegating the archiving command execution to the archive server.

First let’s look under the hood of the old archive schema. The archive is mounted to the standby with the help of NFS. The primary sends the WAL files to archive through the standby (on this standby users’ read queries are routed). The archive command includes the compression step, and it is carried out by the CPU of the standby server.

New archive schema:

  • Compression step is carried out by the archive server CPU.
  • Archive is not the single point of failure. WAL is normally archived in two archives, or at least in one (in case another one is crashed or becomes unavailable). Running synchronizing WAL files procedure after each backup is the way to deal with temporary unavailability of one of two archives servers.
  • With the help of new archive schema we expanded the interval for PITR and the number of backups. We use the archives in turn, for example, 1st backup on 1st archive, 2nd backup on 2nd archive, 3rd backup on 1st archive, 4th backup on 2nd archive.

Both archive solutions were ideas of Michael Tyurin further developed by Sergey Burladyan and implemented by Victor Yagofarov. You can get it . The solutions above were made for PostgreSQL versions where pg_rewind and pg_receivexlog were not available. With its help it is easier to deal with crashes in the distributed PostgreSQL infrastructure based on asynchronous replication. You can see the example of a crash of the primary at the schema below and there can be variants:

  • some changes have been transferred neither to the standby nor to the archive;
  • all the data has been transferred either to the standby or to the archive. In other words, the archive and the standby are in different states.

So there can be different cases with many «IFs» in your Disaster Recovery Plan. On the other hand, when something goes wrong it is harder to concentrate and to do everything right (f). That’s why DRP must be clear to responsible stuff and be automated. There is a really beneficial report named where you can find a good overview and description of the tooling and the techniques to make warm standby.

Besides I want to share one configuration example of PostgreSQL infrastructure from my experience. Even using synchronous replication for standby and archive, it may happen that the standby and the archive are in different states after a crash of the primary. Synchronous replication works in such a manner: on the primary changes are committed, but there is a data structure in memory(it is not written to the WAL) that indicates that clients can’t see these changes. That indicator is stored until synchronous replicas send acknowledgment that they have successfully replayed the changes above.

The existence of the possible lag of standby and other problems I have described above made us use the standbys pool in Avito’s infrastructure.

With the help of HAProxy and check function, we altered few standbys. The idea is that when the standby lag value is greater than the upper border, we close it for users’ queries until the lag value is smaller than the lower border.

if masterthen falseif lag > maxthen create file and return falseif lag > min and file existsthen return falseif lag < min and file existsthen remove file and return trueelsetrue

To implement this logic we need to make a record on the standby side. It can be implemented with the help of foreign data wrapper or untrusted languages(PL/Perl, PL/Python and etc).

It is out of the current topic but I want to say that there is logical replication as an alternative of binary replication. With its help, the following Avito’s issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization, etc. Michael Tyurin (ex chief architect) is the author of the majority of core solutions mentioned above, implemented with the help of SkyTools.

Both kinds of replication solutions have their strengths and weaknesses. Since PostgreSQL 10 logical replication has been a built-in feature. And.

One of successful examples of logical replication usage in Avito is the organization of search results preview (since 2009 to the 1st quarter 2019). The idea is in making trigger-based materialized view (www.pgcon.org/2008/schedule/attachments/64_BSDCan2008-MaterializedViews-paper.pdf, www.pgcon.org/2008/schedule/attachments/63_BSDCan2008-MaterializedViews.pdf ) and then replicating it to another machine for read queries.

This Machine is much cheaper than the main server. And as a result, we serve all our search results from this node. On the graph below you can find information about TPS and CPU usage: it is very effective pattern of scale read load.

There are a lot more things to discuss such as reserve for logical replica, unlogged tables to increase performance, pros and cons and so on. But it is a completely different story and I will tell it next time. The point I want to highlight is that using logical replication can be very useful in many cases. There are no such drawbacks as in streaming replication, but it also has its own operation challenges. As a result there is no «silver bullet», knowledge about both types of replication helps us make the right choice in a specific case.

There are a few types of standby and each fits a specific case. When I was going to start preparing this report I intended to make the following conclusion. There are many operation challenges, that’s why it is better to scale your application with sharding. With the help of standby these tasks can be done without extra efforts:

  • hot reserve;
  • analytical queries;
  • read queries when stale reads is not a problem.

Moreover, I was going to stop using binary standby for reading queries in Avito production. But opportunities have changed and we are going to build multi-datacenter architecture. Each service should be deployed at least in 3 datacenters. This is how the deploy schema looks like: a group of instances of each microservice and one node of its database (primary or standby) should be located in each data center. Reads will be served locally whereas writes will be routed to the primary.

So the opportunities make us use standby in production.

I have described several approaches and techniques above how to serve reads from standby. There are some pitfalls and a few of them have been found and successfully solved by us. Built-in solutions for pitfalls mentioned in this report could make standby operating experience better. For example, some kind of hints to write deadlock and statement timeout to WAL. If you are aware of other pitfalls let’s share that knowledge with us and the community.

Another crucial moment — is to make backups and reserve right way. Create your disaster recovery plan, support it in actual state with regular drills. Moreover, in order to minimize downtime period automize your recovery. One more aspect about reserve and standby is that if you use standby in production to execute application queries you should have another one for reserve purposes.

Suppose you have to do major upgrade and you can’t afford a long downtime. With primary and standby in production there is a high probability that one node can’t serve all queries (after the upgrade you should also have two nodes). To do this fast you can either use logical replication or upgrade master and standby with pg_upgrade. On Avito we use pg_upgrade. From the documentation, it is not obvious how to do it right if your standby receives WAL files from archive. If you use Log-Shipping standby servers (without streaming), the last file in which shutdown checkpoint record is written won’t be archived. To make the standby servers caught up you need to copy the last WAL file from primary to the standby servers and wait till it is applied. After that standby servers can issue restart point at the same location as in the stopped master. Alternatively, you can switch.

from my upgrade experience.

In production config vacuum_defer_cleanup_age = 900000. With this setting pg_upgrade cannot freeze pg_catalog in a new cluster (it executes ‘vacuumdb — all — freeze’) during performing upgrade and upgrade fails:

Performing Upgrade------------------Analyzing all rows in the new cluster okFreezing all rows on the new cluster okDeleting files from new pg_clog okCopying old pg_clog to new server okSetting next transaction ID and epoch for new cluster okDeleting files from new pg_multixact/offsets okCopying old pg_multixact/offsets to new server okDeleting files from new pg_multixact/members okCopying old pg_multixact/members to new server okSetting next multixact ID and offset for new cluster okResetting WAL archives okconnection to database failed: FATAL: database "template1" does not existcould not connect to new postmaster started with the command:"/home/test/inst/pg9.6/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "new/"-o "-p 50432 -b -c synchronous_commit=off -c fsync=off -cfull_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700-c unix_socket_directories='/home/test/tmp/u'" startFailure, exiting#!/bin/bashPGOLD=~/inst/pg9.4/binPGNEW=~/inst/pg9.6/bin${PGOLD}/pg_ctl init -s -D old -o "--lc-messages=C -T pg_catalog.english"${PGNEW}/pg_ctl init -s -D new -o "--lc-messages=C -T pg_catalog.english"echo vacuum_defer_cleanup_age=10000 >> new/postgresql.auto.conf# move txid to 3 000 000 000 in old cluster as in production${PGOLD}/pg_ctl start -w -D old -o "--port=54321--unix_socket_directories=/tmp"${PGOLD}/vacuumdb -h /tmp -p 54321 --all --analyze${PGOLD}/vacuumdb -h /tmp -p 54321 --all --freeze${PGOLD}/pg_ctl stop -D old -m smart#${PGOLD}/pg_resetxlog -x 3000000000 olddd if=/dev/zero of=old/pg_clog/0B2D bs=262144 count=1# # move txid in new cluster bigger than vacuum_defer_cleanup_age may fixproblem# ${PGNEW}/pg_ctl start -w -D new -o "--port=54321--unix_socket_directories=/tmp"# echo "select txid_current();" | ${PGNEW}/pgbench -h /tmp -p 54321 -n -P 5-t 100000 -f- postgres# ${PGNEW}/pg_ctl stop -D new -m smart${PGNEW}/pg_upgrade -k -d old/ -D new/ -b ${PGOLD}/ -B ${PGNEW}/# rm -r new old pg_upgrade_* analyze_new_cluster.sh delete_old_cluster.sh

I did not find any prohibition in the documentation on using production config with pg_upgrade, maybe I am wrong and this has already been mentioned in the documentation.

I and my colleagues are hoping to see improvements in replication in future PostgreSQL versions. Both replication approaches in PostgreSQL are a great achievement, it is very simple today to set up replication with high performance. This article is written to share Avito experience and highlight main demands of Russian PostgreSQL community (I suppose that members of PostgreSQL community worldwide face the same problems).

Separate queries and managing costs using Amazon Athena workgroups

$
0
0

Feed: AWS Big Data Blog.

Amazon Athena is a serverless query engine for data on Amazon S3. Many customers use Athena to query application and service logs, schedule automated reports, and integrate with their applications, enabling new analytics-based capabilities.

Different types of users rely on Athena, including business analysts, data scientists, security, and operations engineers. But how do you separate and manage these workloads so that users get the best experience while minimizing costs?

In this post, I show you how to use workgroups to do the following:

  • Separate workloads.
  • Control user access.
  • Manage query usage and costs.

Separate workloads

By default, all Athena queries execute in the primary workgroup.  As an administrator, you can create new workgroups to separate different types of workloads.  Administrators commonly turn to workgroups to separate analysts running ad hoc queries from automated reports.  Here’s how to build out that separation.

First create two workgroups, one for ad hoc users (ad-hoc-users) and another for automated reports (reporting).

Next, select a specific output location. All queries executed inside this workgroup save their results to this output location. Routing results to a single secure location helps make sure users only access data they are permitted to see. You can also enforce encryption of query results in S3 by selecting the appropriate encryption configuration.

Workgroups also help you simplify the onboarding of new users to Athena. By selecting override client-side settings, you enforce a predefined configuration on all queries within a workgroup. Users no longer have to configure a query results output location or S3 encryption keys. These settings default to the parameters defined for the workgroup where those queries execute. Additionally, each workgroup maintains a unique query history and saved query inventory, making queries easier for you to track down.

Finally, when creating a workgroup, you can add up to 50 key-value pair tags to help identify your workgroup resources. Tags are also useful when attempting to allocate Athena costs between groups of users. Create Name and Dept tags for the ad-hoc-users and reporting workgroups with their name and department association.

Control user access to workgroups

Now that you have two workgroups defined, ad-hoc-users and reporting, you must control who can use and update them.  Remember that workgroups are IAM resources and therefore have an ARN. You can use this ARN in the IAM policy that you associate with your users.  In this example, create a single IAM user representing the team of ad hoc users and add the individual to an IAM group. The group contains a policy that enforces what actions these users can perform.

Start by reviewing IAM Policies for Accessing Workgroups and Workgroup Example Policies to familiarize yourself with policy options. Use the following IAM policy to set up permissions for your analyst user. Grant this user only the permissions required for working in the ad-hoc-users workgroup. Make sure that you tweak this policy to match your exact needs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:ListWorkGroups"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:DeleteNamedQuery",
                "athena:GetNamedQuery",
                "athena:ListQueryExecutions",
                "athena:StopQueryExecution",
                "athena:GetQueryResultsStream",
                "athena:GetQueryExecutions",
                "athena:ListNamedQueries",
                "athena:CreateNamedQuery",
                "athena:GetQueryExecution",
                "athena:BatchGetNamedQuery",
                "athena:BatchGetQueryExecution",
                "athena:GetWorkGroup",
                "athena:ListTagsForResource"
            ],
            "Resource": "arn:aws:athena:us-east-1:112233445566:workgroup/ad-hoc-users"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObjectAcl",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::demo/workgroups/adhocusers/*"
        },
{
            "Effect": "Allow",
            "Action": [
                "glue:Get*"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:112233445566:catalog",
                "arn:aws:glue:us-east-1:112233445566:database/amazon",
                "arn:aws:glue:us-east-1:112233445566:table/amazon/*"
            ]
        }
    ]
}

Now your analyst user can execute queries only in the ad-hoc-users workgroup. The analyst user can switch to other workgroups, but they lose access when they try to perform any action. They are further restricted to list and query only those tables that belong to the Amazon database. For more information about controlling access to AWS Glue resources such as databases and tables, see AWS Glue Resource Policies for Access Control.

The following screenshot shows what the analyst user sees in the Athena console:

I’ve created a simple Node.js tool that executes SQL queries stored as files in a given directory. You can find my Athena test runner code in the athena_test_runner GitHub repo. You can use this code to simulate a reporting tool, after configuring it to use a workgroup. To do that, create an IAM role with permissions like those previously defined for the analyst user. This time, restrict access to the reporting workgroup.

The following JavaScript code example shows how to select a workgroup programmatically when executing queries:

function executeQueries(files) {
    params = 
    {
      "QueryString": "", 
      "ResultConfiguration": { 
        "OutputLocation": ""
      },
      "QueryExecutionContext": {
        "Database": "default"
      },
      "WorkGroup":"reporting"
    }
 
    params.QueryString = "SELECT * FROM amazon.final_parquet LIMIT 10"
    return new Promise((resolve, reject) => {
        athena.startQueryExecution(params, (err, results) => {
            if (err) {
                reject(err.message)
            } else {
                resolve(results)
            }
        })
    })
}

Run sample automated reports under the reporting workgroup, with the following command:

node index.js testsuite/

Query histories remain isolated between workgroups. A user logging into the Athena console as an analyst using the ad-hoc-users workgroup doesn’t see any automated reports that you ran under the reporting workgroup.

Managing query usage and cost

You have two workgroups configured: one for ad hoc users and another for automated reports. Now, you must safeguard against bad queries. In this use case, two potential situations for query usage should be monitored and controlled:

  • Make sure that users don’t run queries that scan more data than allowed by their budget.
  • Safeguard against automated script bugs that could cause indefinite query retirement.

First, configure data usage controls for your ad-hoc-users workgroup. There are two types of data usage controls: per-query and per-workgroup.

Set the per-query control for analysts to be 1 GB. This control cancels any query run in the ad-hoc-users workgroup that tries to scan more than 1 GB.

To observe this limit in action, choose Update, return to the query editor, and run a query that would scan more than 1 GB. This query triggers the error message, “Query cancelled! : Bytes scanned limit was exceeded”. Remember that you incur charges for data the query scanned up to the point of cancellation. In this case, you incur charges for 1 GB of data.

Now, switch to your reporting workgroup. For this workload, you’re not worried about individual queries scanning too much data. However, you want to control the aggregate amount of data scanned of all queries in this workgroup.

Create a per-workload data usage control for the reporting workgroup. You can configure the maximum amount of data scanned by all queries in the workgroup during a specific period.

For the automated reporting workload, you probably have a good idea of how long the process should take and the total amount of data that queries scan during this time. You only have a few reports to run, so you can expect them to run in a few minutes, only scanning a few megabytes of data. Begin by setting up a low watermark alarm to notify you when your queries have scanned more data than you would expect in five minutes. The following example is for demo purposes only. In most cases, this period would be longer. I configured the alarm to send a notification to an Amazon SNS topic that I created.

To validate the alarm, I made a minor change to my test queries, causing them to scan more data. This change triggered the SNS alarm, shown in the following Amazon CloudWatch dashboard:

Next, create a high watermark alarm that is triggered when the queries in your reporting workgroup exceed 1 GB of data over 15 minutes. In this case, the alarm triggers an AWS Lambda function that disables the workgroup, making sure that no additional queries execute in it. This alarm protects you from incurring faulty automation code or runaway query costs.

Before creating the data usage control, create a Node.js Lambda function to disable the workgroup. Paste in the following code:

exports.handler = async (event) => {
    const AWS = require('aws-sdk')
    let athena = new AWS.Athena({region: 'us-east-1'})
 
    let msg = JSON.parse(event.Records[0].Sns.Message)
    let wgname = msg.Trigger.Dimensions.filter((i)=>i.name=='WorkGroup')[0].value
    
    athena.updateWorkGroup({WorkGroup: wgname, State: 'DISABLED'})
 
    const response = {
        statusCode: 200,
        body: JSON.stringify(`Workgroup ${wgname} has been disabled`),
    };
    return response;
}

This code grabs the workgroup name from the SNS message body and calls the UpdateWorkGroup API action with the name and the state of DISABLED. The Athena API requires the most recent version of the AWS SDK. When you create the Lambda bundle, include the latest AWS SDK version in that bundle.

Next, create a new SNS topic and a subscription. For Protocol, select AWS Lambda. Then, select the Lambda function that you created in the previous step.

In the Athena console, create the second alarm, 1 GB for 15 min., and point it to the SNS topic that you created earlier. When triggered, this SNS topic calls the Lambda function that disables the reporting workgroup. No more queries can execute in this workgroup. You see this error message in the console when a workgroup is disabled:

Athena exposes other aggregated metrics per workgroup under the AWS/Athena namespace in CloudWatch, such as the query status and the query type (DDL or DML) per workgroup. To learn more, see Monitoring Athena Queries with CloudWatch Metrics.

Cost allocation tags

When you created your ad-hoc-users and reporting workgroups, you added Name and Dept tags. These tags can be used in your Billing and Cost Management console to determine the usage per workgroup.

Summary

In this post, you learned how to use workgroups in Athena to isolate different query workloads, manage access, and define data usage controls to protect yourself from runaway queries. Metrics exposed to CloudWatch help you monitor query performance and make sure that your users are getting the best experience possible. For more details, see Using Workgroups to Control Query Access.

About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

Bringing your stored procedures to Amazon Redshift

$
0
0

Feed: AWS Big Data Blog.

Amazon always works backwards from the customer’s needs. Customers have made strong requests that they want stored procedures in Amazon Redshift, to make it easier to migrate their existing workloads from legacy, on-premises data warehouses.

With that primary goal in mind, AWS chose to implement PL/pqSQL stored procedure to maximize compatibility with existing procedures and simplify migrations. In this post, we discuss how and where to use stored procedures to improve operational efficiency and security. We also explain how to use stored procedures with AWS Schema Conversion Tool.

What is a stored procedure?

A stored procedure is a user-created object to perform a set of SQL queries and logical operations. The procedure is stored in the database and is available to users who have sufficient privileges to run it.

Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t have to return a value. You can use the PL/pgSQL procedural language, including looping and conditional expressions, to control logical flow.

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific operations. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

You can also use stored procedures for delegated access control. For example, you can create stored procedures to perform functions without giving a user access to the underlying tables.

Why would you use stored procedures?

Many customers migrating to Amazon Redshift have complex data warehouse processing pipelines built with stored procedures on their legacy data warehouse platform. Complex transformations and important aggregations are defined with stored procedures and reused in many parts of their processing. Re-creating the logic of these processes using an external programming language or a new ETL platform could be a large project. Using Amazon Redshift stored procedures allows you to migrate to Amazon Redshift more quickly.

Other customers would like to tighten security and limit the permissions of their database users. Stored procedures offer new options to allow DBAs to perform necessary actions without having to give permissions too widely. With the security definer concepts in stored procedures, it is now possible to allow users to perform actions they otherwise would not have permissions to run.

Additionally, using stored procedures in this way helps reduce the operations burden. An experienced DBA is able to define a well-tested process for some administrative or maintenance action. They can then allow other, less experienced operators to execute the process without entrusting them with full superuser permissions on the cluster.

Finally, some customers prefer using stored procedures to manage their ETL/ELT operations as an alternative to shell scripting or complex orchestration tools. It can be difficult to ensure that shell scripts correctly retrieve and interpret the state of each operation in an ETL/ELT process. It can also be challenging to take on the operation and maintenance of an orchestration tool with a small data warehouse team.

Stored procedures allow the ETL/ELT logical steps to be fully enclosed in a master procedure that is written so that it either succeeds completely or fails cleanly with no side effects. The stored procedure can be called with confidence from a simple scheduler like cron.

Create a stored procedure

To create a stored procedure in Amazon Redshift, use the following the syntax:

CREATE [ OR REPLACE ] PROCEDURE sp_procedure_name 
  ( [ [ argname ] [ argmode ] argtype [, ...] ] )
AS $$
  procedure_body
$$ LANGUAGE plpgsql 
[ { SECURITY INVOKER | SECURITY DEFINER } ]
[ SET configuration_parameter { TO value | = value } ]

When you design stored procedures, think about the encapsulated functionality, input and output parameters, and security level. As an example, here’s how you can write a stored procedure to check primary key violations, given names of the schema, table, and primary key column, using dynamic SQL:

CREATE OR REPLACE PROCEDURE check_primary_key(schema_name varchar(128), 
table_name varchar(128), col_name varchar(128)) LANGUAGE plpgsql
AS $$
DECLARE
  cnt_var integer := 0;
BEGIN
  SELECT INTO cnt_var count(*) from pg_table_def where schemaname = schema_name and
  tablename = table_name and "column" = col_name;
  IF cnt_var = 0 THEN
    RAISE EXCEPTION 'Input table or column does not exist.';
  END IF;

  DROP TABLE IF EXISTS duplicates;
  EXECUTE
    $_$ CREATE TEMP TABLE duplicates as
    SELECT $_$|| col_name ||$_$, count(*) as counter
    FROM $_$|| table_name ||$_$
    GROUP BY 1
    HAVING count(*) > 1
    ORDER BY counter desc $_$;
  SELECT INTO cnt_var COUNT(*) FROM duplicates;
  IF cnt_var = 0
    THEN RAISE INFO 'No duplicates found';
    DROP TABLE IF EXISTS duplicates;
  ELSE
    RAISE INFO 'Duplicates exist for % value(s) in column %', cnt, col_name;
    RAISE INFO 'Check tmp table "duplicates" for duplicated values';
  END IF;
END;
$$;

For details about the kinds of SQL queries and control flow logic that can be used inside a stored procedure, see Creating Stored Procedures in Amazon Redshift.

Invoke a stored procedure

Stored procedures must be invoked by the CALL command, which takes the procedure name and the input argument values. CALL can’t be part of any regular queries. As an example, here’s how to invoke the stored procedure created earlier:

db=# call check_primary_key('public', 'bar', 'b');
INFO:  Duplicates exist for 1 value(s) in column b
INFO:  Check tmp table "duplicates" for duplicated values

Amazon Redshift stored procedure calls can return results through output parameters or a result set. Nested and recursive calls are also supported. For details, see CALL command.

How to use security definer procedures

Now that you know how to create and invoke a stored procedure, here’s more about the security aspects. When you create a procedure, only you as the owner (creator) have the privilege to call or execute it. You can grant EXECUTE privilege to other users or groups, which enables them to execute the procedure. EXECUTE privileges do not automatically imply that the caller can access all database objects (tables, views, and so on) that are referenced in the stored procedure.

Take the example of a procedure, sp_insert_customers, created by user Mary. It has an INSERT statement that writes to table customers that is owned by Mary. If Mary grants EXECUTE privileges to user John, John still can’t INSERT into the table customers unless he has explicitly been granted INSERT privileges on customers.

However, it might make sense to allow John to call the stored procedure without giving him INSERT privileges on customers. To do this, Mary has to set the SECURITY attribute of the procedure to DEFINER when creating the procedure and then grant EXECUTE privileges to John. With this set, when John calls sp_insert_customers, it executes with the privileges of Mary and can insert into customers without him having been granted INSERT privileges on that table.

When the security attribute is not specified during procedure creation, its value is set to INVOKER by default. This means that the procedure executes with the privileges of the user that calls it. When the security attribute is explicitly set to DEFINER, the procedure executes with the privileges of the procedure owner.

Best practices with stored procedures in Amazon Redshift

Here are some best practices for using stored procedures.

Ensure that stored procedures are captured in your source control tool.

If you plan to use stored procedures as a key element of your data processing, you should also establish a practice of committing all stored procedure changes to a source control system.

You could also consider defining a specific user who is the owner of important stored procedures and automating the process of creating and modifying procedures.

You can retrieve the source for existing stored procedures using the following command:

SHOW procedure_name;

Consider the security scope of each procedure and who calls it

By default, stored procedures run with the permission of the user that calls them. Use the SECURITY DEFINER attribute to enable stored procedures to run with different permissions. For instance, explicitly revoke access to DELETE from an important table and define a stored procedure that executes the delete after checking a safe list.

When using SECURITY DEFINER, take care to:

  • Grant EXECUTE on the procedure to specific users, not to PUBLIC. This ensures that the procedure can’t be misused by general users.
  • Qualify all database objects that the procedure accesses with the schema names if possible. For example, use myschema.mytable instead of just mytable.
  • Set the search_path when creating the procedure by using the SET option. This prevents objects in other schemas with the same name from being affected by an important stored procedure.

Use set-based logic and avoid manually looping over large datasets

When manipulating data within your stored procedures, continue to use normal, set-based SQL as much as possible, for example, INSERT, UPDATE, DELETE.

Stored procedures provide new control structures such as FOR and WHILE loops. These are useful for iterating over a small number of items, such as a list of tables. However, you should avoid using the loop structures to replace a set-based SQL operation. For example, iterating over millions of values to update them one-by-one is inefficient and slow.

Be aware of REFCURSOR limits and use temp tables for larger result sets

Result sets may be returned from a stored procedure either as a REFCURSOR or using temp tables.  REFCURSOR is an in-memory data structure and is the simplest option in many cases.

However, there is a limit of one REFCURSOR per stored procedure. You may want to return multiple result sets, interact with results from multiple sub-procedures, or return millions of result rows (or more). In those cases, we recommend directing results to a temp table and returning a reference to the temp table as the output of your stored procedure.

Keep procedures simple and nest procedures for complex processes

Try to keep the logic of each stored procedure as simple possible. You maximize your flexibility and make your stored procedures more understandable by keeping them simple.

The code of your stored procedures can become complex as you refine and enhance them. When you encounter a long and complex stored procedure, you can often simplify by moving sub-elements into a separate procedure that is called from the original procedure.

Migrating a stored procedure with the AWS Schema Conversion Tool

With Amazon Redshift announcing the support for stored procedures, AWS also enhanced AWS Schema Conversion Tool to convert stored procedures from legacy data warehouses to Amazon Redshift.

AWS SCT already supports the conversion of Microsoft SQL Server data warehouse stored procedures to Amazon Redshift.

With build 627, AWS SCT can now convert Microsoft SQL Server data warehouse stored procedures to Amazon Redshift. Here are the steps in AWS SCT:

  1. Create a new OLAP project for a SQL Server data warehouse (DW) to Amazon Redshift conversion.
  2. Connect to the SQL Server DW and Amazon Redshift endpoints.
  3. Uncheck all nodes in the source tree.
  4. Open the context (right-click) menu for Schemas.
  5. Open the context (right-click) menu for the Stored Procedures node and choose Convert Script (just like when you convert database objects).
  6. (Optional) You can also choose to review the assessment report and apply the conversion.

Here is an example of a SQL Server DW stored procedure conversion:

Conclusion

Stored procedure support in Amazon Redshift is now generally available in every AWS Region. We hope you are as excited about running stored procedures in Amazon Redshift as we are.

With stored procedure support in Amazon Redshift and AWS Schema Conversion Tool, you can now migrate your stored procedures to Amazon Redshift without having to encode them in another language or framework. This feature reduces migration efforts. We hope more on-premises customers can take advantage of Amazon Redshift and migrate to the cloud for database freedom.


About the Authors

Joe Harris is a senior Redshift database engineer at AWS, focusing on Redshift performance. He has been analyzing data and building data warehouses on a wide variety of platforms for two decades. Before joining AWS he was a Redshift customer from launch day in 2013 and was the top contributor to the Redshift forum.

Abhinav Singh is a database engineer at AWS. He works on design and development of database migration projects as well as customers to provide guidance and technical assistance on database migration projects, helping them improve the value of their solutions when using AWS.

Entong Shen is a senior software engineer on the Amazon Redshift query processing team. He has been working on MPP databases for over 7 years and has focused on query optimization, statistics and SQL language features. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Vinay is a principal product manager at Amazon Web Services for Amazon Redshift. Previously, he was a senior director of product at Teradata and a director of product at Hortonworks. At Hortonworks, he launched products in Data Science, Spark, Zeppelin, and Security. Outside of work, Vinay loves to be on a yoga mat or on a hiking trail.
 

Sushim Mitra is a software development engineer on the Amazon Redshift query processing team. He focuses on query optimization problems, SQL Language features and Database security. When not at work, he enjoys reading fiction from all over the world.

MySQL 8.0.17 Clone Plugin: How to Create a Slave from Scratch

$
0
0

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

MySQL 8.0.17 Clone PluginIn this blog post, we will discuss a new feature – the MySQL 8.0.17 clone plugin. Here I will demonstrate how easy it is to use to create the “classic” replication, building the standby replica from scratch.

The clone plugin permits cloning data locally or from a remote MySQL server instance. The cloned data is a physical snapshot of data stored in InnoDB, and this means, for example, that the data can be used to create a standby replica.

Let’s go to the hands-on and see how it works.

Installation & validation process of the MySQL 8.0.17 clone plugin

Installation is very easy and it works in the same as installing other plugins. Below is the command line to install the clone plugin:

And how to check if the clone plugin is active:

Note that these steps need to be executed on the Donor (aka master) and on the Recipient (aka slave if the clone is being used to create a replica).

After executing the installation, the plugin will be loaded automatically across restarts, so you don’t need to worry about this anymore.

Next, we will create the user with the necessary privilege on the Donor, so we can connect to the instance remotely to clone it.

As a security measure, I recommend replacing the % for the IP/hostname or network mask of the Recipient so the connections will be accepted only by the future replica server.  Now, on the Recipient server, the clone user requires the CLONE_ADMIN privilege for replacing recipient data, blocking DDL during the cloning operation and automatically restarting the server.

Next, with the plugin installed and validated, and users created on both Donor and Recipient servers, let’s proceed to the cloning process.

Cloning process

As mentioned, the cloning process can be executed locally or remotely.  Also, it supports replication, which means that the cloning operation extracts and transfers replication coordinates from the donor and applies them on the recipient. It can be used for GTID or non-GTID replication.

So, to begin the cloning process, first, let’s make sure that there’s a valid donor. This is controlled by clone_valid_donor_list parameter. As it is a dynamic parameter, you can change it while the server is running. Using the show variables command will show if the parameter has a valid donor:

In our case, we need to set it. So let’s change it:

The next step is not mandatory, but using the default log_error_verbosity the error log does not display much information about the cloning progress. So, for this example, I will adjust the verbosity to a higher level (on the Donor and the Recipient):

Now, let’s start the cloning process on the Recipient:

It is possible to observe the cloning progress in the error log of both servers. Below is the output of the Donor:

And the Recipient:

Note that the MySQL server on the Recipient will be restarted after the cloning process finishes. After this, the database is ready to be accessed and the final step is setting up the replica.

The replica process

Both binary log position (filename, offset) and GTID coordinates are extracted and transferred from the donor MySQL server instance.

The queries below can be executed on the cloned MySQL server instance to view the binary log position or the GTID of the last transaction that was applied:

With the information in hand, we need to execute the CHANGE MASTER command accordingly:

Limitations

The clone plugin has some limitations and they are described here. In my opinion, two major limitations will be faced by the community. First, it is the ability to clone only InnoDB tables. This means that  MyISAM and CSV tables stored in any schema including the sys schema will be cloned as empty tables.

The other limitation is regarding DDL, including TRUNCATE TABLE, which is not permitted during a cloning operation. Concurrent DML, however, is permitted. If a DDL is running, then the clone operation will wait for the lock:

Otherwise, if the clone operation is running, the DDL will wait for the lock:

Conclusion

Creating replicas has become much easier with the help of the MySQL 8.0.17 clone plugin. This feature can be used using SSL connections and with encrypted data as well.  At the moment of publication of this blog post, the clone plugin can be used to set up not only asynchronous replicas but provisioning Group Replication members too. Personally, I believe that in the near future this feature will also be used for Galera clusters. This is my two cents for what the future holds.

Useful Resources

Finally, you can reach us through the social networks, our forum or access our material using the links presented below:

Fun with Bugs #87 – On MySQL Bug Reports I am Subscribed to, Part XXI

$
0
0

Feed: Planet MySQL
;
Author: Valeriy Kravchuk
;

After a 3 months long break I’d like to continue reviewing MySQL bug reports that I am subscribed to. This issue is devoted to bug reports I’ve considered interesting to follow in May, 2019:

  • Bug #95215 – “Memory lifetime of variables between check and update incorrectly managed“. As demonstrated by Manuel Ung, there is a problem with all InnoDB MYSQL_SYSVAR_STR variables that can be dynamically updated. Valgrind allows to highlight it.
  • Bug #95218 – “Virtual generated column altered unexpectedly when table definition changed“. This weird bug (that does not seem to be repeatable on MariaDB 10.3.7 with proper test case modifications like removing NOT NULL and collation settings from virtual column) was reported by Joseph Choi. Unfortunately we do not see any documented attempt to check if MySQL 8.0.x is also affected. My quick test shows MySQL 8.0.17 is NOT affected, but I’d prefer to see check copy/pasted as a public comment to the bug.
  • Bug #95230 – “SELECT … FOR UPDATE on a gap in repeatable read should be exclusive lock“. There are more chances to get a deadlock with InnoDB than one might expect… I doubt this report from Domas Mituzas is a feature request. It took him some extra efforts to insist on the point and get it verified even as S4.
  • Bug #95231 – “LOCK=SHARED rejected contrary to specification“. This bug report from Monty Solomon ended up as a documentation request. The documentation and the implementation are not aligned, and it was decided NOT to change the parser to match documented syntax. But why it is still “Verified” then? Should it take months to correct the fine manual?
  • Bug #95232 – “The text of error message 1846 and the online DDL doc table should be updated”. Yet another bug report from Monty Solomon. Some (but not ALL) partition specific ALTER TABLE operations do not yet support LOCK clause.
  • Bug #95233 – “check constraint doesn’t consider IF function that returns boolean a boolean fun“. As pointed out by Daniel Black, IF() function in a check constraint isn’t considered a boolean type. He had contributed a patch to fix this, but based on comments it’s not clear if it’s going to be accepted and used “as is”. The following test shows that MariaDB 10.3 is not affected:

    C:Program FilesMariaDB 10.3bin>mysql -uroot -proot -P3316 test
    Welcome to the MariaDB monitor.  Commands end with ; or g.
    Your MariaDB connection id is 9
    Server version: 10.3.7-MariaDB-log mariadb.org binary distribution

    Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

    Type ‘help;’ or ‘h’ for help. Type ‘c’ to clear the current input statement.
    MariaDB [test]> create table t1 (source enum(‘comment’,’post’) NOT NULL, comment_id int unsigned, post_id int unsigned);
    Query OK, 0 rows affected (0.751 sec)

    MariaDB [test]> alter table t1 add check(IF(source = ‘comment’, comment_id IS NOT NULL AND post_id IS NULL, post_id IS NOT NULL AND comment_id IS NULL));
    Query OK, 0 rows affected (1.239 sec)
    Records: 0  Duplicates: 0  Warnings: 0

  • Bug #95235 – “ABRT:Can’t generate a unique log-filename binlog.(1-999), while rotating the bin“. Yet another bug report from Daniel Black. When MySQL 8.0.16 is built with gcc 9.0.x abort is triggered in the MTR suite on the binlog.binlog_restart_server_with_exhausted_index_value test.
  • Bug #95249 – “stop slave permanently blocked“. This bug was reported by Wei Zhao, who had contributed a patch.
  • Bug #95256 – “MySQL 8.0.16 SYSTEM USER can be changed by DML“. MySQL 8.0.16 had introduced an new privilege, SYSTEM_USER. MySQL manual actually says:

    The protection against modification by regular accounts that
    is afforded to system accounts by the
    SYSTEM_USER privilege does
    not apply to regular accounts that have privileges on the
    mysql system schema and thus can directly
    modify the grant tables in that schema. For full protection,
    do not grant mysql schema privileges to
    regular accounts.

    But the report that a user with a privilege to execute DML on the mysql.GLOBAL_GRANTS table from Zhao Jianwei was accepted and verified. I hope Oracle engineers will finally make up their mind and decide either to fix this or to close this report as “Not a bug”. I’ve subscribed in a hope for some fun around this decision making.

  • Bug #95269 – “binlog_row_image=minimal causes assertion failure“. This assertion failure happens in debug build when one of standard MTR test cases, rpl.rpl_gis_ddl or rpl.rpl_window_functions is executed with –binlog-row-image=minimal option. In such cases I always wonder what is the reason for a failure NOT to be noted by Oracle MySQL QA and somehow fixed before Community users notice it? Either they don’t run tests on debug builds with all possible combinations, or do not care to fix such failures (and thus should suffer from known failures in other test runs). I do not like any of these options, honestly. The bug was reported by Song Libing.
  • Bug #95272 – “Potential InnoDB SPATIAL INDEX corruption during root page split“. This bug was reported by Albert Hu based on Valgrind report when running the test innodb.instant_alter. Do they run MTR tests under on Valgrind or ASan builds in Oracle? I assume they do, but then why Community users are reporting such cases first? Note that related MariaDB’s bug, MDEV-13942, is fixed in 10.2.24+ and 10.3.15+.
  • Bug #95285 – “InnoDB: Page [page id: space=1337, page number=39] still fixed or dirty“. This assertion failure that happens during normal shutdown was reported by LUCA TRUFFARELLI. There are chances that this is a regression bug (without a regression tag), as it does not happen for reporter on MySQL 5.7.21.
  • Bug #95319 – “SHOW SLAVE HOST coalesces certain server_id’s into one“. This bug was reported by Lalit Choudhary from Percona based on original findings by Glyn Astill.
  • Bug #95416 – “ZERO Date is both NULL and NOT NULL“. This funny bug report was submitted Morgan Tocker. Manual actually explains that it’s intended behavior (MariaDB 10.3.7 works the same way as MySQL), but it’s still funny and unexpected, and the bug report remains “Verified”.
  • Bug #95478 – “CREATE TABLE LIKE does not honour ROW_FORMAT.” I’d like to add “…when it was not defined explicitly for the original table”. The problem was reported by Jean-François Gagné and ended up as a verified feature request. See also this my post on the details of where row_format is stored and is not stored for InnoDB tables…
  • Bug #95484 – “EXCHANGE PARTITION works wrong/werid with different ROW_FORMAT“. Another bug report by Jean-François Gagné related to the previous one. He had shown that it’s actually possible to get partitions with different row formats in the same InnoDB table in MySQL 5.7.26, but not in the most natural way. It seems the problem may be fixed in 5.7.27 (by the fix for another, internally reported bug), but the bug remains “Verified”.

There are some more bugs reported in May 2019 that I was interested in, but let me stop for now. Later in May I’ve got a chance to spend some days off in Barcelona, without any single MySQL bug report opened for day.

I like this view of Barcelona way more than any MySQL bugs review, including this one.

To summarize:

  1. Oracle engineers who process bugs still sometimes do not care to check if all
    supported major versions are affected and/or share the results of such
    checks in public. Instead, some of them care to argue about severity of the bug report, test case details etc.
  2. We still see bug reports that originates from existing, well known MTR test cases runs under Valgrind or in debug builds with some non-default options set. I do not have any good reason in mind to explain why these are NOT reported by Oracle’s internal QA first.
  3. Surely some regression bugs still get verified without the regression tag added.

I truly hope my talk “Problems with Oracle’s Way of MySQL Bugs Database Maintenance” will be accepted for Percona Live Europe 2019 conference (at least as a lightning talk) and I’ll get another chance to speak about the problems highlighted above, and more. There are some “metabugs” in the way Oracle handles MySQL bug report, and these should be discussed and fixed, for the benefits of MySQL quality and all MySQL users and customers.

Is Self-Service Analytics Sustainable?

$
0
0

Feed: Teradata Blog.

What is Self-Service Analytics?

Self-service analytics is a common buzzword for many organizations which desire to be more data driven and less dependent on IT for their data needs. Organizations are increasingly implementing self-service capabilities to enable and promote a data-driven culture within their organizations. Gartner, Inc. predicts that by 2019, the analytics output of business users with self-service capabilities will surpass that of professional data scientists. Self-service analytics sound great, but how sustainable is it? In a series of articles, we will dive into self-service, starting first with “What is self-service analytics?”.  

Every user wants to be self-service enabled, but not all users have the same skillset. Therefore, enabling self-service means different things to different users. Let’s first start by defining some basic user personas, their skillsets and what self-service means to each of them.
 

  • General Consumer – Analytic users who leverage reports and dashboards from trusted, governed data sources to run the day-to-day business. They work exclusively from an operational perspective. They leverage BI tools and dashboards to monitor and analyze the data.
  • Data Analyst – Analytic users who run reports, manipulate or pivot on prepared data products. Access rules are usually restricted to prepared data products from trusted, governed data sources, which may be optimized structures, sometimes referencing common summaries. They primarily work from an operational perspective but do some data discovery as well, when needed. They may leverage the same BI tools as the general consumer but are also savvy at writing SQL code. They can directly query data which has been curated to a tabular structure, e.g. a relational database schema, or a schema defined by Hive or some other interface that enables data queries. They are skilled spreadsheet users.
  • Citizen Data Scientist – Data analyst which can query a broad range of data products looking for answers to new business problems or discover new insights from existing data which is trusted and governed. They may also leverage new experimental data. They work primarily from an R&D perspective. These individuals typically have a strong, business subject area knowledge, SQL coding skills and an understanding of how to apply statistical and machine learning algorithms as a black box in solving business problems.
  • Data Scientist – Work with any data source (raw or curated) to look for answers to new business problems or discover new insights. They will bring their own data and augment it with existing data as needed. They most often work from a R&D perspective. They have the most diverse skill set, e.g. strong interpretive language skills (e.g. Python, R) and SQL skills, and solid mathematic skills for leveraging machine and deep learning algorithms (often a Ph.D. in science or mathematics). They might be a business subject matter expert or work with a business subject matter expert to better understand the business problem they are trying to solve.

These personas are not absolute, but the generalized categorizations enable us to group like users by their skillsets, needs and data reach within the analytic ecosystem. 

General consumers reach into the Access Layer (typically star schemas) via OLAP tools and dashboards. Data analysts leverage the Access Layer and reach further into the Integration Layer (typically a near 3rd normal form structure) to do ad-hoc analytics. Citizen data scientists leverage both the Access and Integration Layers, but also leverage standardized data (typically flattened structures, but not necessarily) in the Acquisition Layer. The data scientists, with their unique skillset, have full access to all forms of data within the entire analytics ecosystem. All users will leverage trusted and governed data, where possible, as it possesses the highest data quality and requires the least effort to utilize.

All personas, except the general consumers, need to provision resources for their self-service analytics. These terms can mean something different to each persona and are described below to provide a more consistent definition in support of managing self-service analytics.

Automated-Provisioning: The goal should be to enable users to request resources which can be systematically enabled within sanctioned areas, if they fall within standard defaults, e.g. storage, compute, memory. An exploration zone, such as Teradata Data Labs, is a good example of this capability in action.  

Self-Service Analytics: The ability for the user to access and utilize available resources (e.g. storage, compute, memory) so that they can acquire, profile, wrangle and analyze data (structured or unstructured) for some analytical purpose on their own. This could be a data scientist doing advanced analytics or a data analyst running SQL and NoSQL queries or connecting an OLAP tool for investigating new reports and dashboards. They may or may not require new data. By far the largest segment of self-service analytics are the general consumers, leveraging BI tools to slice and dice within the tool’s defined data domain.    

Individuals doing self-service analytics should be free to do what they need by themselves. They read, seek out online training, attend instructor led training classes, etc. They learn as they go, most often when they need to remove a roadblock. Publishing and maintaining standards and best practices further enhances self-paced learning.

Some users will need to load data into a relational database, which means writing DDL code, either by using command line or GUI tools. A data analyst is very familiar with SQL, so creating a table and loading it should not be a stretch. A citizen data scientist may also be able to load data into NoSQL structures and perhaps wrangle and conform some data into a usable format. The data scientist can do pretty much everything needed to load and prepare any type of data.  

At some point, exploration results and model definition techniques that both lead to reuse and faster time-to-value are important enough to be recalculated and reanalyzed on a regular basis and should become part of the organization’s critical production processing. Operationalization is the process by which data and/or analytic modules are moved from an exploration environment into a production environment. It entails adhering to data naming conventions, optimizing processes and utilizing standard services, thereby ensuring the data and analytics will be consistently available to the entire user community. Perhaps even more important, it frees up users to focus on other research needs.

Everyone wants to speed up self-service analytics. Depending on the choices made during exploration, the operationalization effort may take longer. A common example would be using tools for exploration, which are not authorized in production. The more aware the users are, the better they understand the trade-offs.  

There is a lot of hype around self-service analytics. The complexity goes up when a user decides to bring their own data; therefore, making data easier to find and access will greatly accelerate self-service analytics initiatives and minimize data sprawl.

There is a lot of hype around self-service analytics. The complexity goes up when a user decides to bring their own data; therefore, making data easier to find and access will greatly accelerate self-service analytics initiatives and minimize data sprawl.

In part 2 of this series we will discuss the value and challenges of self-service analytics within organizations today.


Fun with Bugs #88 – On MySQL Bug Reports I am Subscribed to, Part XXII

$
0
0

Feed: Planet MySQL
;
Author: Valeriy Kravchuk
;

It’s Saturday night. I have a week of vacation ahead that I am going to spend at home, working on things I usually do not have time for. I already did something useful (created a couple of test cases for MariaDB bugs here and there, among other things), so why not to get some fun and continue my previous review of recent interesting MySQL bug report…

So, here is the list of 15 more MySQL community bug reports that I’ve subscribed to back in May and June 2019:

  • Bug #95491 – “The unused wake up in simulator AIO of function reserve_slot“. I am impressed by the fact that these days many people just study InnoDB code, find problematic parts, improve them, test and prove the improvement, and submit patches to Oracle. Chen Zongzhi, who reported this bug, is one of such great people.
  • Bug #95496 – “Replication of memory tables“. As pointed out by Iwo P, MySQL manual describes what happens to MEMORY tables replicated from master to slave in details, including the case of master restart. But one detail is missing – when slave with MEMORY tables restarts, even with super_read_only set, it also generates even to delete data from the table into its own binary log, with GTID, and this leads to problems in case of failover.
  • Bug #95507 – “innodb_stats_method is not honored when innodb_stats_persistent=ON“. This bug was reported by my colleague Sergei Petrunia. See also related MDEV-19574  (still “Open”).
  • Bug #95547 – “MySQL restarts automatically after ABORT_SERVER due to disk full“. I am not sure this is a replication bug, more a problem of too primitive mysqld_safe or systemd startup scripts that do not check free space on disk or reason why mysqld process committed suicide. Anyway, this report by Przemyslaw Malkowski from Percona was accepted as a replication bug.
  • Bug #95582 – “DDL using bulk load is very slow under long flush_list“. As found and proved by Fungo Wang, due to optimized away redo log writing for bulk load, to make sure such DDL is crash safe, dirtied pages must be flushed to disk synchronously. Currently this is done inefficiently, as shown in this bug report. See also a year old (and still “Open”) Bug #92099 – “provide an option to record redo log for bulk loading” from the same reporter, where he asks for a way to get redo logging back (but still benefit from other “online” ALTER features). MariaDB has that option after the fix of MDEV-16809.
  • Bug #95612 – “replication breakage due to Can not lock user management caches for processing“. Simon Mudd noted that in some cases simple DROP USER IF EXISTS … breaks parallel replication in MySQL 8.0.16. Good to know that START SLAVE allows to continue 🙂
  • Bug #95698 – “JSON columns are returned with binary charset“. This bug was reported by Daniel Book. Note that according to the original issue, DBD::MariaDB (I had no idea this exists!) works properly with MariaDB 10.3 or 10.4. I had not checked.
  • Bug #95734 – “Failing SET statement in a stored procedure changes variables“. This really weird and unexpected bug was reported by Przemysław Skibiński.
  • Bug #95863 – “MTS ordered commit may cause deadlock “. Great finding by Lalit Choudhary from Percona!
  • Bug #95895 – “Shutdown takes long time when table uses AUTOINC and FOREIGN KEY constraints“. It was reported by Satya Bodapati from Percona, who later contributed as patch. See also his related Bug #95898 – “InnoDB releases memory of table only on shutdown if table has FK constraint”. I truly hope to see these fixed in next minor releases of MySQL. Note that originally bug (PS-5639) was reported by Sveta Smirnova.
  • Bug #95928 – “InnoDB crashes on compressed and encrypted table when changing FS block size“. Make sure to check and write original filesystem block size somewhere… This bug was reported by Sergei Glushchenko from Percona.
  • Bug #95929 – “Duplicate entry for key ‘PRIMARY’ when querying information_schema.TABLES”. This nice feature of new data dictionary of MySQL 8.0.16 under concurrent load was found by Manuel Rigger.
  • Bug #95934 – “innodb_flush_log_at_trx_commit = 0 get performance regression on 8 core machine“. If I had access to 8 cores box, I’d try to check this sysbench-based complete test case immediately. I should do this on my quadcore box at least, next week. But here we see some 12 days just wasted and then bug report from Chen Zongzh ended up “Analyzing”, till today, with no further feedback.
  • Bug #95954 – “CAST of negative function return value to UNSIGNED malfunctions with BIGINT“. Yet another weird bug found by Manuel Rigger. There are patches (separate for 5.6, 5.7 and 8.0) contributed by Oleksandr Peresypkin.
Montjuïc is a place to enjoy nice view of Barcelona and forget about MySQL bugs. So, I could miss some interesting ones at the end of May, 2019…

To summarize:

  1. Percona engineers still contribute a lot of useful and interesting MySQL bug reports, often – with patches.
  2. I am impressed by quality of MySQL bug reports in 2019. Often we see detailed analysis, proper complete test cases and patches.
  3. There is a recent fork of DBD::mysql for MariaDB (and MySQL), DBD::MariaDB!
  4. I badly need some new hardware for running benchmarks and tests from recent bug reports…
  5. I should spend more time in May in Barcelona 🙂

ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 3 of 3.

$
0
0

Feed: Planet PostgreSQL.

Built-in Sharding Architecture

The build-in sharding feature in PostgreSQL is using the FDW based approach, the FDW’s are based on sql/med specification that defines how an external data source can be accessed from the PostgreSQL server. PostgreSQL provides number of foreign data wrapper (FDW’s) that are used for accessing external data sources, the postgres_fdw is used for accessing Postgres database running on external server, MySQL_fdw is used for accessing MySQL database from PG, MongoDB_fdw is used for accessing MongoDB and so on.

The diagram below explains the current approach of built-in Sharding in PostgreSQL, the partitions are created on foreign servers and PostgreSQL FDW is used for accessing the foreign servers and using the partition pruning logic the planner decides which partition to access and which partitions to exclude from the search.

Push Down Capabilities

Push down in this context is the ability to push parts of the foreign query to foreign servers in order to decrease the amount of data travelling from the foreign server to parent node. The two basic push-down techniques that have been part of postgres fdw from the start are select target-list pushdown and WHERE clause pushdown.

In the query above the planner will decide which partition to access based on the partition key i.e. logdate in this case. The WHERE clause will be pushed down to the foreign server that contains the respective partition. That’s the basic push down capabilities available in postgres_fdw.

The sharding feature requires more advanced push-down capabilities in order to push the maximum operations down to the foreign servers containing partitions and minimising the data sent over the wire to the parent node.

The above is a decent set of push down capabilities that have been added to PostgreSQL in last few major releases. The good thing about these features is that it already benefits a number of use cases even when the entire sharding feature is not in place.

What’s remaining and associated challenges

The are still a number of important features remaining before we can say that we have Sharding feature in PostgreSQL. In this section we are going to discuss these features and what are the challenges with these features. I am sure there are other features related to database cluster management i.e. backup/failover or monitoring that are not in this list.

1- 2PC for foreign data wrapper transactions

Currently FDW transactions don’t
support two phase commit, this means that if you are using multiple foreign
servers in a transaction and if one part of transaction fails in one foreign
server then the entire transaction on all foreign serves are suppose to fail.
This feature is required in order to guarantee data consistency across the
database cluster.

This feature is required in order to
support OLTP workload hence it is very important for sharding feature.

The design proposal and patches for this feature has been sent on hackers for last several years but it is not getting enough community interest hence the design of this feature is still outstanding.

2- Parallel foreign scan

When a query is querying multiple foreign scans in a single query, all the foreign scans are being executed in a sequential manner, one after another. The parallel foreign scan functionality is executing multiple foreign scans in parallel. This feature is really important for the OLAP test cases, for example if you are running AVG query on a large partition table that is divided over large number of partitions. The AVG operation will be sent to each foreign server sequentially and results from each foreign server is sent to the parent node which will be aggregate on the parent node and sent back to client. Once we have the parallel foriegn scan functionality, all the average operations on all the foreign servers will be executed in parallel and results sent to the parent node. The parent node will aggregate the data and sent the results to the client.

This is key piece needed for
completing the sharding feature, we currently have aggregate pushdown that will
send the aggregates down to the foreign server but we don’t have the
functionality to run the aggregate operations on all the partitions in
parallel.

This feature is particularly very
important for the OLAP use-case, the idea of having a large number of foreign
servers containing partition for a large partitioned table and aggregate
operation on partition running on all the foreign servers in parallel is very
powerful.

The infrastructure for parallel foreign scan feature is asynchronous query execution, this is a major change in PostgreSQL. There has been some work done on this but it feels that it is still a release or two away before it will be committed. Once asynchronous query execution is done, it will be easier to add parallel foreign scan functionality.

3- Shard management

The partitions on foreign servers are currently not getting created automatically, as described in “Sharding in PostgreSQL” section, the partitions needs to be created manually on foreign servers. This can be very tedious task if you are creating a partition table with large number of partitions and sub-partitions.

The shard management feature is
suppose to provide the ability to auto-create the partitions and sub-partitions
on the foreign servers. This will make the creation of sharded tables very
easy.

Not intending to go into any design details of how this feature will be implemented, the basic idea is that Sharded table syntax will be built on top on declarative sharding syntax. The postgres_fdw will be used to pushdown the DDL to the foreign servers, while the FDW’s are only meant to do SELECT or DML, doing DDL on external source is not part of sql/med specification. Anyhow we aren’t suppose to discuss the design of this feature in this blog.

This feature is not yet started in the community, the development team at HighGo is planning to work on this feature.

4- Global Transaction Manager / Snapshot Manager

This is another very important and
difficult feature that is mandatory for Sharding feature. The purpose of global
transaction/snapshot manager is suppose to provide global transactional
consistency. The problem described in section 1 of this chapter “2PC for
foreign data wrapper transactions” also ties in with the Global transaction
manager.

Lets suppose you have two concurrent
clients that are using a sharded table, client #1 is trying to access a
partition that is on server 1 and client #2 is also trying to access the
partition that is also on server 1. Client 2 should get a consistent view of
the partition i.e. any changes i.e. updates etc made to the partition during
client 1 transaction shouldn’t be visible to client 2. Once client 1
transaction gets committed, the charges will be visible to all new
transactions. The Global transaction manager is suppose to ensure that all
global transaction gets a consistent view of the database cluster. All the
concurrent clients using the database cluster (with tables sharded across
multiple foreign servers) should see consistent view of the database cluster.

This is hard problem to solve and companies like Postgres Professional have tried to solve this problem by using a external transaction manager. So far there doesn’t seem to be any solution accepted by the community. Right now there is no visible concentrated effort which is trying to implement the global transaction manager in the core or even as an external component.

There is mention of using other approaches like Clock-SI (Snapshot isolation for Partitioned tables) approach that is followed by other successful projects like Google cloud spanner and YugaByte for solving the same problem.

Conclusion

This is conclusion of all the 3 blogs of this series, horizontal scalability with sharding is imperative for PostgreSQL. It is possible that only some of the workloads need sharding today in order to solve there problems but I am sure everyone wants to know that PostgreSQL has a answer of this problem. It is also important to note that Sharding is not a solution for all big data or high concurrent workloads, you need to pick workloads where larger table can be logically partitioned across partitions and the queries are benefiting from the pushdown and other capabilities in using the sharded cluster.

As I mentioned in the initial section of this blog, the first target for the sharding features where it is complete is to be able to speed-up a long running complex query. This would be a OLAP query, not to say that sharing would benefit the OLTP workloads. The data would be partitioned across multiple servers instead of a single server.

Another important exercise that the sharding team should start to do soon is benchmarking using the capabilities already part of PostgreSQL. I know without parallel foreign scan, it is not possible to speed up a real OLAP query that uses multiple partitions. However the process of benchmarking should being soon, we need to identify the type of workload that should benefit from sharding, what is the performance without sharding and what performance to expect with a sharded cluster. I don’t think we can expect the performance to be linear as we add more shards to the cluster.

Another important point that I would to mention here that there has been critics about using the FDW machinery for implementing built-in sharding. There has been suggestion to go a more low level in order to efficiently handle cross node communication etc. The answer given by a senior community member is good one, we are using FDW machinery to implement this feature because that’s the quickest and less error prone route for implementing it.  The FDW functionality is already tried and tested, if we try to implement using a approach that’s more complex and sophisticated,  it will require allot of recourses and lots of time before we can produce something that we can call sharding.

It will take more then a few companies to invest there resources in building this big feature. So more and more companies should come together on implementing this feature in the community because it is Worth it.

Asif Rehman: An Overview of Replication in PostgreSQL Context

$
0
0

Feed: Planet PostgreSQL.

Replication is a critical part of any database system that aims to provide high availability (HA) and effective disaster recovery (DR) strategy. This blog is aimed at establishing the role of replication in a database system. The blog will give a general overview of replication and its types as well as an introduction to replication options in PostgreSQL.

The term Replication is used to describe the process of sharing information between one or more software or hardware systems; to ensure reliability, availability, and fault-tolerance. These systems can be located in the same vicinity, could be on a single machine or perhaps connected over a wide network. The replication can be divided broadly into Hardware and Software categories. We’ll explore these categories briefly, however, the main focus of this blog is database replication. So, let’s first understand what constitutes a database replication system.

In a nutshell, the database replication describes the process of coping data from the database instance to one or more database instances. Again, these instances can be in the same location or perhaps connected over a wide network.

Hardware-Based Replication

Let’s start with the hardware replication. Hardware-based replication keeps the multiple connected systems in sync. This syncing of data is done at the storage level as soon as an I/O is performed on the system, it’s propagated to the configured storage modules/devices/systems. This type of replication can be done for the entire storage or the selected partitions. The biggest advantage of this type of solution is that (generally) it’s easier to set up and is independent of software. This makes it perform much better, however, it reduces the flexibility and control over the replication. Here are the few pros of this type of replication.

Real-time – all the changes are applied to subsequent systems immediately.
Easier to Setup – no scripting or software configurations are required.
Independent of Application – replication happens at the storage layer and is independent of OS/software application.
Data Integrity and Consistency – As the mirroring happens at the storage layer, which effectively makes an exact copy of the storage disk, So the data integrity and consistency are automatically ensured.

Although hardware replication has some very appealing advantages yet it comes with its own limitations. It generally relies on the vendor locking i.e. the same type of hardware has to be used and often it’s not very cost-effective.

Software-Based Replication

This replication can range from generic solutions to product-specific solutions. The generic solutions tend to emulate the hardware replication by copying the data between different system at the software level. The software that is responsible for performing the replication, copies each bit written to the source storage and propagates it to the destination system(s). Whereas the product-specific solutions, are more considerate towards the product requirements and are generally meant for a specific product. The software-based replication has its pros and cons. On one hand, It provides flexibility and control over how the replication is done and is usually very cost-effective while provides a much better set of features. But on the other hand, it requires a lot of configurations and requires continuous monitoring and maintenance.

Database Replication

Having discussed replication and its different types; lets now turn our focus toward the database replication topic.

Before going into more details lets start by discussing the different terminologies used to describe the different components of a replication system in the database world. Primary-Standby, Master-Slave, Publisher-Subscriber, Master-Master/Multimaster are the most often used terms to describe the database servers participating in the replication setup.

The term Primary, Master and Publisher are used to describing the active node that strives to propagate changes received by it to the other nodes. Whereas Standby, Slave and Subscriber terms are used to describe the passive nodes that will be receiving the propagated changes from the active nodes. In this blog, we will use Primary to describe the active node and Standby for the passive node.

The database replication can be configured in Primary-Standby and Multi-master configurations.

In a Primary-Standby configuration, only one instance receives the data changes and then it distributes them among the standbys.

Primary-Standby Configuration

Whereas in the Multi-master system, each instance of a database can receive the data change and then propagates that change to other instances.

Multi-Master Configuration

Synchronous/Asynchronous Replication

At the core of the replication process is the ability of the primary node to transmit data to standbys. Much like any other data transfer strategy, this can happen in a synchronous way where primary waits for all standbys to confirm that they have received and written the data on disk. The client is given a confirmation when these acknowledgments are received. Alternatively, the primary node can commit the data locally and transmit the transaction data to standbys whenever possible with the expectation that standbys will receive and write the data on disk. In this case, standbys do not send confirmation to the primary. Whereas the former strategy is called synchronous replication, this is referred to as asynchronous replication. Both hardware and software-based solutions support synchronous and asynchronous replication.

Following are diagrams that show both types of replication strategies in a graphical way.

Since the main difference between the two strategies is the acknowledgment of data written on the standby systems; there are advantages and disadvantages in using both techniques. Synchronous replication may be configured in a way such that all systems are up-to-date and that the replication is done in real-time. But at the same time, it adversely impacts the performance of the system since the primary node waits for standbys’ confirmations.

The asynchronous replication tends to give better performance as there is no wait time attributed to confirmation messages from standbys.

Cascading Replication

So far we have seen that only the primary node transmits the transaction data to standbys. This is a load that can be distributed among multiple systems; i.e. a standby that has received the data and written on disk can transmit it onwards to standbys that are configured to receive data from it. This is called cascading replication where replication is configured between primary and standbys, and standby(s) and standbys. Following is a visual representation of cascading replication.

Cascading Replication

Standby Modes

Warm standby is a term used to describe a standby server that allows standby to receive the changes from primary but does not allow any of the client connection to it directly.

Hot standby is a term used to describe a standby server which also allows accepting the client connections.

PostgreSQL Replication

In the end, I would like to share the replication options available for PostgreSQL.

PostgreSQL offers built-in log streaming and logical replication options. The built-in replication is only available in the primary-standby configuration.

Streaming Replication (SR)

Also known as physical and binary replication in Postgres. Binary data is streamed to the standbys. In order to stream such data, streaming replication uses the WAL (write-ahead log) records. Any change made on the primary server is first written to the WAL files before it is written on disk and database server has the capability to stream these records to any number of standbys. Since its a binary copy of data, there are certain limitations to it as well. This type of replication can only be used when all of the changes have to be replicated on the standbys. One cannot use this if only a subset of changes is required. The standby servers either do not accept the connections or if they do, they can only serve the read-only queries. The other limitation is that it can not be used with a different version of the database server. The whole setup, consisting of multiple database servers has to be of the same version.

Streaming replication is asynchronous by default, however, it not much difficult to turn on the synchronous replication. This kind of setup is perfect for achieving high availability (HA). If the main server fails, one of the standbys can take its place since they are almost the exact copy of it. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:
CREATE USER foouser REPLICATION;

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser          trust

# Take a base backup on the stanby:
pg_basebackup -h  -U foouser -D ~/standby

# In postgresql.conf following entries are needed
wal_level=replica # should be set to replica for streaming replication
max_wal_senders = 8 # number of standbys to allow connection at a time

# On the stanby, create a recovery.conf file in its data direcrtory with following contents:
standby_mode=on
primary_conninfo='user=foouser host= port='

Now start the servers and a basic streaming replication should work.

Logical Replication

Logical Replication is considerably new in PostgreSQL as compared to the streaming replication. The streaming replication is a byte-by-byte copy of the entire data. It, however, does not allow replication of a single table or a subset of the data from a primary server to standbys. Logical replication enables replicating particular objects in the database to the standbys, instead of replicating the whole database. It also allows replicating data between different versions of the database servers and not only that it also allows the standby to accept both read and write queries. However, be watchful of the fact that there is no conflict resolution system implemented. So if both primary and standbys are allowed to write on the same table, more likely than not, there will be data conflicts that will stop the replication process. In this case, the user must manually resolve these conflicts. However, both standbys and primary may allow writing on complete disjointed sets of data which will avoid any conflict resolution needs. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:
CREATE USER foouser REPLICATION;

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser          trust

# For logical replication, base backup is not required, any of the two database instances can be used to create this setup.

wal_level=logical # for logical replication, wal_level needs to be set to logical in postgresql.conf file.

# on primary server, create a some tables to replicate and a publication that will list down these tables.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);
INSERT INTO t1 ....;
INSERT INTO t2 ....;
CREATE PUBLICATION foopub FOR TABLE t1, t2;

# on the standby, the structure of above mentioned tables needs to be created as DDL is not replicated.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);

# on standby, create a subscription for the above publication.
CREATE SUBSCRIPTION foosub CONNECTION 'host= port= user=foouser' PUBLICATION foopub;

With this a basic logical replication is achieved.

I hope this blog helped you understand replication in general and from PostgreSQL perspective streaming and logical replications concepts. Furthermore, I hope this helps you in better designing a replication environment for your needs.

Check Constraints Issues

$
0
0

Feed: Planet MySQL
;
Author: Dave Stokes
;
Earlier I wrote about check constraints when MySQL 8.0.16 was released. But this week I noticed two different folks having similar problems with them. And sadly it is ‘pilot error’.

The first was labeled  MYSQL 8.0.17 CHECK not working even though it has been implemented and a cursory glance may make one wonder what is going on with the database.

The table is set up with two constraints. And old timers will probably mutter something under their breath about using ENUMs but here they are:

 JOB_TITLE varchar(20) CHECK(JOB_TITLE IN (‘Lecturer’, ‘Professor’, ‘Asst. Professor’, ‘Sr. Lecturer’)),  


DEPARTMENT varchar(20) CHECK(DEPARTMENT IN (‘Biotechnology’, ‘Computer Science’, ‘Nano Technology’, ‘Information Technology’)), 
And if you cut-n-paste the table definition into MySQL Workbench or MySQL Shell, it is perfectly valid DDL. 
So the table is good. 
What about the query?

INSERT INTO Employee (FNAME, MNAME, LNAME, BIRTHDATE, GENDER, SSN, JOB_TITLE,  SALARY, HIREDATE, TAX, DEPARTMENT ) VALUES 
 (‘Sangeet’, ‘R’, ‘Sharma’, date ‘1965-11-08’, ‘M’, ’11MH456633′, ‘Prof’, 1200900, date ‘1990-12-16’, 120090, ‘Computer’);

At first glance the query looks good.  But notice the use of ‘Prof’ instead of ‘Professor’ and ‘Computer’ instead of ‘Computer Science’.  The two respective constraints are are working as they are supposed to. That is why you see the error message ERROR: 3819: Check constraint ’employee_chk_2′ is violated.

So how to fix?  Well you can re-write the DDL so that ‘Prof’ and ‘Computer’.  Or you can make the data match the specifications. If you are going to the trouble to add a constraint you are sabotaging your own work by doing things like this.

The Second Issue

In another Stackoverflow post someone with this table CREATE TABLE Test( ID CHAR(4),     CHECK (CHAR_LENGTH(ID) = 4) );  was wondering why constraint checks were be a problem with insert into Test(ID) VALUES (‘12345’);   

And the error you get if you try the above? ERROR: 1406: Data too long for column ‘ID’ at row 1!

Well, this is not a case where a constraint check is behaving badly.  Look at the table definition.  ID is a four (4) CHAR column.  And the length of ‘12345’ is not four! 

Now in the past MySQL was lax and would truncate that extra character and provide a warning.  And those warnings were often ignored.  MySQL had a bad reputation for doing that truncation and the SQL mode of the server was changed to a default setting that does not allow that truncation. Now the server tells you the data is too long for that column.  The constraint checks have not come into play at that stage as the server sees you trying to shove five characters into a four character space. 

So how to fix? 1) make ID a CHAR(5) and rewrite the constraint, 2) change the SQL mode to allow the server to truncate data, or 3) do not try to put five characters into a space you designed for four.

My Gripe

It is frustrating to see something like constraint checks that are a really useful tool being abused.  And it is frustrating as so many people search the web for answers with keywords and will not look at the manual.  In both of the examples above five minutes with the manual pages would have save a lot of effort.

READ ONLY transactions in MySQL

$
0
0

Feed: Planet MySQL
;
Author: Federico Razzoli
;

Duke Humfrey’s Library – Bodleian Library, Oxford, Europe

Transactions are the least known RDBMS features. Everyone knows they exist, but few know how they work. For this reason, it is not obvious that they can be read only. This article explains what read only transactions are and why we should use them when appropriate.

What is a read only transaction?

An objection I’ve heard more than once is: transactions are for writes, SELECTs don’t even happen inside a transaction! But this is wrong. Transactions are not only about ROLLBACK and COMMIT. They are also about isolation.

As a general rule, this is what happens. When you start a transaction we acquire a view, or snapshot, on data. Until the end of the transaction we will see that snapshot, not modifications made by other connections.

Reads are generally non-locking. After you’ve read a row, other users can still modify it, even if your transaction is still in progress. Internally, MySQL will keep the information that allow to “rebuild” data as they were when you acquired the snapshot, so you can still see them. This information is preserved in the transaction logs.

To be clear, we actually always use transactions. By default, we run queries in autocommit mode, and we don’t start or end transactions explicitly. But even in this case, every SQL statement we run is a transaction.

There is no such thing as reading objective data. We always read from (or write to) snapshots acquired at a certain point of time, that include all the changes made by committed transactions.

Why should we use read only transactions?

The answer should be obvious now. When we want to run more than one SELECT, and we want the results to be consistent, we should do it in a read only transaction. For example, we could do something like this:

SELECT COUNT(DISTINCT user_id)
    FROM product WHERE date = DATE();
SELECT COUNT(DISTINCT product_id)
    FROM product WHERE date = DATE();

If we want the two results to reflect the reality at a certain point in time, we need to do this in a transaction:

START TRANSACTION READ ONLY;
SELECT COUNT(DISTINCT user_id)
    FROM product WHERE date = DATE();
SELECT COUNT(DISTINCT product_id)
    FROM product WHERE date = DATE();
COMMIT;

The question implies something: why should we use a read only transaction, when we can use a regular read write transaction? If for some reason we try to write something, the statement will fail. So it’s simpler to just always use read write transactions.

The reason for using read only transactions is mentioned in the MySQL manual:

InnoDB can avoid the overhead associated with setting up the transaction ID (TRX_ID field) for transactions that are known to be read-only. A transaction ID is only needed for a transaction that might perform write operations or locking reads such as SELECT ... FOR UPDATE. Eliminating unnecessary transaction IDs reduces the size of internal data structures that are consulted each time a query or data change statement constructs a read view.

8.5.3 Optimizing InnoDB Read-Only Transactions – 25 August 2019

Isolation levels

Isolation level is a setting that defines how much the current transaction is isolated from concurrent transactions. It can be changed at session level, before starting the next transaction. There is also a global setting, which takes effect for sessions that don’t change their isolation level. By default it is REPEATABLE READ.

Isolation levels are:

  • READ UNCOMMITTED;
  • READ COMMITTED;
  • REPEATABLE READ.

(We’re skipping SERIALIZABLE here, as it’s just a variation of REPEATABLE READ and it does not make sense with read only transactions)

This list is only valid for MySQL. Different DBMSs support different isolation levels.

The behavior described above refers to REPEATABLE READ. To recap, it acquires a snapshot of the data, which is used for the whole duration of the transaction.

Another isolation level is READ COMMITTED. It acquires a new snapshot for each query in the transaction. But this does not make sense for a read only transaction.

A usually underestimated isolation level is READ UNCOMMITTED. The reason why it is underestimated is that it reads data that are being written by other transactions, but are not yet committed – and possibly, will never be. For this reason, highly inconsistent data may be returned.

However, this can be an important optimisation. In my experience, it can make some queries much faster and greatly reduce CPU usage, especially when you are running expensive analytics SELECTs on write-intensive servers.

There are cases when results inconsistencies don’t matter. A typical example is running aggregations on many rows: an average on one million values will not change sensibly. Another example is when reading session-level data, that no one else is modifying; or old data, that no one is supposed to modify anymore.

If you want to use READ UNCOMMITTED, there is no reason to wrap multiple queries in a single transaction. For each query, you can run something like this:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
START TRANSACTION;
SELECT COUNT(*) FROM product;
COMMIT;

MetaData Lock and the rollback trick

When accessing a table using a non-transactional engine (like MyISAM) inside a transaction, MySQL will return a warning. They seem to assume that there is no reason to access such tables from a transaction, which is false. MariaDB doesn’t return such warning.

In reality, transactions acquire metadata locks (MDL). This means that other connections will not be able to run an ALTER TABLE on those tables until the transaction finishes. Preventing such problems is very important and should be done every time it’s useful.

mysqldump rolls back

mysqldump --single-transaction uses default read write transactions, not read only ones. My guess is that they don’t consider it important for a single long transaction, but only for many concurrent transactions.

However, the MariaDB version of mysqldump does something interesting with rollbacks:

ROLLBACK TO SAVEPOINT in --single-transaction mode to release metadata lock on table which was already dumped. This allows to avoid blocking concurrent DDL on this table without sacrificing correctness, as we won't access table second time and dumps created by --single-transaction mode have validity point at the start of transaction anyway.

Note that this doesn't make --single-transaction mode with concurrent DDL safe in general case. It just improves situation for people for whom it might be working.

mysqldump in MariaDB repo, taken from commit 624dd71b9419555eca8baadc695e3376de72286f

In other words:

  • An MDL blocks some operations, so it can be a problem just like any other lock, especially in a long transaction;
  • An MDL is only useful for data that we are going to access again, just like any other lock;
  • Transactions support savepoints in the middle of a transaction, and we can rollback to a savepoint instead of rolling the whole transaction back;
  • Doing so releases the MDLs on objects that were only accessed after the savepoint.

See also

Related courses:

Conclusions

Here I wanted to advertise a feature that, in my opinion, should be used much more often. Actually, if you know from the beginning that a transaction is read only, there is no reason not to inform MySQL about that.

We discussed which isolation levels make sense for read only transactions, and the nice ROLLBACK TO SAVEPOINT trick used by mysqldump for MariaDB.

Did you notice any mistake? Do you have ideas to contribute?
Please comment!

As usual, I’ll be happy to fix errors and discuss your ideas.

Toodle pip,
Federico

Photo source: Wikipedia

Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

$
0
0

Feed: AWS Big Data Blog.

This post describes how you can improve performance and reduce costs by offloading reporting workloads from an online transaction processing (OLTP) database to Amazon Athena and Amazon S3. The architecture described allows you to implement a reporting system and have an understanding of the data that you receive by being able to query it on arrival. In this solution:

  • Oracle GoldenGate generates a new row on the target for every change on the source to create Slowly Changing Dimension Type 2 (SCD Type 2) data.
  • Athena allows you to run ad hoc queries on the SCD Type 2 data.

Principles of a modern reporting solution

Advanced database solutions use a set of principles to help them build cost-effective reporting solutions. Some of these principles are:

  • Separate the reporting activity from the OLTP. This approach provides resource isolation and enables databases to scale for their respective workloads.
  • Use query engines running on top of distributed file systems like Hadoop Distributed File System (HDFS) and cloud object stores, such as Amazon S3. The advent of query engines that can run on top of open-source HDFS and cloud object stores further reduces the cost of implementing dedicated reporting systems.

Furthermore, you can use these principles when building reporting solutions:

  • To reduce licensing costs of the commercial databases, move the reporting activity to an open-source database.
  • Use a log-based, real-time, change data capture (CDC), data-integration solution, which can replicate OLTP data from source systems, preferably in real-time mode, and provide a current view of the data. You can enable the data replication between the source and the target reporting systems using database CDC solutions. The transaction log-based CDC solutions capture database changes noninvasively from the source database and replicate them to the target datastore or file systems.

Prerequisites

If you use GoldenGate with Kafka and are considering cloud migration, you can benefit from this post. This post also assumes prior knowledge of GoldenGate and does not detail steps to install and configure GoldenGate. Knowledge of Java and Maven is also assumed. Ensure that a VPC with three subnets is available for manual deployment.

Understanding the architecture of this solution

The following workflow diagram (Figure 1) illustrates the solution that this post describes:

  1. Amazon RDS for Oracle acts as the source.
  2. A GoldenGate CDC solution produces data for Amazon Managed Streaming for Apache Kafka (Amazon MSK). GoldenGate streams the database CDC data to the consumer. Kafka topics with an MSK cluster receives the data from GoldenGate.
  3. The Apache Flink application running on Amazon EMR consumes the data and sinks it into an S3 bucket.
  4. Athena analyzes the data through queries. You can optionally run queries from Amazon Redshift Spectrum.

Data Pipeline

Figure 1

Amazon MSK is a fully managed service for Apache Kafka that makes it easy to provision  Kafka clusters with few clicks without the need to provision servers, storage and configuring Apache Zookeeper manually. Kafka is an open-source platform for building real-time streaming data pipelines and applications.

Amazon RDS for Oracle is a fully managed database that frees up your time to focus on application development. It manages time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.

GoldenGate is a real-time, log-based, heterogeneous database CDC solution. GoldenGate supports data replication from any supported database to various target databases or big data platforms like Kafka. GoldenGate’s ability to write the transactional data captured from the source in different formats, including delimited text, JSON, and Avro, enables seamless integration with a variety of BI tools. Each row has additional metadata columns including database operation type (Insert/Update/Delete).

Flink is an open-source, stream-processing framework with a distributed streaming dataflow engine for stateful computations over unbounded and bounded data streams. EMR supports Flink, letting you create managed clusters from the AWS Management Console. Flink also supports exactly-once semantics with the checkpointing feature, which is vital to ensure data accuracy when processing database CDC data. You can also use Flink to transform the streaming data row by row or in batches using windowing capabilities.

S3 is an object storage service with high scalability, data availability, security, and performance. You can run big data analytics across your S3 objects with AWS query-in-place services like Athena.

Athena is a serverless query service that makes it easy to query and analyze data in S3. With Athena and S3 as a data source, you define the schema and start querying using standard SQL. There’s no need for complex ETL jobs to prepare your data for analysis, which makes it easy for anyone familiar with SQL skills to analyze large-scale datasets quickly.

The following diagram shows a more detailed view of the data pipeline:

  1. RDS for Oracle runs in a Single-AZ.
  2. GoldenGate runs on an Amazon EC2 instance.
  3. The MSK cluster spans across three Availability Zones.
  4. Kafka topic is set up in MSK.
  5. Flink runs on an EMR Cluster.
  6. Producer Security Group for Oracle DB and GoldenGate instance.
  7. Consumer Security Group for EMR with Flink.
  8. Gateway endpoint for S3 private access.
  9. NAT Gateway to download software components on GoldenGate instance.
  10. S3 bucket and Athena.

For simplicity, this setup uses a single VPC with multiple subnets to deploy resources.

Figure 2

Configuring single-click deployment using AWS CloudFormation

The AWS CloudFormation template included in this post automates the deployment of the end-to-end solution that this blog post describes. The template provisions all required resources including RDS for Oracle, MSK, EMR, S3 bucket, and also adds an EMR step with a JAR file to consume messages from Kafka topic on MSK. Here’s the list of steps to launch the template and test the solution:

  1. Launch the AWS CloudFormation template in the us-east-1
  2. After successful stack creation, obtain GoldenGate Hub Server public IP from the Outputs tab of cloudformation.
  3. Login to GoldenGate hub server using the IP address from step 2 as ec2-user and then switch to oracle user.sudo su – oracle
  4. Connect to the source RDS for Oracle database using the sqlplus client and provide password(source).[oracle@ip-10-0-1-170 ~]$ sqlplus source@prod
  5. Generate database transactions using SQL statements available in oracle user’s home directory.
    SQL> @s
    
     SQL> @s1
    
     SQL> @s2
  6. Query STOCK_TRADES table from Amazon Athena console. It takes a few seconds after committing transactions on the source database for database changes to be available for Athena for querying.

Manually deploying components

The following steps describe the configurations required to stream Oracle-changed data to MSK and sink it to an S3 bucket using Flink running on EMR. You can then query the S3 bucket using Athena. If you deployed the solution using AWS CloudFormation as described in the previous step, skip to the Testing the solution section.

  1. Prepare an RDS source database for CDC using GoldenGate.The RDS source database version is Enterprise Edition 12.1.0.2.14. For instructions on configuring the RDS database, see Using Oracle GoldenGate with Amazon RDS. This post does not consider capturing data definition language (DDL).
  2. Configure an EC2 instance for the GoldenGate hub server.Configure the GoldenGate hub server using Oracle Linux server 7.6 (ami-b9c38ad3) image in the us-east-1 Region. The GoldenGate hub server runs the GoldenGate extract process that extracts changes in real time from the database transaction log files. The server also runs a replicat process that publishes database changes to MSK.The GoldenGate hub server requires the following software components:
  • Java JDK 1.8.0 (required for GoldenGate big data adapter).
  • GoldenGate for Oracle (12.3.0.1.4) and GoldenGate for big data adapter (12.3.0.1).
  • Kafka 1.1.1 binaries (required for GoldenGate big data adapter classpath).
  • An IAM role attached to the GoldenGate hub server to allow access to the MSK cluster for GoldenGate processes running on the hub server.Use the GoldenGate (12.3.0) documentation to install and configure the GoldenGate for Oracle database. The GoldenGate Integrated Extract parameter file is eora2msk.prm.
    EXTRACT eora2msk
    SETENV (NLSLANG=AL32UTF8)
    
    USERID ggadmin@ORCL, password ggadmin
    TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 256)
    EXTTRAIL /u01/app/oracle/product/ogg/dirdat/or
    LOGALLSUPCOLS
    
    TABLE SOURCE.STOCK_TRADES;

    The logallsupcols extract parameter ensures that a full database table row is generated for every DML operation on the source, including updates and deletes.

  1. Create a Kafka cluster using MSK and configure Kakfa topic.You can create the MSK cluster from the AWS Management Console, using the AWS CLI, or through an AWS CloudFormation template.
  • Use the list-clusters command to obtain a ClusterArn and a Zookeeper connection string after creating the cluster. You need this information to configure the GoldenGate big data adapter and Flink consumer. The following code illustrates the commands to run:
    $aws kafka list-clusters --region us-east-1
    {
        "ClusterInfoList": [
            {
                "EncryptionInfo": {
                    "EncryptionAtRest": {
                        "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:xxxxxxxxxxxx:key/717d53d8-9d08-4bbb-832e-de97fadcaf00"
                    }
                }, 
                "BrokerNodeGroupInfo": {
                    "BrokerAZDistribution": "DEFAULT", 
                    "ClientSubnets": [
                        "subnet-098210ac85a046999", 
                        "subnet-0c4b5ee5ff5ef70f2", 
                        "subnet-076c99d28d4ee87b4"
                    ], 
                    "StorageInfo": {
                        "EbsStorageInfo": {
                            "VolumeSize": 1000
                        }
                    }, 
                    "InstanceType": "kafka.m5.large"
                }, 
                "ClusterName": "mskcluster", 
                "CurrentBrokerSoftwareInfo": {
                    "KafkaVersion": "1.1.1"
                }, 
                "CreationTime": "2019-01-24T04:41:56.493Z", 
                "NumberOfBrokerNodes": 3, 
                "ZookeeperConnectString": "10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181", 
                "State": "ACTIVE", 
                "CurrentVersion": "K13V1IB3VIYZZH", 
                "ClusterArn": "arn:aws:kafka:us-east-1:xxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3", 
                "EnhancedMonitoring": "DEFAULT"
            }
        ]
    }
  • Obtain the IP addresses of the Kafka broker nodes by using the ClusterArn.
    $aws kafka get-bootstrap-brokers --region us-east-1 --cluster-arn arn:aws:kafka:us-east-1:xxxxxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3
    {
        "BootstrapBrokerString": "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092"
    }
  • Create a Kafka topic. The solution in this post uses the same name as table name for Kafka topic.
    ./kafka-topics.sh --create --zookeeper 10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181 --replication-factor 3 --partitions 1 --topic STOCK_TRADES
  1. Provision an EMR cluster with Flink.Create an EMR cluster 5.25 with Flink 1.8.0 (advanced option of the EMR cluster), and enable SSH access to the master node. Create and attach a role to the EMR master node so that Flink consumers can access the Kafka topic in the MSK cluster.
  2. Configure the Oracle GoldenGate big data adapter for Kafka on the GoldenGate hub server.Download and install the Oracle GoldenGate big data adapter (12.3.0.1.0) using the Oracle GoldenGate download link. For more information, see the Oracle GoldenGate 12c (12.3.0.1) installation documentation.The following is the GoldenGate producer property file for Kafka (custom_kafka_producer.properties):
    #Bootstrap broker string obtained from Step 3
    bootstrap.servers= 10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092
    #bootstrap.servers=localhost:9092
    acks=1
    reconnect.backoff.ms=1000
    value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    # 100KB per partition
    batch.size=16384
    linger.ms=0

    The following is the GoldenGate properties file for Kafka (Kafka.props):

    gg.handlerlist = kafkahandler
    gg.handler.kafkahandler.type=kafka
    gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties
    #The following resolves the topic name using the short table name
    #gg.handler.kafkahandler.topicName=SOURCE
    gg.handler.kafkahandler.topicMappingTemplate=${tableName}
    #The following selects the message key using the concatenated primary keys
    gg.handler.kafkahandler.keyMappingTemplate=${primaryKeys}
    gg.handler.kafkahandler.format=json_row
    #gg.handler.kafkahandler.format=delimitedtext
    #gg.handler.kafkahandler.SchemaTopicName=mySchemaTopic
    #gg.handler.kafkahandler.SchemaTopicName=oratopic
    gg.handler.kafkahandler.BlockingSend =false
    gg.handler.kafkahandler.includeTokens=false
    gg.handler.kafkahandler.mode=op
    goldengate.userexit.writers=javawriter
    javawriter.stats.display=TRUE
    javawriter.stats.full=TRUE
    
    gg.log=log4j
    #gg.log.level=INFO
    gg.log.level=DEBUG
    gg.report.time=30sec
    gg.classpath=dirprm/:/home/oracle/kafka/kafka_2.11-1.1.1/libs/*
    
    javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

    The following is the GoldenGate replicat parameter file (rkafka.prm):

    REPLICAT rkafka
    -- Trail file for this example is located in "AdapterExamples/trail" directory
    -- Command to add REPLICAT
    -- add replicat rkafka, exttrail AdapterExamples/trail/tr
    TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.props
    REPORTCOUNT EVERY 1 MINUTES, RATE
    GROUPTRANSOPS 10000
    MAP SOURCE.STOCK_TRADES, TARGET SOURCE.STOCK_TRADES;
  3. Create an S3 bucket and directory with a table name underneath for Flink to store (sink) Oracle CDC data.
  4. Configure a Flink consumer to read from the Kafka topic that writes the CDC data to an S3 bucket.For instructions on setting up a Flink project using the Maven archetype, see Flink Project Build Setup.The following code example is the pom.xml file, used with the Maven project. For more information, see Getting Started with Maven.
    
      4.0.0
    
      org.apache.flink
      flink-quickstart-java
      1.8.0
      jar
    
      flink-quickstart-java
      http://www.example.com
    
      
        UTF-8
        @slf4j.version@
        @log4j.version@
        1.8
        1.7
        1.7
      
    
    
        
            org.apache.flink
            flink-java
            1.8.0
            compile
        
        
            org.apache.flink
            flink-hadoop-compatibility_2.11
            1.8.0
        
        
         org.apache.flink
         flink-connector-filesystem_2.11
         1.8.0
        
    
        
            org.apache.flink
            flink-streaming-java_2.11
            1.8.0
            compile
        
         
            org.apache.flink
            flink-s3-fs-presto
            1.8.0
        
        
       org.apache.flink
            flink-connector-kafka_2.11
            1.8.0
        
        
          org.apache.flink
          flink-clients_2.11
          1.8.0
        
        
          org.apache.flink
          flink-scala_2.11
          1.8.0
        
    
        
          org.apache.flink
          flink-streaming-scala_2.11
          1.8.0
        
    
        
          com.typesafe.akka
          akka-actor_2.11
          2.4.20
        
        
           com.typesafe.akka
           akka-protobuf_2.11
           2.4.20
        
    
      
         
             org.apache.maven.plugins
                maven-shade-plugin
                3.2.1
                
                       
                          package
                            
                                shade
                             
                           
                          
                      
    
                             
                               
    
                    
                            
                                                                                     org.apache.flink:*
                            
                       
                 
                   
                                                                            
                                                                            flinkconsumer.flinkconsumer
                   
                                                                             
                                                                         reference.conf
                                                                    
                                                            
                    
                          
                                                                org.codehaus.plexus.util
                                                                  org.shaded.plexus.util
                        
                                                                      org.codehaus.plexus.util.xml.Xpp3Dom
                                                                      org.codehaus.plexus.util.xml.pull.*
                                                                  
                                                               
                                                            
                                                            false
                                                    
                                            
                                    
                            
    
                            
                                    org.apache.maven.plugins
                                    maven-jar-plugin
                                    2.5
                                    
                                            
                                                    
                                                            flinkconsumer.flinkconsumer
                                                    
                                            
           
                            
    
                            
                                    org.apache.maven.plugins
                                    maven-compiler-plugin
                                    3.1
                                    
                                            1.7
                                            1.7
                                    
                            
                    
    
    
    
                    
                            build-jar
                            
                                    false
                            
                    
            
    
    
    

    Compile the following Java program using mvn clean install and generate the JAR file:

    package flinkconsumer;
    import org.apache.flink.api.common.typeinfo.TypeInformation;
    import org.apache.flink.api.java.typeutils.TypeExtractor;
    import org.apache.flink.api.java.utils.ParameterTool;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.source.SourceFunction;
    import org.apache.flink.streaming.util.serialization.DeserializationSchema;
    import org.apache.flink.streaming.util.serialization.SerializationSchema;
    import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
    import org.apache.flink.api.common.functions.FlatMapFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.util.Collector;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
    import org.slf4j.LoggerFactory;
    import org.apache.flink.runtime.state.filesystem.FsStateBackend;
    import akka.actor.ActorSystem;
    import akka.stream.ActorMaterializer;
    import akka.stream.Materializer;
    import com.typesafe.config.Config;
    import org.apache.flink.streaming.connectors.fs.*;
    import org.apache.flink.streaming.api.datastream.*;
    import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
    import java.util.stream.Collectors;
    import java.util.Arrays;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.List;
    import java.util.Properties;
    import java.util.regex.Pattern;
    import java.io.*;
    import java.net.BindException;
    import java.util.*;
    import java.util.Map.*;
    import java.util.Arrays;
    
    public class flinkconsumer{
    
        public static void main(String[] args) throws Exception {
            // create Streaming execution environment
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setBufferTimeout(1000);
            env.enableCheckpointing(5000);
            Properties properties = new Properties();
            properties.setProperty("bootstrap.servers", "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092");
            properties.setProperty("group.id", "flink");
            properties.setProperty("client.id", "demo1");
    
            DataStream message = env.addSource(new FlinkKafkaConsumer<>("STOCK_TRADES", new SimpleStringSchema(),properties));
            env.enableCheckpointing(60_00);
            env.setStateBackend(new FsStateBackend("hdfs://ip-10-0-3-12.ec2.internal:8020/flink/checkpoints"));
    
            RollingSink sink= new RollingSink("s3://flink-stream-demo/STOCK_TRADES");
           // sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HHmm"));
           // The bucket part file size in bytes.
               sink.setBatchSize(400);
             message.map(new MapFunction() {
                private static final long serialVersionUID = -6867736771747690202L;
                @Override
                public String map(String value) throws Exception {
                    //return " Value: " + value;
                    return value;
                }
            }).addSink(sink).setParallelism(1);
            env.execute();
        }
    }

    Log in as a Hadoop user to an EMR master node, start Flink, and execute the JAR file:

    $ /usr/bin/flink run ./flink-quickstart-java-1.7.0.jar

  5. Create the stock_trades table from the Athena console. Each JSON document must be on a new line.
    CREATE EXTERNAL TABLE `stock_trades`(
      `trade_id` string COMMENT 'from deserializer', 
      `ticker_symbol` string COMMENT 'from deserializer', 
      `units` int COMMENT 'from deserializer', 
      `unit_price` float COMMENT 'from deserializer', 
      `trade_date` timestamp COMMENT 'from deserializer', 
      `op_type` string COMMENT 'from deserializer')
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
      's3://flink-cdc-demo/STOCK_TRADES'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'transient_lastDdlTime'='1561051196')

    For more information, see Hive JSON SerDe.

Testing the solution

To test that the solution works, complete the following steps:

  1. Log in to the source RDS instance from the GoldenGate hub server and perform insert, update, and delete operations on the stock_trades table:
    $sqlplus source@prod
    SQL> insert into stock_trades values(6,'NEW',29,75,sysdate);
    SQL> update stock_trades set units=999 where trade_id=6;
    SQL> insert into stock_trades values(7,'TEST',30,80,SYSDATE);
    SQL>insert into stock_trades values (8,'XYZC', 20, 1800,sysdate);
    SQL> commit;
  2. Monitor the GoldenGate capture from the source database using the following stats command:
    [oracle@ip-10-0-1-170 12.3.0]$ pwd
    /u02/app/oracle/product/ogg/12.3.0
    [oracle@ip-10-0-1-170 12.3.0]$ ./ggsci
    
    Oracle GoldenGate Command Interpreter for Oracle
    Version 12.3.0.1.4 OGGCORE_12.3.0.1.0_PLATFORMS_180415.0359_FBO
    Linux, x64, 64bit (optimized), Oracle 12c on Apr 16 2018 00:53:30
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats eora2msk
  3. Monitor the GoldenGate replicat to a Kafka topic with the following:
    [oracle@ip-10-0-1-170 12.3.0]$ pwd
    /u03/app/oracle/product/ogg/bdata/12.3.0
    [oracle@ip-10-0-1-170 12.3.0]$ ./ggsci
    
    Oracle GoldenGate for Big Data
    Version 12.3.2.1.1 (Build 005)
    
    Oracle GoldenGate Command Interpreter
    Version 12.3.0.1.2 OGGCORE_OGGADP.12.3.0.1.2_PLATFORMS_180712.2305
    Linux, x64, 64bit (optimized), Generic on Jul 13 2018 00:46:09
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats rkafka
  4. Query the stock_trades table using the Athena console.

Summary

This post illustrates how you can offload reporting activity to Athena with S3 to reduce reporting costs and improve OLTP performance on the source database. This post serves as a guide for setting up a solution in the staging environment.

Deploying this solution in a production environment may require additional considerations, for example, high availability of GoldenGate hub servers, different file encoding formats for optimal query performance, and security considerations. Additionally, you can achieve similar outcomes using technologies like AWS Database Migration Service instead of GoldenGate for database CDC and Kafka Connect for the S3 sink.


About the Authors

Sreekanth Krishnavajjala is a solutions architect at Amazon Web Services.

Vinod Kataria is a senior partner solutions architect at Amazon Web Services.

Troubleshooting Row Size Too Large Errors with InnoDB

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

With InnoDB, it is somewhat common to see the following message as an error or warning:

ERROR 1118 (42000): Row size too large (> 8126). Changing some columns to 
TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline.

This message is raised in the following cases:

  • If the innodb_strict_mode system variable is set to ON, then InnoDB will raise an error with this message if a user or application executes a DDL statement that attempts to create or alter a table and that table’s definition would allow rows that the table’s InnoDB row format can’t actually store.
  • If the innodb_strict_mode system variable is set to OFF, then InnoDB will raise a warning with this message during the DDL statement instead. In this case, if a user or application actually tries to write a row that the table’s InnoDB row format can’t store, then InnoDB will raise an error with this message during the DML statement that attempts to write the row.

The root cause of the problem is that InnoDB has a maximum row size that is roughly equivalent to half of the value of the innodb_page_size system variable. See InnoDB Row Formats Overview: Maximum Row Size for more information.

InnoDB’s row formats work around this problem by storing certain kinds of variable-length columns on overflow pages. However, different row formats can store different types of data on overflow pages. Some row formats can store more data in overflow pages than others. For example, the DYNAMIC and COMPRESSED row formats can store the most data in overflow pages. To see how to determine how the various InnoDB row formats use overflow pages, see the following pages:

Checking Existing Tables for the Problem

InnoDB does not currently have an easy way to check which existing tables have this problem. See MDEV-20400 for more information.

Solving the Problem

If you encounter this error, then there are some solutions available.

Disabling InnoDB Strict Mode

The unsafe option is to disable InnoDB strict mode.

InnoDB strict mode can be disabled by setting the innodb_strict_mode system variable to OFF.

For example, even though the following table schema is too large for most InnoDB row formats to store, it can still be created when InnoDB strict mode is disabled:

SET GLOBAL innodb_default_row_format='dynamic';
SET SESSION innodb_strict_mode=OFF;
CREATE OR REPLACE TABLE tab (
   col1 varchar(40) NOT NULL,
   col2 varchar(40) NOT NULL,
   col3 varchar(40) NOT NULL,
   col4 varchar(40) NOT NULL,
   col5 varchar(40) NOT NULL,
   col6 varchar(40) NOT NULL,
   col7 varchar(40) NOT NULL,
   col8 varchar(40) NOT NULL,
   col9 varchar(40) NOT NULL,
   col10 varchar(40) NOT NULL,
   col11 varchar(40) NOT NULL,
   col12 varchar(40) NOT NULL,
   col13 varchar(40) NOT NULL,
   col14 varchar(40) NOT NULL,
   col15 varchar(40) NOT NULL,
   col16 varchar(40) NOT NULL,
   col17 varchar(40) NOT NULL,
   col18 varchar(40) NOT NULL,
   col19 varchar(40) NOT NULL,
   col20 varchar(40) NOT NULL,
   col21 varchar(40) NOT NULL,
   col22 varchar(40) NOT NULL,
   col23 varchar(40) NOT NULL,
   col24 varchar(40) NOT NULL,
   col25 varchar(40) NOT NULL,
   col26 varchar(40) NOT NULL,
   col27 varchar(40) NOT NULL,
   col28 varchar(40) NOT NULL,
   col29 varchar(40) NOT NULL,
   col30 varchar(40) NOT NULL,
   col31 varchar(40) NOT NULL,
   col32 varchar(40) NOT NULL,
   col33 varchar(40) NOT NULL,
   col34 varchar(40) NOT NULL,
   col35 varchar(40) NOT NULL,
   col36 varchar(40) NOT NULL,
   col37 varchar(40) NOT NULL,
   col38 varchar(40) NOT NULL,
   col39 varchar(40) NOT NULL,
   col40 varchar(40) NOT NULL,
   col41 varchar(40) NOT NULL,
   col42 varchar(40) NOT NULL,
   col43 varchar(40) NOT NULL,
   col44 varchar(40) NOT NULL,
   col45 varchar(40) NOT NULL,
   col46 varchar(40) NOT NULL,
   col47 varchar(40) NOT NULL,
   col48 varchar(40) NOT NULL,
   col49 varchar(40) NOT NULL,
   col50 varchar(40) NOT NULL,
   col51 varchar(40) NOT NULL,
   col52 varchar(40) NOT NULL,
   col53 varchar(40) NOT NULL,
   col54 varchar(40) NOT NULL,
   col55 varchar(40) NOT NULL,
   col56 varchar(40) NOT NULL,
   col57 varchar(40) NOT NULL,
   col58 varchar(40) NOT NULL,
   col59 varchar(40) NOT NULL,
   col60 varchar(40) NOT NULL,
   col61 varchar(40) NOT NULL,
   col62 varchar(40) NOT NULL,
   col63 varchar(40) NOT NULL,
   col64 varchar(40) NOT NULL,
   col65 varchar(40) NOT NULL,
   col66 varchar(40) NOT NULL,
   col67 varchar(40) NOT NULL,
   col68 varchar(40) NOT NULL,
   col69 varchar(40) NOT NULL,
   col70 varchar(40) NOT NULL,
   col71 varchar(40) NOT NULL,
   col72 varchar(40) NOT NULL,
   col73 varchar(40) NOT NULL,
   col74 varchar(40) NOT NULL,
   col75 varchar(40) NOT NULL,
   col76 varchar(40) NOT NULL,
   col77 varchar(40) NOT NULL,
   col78 varchar(40) NOT NULL,
   col79 varchar(40) NOT NULL,
   col80 varchar(40) NOT NULL,
   col81 varchar(40) NOT NULL,
   col82 varchar(40) NOT NULL,
   col83 varchar(40) NOT NULL,
   col84 varchar(40) NOT NULL,
   col85 varchar(40) NOT NULL,
   col86 varchar(40) NOT NULL,
   col87 varchar(40) NOT NULL,
   col88 varchar(40) NOT NULL,
   col89 varchar(40) NOT NULL,
   col90 varchar(40) NOT NULL,
   col91 varchar(40) NOT NULL,
   col92 varchar(40) NOT NULL,
   col93 varchar(40) NOT NULL,
   col94 varchar(40) NOT NULL,
   col95 varchar(40) NOT NULL,
   col96 varchar(40) NOT NULL,
   col97 varchar(40) NOT NULL,
   col98 varchar(40) NOT NULL,
   col99 varchar(40) NOT NULL,
   col100 varchar(40) NOT NULL,
   col101 varchar(40) NOT NULL,
   col102 varchar(40) NOT NULL,
   col103 varchar(40) NOT NULL,
   col104 varchar(40) NOT NULL,
   col105 varchar(40) NOT NULL,
   col106 varchar(40) NOT NULL,
   col107 varchar(40) NOT NULL,
   col108 varchar(40) NOT NULL,
   col109 varchar(40) NOT NULL,
   col110 varchar(40) NOT NULL,
   col111 varchar(40) NOT NULL,
   col112 varchar(40) NOT NULL,
   col113 varchar(40) NOT NULL,
   col114 varchar(40) NOT NULL,
   col115 varchar(40) NOT NULL,
   col116 varchar(40) NOT NULL,
   col117 varchar(40) NOT NULL,
   col118 varchar(40) NOT NULL,
   col119 varchar(40) NOT NULL,
   col120 varchar(40) NOT NULL,
   col121 varchar(40) NOT NULL,
   col122 varchar(40) NOT NULL,
   col123 varchar(40) NOT NULL,
   col124 varchar(40) NOT NULL,
   col125 varchar(40) NOT NULL,
   col126 varchar(40) NOT NULL,
   col127 varchar(40) NOT NULL,
   col128 varchar(40) NOT NULL,
   col129 varchar(40) NOT NULL,
   col130 varchar(40) NOT NULL,
   col131 varchar(40) NOT NULL,
   col132 varchar(40) NOT NULL,
   col133 varchar(40) NOT NULL,
   col134 varchar(40) NOT NULL,
   col135 varchar(40) NOT NULL,
   col136 varchar(40) NOT NULL,
   col137 varchar(40) NOT NULL,
   col138 varchar(40) NOT NULL,
   col139 varchar(40) NOT NULL,
   col140 varchar(40) NOT NULL,
   col141 varchar(40) NOT NULL,
   col142 varchar(40) NOT NULL,
   col143 varchar(40) NOT NULL,
   col144 varchar(40) NOT NULL,
   col145 varchar(40) NOT NULL,
   col146 varchar(40) NOT NULL,
   col147 varchar(40) NOT NULL,
   col148 varchar(40) NOT NULL,
   col149 varchar(40) NOT NULL,
   col150 varchar(40) NOT NULL,
   col151 varchar(40) NOT NULL,
   col152 varchar(40) NOT NULL,
   col153 varchar(40) NOT NULL,
   col154 varchar(40) NOT NULL,
   col155 varchar(40) NOT NULL,
   col156 varchar(40) NOT NULL,
   col157 varchar(40) NOT NULL,
   col158 varchar(40) NOT NULL,
   col159 varchar(40) NOT NULL,
   col160 varchar(40) NOT NULL,
   col161 varchar(40) NOT NULL,
   col162 varchar(40) NOT NULL,
   col163 varchar(40) NOT NULL,
   col164 varchar(40) NOT NULL,
   col165 varchar(40) NOT NULL,
   col166 varchar(40) NOT NULL,
   col167 varchar(40) NOT NULL,
   col168 varchar(40) NOT NULL,
   col169 varchar(40) NOT NULL,
   col170 varchar(40) NOT NULL,
   col171 varchar(40) NOT NULL,
   col172 varchar(40) NOT NULL,
   col173 varchar(40) NOT NULL,
   col174 varchar(40) NOT NULL,
   col175 varchar(40) NOT NULL,
   col176 varchar(40) NOT NULL,
   col177 varchar(40) NOT NULL,
   col178 varchar(40) NOT NULL,
   col179 varchar(40) NOT NULL,
   col180 varchar(40) NOT NULL,
   col181 varchar(40) NOT NULL,
   col182 varchar(40) NOT NULL,
   col183 varchar(40) NOT NULL,
   col184 varchar(40) NOT NULL,
   col185 varchar(40) NOT NULL,
   col186 varchar(40) NOT NULL,
   col187 varchar(40) NOT NULL,
   col188 varchar(40) NOT NULL,
   col189 varchar(40) NOT NULL,
   col190 varchar(40) NOT NULL,
   col191 varchar(40) NOT NULL,
   col192 varchar(40) NOT NULL,
   col193 varchar(40) NOT NULL,
   col194 varchar(40) NOT NULL,
   col195 varchar(40) NOT NULL,
   col196 varchar(40) NOT NULL,
   col197 varchar(40) NOT NULL,
   col198 varchar(40) NOT NULL,
   PRIMARY KEY (col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

But if you look closely, you’ll see that InnoDB will still raise a warning with the same message:

MariaDB [db1]> SHOW WARNINGS;
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 139 | Row size too large (> 8126). Changing some columns to TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline. |
+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

As mentioned above, even though InnoDB is allowing the table to be created, there is still an opportunity for errors. If a user or application actually tries to write a row that the InnoDB row format can’t store, then InnoDB will raise an error during the DML statement that attempts to write the row. This creates a somewhat unsafe situation, because it means that the application has the chance to encounter an additional error while executing DML.

Converting the Table to the DYNAMIC Row Format

If the table is using either the REDUNDANT or the COMPACT row format, then a safe option is to convert it to use the DYNAMIC row format instead. If your tables were originally created on an older version of MySQL or MariaDB, then your table may be using one of these older row formats:

  • In MySQL 4.1 and before, the REDUNDANT row format was the default row format.
  • In MariaDB 10.1 and before, the COMPACT row format was the default row format.

The DYNAMIC row format can store more data on overflow pages than these older row formats, so converting your tables to use this row format can sometimes fix the problem.

For example:

ALTER TABLE tab ROW_FORMAT=DYNAMIC;

This workaround is a safe option, because the DYNAMIC row format may actually be able to store the data safely with this new table schema. See InnoDB DYNAMIC Row Format: Overflow Pages with the DYNAMIC Row Format for more information.

In MariaDB 10.2 and later, the DYNAMIC row format is the default row format. If your tables were originally created on one of these newer versions, then they may already be using this row format. In that case, you may need to use the next workaround.

Fitting More Columns on Overflow Pages

If the table is already using the DYNAMIC row format, then another safe option is to change the table schema, so that the row format can store more columns on overflow pages.

For VARCHAR columns, the DYNAMIC row format can only store these columns on overflow pages if the maximum length of the column is 256 bytes or longer. The original table schema is failing, because all of those small VARCHAR columns have to be stored on the row’s main data page. See InnoDB DYNAMIC Row Format: Overflow Pages with the DYNAMIC Row Format for more information.

Therefore, another safe workaround in a lot of cases is actually to increase the size of some variable-length columns, so that InnoDB’s row format can store them on overflow pages.

For example, when using InnoDB’s DYNAMIC row format and a default character set of latin1 (which requires up to 1 byte per character), the 256 byte limit means that a VARCHAR column will only be stored on overflow pages if it is at least as large as a varchar(256):

SET GLOBAL innodb_default_row_format='dynamic';
SET SESSION innodb_strict_mode=ON;
CREATE OR REPLACE TABLE tab (
   col1 varchar(256) NOT NULL,
   col2 varchar(256) NOT NULL,
   col3 varchar(256) NOT NULL,
   col4 varchar(256) NOT NULL,
   col5 varchar(256) NOT NULL,
   col6 varchar(256) NOT NULL,
   col7 varchar(256) NOT NULL,
   col8 varchar(256) NOT NULL,
   col9 varchar(256) NOT NULL,
   col10 varchar(256) NOT NULL,
   col11 varchar(256) NOT NULL,
   col12 varchar(256) NOT NULL,
   col13 varchar(256) NOT NULL,
   col14 varchar(256) NOT NULL,
   col15 varchar(256) NOT NULL,
   col16 varchar(256) NOT NULL,
   col17 varchar(256) NOT NULL,
   col18 varchar(256) NOT NULL,
   col19 varchar(256) NOT NULL,
   col20 varchar(256) NOT NULL,
   col21 varchar(256) NOT NULL,
   col22 varchar(256) NOT NULL,
   col23 varchar(256) NOT NULL,
   col24 varchar(256) NOT NULL,
   col25 varchar(256) NOT NULL,
   col26 varchar(256) NOT NULL,
   col27 varchar(256) NOT NULL,
   col28 varchar(256) NOT NULL,
   col29 varchar(256) NOT NULL,
   col30 varchar(256) NOT NULL,
   col31 varchar(256) NOT NULL,
   col32 varchar(256) NOT NULL,
   col33 varchar(256) NOT NULL,
   col34 varchar(256) NOT NULL,
   col35 varchar(256) NOT NULL,
   col36 varchar(256) NOT NULL,
   col37 varchar(256) NOT NULL,
   col38 varchar(256) NOT NULL,
   col39 varchar(256) NOT NULL,
   col40 varchar(256) NOT NULL,
   col41 varchar(256) NOT NULL,
   col42 varchar(256) NOT NULL,
   col43 varchar(256) NOT NULL,
   col44 varchar(256) NOT NULL,
   col45 varchar(256) NOT NULL,
   col46 varchar(256) NOT NULL,
   col47 varchar(256) NOT NULL,
   col48 varchar(256) NOT NULL,
   col49 varchar(256) NOT NULL,
   col50 varchar(256) NOT NULL,
   col51 varchar(256) NOT NULL,
   col52 varchar(256) NOT NULL,
   col53 varchar(256) NOT NULL,
   col54 varchar(256) NOT NULL,
   col55 varchar(256) NOT NULL,
   col56 varchar(256) NOT NULL,
   col57 varchar(256) NOT NULL,
   col58 varchar(256) NOT NULL,
   col59 varchar(256) NOT NULL,
   col60 varchar(256) NOT NULL,
   col61 varchar(256) NOT NULL,
   col62 varchar(256) NOT NULL,
   col63 varchar(256) NOT NULL,
   col64 varchar(256) NOT NULL,
   col65 varchar(256) NOT NULL,
   col66 varchar(256) NOT NULL,
   col67 varchar(256) NOT NULL,
   col68 varchar(256) NOT NULL,
   col69 varchar(256) NOT NULL,
   col70 varchar(256) NOT NULL,
   col71 varchar(256) NOT NULL,
   col72 varchar(256) NOT NULL,
   col73 varchar(256) NOT NULL,
   col74 varchar(256) NOT NULL,
   col75 varchar(256) NOT NULL,
   col76 varchar(256) NOT NULL,
   col77 varchar(256) NOT NULL,
   col78 varchar(256) NOT NULL,
   col79 varchar(256) NOT NULL,
   col80 varchar(256) NOT NULL,
   col81 varchar(256) NOT NULL,
   col82 varchar(256) NOT NULL,
   col83 varchar(256) NOT NULL,
   col84 varchar(256) NOT NULL,
   col85 varchar(256) NOT NULL,
   col86 varchar(256) NOT NULL,
   col87 varchar(256) NOT NULL,
   col88 varchar(256) NOT NULL,
   col89 varchar(256) NOT NULL,
   col90 varchar(256) NOT NULL,
   col91 varchar(256) NOT NULL,
   col92 varchar(256) NOT NULL,
   col93 varchar(256) NOT NULL,
   col94 varchar(256) NOT NULL,
   col95 varchar(256) NOT NULL,
   col96 varchar(256) NOT NULL,
   col97 varchar(256) NOT NULL,
   col98 varchar(256) NOT NULL,
   col99 varchar(256) NOT NULL,
   col100 varchar(256) NOT NULL,
   col101 varchar(256) NOT NULL,
   col102 varchar(256) NOT NULL,
   col103 varchar(256) NOT NULL,
   col104 varchar(256) NOT NULL,
   col105 varchar(256) NOT NULL,
   col106 varchar(256) NOT NULL,
   col107 varchar(256) NOT NULL,
   col108 varchar(256) NOT NULL,
   col109 varchar(256) NOT NULL,
   col110 varchar(256) NOT NULL,
   col111 varchar(256) NOT NULL,
   col112 varchar(256) NOT NULL,
   col113 varchar(256) NOT NULL,
   col114 varchar(256) NOT NULL,
   col115 varchar(256) NOT NULL,
   col116 varchar(256) NOT NULL,
   col117 varchar(256) NOT NULL,
   col118 varchar(256) NOT NULL,
   col119 varchar(256) NOT NULL,
   col120 varchar(256) NOT NULL,
   col121 varchar(256) NOT NULL,
   col122 varchar(256) NOT NULL,
   col123 varchar(256) NOT NULL,
   col124 varchar(256) NOT NULL,
   col125 varchar(256) NOT NULL,
   col126 varchar(256) NOT NULL,
   col127 varchar(256) NOT NULL,
   col128 varchar(256) NOT NULL,
   col129 varchar(256) NOT NULL,
   col130 varchar(256) NOT NULL,
   col131 varchar(256) NOT NULL,
   col132 varchar(256) NOT NULL,
   col133 varchar(256) NOT NULL,
   col134 varchar(256) NOT NULL,
   col135 varchar(256) NOT NULL,
   col136 varchar(256) NOT NULL,
   col137 varchar(256) NOT NULL,
   col138 varchar(256) NOT NULL,
   col139 varchar(256) NOT NULL,
   col140 varchar(256) NOT NULL,
   col141 varchar(256) NOT NULL,
   col142 varchar(256) NOT NULL,
   col143 varchar(256) NOT NULL,
   col144 varchar(256) NOT NULL,
   col145 varchar(256) NOT NULL,
   col146 varchar(256) NOT NULL,
   col147 varchar(256) NOT NULL,
   col148 varchar(256) NOT NULL,
   col149 varchar(256) NOT NULL,
   col150 varchar(256) NOT NULL,
   col151 varchar(256) NOT NULL,
   col152 varchar(256) NOT NULL,
   col153 varchar(256) NOT NULL,
   col154 varchar(256) NOT NULL,
   col155 varchar(256) NOT NULL,
   col156 varchar(256) NOT NULL,
   col157 varchar(256) NOT NULL,
   col158 varchar(256) NOT NULL,
   col159 varchar(256) NOT NULL,
   col160 varchar(256) NOT NULL,
   col161 varchar(256) NOT NULL,
   col162 varchar(256) NOT NULL,
   col163 varchar(256) NOT NULL,
   col164 varchar(256) NOT NULL,
   col165 varchar(256) NOT NULL,
   col166 varchar(256) NOT NULL,
   col167 varchar(256) NOT NULL,
   col168 varchar(256) NOT NULL,
   col169 varchar(256) NOT NULL,
   col170 varchar(256) NOT NULL,
   col171 varchar(256) NOT NULL,
   col172 varchar(256) NOT NULL,
   col173 varchar(256) NOT NULL,
   col174 varchar(256) NOT NULL,
   col175 varchar(256) NOT NULL,
   col176 varchar(256) NOT NULL,
   col177 varchar(256) NOT NULL,
   col178 varchar(256) NOT NULL,
   col179 varchar(256) NOT NULL,
   col180 varchar(256) NOT NULL,
   col181 varchar(256) NOT NULL,
   col182 varchar(256) NOT NULL,
   col183 varchar(256) NOT NULL,
   col184 varchar(256) NOT NULL,
   col185 varchar(256) NOT NULL,
   col186 varchar(256) NOT NULL,
   col187 varchar(256) NOT NULL,
   col188 varchar(256) NOT NULL,
   col189 varchar(256) NOT NULL,
   col190 varchar(256) NOT NULL,
   col191 varchar(256) NOT NULL,
   col192 varchar(256) NOT NULL,
   col193 varchar(256) NOT NULL,
   col194 varchar(256) NOT NULL,
   col195 varchar(256) NOT NULL,
   col196 varchar(256) NOT NULL,
   col197 varchar(256) NOT NULL,
   col198 varchar(256) NOT NULL,
   PRIMARY KEY (col1)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

And when using InnoDB’s DYNAMIC row format and a default character set of utf8 (which requires up to 3 bytes per character), the 256 byte limit means that a VARCHAR column will only be stored on overflow pages if it is at least as large as a varchar(86):

SET GLOBAL innodb_default_row_format='dynamic';
SET SESSION innodb_strict_mode=ON;
CREATE OR REPLACE TABLE tab (
   col1 varchar(86) NOT NULL,
   col2 varchar(86) NOT NULL,
   col3 varchar(86) NOT NULL,
   col4 varchar(86) NOT NULL,
   col5 varchar(86) NOT NULL,
   col6 varchar(86) NOT NULL,
   col7 varchar(86) NOT NULL,
   col8 varchar(86) NOT NULL,
   col9 varchar(86) NOT NULL,
   col10 varchar(86) NOT NULL,
   col11 varchar(86) NOT NULL,
   col12 varchar(86) NOT NULL,
   col13 varchar(86) NOT NULL,
   col14 varchar(86) NOT NULL,
   col15 varchar(86) NOT NULL,
   col16 varchar(86) NOT NULL,
   col17 varchar(86) NOT NULL,
   col18 varchar(86) NOT NULL,
   col19 varchar(86) NOT NULL,
   col20 varchar(86) NOT NULL,
   col21 varchar(86) NOT NULL,
   col22 varchar(86) NOT NULL,
   col23 varchar(86) NOT NULL,
   col24 varchar(86) NOT NULL,
   col25 varchar(86) NOT NULL,
   col26 varchar(86) NOT NULL,
   col27 varchar(86) NOT NULL,
   col28 varchar(86) NOT NULL,
   col29 varchar(86) NOT NULL,
   col30 varchar(86) NOT NULL,
   col31 varchar(86) NOT NULL,
   col32 varchar(86) NOT NULL,
   col33 varchar(86) NOT NULL,
   col34 varchar(86) NOT NULL,
   col35 varchar(86) NOT NULL,
   col36 varchar(86) NOT NULL,
   col37 varchar(86) NOT NULL,
   col38 varchar(86) NOT NULL,
   col39 varchar(86) NOT NULL,
   col40 varchar(86) NOT NULL,
   col41 varchar(86) NOT NULL,
   col42 varchar(86) NOT NULL,
   col43 varchar(86) NOT NULL,
   col44 varchar(86) NOT NULL,
   col45 varchar(86) NOT NULL,
   col46 varchar(86) NOT NULL,
   col47 varchar(86) NOT NULL,
   col48 varchar(86) NOT NULL,
   col49 varchar(86) NOT NULL,
   col50 varchar(86) NOT NULL,
   col51 varchar(86) NOT NULL,
   col52 varchar(86) NOT NULL,
   col53 varchar(86) NOT NULL,
   col54 varchar(86) NOT NULL,
   col55 varchar(86) NOT NULL,
   col56 varchar(86) NOT NULL,
   col57 varchar(86) NOT NULL,
   col58 varchar(86) NOT NULL,
   col59 varchar(86) NOT NULL,
   col60 varchar(86) NOT NULL,
   col61 varchar(86) NOT NULL,
   col62 varchar(86) NOT NULL,
   col63 varchar(86) NOT NULL,
   col64 varchar(86) NOT NULL,
   col65 varchar(86) NOT NULL,
   col66 varchar(86) NOT NULL,
   col67 varchar(86) NOT NULL,
   col68 varchar(86) NOT NULL,
   col69 varchar(86) NOT NULL,
   col70 varchar(86) NOT NULL,
   col71 varchar(86) NOT NULL,
   col72 varchar(86) NOT NULL,
   col73 varchar(86) NOT NULL,
   col74 varchar(86) NOT NULL,
   col75 varchar(86) NOT NULL,
   col76 varchar(86) NOT NULL,
   col77 varchar(86) NOT NULL,
   col78 varchar(86) NOT NULL,
   col79 varchar(86) NOT NULL,
   col80 varchar(86) NOT NULL,
   col81 varchar(86) NOT NULL,
   col82 varchar(86) NOT NULL,
   col83 varchar(86) NOT NULL,
   col84 varchar(86) NOT NULL,
   col85 varchar(86) NOT NULL,
   col86 varchar(86) NOT NULL,
   col87 varchar(86) NOT NULL,
   col88 varchar(86) NOT NULL,
   col89 varchar(86) NOT NULL,
   col90 varchar(86) NOT NULL,
   col91 varchar(86) NOT NULL,
   col92 varchar(86) NOT NULL,
   col93 varchar(86) NOT NULL,
   col94 varchar(86) NOT NULL,
   col95 varchar(86) NOT NULL,
   col96 varchar(86) NOT NULL,
   col97 varchar(86) NOT NULL,
   col98 varchar(86) NOT NULL,
   col99 varchar(86) NOT NULL,
   col100 varchar(86) NOT NULL,
   col101 varchar(86) NOT NULL,
   col102 varchar(86) NOT NULL,
   col103 varchar(86) NOT NULL,
   col104 varchar(86) NOT NULL,
   col105 varchar(86) NOT NULL,
   col106 varchar(86) NOT NULL,
   col107 varchar(86) NOT NULL,
   col108 varchar(86) NOT NULL,
   col109 varchar(86) NOT NULL,
   col110 varchar(86) NOT NULL,
   col111 varchar(86) NOT NULL,
   col112 varchar(86) NOT NULL,
   col113 varchar(86) NOT NULL,
   col114 varchar(86) NOT NULL,
   col115 varchar(86) NOT NULL,
   col116 varchar(86) NOT NULL,
   col117 varchar(86) NOT NULL,
   col118 varchar(86) NOT NULL,
   col119 varchar(86) NOT NULL,
   col120 varchar(86) NOT NULL,
   col121 varchar(86) NOT NULL,
   col122 varchar(86) NOT NULL,
   col123 varchar(86) NOT NULL,
   col124 varchar(86) NOT NULL,
   col125 varchar(86) NOT NULL,
   col126 varchar(86) NOT NULL,
   col127 varchar(86) NOT NULL,
   col128 varchar(86) NOT NULL,
   col129 varchar(86) NOT NULL,
   col130 varchar(86) NOT NULL,
   col131 varchar(86) NOT NULL,
   col132 varchar(86) NOT NULL,
   col133 varchar(86) NOT NULL,
   col134 varchar(86) NOT NULL,
   col135 varchar(86) NOT NULL,
   col136 varchar(86) NOT NULL,
   col137 varchar(86) NOT NULL,
   col138 varchar(86) NOT NULL,
   col139 varchar(86) NOT NULL,
   col140 varchar(86) NOT NULL,
   col141 varchar(86) NOT NULL,
   col142 varchar(86) NOT NULL,
   col143 varchar(86) NOT NULL,
   col144 varchar(86) NOT NULL,
   col145 varchar(86) NOT NULL,
   col146 varchar(86) NOT NULL,
   col147 varchar(86) NOT NULL,
   col148 varchar(86) NOT NULL,
   col149 varchar(86) NOT NULL,
   col150 varchar(86) NOT NULL,
   col151 varchar(86) NOT NULL,
   col152 varchar(86) NOT NULL,
   col153 varchar(86) NOT NULL,
   col154 varchar(86) NOT NULL,
   col155 varchar(86) NOT NULL,
   col156 varchar(86) NOT NULL,
   col157 varchar(86) NOT NULL,
   col158 varchar(86) NOT NULL,
   col159 varchar(86) NOT NULL,
   col160 varchar(86) NOT NULL,
   col161 varchar(86) NOT NULL,
   col162 varchar(86) NOT NULL,
   col163 varchar(86) NOT NULL,
   col164 varchar(86) NOT NULL,
   col165 varchar(86) NOT NULL,
   col166 varchar(86) NOT NULL,
   col167 varchar(86) NOT NULL,
   col168 varchar(86) NOT NULL,
   col169 varchar(86) NOT NULL,
   col170 varchar(86) NOT NULL,
   col171 varchar(86) NOT NULL,
   col172 varchar(86) NOT NULL,
   col173 varchar(86) NOT NULL,
   col174 varchar(86) NOT NULL,
   col175 varchar(86) NOT NULL,
   col176 varchar(86) NOT NULL,
   col177 varchar(86) NOT NULL,
   col178 varchar(86) NOT NULL,
   col179 varchar(86) NOT NULL,
   col180 varchar(86) NOT NULL,
   col181 varchar(86) NOT NULL,
   col182 varchar(86) NOT NULL,
   col183 varchar(86) NOT NULL,
   col184 varchar(86) NOT NULL,
   col185 varchar(86) NOT NULL,
   col186 varchar(86) NOT NULL,
   col187 varchar(86) NOT NULL,
   col188 varchar(86) NOT NULL,
   col189 varchar(86) NOT NULL,
   col190 varchar(86) NOT NULL,
   col191 varchar(86) NOT NULL,
   col192 varchar(86) NOT NULL,
   col193 varchar(86) NOT NULL,
   col194 varchar(86) NOT NULL,
   col195 varchar(86) NOT NULL,
   col196 varchar(86) NOT NULL,
   col197 varchar(86) NOT NULL,
   col198 varchar(86) NOT NULL,
   PRIMARY KEY (col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And when using InnoDB’s DYNAMIC row format and a default character set of utf8mb4 (which requires up to 4 bytes per character), the 256 byte limit means that a VARCHAR column will only be stored on overflow pages if it is at least as large as a varchar(64):

SET GLOBAL innodb_default_row_format='dynamic';
SET SESSION innodb_strict_mode=ON;
CREATE OR REPLACE TABLE tab (
   col1 varchar(64) NOT NULL,
   col2 varchar(64) NOT NULL,
   col3 varchar(64) NOT NULL,
   col4 varchar(64) NOT NULL,
   col5 varchar(64) NOT NULL,
   col6 varchar(64) NOT NULL,
   col7 varchar(64) NOT NULL,
   col8 varchar(64) NOT NULL,
   col9 varchar(64) NOT NULL,
   col10 varchar(64) NOT NULL,
   col11 varchar(64) NOT NULL,
   col12 varchar(64) NOT NULL,
   col13 varchar(64) NOT NULL,
   col14 varchar(64) NOT NULL,
   col15 varchar(64) NOT NULL,
   col16 varchar(64) NOT NULL,
   col17 varchar(64) NOT NULL,
   col18 varchar(64) NOT NULL,
   col19 varchar(64) NOT NULL,
   col20 varchar(64) NOT NULL,
   col21 varchar(64) NOT NULL,
   col22 varchar(64) NOT NULL,
   col23 varchar(64) NOT NULL,
   col24 varchar(64) NOT NULL,
   col25 varchar(64) NOT NULL,
   col26 varchar(64) NOT NULL,
   col27 varchar(64) NOT NULL,
   col28 varchar(64) NOT NULL,
   col29 varchar(64) NOT NULL,
   col30 varchar(64) NOT NULL,
   col31 varchar(64) NOT NULL,
   col32 varchar(64) NOT NULL,
   col33 varchar(64) NOT NULL,
   col34 varchar(64) NOT NULL,
   col35 varchar(64) NOT NULL,
   col36 varchar(64) NOT NULL,
   col37 varchar(64) NOT NULL,
   col38 varchar(64) NOT NULL,
   col39 varchar(64) NOT NULL,
   col40 varchar(64) NOT NULL,
   col41 varchar(64) NOT NULL,
   col42 varchar(64) NOT NULL,
   col43 varchar(64) NOT NULL,
   col44 varchar(64) NOT NULL,
   col45 varchar(64) NOT NULL,
   col46 varchar(64) NOT NULL,
   col47 varchar(64) NOT NULL,
   col48 varchar(64) NOT NULL,
   col49 varchar(64) NOT NULL,
   col50 varchar(64) NOT NULL,
   col51 varchar(64) NOT NULL,
   col52 varchar(64) NOT NULL,
   col53 varchar(64) NOT NULL,
   col54 varchar(64) NOT NULL,
   col55 varchar(64) NOT NULL,
   col56 varchar(64) NOT NULL,
   col57 varchar(64) NOT NULL,
   col58 varchar(64) NOT NULL,
   col59 varchar(64) NOT NULL,
   col60 varchar(64) NOT NULL,
   col61 varchar(64) NOT NULL,
   col62 varchar(64) NOT NULL,
   col63 varchar(64) NOT NULL,
   col64 varchar(64) NOT NULL,
   col65 varchar(64) NOT NULL,
   col66 varchar(64) NOT NULL,
   col67 varchar(64) NOT NULL,
   col68 varchar(64) NOT NULL,
   col69 varchar(64) NOT NULL,
   col70 varchar(64) NOT NULL,
   col71 varchar(64) NOT NULL,
   col72 varchar(64) NOT NULL,
   col73 varchar(64) NOT NULL,
   col74 varchar(64) NOT NULL,
   col75 varchar(64) NOT NULL,
   col76 varchar(64) NOT NULL,
   col77 varchar(64) NOT NULL,
   col78 varchar(64) NOT NULL,
   col79 varchar(64) NOT NULL,
   col80 varchar(64) NOT NULL,
   col81 varchar(64) NOT NULL,
   col82 varchar(64) NOT NULL,
   col83 varchar(64) NOT NULL,
   col84 varchar(64) NOT NULL,
   col85 varchar(64) NOT NULL,
   col86 varchar(64) NOT NULL,
   col87 varchar(64) NOT NULL,
   col88 varchar(64) NOT NULL,
   col89 varchar(64) NOT NULL,
   col90 varchar(64) NOT NULL,
   col91 varchar(64) NOT NULL,
   col92 varchar(64) NOT NULL,
   col93 varchar(64) NOT NULL,
   col94 varchar(64) NOT NULL,
   col95 varchar(64) NOT NULL,
   col96 varchar(64) NOT NULL,
   col97 varchar(64) NOT NULL,
   col98 varchar(64) NOT NULL,
   col99 varchar(64) NOT NULL,
   col100 varchar(64) NOT NULL,
   col101 varchar(64) NOT NULL,
   col102 varchar(64) NOT NULL,
   col103 varchar(64) NOT NULL,
   col104 varchar(64) NOT NULL,
   col105 varchar(64) NOT NULL,
   col106 varchar(64) NOT NULL,
   col107 varchar(64) NOT NULL,
   col108 varchar(64) NOT NULL,
   col109 varchar(64) NOT NULL,
   col110 varchar(64) NOT NULL,
   col111 varchar(64) NOT NULL,
   col112 varchar(64) NOT NULL,
   col113 varchar(64) NOT NULL,
   col114 varchar(64) NOT NULL,
   col115 varchar(64) NOT NULL,
   col116 varchar(64) NOT NULL,
   col117 varchar(64) NOT NULL,
   col118 varchar(64) NOT NULL,
   col119 varchar(64) NOT NULL,
   col120 varchar(64) NOT NULL,
   col121 varchar(64) NOT NULL,
   col122 varchar(64) NOT NULL,
   col123 varchar(64) NOT NULL,
   col124 varchar(64) NOT NULL,
   col125 varchar(64) NOT NULL,
   col126 varchar(64) NOT NULL,
   col127 varchar(64) NOT NULL,
   col128 varchar(64) NOT NULL,
   col129 varchar(64) NOT NULL,
   col130 varchar(64) NOT NULL,
   col131 varchar(64) NOT NULL,
   col132 varchar(64) NOT NULL,
   col133 varchar(64) NOT NULL,
   col134 varchar(64) NOT NULL,
   col135 varchar(64) NOT NULL,
   col136 varchar(64) NOT NULL,
   col137 varchar(64) NOT NULL,
   col138 varchar(64) NOT NULL,
   col139 varchar(64) NOT NULL,
   col140 varchar(64) NOT NULL,
   col141 varchar(64) NOT NULL,
   col142 varchar(64) NOT NULL,
   col143 varchar(64) NOT NULL,
   col144 varchar(64) NOT NULL,
   col145 varchar(64) NOT NULL,
   col146 varchar(64) NOT NULL,
   col147 varchar(64) NOT NULL,
   col148 varchar(64) NOT NULL,
   col149 varchar(64) NOT NULL,
   col150 varchar(64) NOT NULL,
   col151 varchar(64) NOT NULL,
   col152 varchar(64) NOT NULL,
   col153 varchar(64) NOT NULL,
   col154 varchar(64) NOT NULL,
   col155 varchar(64) NOT NULL,
   col156 varchar(64) NOT NULL,
   col157 varchar(64) NOT NULL,
   col158 varchar(64) NOT NULL,
   col159 varchar(64) NOT NULL,
   col160 varchar(64) NOT NULL,
   col161 varchar(64) NOT NULL,
   col162 varchar(64) NOT NULL,
   col163 varchar(64) NOT NULL,
   col164 varchar(64) NOT NULL,
   col165 varchar(64) NOT NULL,
   col166 varchar(64) NOT NULL,
   col167 varchar(64) NOT NULL,
   col168 varchar(64) NOT NULL,
   col169 varchar(64) NOT NULL,
   col170 varchar(64) NOT NULL,
   col171 varchar(64) NOT NULL,
   col172 varchar(64) NOT NULL,
   col173 varchar(64) NOT NULL,
   col174 varchar(64) NOT NULL,
   col175 varchar(64) NOT NULL,
   col176 varchar(64) NOT NULL,
   col177 varchar(64) NOT NULL,
   col178 varchar(64) NOT NULL,
   col179 varchar(64) NOT NULL,
   col180 varchar(64) NOT NULL,
   col181 varchar(64) NOT NULL,
   col182 varchar(64) NOT NULL,
   col183 varchar(64) NOT NULL,
   col184 varchar(64) NOT NULL,
   col185 varchar(64) NOT NULL,
   col186 varchar(64) NOT NULL,
   col187 varchar(64) NOT NULL,
   col188 varchar(64) NOT NULL,
   col189 varchar(64) NOT NULL,
   col190 varchar(64) NOT NULL,
   col191 varchar(64) NOT NULL,
   col192 varchar(64) NOT NULL,
   col193 varchar(64) NOT NULL,
   col194 varchar(64) NOT NULL,
   col195 varchar(64) NOT NULL,
   col196 varchar(64) NOT NULL,
   col197 varchar(64) NOT NULL,
   col198 varchar(64) NOT NULL,
   PRIMARY KEY (col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

A Technical Introduction to MemSQL

$
0
0

Feed: MemSQL Blog.
Author: John Sherwood.

John Sherwood, a senior engineer at MemSQL on the query optimizer team, presented at MemSQL’s Engineering Open House in our Seattle offices last month. He gave a technical introduction to the MemSQL database, including its support for in-memory rowstore tables and disk-backed columnstore tables, its SQL support and MySQL wire protocol compatibility, and how aggregator and leaf nodes interact to store data and answer queries simultaneously, scalably, and with low latencies. He also went into detail about code generation for queries and query execution. Following is a lightly edited transcript of John’s talk. – Ed.

This is a brief technical backgrounder on MemSQL, our features, our architecture, and so on. MemSQL: we exist. Very important first point. We have about 50 engineers scattered across our San Francisco and Seattle offices for the most part, but also a various set of offices across the rest of the country and the world.

With any company, and especially a database company, there is the question of why do we specifically exist? There’s absolutely no shortage of database products out there, as probably many of you could attest from your own companies.

Technical Introduction to MemSQL database 1

Scale-out is of course a bare minimum these days, but the primary feature of MemSQL has traditionally been the in-memory rowstore which allows us to circumvent many of the issues that arise with disk-based databases. Along the way, we’ve added columnstore, with several of its own unique features, and of course you’re presented all this functionality through a MySQL wire protocol-compatible interface.

Technical Introduction to MemSQL database 2

The rowstore requires that all the data can fit in main memory. By completely avoiding disk IO, we were able to make use of a variety of techniques to speed up the execution, with minimal principal latencies. The columnstore is able to leverage coding techniques that – with code generation and modern hardware – allow for incredibly fast scans.

The general market we find ourselves in is: companies who have large, shifting datasets, who are looking for very fast answers, ideally with minimal changes in latency, as well as those who have large historical data sets, who want very quick, efficient queries.

So, from 20,000 feet as mentioned, we scale out as well as up. At the very highest level, our cluster is made up of two kinds of nodes, leaves and aggregators. Leaves actually store data, while aggregators coordinate the data manipulation language (DML). There’s a single aggregator which we call the master aggregator – actually, in our codebase, we call it the Supreme Leader – which is actually responsible for coordinating the data definition language (DDL) and is the closest thing we have to Hadoop-style named namenode, et cetera that actually runs our cluster.

Technical Introduction to MemSQL database 3

As mentioned, the interface at MemSQL is MySQL compatible with extensions and our basic idiom remains the same: database, tables, rows. The most immediate nuance is that our underlying system will automatically break a logical database into multiple physical partitions, each of which is visible on the actual leaf. While we are provisionally willing to shard data without regard to what the user gives us, we much prefer it if you actually use a shard key which allows us to set up convenient joins, et cetera, for actual exploration of data.

The aggregator then is responsible for formulating query plans, bridging out across leaves as necessary to service the DML. Of particular note is that the engine that we use is able to have leaves perform computations with the same full amount of functionality that the aggregator itself can perform, which allows us to perform many worthwhile optimizations across the cluster.

Technical Introduction to MemSQL database 4

A quick, more visual example will better show what I’m talking about. Here we have an example cluster. We have a single master aggregator and three leaf nodes. A user has given us the very imaginatively named database “db” which we’re supposed to create. Immediately the aggregator’s job is to stripe this into multiple sub-databases, here shown as db_0 through db_2. In practice, we find that a database per physical core on the host works best, it allows parallelization and so on, but drawing out 48 of these boxes per would probably be a little bit much.

Technical Introduction to MemSQL database 5

So beyond just creating the database, as mentioned, we have a job as a database to persist data. And as running on a single host does not get you very far in the modern world. And so, we have replication. We do this by database partition, replicating data from each leaf to a chosen slave.
So as you can see here, we’ve created a cluster such that there is no single point of failure. If a node goes down, such as this leaf mastering db_2, the other leaf that currently masters db_0 will be promoted, step up, and start new serving data.

Technical Introduction to MemSQL database 6

I’d also note that while I’m kind of hand waving a lot of things, all this does take place under a very heavy, two phase commit sort of thing. Such that we do handle failures properly, but for hopefully obvious reasons, I’m not going to go there.

So in a very basic example, let’s say a user is actually querying this cluster. As mentioned, they talked to the master aggregator that’s shown as the logical database, db as mentioned, which they treat as just any other data. The master aggregator in this case is going to have to fan out across all the leaves, query them individually and merge the results.

One thing that I will note here, is that I mentioned that we can actually perform computations on the leaves, in a way that allows us not to do so on the master. Here we have an order-by clause, which we actually push down to each leaf. Perhaps there was actually an index on A that we take advantage of.

Technical Introduction to MemSQL database 7

Here the master aggregator will simply combine, merge, stream the results back. We can easily imagine that even for this trivial example, if each leaf is using its full storage for this table, the master aggregator (on homogenous hardware at least) will not be able to do a full quick sort, whatever you want to use, and actually sort all the data without spooling. And so even this trivial example shows how our distributed architecture allows faster speeds.

Before I move on, here’s an example of inserts. Here, as with point lookups and so on in the DML, we’re able to say the exact leaf that owns this row across its object.

Technical Introduction to MemSQL database 8

So here we talk to a single leaf end up transparently without the master aggregator necessarily knowing about it. Replicates that down to db_1’s slave on the other host. Allowing us to have durability, replication, all that good stuff.

Again, as a database, we are actually persisting everything in the data that has been entrusted to us. We kind of nuance between durability to the actual persistence of a single host versus replication across multiple hosts.

Like many databases, the strategy that we use for this is a streaming write-ahead-log which allows us to rephrase the problem from, “How do I stream transactions across the cluster?” to simply, “How do I actually replicate pages in an ordered log across multiple hosts?” As mentioned, this works at the database level, which means that there’s no actual concept of a schema, of the actual transactions themselves, or the row data. All that happens is that this storage layer is responsible for replicating these pages, the contents of which it is entirely agnostic to.

Technical Introduction to MemSQL database 9

The other large feature of MemSQL is its code generation. Essentially the classic way for a database to work is injecting in what we would call in the C++ world, virtual functions. The idea that in the common case, you might have an operator comparing a field of a row to a constant value.

Technical Introduction to MemSQL database 10

In a normal database you might inject an operator class that has a constant value, do a virtual function lookup to actually check that, and we go on with our lives. The nuance here is in a couple of ways this is suboptimal. First being that if we’re using a function pointer, a function call, we’re not in-line. And the second is simply that in making a function call, we’re having to dynamically look it up. Code generation on the other hand allows us to make those decisions beforehand, well before anything actually executes. This allows us both to make these basic optimizations where we could say, “this common case any engine would have – just optimize around it,” but also allows us to do very complex things outside of queries in a kind of precognitive way.

An impressive thing for most when they look through our code base is is just the amount of metadata we collect. We have huge amounts of data on various columns, on the tables and databases, and everything else. And at runtime if we were to attempt to read this, look at it, make decisions on it, we would be hopelessly slow. But instead, by using code generation, we’re able to make all the decisions up front, efficiently generate code and go on with our lives without having runtime costs. A huge lever for us is the fact that we use an LLVM toolchain underneath the hood, such that by generating IR – intermediate representation – LLVM, we can actually take advantage of the entire tool chain they’ve built up. In fact the same toolchain that we all love – or we would love if we actually used it here for our main code base – to use in our day to day lives. We get all those advantages: function inlining, loop unrolling vectorization, and so on.

And so between those two features we have the ability to build a next generation, amazing, streaming database.

CREATE PIPELINE: Real-Time Streaming and Exactly-Once Semantics with Kafka

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

In this presentation, recorded shortly after MemSQL introduced MemSQL Pipelines, two MemSQL engineers describe MemSQL’s underlying architecture and how it matches up perfectly to Kafka, including in the areas of scalability and exactly-one updates. The discussion includes specific SQL commands used to interface MemSQL to Kafka, unleashing a great deal of processing power from both technologies. In the video, the MemSQL people go on to describe how to try this on your own laptop, with free-to-use MemSQL software.

Introduction to MemSQL: Carl Sverre

I want to start with a question. What is MemSQL? It’s really important to understand the underpinnings of what makes Pipeline so great, which is our MemSQL distributed SQL engine.

There are three main areas of MemSQL that I want to talk about really briefly. The first area of MemSQL is that we’re a scalable SQL database. So if you’re familiar with MySQL, Postgres, Oracle, SQL Server, a lot of our really awesome competitors, we are really similar. If you’re used to their syntax, you can get up and running with MemSQL really easily, especially if you use MySQL. We actually have followed their syntax very similarly, and so if you already used MySQL, you can pretty much drop in MemSQL in place, and it just works.


So, familiar syntax, scalable SQL database. What makes us scalable, really briefly? Well, we’re a distributed system. We scale out on commodity hardware, which means you can run us in your favorite cloud provider, you can run us on-premises and it generally just works. It’s super, super fast, as you can see – really, really fast and it’s just fun.

So without further ado, I want to get into a little bit of the technical details that are behind what makes us such a great database. In MemSQL, we have two primary roles in the cluster. So if you think about a collection of Linux machines, we have some aggregators and some leaves. Aggregators are essentially responsible for the metadata-level concepts in the cluster. So we’re talking about the data definition layer – DDL, for people who are familiar SQL. We’re responsible for CREATE TABLE statements, for CREATE PIPELINE statements, which we’re going to get right into.

MemSQL stores data in leaf nodes and distributes operations across them.

In addition, master aggregators and child aggregators collectively handle things like failover, high availability, cluster management, and really importantly, query distribution. So when a SELECT query comes in, or an INSERT command comes in, you want to get some data, you want to insert some data. What we do is we take those queries and we shard those queries down onto leaf nodes. So leaf nodes are never connected to directly by your app. Instead you connect to the aggregators and we shard down those queries to the leaf nodes.

And what are the leaf nodes? Well, leaf nodes satisfy a couple of really, really powerful features inside the engine. One feature is storage. If you have leaf nodes, you can store data. If you have more leaf nodes, you can store more data. That’s the general concept here. In addition, it handles pre-execution. So the more leaf nodes you have, generally the faster your database goes. That’s a great property to have in a distributed system. You want to go faster, add more leaf nodes. It generally scales up and we see some really amazing workloads that are satisfiable by simply increasing the size of the cluster. And that’s exciting.

Finally you get a sort of natural parallelism. Because we shard these queries to all the leaf nodes, you can sort of take advantage of the fact that by scaling out, you are taking advantage of many, many cores like real, real solid Linux; everything you want performance-wise just works. And that’s a simple system to get behind, and I’m really excited to always talk about it because I’m a performance geek.

So that’s the general idea of MemSQL. This Meetup is going to be really focused on Pipelines, so I just wanted to give you some basic ideas at the MemSQL level.

Introduction to MemSQL Pipelines: John Bowler

Pipelines are MemSQL’s solution to real-time streaming. For those of you who are familiar with what an analytics pipeline looks like, you might have your production database going into a Kafka stream or writing to S3 or writing to a Hadoop data lake. You might have a computation framework like Spark or Storm, and then you might have an analytics database farther downstream, such as RedShift, for example. And you might have some business intelligence (BI) tools that are hitting those analytics databases.

So MemSQL Pipelines are our attempt at taking a step back and solving the core problem, which is: how do you easily and robustly and scalably create this sort of streaming analytics workload, end to end? And we are able to leverage some unique properties of our existing system.

For example, since MemSQL is already an ACID-compliant SQL database, we get things like exactly-once semantics out of the box. Every micro-batch that you’re streaming through your system happens within a transaction. So you’re not going to duplicate micro-batches and you’re not going to drop them.
You also have these pipelines, these streaming workloads are automatically distributed using exactly the same underlying machinery that we use to distribute our tables. In our database, they’re just automatically sharded across your entire cluster.

And finally, for those of you who have made these sort of analytics workloads, there’s always going to be some sort of computation step, whether it’s Spark or any similar frameworks. We offer the ability to perform this computation, this transformation, written in whatever language you want, using whatever framework or whatever libraries you want. (This happens within the Pipeline, very fast. Accomplishing the same thing as an ETL step, but within a real-time streaming context. – Ed.) And we’ll explain in more detail how that works.

So this is how you create a pipeline – or, this is one of the ways that you create a pipeline. You’ll notice that it is very similar to a CREATE TABLE statement and you can also alter a pipeline and drop a pipeline. The fact is that pipelines exist as first-class entities within our database engine.

The MemSQL CREATE PIPELINE statement names a Kafka pipeline as its input source.

And underneath this CREATE PIPELINE line is a LOAD DATA statement that is familiar to anyone who’s used a SQL database, except instead of loading data from a file, you’re loading data from Kafka, specifically from this, a host name and the tweets topic. And then the destination of this stream is the tweets table. So in this three lines of SQL, you can declaratively describe the source of your stream and the sync of your stream and everything related to managing it is automatically handled by the MemSQL engine.

This is sort of a diagram of how it works. Kafka, for those of you who are unfamiliar, is a distributed message queue. When you’re building analytics pipelines, you very commonly have lots of bits of production code that are emitting events or emitting clicks or emitting sensor data and they all have to get aggregated into some sort of buffer somewhere for whatever part of your analytics system is consuming them. Kafka is one of those very commonly used types, and it’s one that we support.

So you have all different parts of your production system emitting events. They arrive in Kafka, maybe a few days worth of buffer. And when you create your pipeline in MemSQL, it automatically streams data in. Now, Kafka is a distributed system. You have data sharded across your entire cluster. MemSQL is also a distributed system. You have data sharded across your entire set of leaf nodes. So when you create this Kafka consumer, it happens in parallel, automatically.

Now, this is sufficient if you have data in Kafka and you just want to load it straight into a table. If you additionally want to run some sort of transform, or MapReduce, or RDD-like operation on it, then you can have a transform. And the transform is just a binary or just a program.

You can write it in whatever language you want. All it does is it reads records from stdin and it writes records to stdout, which means that you could write it as a Python script. You could write it as a Bash script. You could write it as a C program if you want. Amazing performance. You can do any sort of machine learning or data science work. You can even hit an external server if you want to.

Kafka distributed data maps to MemSQL leaf nodes.

So every record that gets received from Kafka passes through this transform and is loaded into the leaf, and all of this happens automatically in parallel. You create the code of this transform and MemSQL takes care of automatically deploying it for you across this entire cluster.

Conclusion

The video shows all of the above, plus a demonstration you can try yourself, using MemSQL to process tweet streams. You can download and run MemSQL for free or contact MemSQL.

movead li: An Overview of Logical Replication in PostgreSQL

$
0
0

Feed: Planet PostgreSQL.

Logical Replication appeared in Postgres10, it came along with number of keywords like ‘logical decoding’, ‘pglogical’, ‘wal2json’, ‘BDR’ etc. These words puzzle me so much so I decided to start the blog with explaining these terms and describing the relationships between them. I will also mention a new idea for a tool called ‘logical2sql’, it is designed for synchronization of data from Postgres to databases of different kinds and architectures.

Difference Between Physical And Logical Replication

Replication (Or Physical Replication) is a more ancient or traditional data synchronize approach. It plays an important role in the data synchronization of Postgres. It copies data in the WAL record by using exact block addresses and byte-by-byte replication. So it will be almost the same between master and slave and the ‘ctid’ is the same which identifies the location of a data row. ​ This blog will also briefly discuss logical replication, I have mentioned physical replication here just to differentiate physical and logical replication for the readers.

physical replication

Physical Replication Schematics

Logical Replication support table-level data synchronization, in princial it is very different from physical replication. The detailed analysis below will help in understanding this :

Logical Replication : When the WAL record is produced, the logical decoding module analyzes the record to a ‘ReorderBufferChange’ struct (you can deem it as HeapTupleData). The pgoutput plugin will then do a filter and data reorganization to the ReorderBufferChanges and send it to subscription.

The publication will get some ‘HeapTupleData’ and will remake insert, delete, update to the received data. Notice that Physical Replication will copy data from wal record to data pages while Logical Replication will execute an insert, update, delete again in the subscription.

Logical Replication Schematics

Concepts About Recognize Logical

A picture to known every things about logical

Logical Decoding is the key technology of all logical replication feature. The test_decoding plugin can be used to test logical decoding, it doesn’t do a whole lot but it can be used to develop other output plugin for logical decoding. The wal2json appeared as a contrib which has better output than test_decoding. BDR and pglogical are contrib modules who encompass the same feature as logical replication, BDR supports multi-master, DDL replication, etc. The pglogical almost has the same feature as built-in Logical Replication with different operational steps however we can use pglogical for a low version of Postgres while Logical Replication can support from pg10.

1. Key:logical decoding

logical decoding Schematics

The walsender process continue receiving wal record from replicate slot, the wal records will be filtered and sent to ReorderBufferChange (It is mainly oldtuple and newtuple for DML records). The ReorderBufferChange will be set into recordbuffer by their TransactionID, as one TransactionID is committed, the ReorderBufferChange in recordbuffer related to the transaction will be analyzed by the pointed plugin.

2. test_decoding Plugin

The test_decoding is an plugin of Postgres and it appears with logical decoding,it can turn the wal record to format that human-readable. We can use the plugin in two ways as below, you can check it in the document.

2.1 Used By SQL

postgres=# SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
    slot_name    |    lsn    
-----------------+-----------
 regression_slot | 0/16569D8
(1 row)
​
postgres=# d t1
                      Table "public.t1"
 Column |       Type        | Collation | Nullable | Default 
--------+-------------------+-----------+----------+---------
 i      | integer           |           |          | 
 j      | integer           |           |          | 
 k      | character varying |           |          | 
​
postgres=# insert into t1 values(1,1,'test_decoding use by SQL');
INSERT 0 1
postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
    lsn    | xid |                                                data                                                
-----------+-----+----------------------------------------------------------------------------------------------------
 0/16569D8 | 492 | BEGIN 492
 0/16569D8 | 492 | table public.t1: INSERT: i[integer]:1 j[integer]:1 k[character varying]:'test_decoding use by SQL'
 0/1656A60 | 492 | COMMIT 492
(3 rows)
​
postgres=#

It’s Schematics

2.2 Used By pg_recvlogical

movead@home:~/ld $ pg_recvlogical -d postgres --slot=test2 --create-slot
movead@home:~/ld $ pg_recvlogical -d postgres --slot=test2 --start -f ld.out &
[1] 19807
movead@home:~/ld $ psql -c "insert into t1 values(2,2,'test_decoding use by pg_recvlogical');"
INSERT 0 1
movead@home:~/ld $ vi ld.out
movead@home:~/ld $

Then you can see the output in file ld.out


BEGIN 496
table public.t1: INSERT: i[integer]:2 j[integer]:2 k[character varying]:'test_decoding use by pg_recvlogical'
COMMIT 496

It’s Schematics

3. Wal2json

wal2json plugin is an enhanced version of test_decoding, it’s output is relatively easier to use.

movead@home:~/ld $ pg_recvlogical -d postgres --slot test_slot_wal2json --create-slot -P wal2json
movead@home:~/ld $ pg_recvlogical -d postgres --slot test_slot_wal2json --start -o pretty-print=1 -f ld_wal2json.out
movead@home:~/ld $ vi ld_wal2json.out
movead@home:~/ld $

now can check the output in ld_wal2json.out file.

​{
      "change": [
              {
                      "kind": "insert",
                      "schema": "public",
                      "table": "t1",
                      "columnnames": ["i", "j", "k"],
                      "columntypes": ["integer", "integer", "character varying"],
                      "columnvalues": [1, 1, "wal2json use by pg_recvlogical"]
             
}
      ]

}

4. Logical Replication

Postgres user can use test_decoding to recognize the logical replication but can hardly use it without a secondary development or third-party contrib. The Logical Replication makes us use logical replicate with a build-in feature of Postgres. In other words, users cannot directly use the test_decoding function for production environments unless you use third-party plug-ins to synchronize data, while logical replication makes logical decoding easier to use in production environments.

Here is simple Logical Replication progress, I will show what Postgres do with every step.

4.1 Prepare Test Enviroment

//Publication IP43
create user user_logical43 replication encrypted password '123';
create table t1(i int, j int ,k varchar);
alter table t1 add primary key(i);
grant select on t1 to user_logical43;

//Subscription IP89
create table t1(i int,j int,k varchar);

4.2 Logical Replication Deploy

IP43 create publication

We will use pub43 later, others is for introduce.

create publication pub43 for table t1;
create publication pub43_1 for all tables;
create publication pub43_2 for all tables with (publish='insert,delete');

publication create will modify the catalog relation

postgres=# select * from pg_publication ;
oid | pubname | pubowner | puballtables | pubinsert | pubupdate | pubdelete | pubtruncate
-------+---------+----------+--------------+-----------+-----------+-----------+-------------
16420 | pub43   |       10 | f           | t         | t         | t         | t
16422 | pub43_1 |       10 | t           | t         | t         | t         | t
16423 | pub43_2 |       10 | t           | t         | f         | t         | f
(3 rows)
postgres=#  select * from pg_publication_rel ;
oid | prpubid | prrelid
-------+---------+---------
16421 |   16420 |   16392
(1 row)
postgres=# select oid,relname from pg_class where relname ='t1' or relname ='t2';
oid | relname
-------+---------
16402 | t2
16392 | t1
(2 rows)

According to the test above we can see that the publication name will appear in catalog pg_publication, and if the publication is not for all tables then a mapping between publication and tables will appear in catalog pg_publication_rel. You can use publish option to point more detail publication information(insert,delete,update.truncate)

IP89 create subscription

create subscription sub89 connection 'host=192.168.102.43 port=5432 dbname=postgres user=user_logical43 password=123' publication pub43;

The catalog relations changed

//For IP89
postgres=# select * from pg_subscription;
oid | subdbid | subname | subowner | subenabled |                                 subconninfo                                   | subslotname | subsync
commit | subpublications
-------+---------+---------+----------+------------+--------------------------------------------------------------------------------+-------------+--------
-------+-----------------
16394 |   13591 | sub89   |       10 | t         | host=192.168.102.43 port=5432 dbname=postgres user=user_logical43 password=123 | sub89       | off    
      | {pub43}
(1 row)

postgres=# select * from pg_subscription_rel;
srsubid | srrelid | srsubstate | srsublsn  
---------+---------+------------+-----------
  16394 |   16384 | r         | 0/16A2700
(1 row)

postgres=# select oid,relname from pg_class where relname ='t1';
oid | relname
-------+---------
16384 | t1
(1 row)

//For IP43
postgres=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | ca
talog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+----------+-----------+--------+------------+------+---
-----------+-------------+---------------------
sub89     | pgoutput | logical   |  13594 | postgres | f         | t     |       8940 |     |  
      524 | 0/16A26C8   | 0/16A2700
(1 row)

Also, walsend and logical replication work processes are created for logical replication services in IP43 and IP89, respectively.

4.3 Check Result

Insert a row on IP43

lichuancheng@IP43:~ $ psql
psql (13devel)
Type "help" for help.

postgres=# insert into t1 values(1,1,'data from ip43');
INSERT 0 1
postgres=#

Query in IP89

[lichuancheng@IP89 data]$ psql
psql (13devel)
Type "help" for help.

postgres=# select *from t1;
i | j |       k        
---+---+----------------
1 | 1 | data from ip43
(1 row)

postgres=#

Yes, the row is applied as below describe.

4.4 How it works

Now we have completed a simple demo of the function of logical replication. I believe you are not familiar with this process. This paper mainly explains the principle instead of going into many functional demonstrations. The principle of convective replication has been described in detail in the previous section Difference Between Physical And Logical Replication below is a miniature of this simple insert statement in Logical Replication Schematics.

Insert Logical Replication Schematics

5. BDR and Pglogical

There is only one version of BDR (Bi Directional Replication) in github which only supports Postgres9.4. ​ The pglogical and Logical Replication are built on the same principle. Pglogical can support pg 9.4 and onwards while Logical Replication is supported from pg10, pglogical can also handle conflict of data.

The New Concept: logical2sql

1. What And Why Logical2sql

The test_decoding or wal2json can get data from the database however we need a tool that can apply the data to other databases as-well, this is where logical2sql will come into play. The process of logical2sql is described in the diagram below :

Implementation Schematics

However it seems to have some problem with a delay caused by writing and reading between pg_recvlogical and tool used for forwarding the data. Look at the red box in the picture, the pg_recvlogic program writes data to the temp file and forward tool reads data from the temp file which results in the loss of data transmission efficiency. Secondly, it will cause a problem with the escape character. If we can combine the two tools, we can solve the problem by avoiding the loss of transmission efficiency of read and write data. At the same time because there is no intermediate file, there is no need to worry about the problem of escape characters. It is Logical2sqlⅠ.

2. Logical2sql Design

There was a need to develop a new plugin which can change RecordBufferChange to SQL to adapt to different databases. We can then build a new tool which will receive the SQL and forward to other databases instead of writing down to the temporary file.

logical2sqlⅠ

3. Logical2sql Design Ⅱ

There was also a data transfer in the process of SQL from the new output plugin to the new tool. We can even abandon new tool to directly implement SQL forwarding in the plugin. In the logical2sql idea mapI we can see that the SQL is formed in the new output plugin and then sent to the new tool, the SQL forwarding is done by the new tool. SQL from the new output plugin to the new tool will also have some efficiency loss. We can finish the SQL forwarding process without any tools after the SQL is formed in the new output plugin. It is logical2sql Ⅱ.

logical2sql Ⅱ

I think it will be a good approach,Let’s try it?

Mariabackup and BACKUP STAGE Commands

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

The BACKUP STAGE commands are a set of commands to make it possible to make an efficient external backup tool. How Mariabackup uses these commands depends on whether you are using the version that is bundled with MariaDB Community Server or the version that is bundled with MariaDB Enterprise Server.

Mariabackup and BACKUP STAGE Commands in MariaDB Community Server

The version of Mariabackup that is bundled with MariaDB Community Server does not yet use the BACKUP STAGE commands in an efficient way. In that version, instead of executing the FLUSH TABLES WITH READ LOCK command to lock the database, it executes the following BACKUP STAGE commands:

BACKUP STAGE START;
BACKUP STAGE BLOCK_COMMIT;

And then to unlock the database when the backup is complete, it executes the following:

BACKUP STAGE END;

Mariabackup and BACKUP STAGE Commands in MariaDB Enterprise Server

The following sections describe how the MariaDB Enterprise Backup version of Mariabackup that is bundled with MariaDB Enterprise Server uses each BACKUP STAGE command in an efficient way.

BACKUP STAGE START in MariaDB Enterprise Server

Mariabackup could perform the following tasks in the START stage:

  • Copy all transactional tables.
    • InnoDB (i.e. ibdataN and file extensions .ibd and .isl)
    • Aria (i.e. aria_log_control and file extensions .MAD and .MAI)
  • Copy the tail of all transaction logs.
    • The tail of the InnoDB redo log (i.e. ib_logfileN files) will be copied for InnoDB tables.
    • The tail of the Aria redo log (i.e. aria_log.N files) will be copied for Aria tables.

BACKUP STAGE FLUSH in MariaDB Enterprise Server

Mariabackup could perform the following tasks in the FLUSH stage:

  • Copy all non-transactional tables that are not in use. This list of used tables is found with SHOW OPEN TABLES.
    • MyISAM (i.e. file extensions .MYD and .MYI)
    • MERGE (i.e. file extensions .MRG)
    • ARCHIVE (i.e. file extensions .ARM and .ARZ)
    • CSV (i.e. file extensions .CSM and .CSV)
  • Copy the tail of all transaction logs.
    • The tail of the InnoDB redo log (i.e. ib_logfileN files) will be copied for InnoDB tables.
    • The tail of the Aria redo log (i.e. aria_log.N files) will be copied for Aria tables.

At this point data for all old tables should have been copied (except for some system tables).

BACKUP STAGE BLOCK_DDL in MariaDB Enterprise Server

Mariabackup could perform the following tasks in the BLOCK_DDL stage:

  • Copy other files.
    • i.e. file extensions .frm, .isl, .TRG, .TRN, .opt, .par
  • Copy the non-transactional tables that were in use during BACKUP STAGE FLUSH.
    • MyISAM (i.e. file extensions .MYD and .MYI)
    • MERGE (i.e. file extensions .MRG)
    • ARCHIVE (i.e. file extensions .ARM and .ARZ)
    • CSV (i.e. file extensions .CSM and .CSV)
  • Check ddl.log for DDL executed before the BLOCK DDL stage.
    • The file names of newly created tables can be read from ddl.log.
    • The file names of dropped tables can also be read from ddl.log.
    • The file names of renamed tables can also be read from ddl.log, so the files can be renamed instead of re-copying them.
  • Copy changes to system log tables.
  • Copy the tail of all transaction logs.
    • The tail of the InnoDB redo log (i.e. ib_logfileN files) will be copied for InnoDB tables.
    • The tail of the Aria redo log (i.e. aria_log.N files) will be copied for Aria tables.

BACKUP STAGE BLOCK_COMMIT in MariaDB Enterprise Server

Mariabackup could perform the following tasks in the BLOCK_COMMIT stage:

  • Create a MyRocks checkpoint using the rocksdb_create_checkpoint system variable.
  • Copy changes to system log tables.
  • Copy changes to statistics tables.
  • Copy the tail of all transaction logs.
    • The tail of the InnoDB redo log (i.e. ib_logfileN files) will be copied for InnoDB tables.
    • The tail of the Aria redo log (i.e. aria_log.N files) will be copied for Aria tables.

BACKUP STAGE END in MariaDB Enterprise Server

Mariabackup could perform the following tasks in the END stage:

  • Copy the MyRocks checkpoint into the backup.

Matillion Security Controls: Enable User Auditing (Part 2 of 3)

$
0
0

Feed: Cloud Data Transformation Software | Matillion.
Author: Arawan Gajajiva, Principal Architect
;

Data Cloud Security Controls: Photo of a cloud icon on a motherboard

User Auditing Part 2 – Dimensional Modeling and Implementing Type 2 Dimensions

In Part 1 of this series on Matillion security controls and how to enable user auditing, we showed how to use Matillion ETL to integrate with a couple Matillion v1 API endpoints as sources of User and Audit Logging data.  In Part 2 of this series, we show how to use Matillion ETL to implement a common data design pattern, Dimensional Modeling, to get the most insights out of this User and Audit Logging data. If Dimension Modeling is new to you, or would like to learn more about this frequently used data modeling technique, Ralph Kimball’s book, The Data Warehouse Toolkit, is a great resource!

Other useful resources:

Userconfig API Profile RSD

Audit API Profile RSD

Access all of the jobs referenced in this series

Data Model

A traditional Fact/Dimension model is a good fit for storing user audit information for advanced reporting.  But, before we can start developing the Matillion ETL jobs to manage Fact and Dimension tables, we need to first design these tables. User audit events can be represented as a “fact” that occurred at a specific time, with related “dimensions” that describe aspects of the audit event.  Additionally, each “user” can have some access related attributes that can change over time. As such, a Type 2 Slowly Changing Dimension is a nice way to capture how a user’s attributes can change over time while preserving that history of change.

Matillion Security Controls Enable User Auditing: ERD screen shot

Managing DDL with Matillion ETL

The “Security DDL Setup” Orchestration Job creates all of the Fact and Dimension tables represented in the Entity Relationship Diagram (ERD) shown above.  Additionally, the Orchestration Job calls Transformation Jobs that help to initialize the Dimension tables with default data that can be used to represent an “Unknown” or failed lookup against the Dimension table.

Matillion Audit Logging Screen 2

Main Orchestration

Matillion Security Controls Enable User Auditing: Main Orchestration screen

As is the case for all Matillion ETL jobs, there is one Main Orchestration Job that executes Components and other nested Orchestration and Transformation Jobs.  This is the main orchestrator of the workflow, defining dependencies and order of events. Dependencies are defined based on the output nubs of each component within the job. Notifications are also included in the job design, using an SNS Message component, to notify when any specific step in the Orchestration has failed.

SCD Type 2 – User Dimension (mtln_internal_user_dm)

The jobs that manage the User Dimension table demonstrate common Matillion ETL techniques and design patterns for managing Slowly Changing Dimensions (SCD). In this example, we demonstrate how to implement a Type 2 Slowly Changing Dimension workflow.

Table Metadata

Matillion Security Controls Enable User Auditing: Table Metadata Screen

Note the following SCD-specific fields added to this dimension table:

  • is_active – This boolean field is used to identify the currently active dimension row. This gives a view as to what the user and their related attributes look like currently.
  • start_ts – This timestamp field defines when the dimension row was active, as of a certain time.  When initially loading users, we will populate this field with a default timestamp in the past.
  • end_ts – This timestamp field defines when the dimension row was active until. We will populate this field with a default timestamp in the future for currently active users and will update this value when the dimension row is no longer active.

Job Design

Matillion Security Controls enable user auditing job design screen

The primary Orchestration job that manages the User Dimension table does the following (in order):

  1. Set values for job variables
  2. Stages Matillion ETL User data into Snowflake
  3. Prepares User Dimension data
  4. Applies changes to the User Dimension table

Avoiding Race Conditions

An important aspect to understand about the demonstrated design pattern is the use of two Transformation Jobs to identify and execute data changes on a target table. The first Transformation Job evaluates the changes that need to be applied to the target table and stages the prepared data into an intermediary table.  The second Transformation Job applies the staged data changes to the target table. This design pattern addresses the race condition that would otherwise be encountered when trying to apply multiple DML operations (insert/update/delete) to a table that is also used as an input in the same Transformation Job. The race condition occurs when multiple DML operations are executed as separate statements and the first executed DML affects the data set used in the subsequent DML operations.

The nature of this Type 2 Slowly Changing Dimension is such that there may be more than one type of DML operation that needs to be applied to the table.  In this case, a set of data needs to be inserted (New or Changed records) and another set of data needs to be updated (Changed or Deleted records), both against the same Dimension table.  This Dimension table is also used as an Input in the logic of the Prepare job.  To avoid a race condition on applying changes back to the Dimension table, the changes to the Dimension table are first staged via the Prepare Transformation job and then the DML statements are executed in the Apply Transformation job.

Set Job Variables

Here we are using a Python Script component to set the value of Job Variables that are used throughout the User Dimension jobs.

Current Timestamp

First, we are getting a current timestamp that will be used throughout the rest of the job in a job variable named “curr_ts”.  The reason for setting a current timestamp in this manner (as opposed to using Snowflake’s CURRENT_TIMESTAMP function in the child Transformation Jobs) is so that we have a single consistent timestamp for all data affected by this run of the job.  Also note that there is a Job Variable named “tzone” to define the timezone that the current timestamp should be generated for. We are using UTC timezone in this example.

import pytz
from datetime import datetime

tz = pytz.timezone(tzone)
curr_time = datetime.now(tz)
str_time = curr_time.strftime('%Y-%m-%d %H:%M:%S')
context.updateVariable('curr_ts',str_time)

Initialize Dimension

Next, because we are using the Jython interpreter, we are able to directly query the User Dimension table to determine if this is an initial run of populating the dimension table. The “init_dm” job variable will let us know if this is an initialization of the dimension.  In a subsequent Transformation we will use this information to determine if the “start_ts” column for new dimension records should be set to a default timestamp in the past or if the current timestamp should be used. This is a good example of using a Python Script component to easily set variable values at runtime.

cursor = context.cursor()
cursor.execute('select count(*) from "mtln_internal_user_dm"')
rowcount = cursor.fetchone()[0]
if rowcount > 0:
  context.updateVariable('init_dm',0)

Stage User Data

In this step an API Query component is used to fetch all User data from Matillion and stage it into a table.  The API Profile used as the data source, userconfig, was detailed earlier.

Prepare SCD Type 2 Data

Matillion Security Controls Enable User Auditing: SCD Type 2 data screen

The general Matillion ETL design pattern for implementing slowly changing dimensions has previously been detailed in the article Type 6 Slowly Changing Dimensions Example. See that article for additional examples of implementing SCD design patterns.  In this Transformation Job, the incoming staged data is compared against the target Dimension table. Any new, changed or deleted records have some data transformations applied to them and the resulting data set is then staged into an intermediary table, to be used in a subsequent Transformation Job.

Prepare: user_sh

Matillion Security Controls Enable User Auditing: user_sh screen

A specific design decision was made in the dimension tables to add a column that is populated with a hashed string that represents a unique entity in the dimension.  That unique entity can be represented by a composite or single key, where the related fields are passed through a SHA-2 algorithm to return back a hashed string.  Having this hashed string makes it simpler, and more consistent, to design a common pattern for comparing incoming staged data against the target dimension table.  In the case of the User Dimension, the “name” field is what uniquely identifies an entity in the dimension. The hashed “name” string is then stored in the “user_sh” field.  Because this is a Type 2 SCD, a single user record can have multiple rows in the User Dimension table (as the attributes of that user changes over time). All occurrences of the same user in the User Dimension will therefore have the same “user_sh” value.

Prepare: Detect Changes

Matillion Security Controls enable user auditing: detect changes screen

One of the most important Matillion ETL components in this design pattern is the Detect Changes component.  This component allows us to compare the incoming staged data against the target dimension table and add an indicator for every row of staged data to denote if it is New, Changed, Deleted or Identical. When comparing the incoming staged data against the data in the target dimension table, we only need to look at the currently active dimension row for each user.  And, for the purposes of populating the User Dimension table, we can filter out any Identical rows of data. Also note that when joining the data sets, the hashed field (user_sh) is used.

Prepare: Shape Data

Once we have filtered out Identical records from the incoming staged data, a Calculator component is used to add new fields or modify existing fields with specific business logic.  This is what we call “shaping” the data. All data transformation business logic is encapsulated into this single component so as to make the overall job design simpler to manage. The resulting filtered and transformed data is then staged into an intermediary table, which will be used in a subsequent Transformation Job. A couple things to call attention to in this step are as follows.

Prepare: user_id

Matillion security controls enable user auditing: user_id screen

A “user_id” field has been added, which uniquely identifies a single record in the User Dimension table. In this instance, a strong Snowflake Sequence named “MTLN_INTERNAL_USER_DM_SEQ” was created and used here to assign “user_id” values.  As previously mentioned, because this dimension is Type 2, a single User can have multiple entries in this dimension table. The “user_id” field will be referenced in the related Audit Log Fact table, allowing the association of an Audit Log event to a specific User record from a point in time.

Prepare: start_ts

Matillion Security Controls enable audit logging: start_ts screen

The “start_ts” SCD field represents when the Dimension row was active as of.  The logic in the Calculator expression for this field accounts for if the User Dimension is being initialized.  If the dimension is being initialized, the “start_ts” will get a default timestamp in the past. Otherwise, it will get the value of the current timestamp.

Apply Dimension Changes

Matillion security controls enable user auditing: dimension changes screen

This Transformation Job takes the previously staged SCD Type 2 data and applies the appropriate data changes to the User Dimension table. Staged records will represent New, Changed or Deleted records. For New records, a row will be inserted into the User Dimension table. For Deleted records, the existing “active” row (is_active = TRUE) will be updated.  For Changed records, a new “active” record is inserted and the previously “active” record is updated.

Transaction Control

Matillion Security Controls enable user auditing: transaction control screen

The Orchestration Job that calls the Transformation Job to Apply the SCD data changes includes Transaction control components (Begin, Commit, Rollback). When the Transformation Job executes, there are two separate SQL queries executed in the cloud data warehouse, one SQL query per output. The Transaction control logic ensures that when applying the SCD data changes, all operations must succeed for the data changes to be applied.  If there is a failure in one of the two outputs in the Transformation Job, nothing will get applied. This ensures that the data in the User Dimension table does not get into an undesirable state (i.e. more than one “active” record for a user).

Summary

In this article we defined a dimensional data model for storing Matillion ETL UserConfig and Audit Log data to make our data analytics ready. Then, we took a detailed look at a Matillion job that populates and manages Matillion User data in a Type 2 Slowly Changing Dimension.  By decomposing this job, we touched on techniques and points of consideration when implementing SCD design patterns in Matillion ETL. In the final part of our 3-part series on Matillion security controls and user auditing, we will provide a similar detailed look at using Matillion ETL Audit log data to populate Facts and Type 0 Dimensions.

To learn more about security controls and cloud security, download our ebook, The Data Leader’s Guide to Cloud Security and Data Architecture.

Viewing all 275 articles
Browse latest View live