DDL – Cloud Data Architect

Author: .

In SQL you can perform a lot of DDL operations, such as creating or renaming tables, creating or removing columns, and more, and these DDL statements are most often used when creating or upgrading your application schema. However, there’s more to transactional DDL than just being able to do large schema updates atomically. Transactional DDL also provides protection against failures. That means that networks can fail, machines can fail, and still, NuoDB will guarantee that all schema modifications are either all applied or nothing is applied. Essentially this means that all operations in NuoDB aspire to be failure resistant. Learn how we’ve approached that when building our database.

In this article, we’ll focus on only one of the transactional aspects of ACID, namely Isolation. I will explain how it might impact your application code and how to avoid some of the caveats.

What Does Transactional DDL Mean?

What Is DDL?

DDL stands for Data Definition Language and is a group of SQL statements that modifies the metadata of the database. CREATE TABLE T1(I INT); is a DDL statement and so is ALTER TABLE T1 ADD COLUMN J INT;.

What is ‘Transactional’?

We are accustomed to talking about transactions in the context of DML (Data Manipulation Language), but the same principles apply when we talk about DDL transactions. A transaction must be ACID – Atomic, Consistent, Isolated, and Durable. In this case, that means that a transaction executes several statements, some of which are DDL, while treating them as a single operation that can either be rolled back or committed.

What is ‘Isolation’?

This article is focused on one aspect of transactionality: Isolation. A transaction is isolated from the rest of the system if its effects do not spill over to other transactions. You can pick various isolation levels that suit your needs the best. NuoDB supports Consistent Read and Read Committed. In this article, we’ll explore the various interactions of DML vs. DDL and isolation levels.

Does DDL have to be transactional?

There is no fundamental reason why DDL statements shouldn’t be transactional, but historically, databases haven’t provided this functionality. Even today, only a handful of databases provide truly transactional DDL, and it has limitations in most cases.

It is hard to say why the vendors of the ‘90s decided to exclude DDL from the ACIDity guarantees. The ISO 9075 standard has never made any distinction between DML and DDL when it comes to transactions.

The lack of transactionality means that many application developers are accustomed to performing application upgrades during a database maintenance window. These windows are no longer tolerated in today’s always-up world.

Transactional DDL implies that DDL needs to happen in Isolation from other transactions that happen concurrently. This means that the metadata of the modified table needs to be versioned. MVCC is a natural precursor for transactional DDL. In a database that implements snapshot isolation, readers do not block writers and writers do not block readers. Using MVCC semantics for DDL means that DDL can happen under the covers while readers proceed with using the old snapshot of the metadata. Writers, on the other hand, are not allowed to proceed while DDL is happening.

If maintenance windows are not viable and thus DDL has to coexist with DML, proper Isolation becomes key. Without it, DML can observe the database in a semi upgraded state that is unexpected and hence was never tested. This leads to unexpected/undesired behavior that can cause application outages and downtime.

One popular alternative in the databases that don’t support Transactional DDL is to commit every transaction before and after a DDL statement. This ensures that the DDL has the newest snapshot, and thanks to table locks, it removes the need for metadata versioning. It also shields the application developer from the convoluted case when DDL is executed in Consistent Read isolation level, which we will explore soon.

Autocommitting DDL is definitely viable and NuoDB offers a compatibility mode for those who are migrating from databases that use this. To enable auto-commit of DDL, execute the following statement:

set system property AUTOCOMMIT_DDL=true;

How to make DML isolated from DDL correctly

In this section, I use the terms update DML (insert, update, delete, select for update) and writer interchangeably. They both refer to a subset of all DML. Read DML (select) is not impacted by concurrent DDL.

If write DML was allowed to proceed while a DDL statement was modifying a table, the database would have to give up Isolation guarantees.

We are assuming that a table T1 (create table T1(i int);) has been created prior to both conflicting transactions.

You might notice that we never specify what the isolation level of the first transaction is. Isolation levels only impact the transaction that is “second,” as they allow the application developer to violate serializability and act on the results of transactions that have started concurrently and have been committed in the past. NuoDB does not support any isolation level lower than Read Committed, so it is impossible to observe uncommitted state.

Let me explain using a concrete conflict between update DML insert into T1(i) values(5) and DDL alter table T1 add column j int not null. Thanks to table locks, these two operations can never happen concurrently and order is enforced. Additionally, NuoDB supports two Isolation levels: Read Committed and Consistent Read, which further complicates the interaction.

DDL is first; Insert is Read Committed

If the insert is in Read Committed, it has to operate on a consistent snapshot of the database that is recalculated for every statement. This means that once the table lock is released by the DDL, the insert is allowed to reload the metadata and proceed. In this case, the snapshot of the insert will contain the new column and hence the insert will fail with Error 58000: illegal null in field J in table T1. This result might be surprising to application developers who did not anticipate a concurrent metadata modification.

DDL is first; Insert is Consistent Read

If the insert is in Consistent Read, it has to operate on a consistent snapshot of the database. NuoDB includes the metadata at the start of the transaction in the snapshot. This means that reads always return the same columns, even if concurrent metadata modifications are happening. For writes, enforcing a consistent snapshot leads to potential Consistency violations. If the insert happens after the DDL, it might insert NULL into a column that has been defined as NOT NULL. Clearly, this is a violation of the table constraints, hence the write needs to be aborted with Error 58000: Table USER.T1 has been changed.

Insert is first; DDL is Read Committed

After the insert completes, the table contains the value 5 in column i. There is no column J. If one attempted to add a column J, all rows would be implicitly expanded with the value of NULL. Since J is defined as NOT NULL in our scenario, the DDL will fail with Error 42000: field J doesn't allow NULL values, but a default value for the existing rows hasn't been specified. The application developer now has to rewrite the insert DML to include data for column J.

Insert is first; DDL is Consistent Read

This scenario is probably the most confusing one. DDL, similarly to DML, contains a snapshot of the database. The snapshot contains all records that are visible at the start of the transaction. The newly inserted row is not visible to the consistent read DDL transaction because it happened after the transaction started. If DDL were allowed to operate on the row, it would be a violation of the Isolation level. On the other hand, if the DDL were allowed to proceed, it would have to ignore the row and not expand it with NULL. This would be a violation of the NOT NULL constraint and a violation of Consistency. To prevent this from happening, the DDL transaction needs to fail. This can be confusing if you have a workflow that looks like the following:

SQL> START TRANSACTION isolation level consistent READ;
SQL> SELECT * FROM T1;
SQL> ALTER TABLE T1 ADD COLUMN j INT NOT NULL;
Error 42000: FIELD J doesn't allow NULL values, but a default value for the existing rows hasn't been specified
SQL> SELECT * FROM T1;
SQL> ROLLBACK;

As you can see, the snapshot of the read DML does not contain the newly inserted record, yet the DDL is prevented. I believe that this is why some traditional databases decided to only allow transactional DDL in Read Committed transactions.

Why Is Transactional DDL Relevant?

So far we have talked about transactional DDL from the point of view of the database vendor. But the use case is from the point of view of the database administrator. When we apply ACIDity to DDL transactions, we realize that Atomicity is extremely useful. The definition states that:

“Each transaction is treated as a single “unit,” which either succeeds completely or fails completely.”

Transactional DDL gives database administrators the ability to perform multiple modifications (for example, several ALTER TABLE statements) in a single operation.

For developers, the strong Isolation guarantees of transactional DDL makes development of applications easier. An application can only observe the database in state A (before the upgrade) or in state B (after the upgrade) and will never see partial results. This reduces the required test matrix and increases confidence in the rolling upgrade procedure.

Stay tuned for future articles that explore the ability to roll back large schema modifications.

Summary

Implementing transactional DDL in a database requires a non-trivial amount of work and not all database vendors have decided to put in the effort. NuoDB supports transactional DDL and we believe in it as the base of our always-on promise. We’ll explore the competitive landscape and provide an overview of transactional DDL support in 2019 in the next article.

Applications change on a regular basis given the ever-changing business requirements. When writing an always-on application, you should expect concurrent/online metadata modifications by different actors. After reading our examples in this article, we recommend looking at your application code and asking yourself the question: “What will this query do if the metadata is changed concurrently?”

Want to try our NuoDB? Download our Community Edition today.

Additional Reading

Author: mkysel.

In this article, we’ll focus on only one of the transactional aspects of ACID, namely Isolation. I will explain how it might impact your application code and how to avoid some of the caveats.

What Does Transactional DDL Mean?

What Is DDL?

What is ‘Transactional’?

What is ‘Isolation’?

Does DDL have to be transactional?

Autocommitting DDL is definitely viable and NuoDB offers a compatibility mode for those who are migrating from databases that use this. To enable auto-commit of DDL, execute the following statement:

set system property AUTOCOMMIT_DDL=true;

How to make DML isolated from DDL correctly

If write DML was allowed to proceed while a DDL statement was modifying a table, the database would have to give up Isolation guarantees.

We are assuming that a table T1 (create table T1(i int);) has been created prior to both conflicting transactions.

DDL is first; Insert is Read Committed

DDL is first; Insert is Consistent Read

Insert is first; DDL is Read Committed

Insert is first; DDL is Consistent Read

SQL> START TRANSACTION isolation level consistent READ;
SQL> SELECT * FROM T1;
SQL> ALTER TABLE T1 ADD COLUMN j INT NOT NULL;
Error 42000: FIELD J doesn't allow NULL values, but a default value for the existing rows hasn't been specified
SQL> SELECT * FROM T1;
SQL> ROLLBACK;

Why Is Transactional DDL Relevant?

“Each transaction is treated as a single “unit,” which either succeeds completely or fails completely.”

Transactional DDL gives database administrators the ability to perform multiple modifications (for example, several ALTER TABLE statements) in a single operation.

Stay tuned for future articles that explore the ability to roll back large schema modifications.

Summary

Want to try our NuoDB? Download our Community Edition today.

Additional Reading

Feed: Planet PostgreSQL.

This article is a transcript of the conference I gave at Postgres Open 2019,
titled the same as the book: The Art of
PostgreSQL. It’s availble as a video online
at Youtube if you want to watch the slides and listen to it, and it even has
a subtext!

Some people still prefer to read the text, so here it is.

Hi everobody. So we’re going to talk about The Art of PostgreSQL. The idea
of this presentation is that it’s mostly oriented to Application Developers.

I’ve been a contributor for Postgres for a long time, started last century.
I work at Citus Data and we’ve been acquired by Microsoft. So nowadays it is
Azure Database for PostgreSQL HyperScale (Citus), or something like that.

One of the projects that I’m working on currently is named pg_auto_failover.
The idea is just exactly as the name says. In PostgreSQL we tend to like
boring names, so that you read the name and you know what it is about,
usually.

So it’s business continuity, it’s automating the failover. That’s all about
it. It’s on github, it’s completely Open Source, you can open issues, you
can send bug fixes if you want to, even new features if you fancy that. So
go have a look, it’s a project we did to simplify setting up HA for
PostgreSQL.

Another project I’ve been working on a lot in the past years is a migration
project for going from something else to PostgreSQL. The idea is that you
don’t have an excuse anymore to be using for example MySQL. Just use
PostgreSQL instead, it’s much better. But it’s not always easy to implement
that. So with pgloader in a single command line you give it a source
connection string, and a target connection string, the target should be
PostgreSQL. And then it’s going to read all the catalogs from the source
database, decide what are the tables, the columns the attributes, the types,
do the type mapping for you and load the data and then create the indexes in
parallel and etc etc. So it’s one command line and then your database is
running on PostgreSQL now. So no excuse, just do it. It support MySQL,
SQLite, SQL Server and some other input kinds.

Another project I’ve been working on is this book, The Art of PostgreSQL. We
have some copies left, maybe the last one or something like that, at the
booth. So if you’re interested show up there later. And the slides that
we’re going to go through are mostly extracted from the book. So it’s kind
of the same content.

So let’s get started now.

The first thing that for me is important as an application developer is why
are you using PostgreSQL? Often when you ask that question — and I used to
be a consultant before — and when you get around this question, most of them
developers they don’t really know why, you know, it’s in the stack, it’s
been deployed already, they have joined the project and they have to use it.

Some of them they’re like “Oh, I know why, it’s because it’s solving this
problem that is quite hard to solve for the application and we are using
PostgreSQL to do that.“ But often enough I heard that PostgreSQL is used to
solve storage. Which is suprising as an answer because it’s so wrong. It’s
not true. Storage in the 60s it was easy because at the time, with the
compute we had, if you would unplug it from the power socket, then anything
that was in memory would stay exactly the way it was. And you could re-plug
like a couple weeks later and it would be as it was two weeks before. And in
the 70s we switched to other technology where it was not true anymore, but
being able to serialize something that you had in memory to disk has never
been such a problem in computing science. It’s easy to do, everybody knows
how to do it, you don’t need PostgreSQL to do that.

If you are a Java shop, you can serialize your objects in XML and read them
back and that’s it. So if storage is the problem, you go use, for example in
the cloud, blob storage from Azure or maybe S3 from AWS or something else.
So that’s storage. PostgreSQL is not about storage.

PostgreSQL is about concurrency and isolation. The idea is that what happens
when you have more than one person trying to do the same thing, like two
updates concurrently? And the image is obviously the difference between
theory and practice: in theory it’s the same, but not in practice.

The main thing around concurrent and isolation within the context of the
database — Relational Database Management System, RDMBS — is that we provide
ACID guarantees. I guess everybody knows what ACID is?

Atomic basically means that if you have many things to do in the same
transaction and something goes wrong you can rollback. If you did two
inserts and one update and then you rollback then everything is cancelled.
You don’t have the situation where one of the inserts went through but not
the other one and now your database doesn’t make sense anymore. So that’s
pretty cool.

Usually we don’t type in rollback. Sometimes we do when testing
interactively, but in the application, have you ever implemented a
transaction that would do a rollback in your transaction? Maybe not.

What happens is… file system is full. I know it’s 2019 but we still have
that problem in production sometimes. So file system is full, what’s next?
Well with an atomic system, the transaction is rolled back and never
happened. That’s it. So you’re safe.

Well PostgreSQL does something that almost no other system is able to do: it
supports transactions for DDL. So if you have an application script to
migrate from one version to the next version of the schema, you had a new
column, a new table, maybe a new index, something like that, when then what
happens if file is full in the middle of the script?

If you’re not using PostgreSQL, and you had version 1 in production, the
script was to go from version 1 to version 2 and it failed in the middle. So
now you have a version that nobody ever saw nowhere. No developer ever saw
that version which is now in production… if you don’t have transactions for
DDLs.

With PostgreSQL “file system is full” implies a rollback, you still have
version 1, don’t deploy the app yet, that’s it. Simple, done. So that’s one
of the reasons why you use PostgreSQL of course.

The C of ACID, it means consistency. Consistent means there are some
business rules that you know about and that you can share with PostgreSQL,
can explain to PostgreSQL, here’s what’s important for me to keep in mind
for the whole data set that we are going to manage; and you can have
PostgreSQL implement those guarantees for you.

So the first step for the consistency is the schema, and the data types.
Here we have a very simple table with two columns. Anything that goes into
those columns — here ID is an integer. If you have MySQL and you have an
integer column and you insert into it the string “banana”, then it will
happily take it and if you SELECT from it than it’s going to say zero. But
no errors whatsoever. It’s happy to work with that.

With PostgreSQL we don’t do that. So if you try to insert a “banana” into an
integer column, PostgreSQL will tell you “hey I don’t know what that is, but
it does not fit your model, please be careful”. And then we have constraints
like CHECK, NOT NULL, FOREIGN KEY, PRIMARY KEY… relations.

We’ll get back to that, relation are the central concept of SQL basically.
And some people think it’s because we have foreign keys but it’s not true.
A relation is just a mathematical concept where you have a set of elements
that all share the same properties. It’s called attribute domains in the
relational jargon and it means that it all looks the same. Here it’s an
integer and a text columns, and anything that is in this table foo is
going to have an integer and a text, that’s it. All of them are the same.
That’s what is a relation.

So consistency is pretty important.

Then the I for ACID is isolation. It’s the other side of
atomicity. It’s a little bit more complex to understand sometimes.
Isolation means that while you are doing your queries, are you allowed to
see what is going on concurrently in the rest of the system?

So if you want to take a consistent backup for example, you need to make it
so that even if pg_dump is going to run for several hours because you have
terabytes of data, it needs to be a consistent snapshot of the production.
If during the backup someone else is doing inserts and updates and something
else, you don’t want those to be in the backup, because you want something
consistent. You want a snapshot that doesn’t move. You don’t want to see
everything that’s new. So pg_dump will typically use an isolation mode
where you don’t see the changes from the other transactions.

You can also do that in your application, and maybe it could be the default:
REPEATABLE READ. Or even SERIALIZABLE, but that one is different. REPEATABLE
READ might be what you expect from the database but it’s not the default.
The default is READ COMMITTED. So maybe you want to look into that.

Anyway, every transaction in Postgres can have a different isolation level.
pg_dump will be SERIALIZABLE while the rest of the system is REPEATABLE
READ or READ COMMITTED, depending. So that’s isolation. So you see that’s
very important, and that’s very hard to implement at the application level
and so maybe that’s why you’re using PostgreSQL actually.

And then of course it’s durable.

Do you know the little test to do with the power socket plug? Basically you
write a little client application that will only do INSERTs for example. And
you count how many times you got the COMMIT message back from Postgres.

Remember that when you say COMMIT, maybe the answer is going to be ROLLBACK.
Because there was a proble, Postgres was not in a position where it could
actually implement the COMMIT. “File system is full” is the easiest example
to have in mind. So you say COMMIT, maybe it’s ROLLBACK. So you count how
many times when you said COMMIT it was committed actually.

And then while the test is running you unplug the power socket from the
server. In the middle of the test. Then you plug again and you count what
you have on the server and what you have on the client. If it’s not the
same, there’s a bug somewhere. It’s not durable.

Durability means that anything that has been known to be committed by the
client should still be there when you do that. If it’s not, maybe the
hardware is faulty, maybe the BIOS configuration or many the kernel, OS
configuration is wrong. Maybe you did fsync = off in PostgreSQL, or maybe
you’re not using PostgreSQL. And then… yeah, don’t do that.

So that’s the basics around why would you use PostgreSQL. So to recap
because you have transactions. And transaction is a short way to say you are
compliant with ACID. But be careful, because some systems are naming
themselves databases nowadays, and the NoSQL systems in particular, where as
a developer if you think about them as a database, you might be in trouble
because they are not ACID-compliant.

All of the NoSQL systems that you will find are going to implement some
trade-offs. The only that is obvious is that they are not implementing SQL,
it’s No SQL. Okay. But also they don’t implement ACID usually. Take MongoDB
for example. It’s schemaless, that’s a feature. It means that you don’t have
consistency, so you lose the C of ACID. It doesn’t have transactions, so you
don’t have the A nor the I of ACID. No atomicity, no isolation. Remains the
D of ACID, durability. It used to not implemnent that. Apparently they’ve
fixed it nowadays, but for a long time you wouldn’t have the D of ACID.

So maybe it’s fine to use it anyway in your application because it fits your
use-case. But as a developer if you think of a database as something that is
not ACID-compliant, because that’s how we are taught about databases
usually, and the system you use is actually not ACID-compliant, it means
that all those guarantees that you don’t have, either you don’t need them,
that’s cool, or if you need them, then you need to implement them yourself.

So that’s the main kicker of using PostgreSQL, is that you get everything
for free and it just works and it’s available and you can just, you know,
just care about the application.

And other good reasons to use it are written there and we’re going to see
about them. We’re going to see about why I say it’s object-oriented. We have
extensions in PostgreSQL, we’re going to see a couple examples. Rich
datatypes. You can do actually data processing in SQL and we’re going to see
what I mean with that. Etc etc.

That’s it for the first part of this presentation. We covered about 15 mins
of the 50 mins of this talk. I will publish the transcript for part II and
part III later next week, so stay tuned to this blog if you like this
content!

Also, as the content comes from my book anyway, you could also subscribe
below to get the free sample, or just go buy the book at the main home page
of this website: The Art of PostgreSQL.

Subscribe to receive a FREE chapter of the second edition of my book, “The Art of PostgreSQL” including the full Table of Contents!

Feed: Planet MySQL
;
Author: Lukas Eder
;

MySQL 8 does not yet support the BOOLEAN type as specified in the SQL standard. There is a DDL “type” called BOOL, which is just an alias for TINYINT:

create table t(b bool);

select 
  table_name, 
  column_name, 
  data_type, 
  column_type
from information_schema.columns
where table_name = 't';

The above produces:

TABLE_NAME|COLUMN_NAME|DATA_TYPE|COLUMN_TYPE|
----------|-----------|---------|-----------|
t         |b          |tinyint  |tinyint(1) |

Notice that BOOL translates to a specific “type” of TINYINT, a TINYINT(1), where we might be inclined to believe that the (1) corresponds to some sort of precision, as with NUMERIC types.

However, counter intuitively, that is not the case. It corresponds to the display width of the type, when fetching it, using some deprecated modes. Consider:

insert into t(b) values (0), (1), (10);
select * from t;

We’re getting:

b |
--|
 0|
 1|
10|

Notice also that MySQL can process non-boolean types as booleans. Running the following statement:

select * from t where b;

We’re getting:

b |
--|
 1|
10|

Using this column as a Boolean column in jOOQ

By default, jOOQ doesn’t recognise such TINYINT(1) columns as boolean columns, because it is totally possible that a user has created such a column without thinking of boolean types, as the above example has shown.

In previous versions of jOOQ, the data type rewriting feature could be used on arbitrary expressions that match the boolean column name, e.g. the below would treat all columns named "B" as BOOLEAN:


  
    BOOLEAN
    B

With jOOQ 3.12.0 (issue #7719), we can now match this display width as well for MySQL types. That way, you can write this single data type rewriting configuration to treat all integer types of display width 1 as booleans:


  
    BOOLEAN
    (?i:TINYINT(1))

Using this configuration in the code generator, the above query:

select * from t where b;

… can now be written as follows, in jOOQ

selectFrom(T).where(T.B).fetch();

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

Experimental Binary XtraDB 8.0 Percona is happy to announce the first experimental binary of Percona XtraDB Cluster 8.0 on October 1, 2019. This is a major step for tuning Percona XtraDB Cluster to be more cloud- and user-friendly. This release combines the updated and feature-rich Galera 4, with substantial improvements made by our development team.

Improvements and New Features

Galera 4, included in Percona XtraDB Cluster 8.0, has many new features. Here is a list of the most essential improvements:

Streaming replication supports large transactions
The synchronization functions allow action coordination (wsrep_last_seen_gtid, wsrep_last_written_gtid, wsrep_sync_wait_upto_gtid)
More granular and improved error logging. wsrep_debug is now a multi-valued variable to assist in controlling the logging, and logging messages have been significantly improved.
Some DML and DDL errors on a replicating node can either be ignored or suppressed. Use the wsrep_ignore_apply_errors variable to configure.
Multiple system tables help find out more about the state of the cluster state.
The wsrep infrastructure of Galera 4 is more robust than that of Galera 3. It features a faster execution of code with better state handling, improved predictability, and error handling.

Percona XtraDB Cluster 8.0 has been reworked in order to improve security and reliability as well as to provide more information about your cluster:

There is no need to create a backup user or maintain the credentials in plain text (a security flaw). An internal SST user is created, with a random password for making a backup, and this user is discarded immediately once the backup is done.
Percona XtraDB Cluster 8.0 now automatically launches the upgrade as needed (even for minor releases). This avoids manual intervention and simplifies the operation in the cloud.
SST (State Snapshot Transfer) rolls back or fixes an unwanted action. It is no more “a copy only block” but a smart operation to make the best use of the copy-phase.
Additional visibility statistics are introduced in order to obtain more information about Galera internal objects. This enables easy tracking of the state of execution and flow control.

Installation

You can only install this release from a tarball and it, therefore, cannot be installed through a package management system, such as apt or yum. Note that this release is not ready for use in any production environment.

Percona XtraDB Cluster 8.0 is based on the following:

Please be aware that this release will not be supported in the future, and as such, neither the upgrade to this release nor the downgrade from higher versions is supported.

This release is also packaged with Percona XtraBackup 8.0.5. All Percona software is open-source and free.

In order to experiment with Percona XtraDB Cluster 8.0 in your environment, download and unpack the tarball for your platform.

Note

Be sure to check your system and make sure that the packages are installed which Percona XtraDB Cluster 8.0 depends on.

For Debian or Ubuntu:

$ sudo apt-get install -y socat libdbd-mysql-perl rsync libaio1 libc6 libcurl3 libev4 libgcc1 libgcrypt20 libgpg-error0 libssl1.1 libstdc++6 zlib1g libatomic1

For Red Hat Enterprise Linux or CentOS:

$ sudo yum install -y openssl socat procps-ng chkconfig procps-ng coreutils shadow-utils grep libaio libev libcurl perl-DBD-MySQL perl-Digest-MD5 libgcc rsync libstdc++ libgcrypt libgpg-error zlib glibc openssl-libs

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Feed: Cloud Data Transformation Software | Matillion.
Author: Veronica Kupetz, Solution Architect
;

Veronica Kupetz, Solution Architect

source DDL changes: green puzzle piece

Change Data Capture (CDC) in Matillion is a great feature that will help keep your cloud data warehouse up to date with a source database (MySQL, PostgreSQL, Oracle, or Microsoft SQL Server). Setting it up is a fairly straightforward process. (This support article and our recent blog post about setting up CDC with an RDS MySQL source database are a few resources that can help.) But after you set up CDC, some questions may arise. A common question: How do we handle source data definition language (DDL) changes in CDC? Let’s talk about how to handle source DDL changes and apply them to the target. In addition, we can dive into modifying the CDC job to alert for failures.

DDL changes: A common cause of CDC failure

One reason that a Matillion CDC job or task can fail is because of source DDL changes. CDC in Matillion will not make those DDL changes from the source to the target. For example, if a new column is added to the source table, it will not be added to the target table in the cloud data warehouse. Once data gets loaded into that newly created column(s), the CDC task will fail.

The table(s) that failed can be queried in the >mat_cdc_log table (select * from mat_cdc_log where cdc_table_status <> 0) that is created in the cloud data warehouse, or by viewing and expanding the Task History in the Matillion UI. As noted in the fourth step of “Update the CDC Orchestration Job” in the next section, “Alerting When CDC Fails”, the SNS Message Component can mention the table(s) that failed by referencing the “cdc_src_table_name” variable name in the message.

DMS Console – Reload Table Data

One option to handle DDL changes in the source table is to reload the table(s) that caused the failure within the DMS console in AWS.

In the AWS DMS console, select “Data Migration Tasks”.
Select the replication (CDC) task that is running on the Matillion instance.
Under the “Table Statistics” menu, select the table you want to reload. Note: You can see what tables had DDL changes by seeing a non-zero value in the “DDLs” column.
Select “Reload table data”.

CDC DDL changes: reload table data

A confirmation screen will appear; select “Confirm”

CDC DDL changes: confirm reload table data

Restart CDC Task in Matillion UI

If there are other errors that arise due to a DDL change in one of the source tables, it may be best to “restart” the entire CDC task. This method may not be ideal. But at this point, restarting the task will ensure that data is current in the cloud data warehouse.

From the “Project” menu in Matillion, select “Manage Change Data Capture”
Select the CDC task to restart
Select the “Play” icon

CDC DDL changes: select play icon

A “Run CDC Task” menu will open
Select “Restart” to restart the task

run CDC task restart

To avoid restarting a task for your entire replicated table set, you may want to break up the CDC task in Matillion into multiple tasks. For example, if there are tables that will probably not have DDL changes, they can be in the same task. Another task can house tables that will likely have DDL changes. Keep in mind that each task will create a new queue/s in addition to the queues for another task. When replicating several tables on an ongoing basis, it is best to have a dedicated Matillion instance for CDC.

Alerting When CDC Fails

When configuring CDC and creating a new CDC task, an Orchestration Job will be created that references the CDC shared job. This Orchestration Job will have the same name as the new CDC task. When you initially set up CDC, it creates a log table in your cloud data warehouse. You can query that table to check for any failures. You can also modify the CDC job so that it emails you if a failure occurs. Use the SNS Message Component to set up the email alert, following the steps below.

Create an SNS Topic in AWS

In the AWS console, select “Services” and search for “SNS” (Simple Notification Service)
Select “Topics” from the menu in the upper right hand corner of the console.
Select “Create topic”
Name the topic appropriately. Other configuration settings can be applied based on your requirements otherwise leave the initial settings
Hit the “Create topic” button

create topic in AWS

Create an SNS Subscription in AWS

In the AWS SNS Console, select “Subscriptions”
Select “Create subscription”
Choose your newly created Topic from the “Topic ARN” dropdown menu
Designate “Email” as the “Protocol”
Enter the appropriate email address for the “Endpoint”
Hit the “Create subscription” button. Note that after the subscription is created, it must be confirmed. This can be done either from the email or through the AWS SNS Subscriptions menu by selecting “Confirm subscription”

create subscription

Update the CDC Orchestration Job

Open the CDC job (same name as the CDC task)
Drag an “SNS Message” component to the canvas
Attach the component to the “CDC Sync To Target” component. There are 3 options when attaching. Use the red failure option so it can send a message if there is a failure
Select the newly created Topic and edit the “Subject” and “Message” properties with appropriate messaging. The parameter values for the failed table name can be referenced in the message. For example: ${cdc_src_db_database}.${cdc_src_table_name} failed to load
Note that this can be done similarly for a successful run (green success option), but the “SNS Message” component will need to be updated with the appropriate message

send failure alert

Learn more about automation

To learn more about CDC and other ways that Matillion helps you automate data transformation, download our ebook, Accelerate Time to Insight.

Feed: Planet MySQL
;
Author: MyDBOPS
;

MySQL 8 has recently released clone plugin which makes DBA’s task of rebuilding the DB servers more easy.

Cloning is a process of creating an exact copy of the original. In technical terms cloning alias to (Backup + Recovery), MySQL database cloning requires a sequence of actions to be performed manually or in a scripted fashion with and without the tools involved.
Cloning is the first step when you want to configure the replication slave or Joining a new server to the InnoDB cluster. There was no native support for auto provisioning earlier. Percona XtraDB Cluster (MySQL + Galera Cluster) does cloning using xtrabackup tool by default when a new node joins the cluster.
Now MySQL simplified this task, In this post, We will see how to clone the database using clone plugin and its internals.

Clone Plugin :

Clone Plugin was bundled with MySQL 8.0.17 , which enables the automatic node provisioning from an existing node ( Donor).
The clone plugin permits cloning data locally or from a remote MySQL server instance. The cloned data is a physical snapshot of data stored in InnoDB.

Types of cloning :

Remote cloning
Local cloning

Remote Cloning :

The remote cloning operation is initiated on the local server (recipient), cloned data is transferred over the network from the Remote server (donor) to the recipient.
By default, during remote cloning operation removes the data in the recipient data directory and replaces it with the cloned data.
Optionally, you can clone data to a different directory on the recipient to avoid removing existing data.

Local Cloning :

The clone plugin permits cloning data locally. Cloned data is a physical snapshot of data stored in InnoDB that includes schemas, tables, tablespaces, and data dictionary.
The cloned data comprises a fully functional data directory, which permits using the clone plugin for MySQL server provisioning.

Plugin Installation :

To load the plugin at server startup we need to add the following in my.cnf file and restart the server for the new settings to take effect.

[mysqld]
plugin-load-add=mysql_clone.so
clone=FORCE_PLUS_PERMANENT

Runtime Plugin installation :

We can load the plugin at runtime, use the below statement,

mysql> install plugin clone soname 'mysql_clone.so';
Query OK, 0 rows affected (0.27 sec)

In this install plugin registers the mysql.plugins system table to cause the plugin to be loaded.
To check whether the plugin is loaded we can use information_schema.

mysql> select plugin_name,plugin_status from information_schema.plugins where plugin_name='clone';
+-------------+---------------+
| plugin_name | plugin_status |
+-------------+---------------+
| clone       | ACTIVE        |
+-------------+---------------+
1 row in set (0.00 sec)

Cloning Remote Data :

Remote Cloning Prerequisites

1) To perform a cloning operation, the clone plugin must be active on both the donor and recipient MySQL servers.

2) A MySQL user on the donor and recipient is required for executing the cloning operation It’s called “clone user”.

3) The donor and recipient must have the same MySQL server version 8.0.17 and higher.

4) The donor and recipient MySQL server instances must run on the same operating system and platform.

untitled-diagram-1

Required Privileges :

1) The donor node clone user requires the “BACKUP_ADMIN” privilege for accessing and transferring data from the donor, and for blocking DDL during the cloning operation.

2) The recipient, the clone user requires the “CLONE_ADMIN” privilege for replacing recipient data, blocking DDL during the cloning operation, and automatically restarting the server.

Step 1 :

mysql> create user 'mydbops_clone_user'@'%' identified by 'Mydbops@8017';
Query OK, 0 rows affected (0.04 sec)

mysql> grant backup_admin on *.* to 'mydbops_clone_user'@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> show grants for 'mydbops_clone_user'@'%';
+-------------------------------------------------------+
| Grants for mydbops_clone_user@%                       |
+-------------------------------------------------------+
| GRANT USAGE ON *.* TO `mydbops_clone_user`@`%`        |
| GRANT BACKUP_ADMIN ON *.* TO `mydbops_clone_user`@`%` |
+-------------------------------------------------------+
2 rows in set (0.01 sec)

Step 2 :

mysql> create user 'mydbops_clone_user'@'%' identified by 'Mydbops@8017';
Query OK, 0 rows affected (0.04 sec)

mysql> grant clone_admin on *.* to 'mydbops_clone_user'@'%';
Query OK, 0 rows affected (0.01 sec)

mysql> show grants for 'mydbops_clone_user'@'%';
+------------------------------------------------------+
| Grants for mydbops_clone_user@%                      |
+------------------------------------------------------+
| GRANT USAGE ON *.* TO `mydbops_clone_user`@`%`       |
| GRANT CLONE_ADMIN ON *.* TO `mydbops_clone_user`@`%` |
+------------------------------------------------------+
2 rows in set (0.00 sec)

Step 3 :

By default, a remote cloning operation removes the data in the recipient data directory and replaces it with the cloned data. By cloning to a named directory, you can avoid removing existing data from the recipient data directory.
Here i am cloned the remote server data do different location using “DATA DIRECTORY” option.

mysql> clone instance from mydbops_clone_user@192.168.33.11:3306 identified by 'Mydbops@8017' data directory='/var/lib/mysql_backup/mysql';
Query OK, 0 rows affected (4.94 sec)

[root@mydbopslabs12 mysql]# pwd
/var/lib/mysql_backup/mysql

[root@mydbopslabs12 mysql]# ls -lrth
total 152M
drwxr-x---. 2 mysql mysql 6 Aug 25 09:12 mysql
drwxr-x---. 2 mysql mysql 28 Aug 25 09:12 sys
drwxr-x---. 2 mysql mysql 30 Aug 25 09:12 accounts
-rw-r-----. 1 mysql mysql 3.4K Aug 25 09:12 ib_buffer_pool
-rw-r-----. 1 mysql mysql 12M Aug 25 09:12 ibdata1
-rw-r-----. 1 mysql mysql 23M Aug 25 09:12 mysql.ibd
-rw-r-----. 1 mysql mysql 10M Aug 25 09:12 undo_002
-rw-r-----. 1 mysql mysql 10M Aug 25 09:12 undo_001
-rw-r-----. 1 mysql mysql 48M Aug 25 09:12 ib_logfile0
-rw-r-----. 1 mysql mysql 48M Aug 25 09:12 ib_logfile1
drwxr-x---. 2 mysql mysql 89 Aug 25 09:12 #clone

Local Cloning :

Cloning data from the local MySQL data directory to another directory on the same server where the MySQL server instance runs.

Step 1 :

mysql> grant BACKUP_ADMIN ON *.* TO 'mydbops_clone_user'@'%';
Query OK, 0 rows affected (0.10 sec)

Step 2 :

mysql> clone local data directory='/vagrant/clone_backup/mysql';
Query OK, 0 rows affected (3.94 sec)

Note :

The MySQL server must have the necessary write access to create the directory.
A local cloning operation does not support cloning of user-created tables or table-spaces that reside outside of the data directory.

How does the clone Plugin Works ?

I have two standalone servers with the same configuration.

1) 192.168.33.25 -
   * 2 core
   * 4GB RAM
   * 50 GB SSD

2) 192.168.33.26 -
   * 2 core
   * 4GB RAM
   * 50 GB SSD

i have installed the MySQL 8.0.17 and enabled the clone plugin for those two servers, and created the above mentioned users for donor & recipient nodes.

Donor  192.168.33.25
Recipient 192.168.33.26

Step 1 :

I have created mydbops database and created 3 tables then loaded 2M records for each tables in donor node.

Step 2 :

I have created another database called sysbench and started to loading the data.

Example :

[root@mydbopslabs25 sysbench]# sysbench oltp_insert.lua --table-size=2000000 --num-threads=2 --rand-type=uniform --db-driver=mysql --mysql-db=sysbench --tables=10 --mysql-user=test --mysql-password=Secret!@817 prepare
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Initializing worker threads...

Creating table 'sbtest1'...
Creating table 'sbtest2'...
Inserting 2000000 records into 'sbtest1'
Inserting 2000000 records into 'sbtest2'
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest2'...
.
.
.
Inserting 2000000 records into 'sbtest8'
.
.
Inserting 2000000 records into 'sbtest10'
Creating a secondary index on 'sbtest9'...
Creating a secondary index on 'sbtest10'...

Step 3 :

At the same time i have added the address of donor MySQL server instance (with port) in recipient node.

Example :

mysql > set global clone_valid_donor_list='192.168.33.25:3306';
Query OK, 0 rows affected (0.01 sec)

Step 4:

initialised the cloning process in recipient node.

mysql> clone instance from mydbops_clone_user@192.168.33.25:3306 identified by 'Mydbops@123';
Query OK, 0 rows affected (4 min 20.48 sec)

While running the cloning process i have analysed the mysql data directory, how it will clone the data and how it’s replacing data directory files.
During this process it will not overwrite the existing undo & redo log files.It will create new files like this.

-rw-r-----. 1 mysql mysql 23M Aug 25 10:44 mysql.ibd.#clone
-rw-r-----. 1 mysql mysql 5.3K Aug 25 10:44 ib_buffer_pool.#clone
-rw-r-----. 1 mysql mysql 12M Aug 25 10:45 ibdata1.#clone
-rw-r-----. 1 mysql mysql 40M Aug 25 10:48 undo_002.#clone
-rw-r-----. 1 mysql mysql 40M Aug 25 10:48 undo_002.#clone

Inside the data directory it will create #clone directory following files are created.

1.#view_progress: persists performance_schema.clone_progress data

2.#view_status: Persists persists performance_schema.clone_status data

3.#status_in_progress: Temporary file that exists when clone in progress

4.#status_error: Temporary file to indicate incomplete clone.

5.#status_recovery: Temporary file to hold recovery status information

6.#new_files: List of all files created during clone

7.#replace_files: List of all files to be replaced during recovery

once the cloning process is completed it will swap the file and restart the mysql service.

During this cloning we are able to access the data inside mysql in recipient node.It will close the connection while restarting (Swapping the files )the mysqld service.

Example :

[root@mydbopslabs26 vagrant]# mysql -e "select count(*) from mydbops.t1;select sleep(30);select count(*) from mydbops.t1;"
+----------+
| count(*) |
+----------+
| 2000000  |
+----------+
+-----------+
| sleep(30) |
+-----------+
| 0         |
+-----------+
ERROR 2006 (HY000) at line 1: MySQL server has gone away

After the completion of cloning process will maintain a few stats in #clone directory.
This directory will be located inside the mysql data directory.

1) #view_status

This file will maintain the donor node host and mysql port details

2) #view_progress

In this file will maintain the Progress of cloning status

Example :

2 1 1568463987624627 1568463988356572 0 0 0
2 2 1568463988357856 1568464039117879 1630655790 1630655790 1630750990
2 2 1568464039120173 1568464039648790 0 0 197

Here “1568464039120173” is an epoch timestamp.

3) #status_recovery

This file contain binlog co-ordinates.

Example :

./binlog.000011
190479203

Note :

We can get those stat’s from performance schema too.

Page Tracking :

How the active changes to DB are tracked ?

The pages modified during the cloning process are tracked either during mtr (mini transaction) address them to flush list or when they are flushed to disk by I/O threads. We choose to track

Consistency of this phase is defined as follows,

* At start, it guarantees to track all pages that are not yet flushed. All
flushed pages would be included in “FILE COPY”.

* At end, it ensures that the pages are tracked at least up to the checkpoint
LSN. All modifications after checkpoint would be included in “REDO COPY”.

Monitoring Cloning Operations:

Is it possible to monitor the cloning progress ?

Yes , A cloning operation may take a long/short time to complete, depending on the amount of data and other factors related to data transfer.
You can monitor the status and progress of a cloning operation using performance schema.
In Mysql 8.0.17 they introduced new Clone tables and Clone Instrumentation are introduced as well

Note :

The clone_status and clone_progress Performance Schema tables can be used to monitor a cloning operation on the recipient MySQL server instance only.
The clone_status table provides the state of the current or last executed cloning operation.
A clone operation has four possible states:
- Not Started
- In Progress
- Completed,
- Failed.

Example :

mysql> select stage,state,begin_time as start_time,end_time,data,network from performance_schema.clone_progress;
+-----------+-------------+----------------------------+----------------------------+----------+----------+
| stage     | state       | start_time                 | end_time                   | data     | network  |
+-----------+-------------+----------------------------+----------------------------+----------+----------+
| DROP DATA | Completed   | 2019-08-25 09:27:53.725694 | 2019-08-25 09:27:53.922072 | 0        |  0       |
| FILE COPY | Completed   | 2019-08-25 09:27:53.922424 | 2019-08-25 09:27:54.651132 | 57904527 | 57915509 |
| PAGE COPY | Completed   | 2019-08-25 09:27:54.651463 | 2019-08-25 09:27:54.756606 | 0        | 99       |
| REDO COPY | Completed   | 2019-08-25 09:27:54.756926 | 2019-08-25 09:27:54.858837 | 2560     | 3031     |
| FILE SYNC | Completed   | 2019-08-25 09:27:54.859098 | 2019-08-25 09:27:55.273789 | 0        | 0        |
| RESTART   | Not Started | NULL                       | NULL                       | 0        | 0        |
| RECOVERY  | Not Started | NULL                       | NULL                       | 0        | 0        |
+-----------+-------------+----------------------------+----------------------------+----------+----------+
7 rows in set (0.00 sec)

Snapshot Status :

INIT :

The clone object is initialized identified by a Donor.

FILE COPY :

The state changes from INIT to “FILE COPY” when snapshot_copy interface is called.
Before making the state change we start “Page Tracking” at lSN “CLONE START LSN”.
In this state we copy all database files and send to the recipient.

PAGE COPY :

The state changes from “FILE COPY” to “PAGE COPY” after all files are copied and sent.
Before making the state change we start “Redo Archiving” at lsn “CLONE FILE END LSN” and stop “Page Tracking”.
In this state, all modified pages as identified by Page IDs between “CLONE START LSN” and “CLONE FILE END LSN” are read from “buffer pool” and sent.
We would sort the pages by space ID, page ID to avoid random read(donor) and random write(recipient) as much as possible.

REDO COPY :

The state changes from “PAGE COPY” to “REDO COPY” after all modified pages are sent.
Before making the state change we stop “Redo Archiving” at lsn “CLONE LSN”.
This is the LSN of the cloned database. We would also need to capture the replication coordinates at this point in future.
It should be the replication coordinate of the last committed transaction up to the “CLONE LSN”.
We send the redo logs from archived files in this state from “CLONE FILE END LSN” to “CLONE LSN” before moving to “Done” state.

Done :

The clone object is kept in this state till destroyed by snapshot_end() call.

Performance Schema to Monitor Cloning:

There are three stages of events for monitoring progress of a cloning operation.
Each stage event reports WORK_COMPLETED and WORK_ESTIMATED values. Reported values are revised as the operation progresses.

1)stage/innodb/clone (file copy) :

- Indicates progress of the file copy phase of the cloning operation.
- The number of files to be transferred is known at the start of the file copy phase, and the number of chunks is estimated based on the number of files.

2) stage/innodb/clone (page copy) :

Indicates progress of the page copy phase of cloning operation.
Once the file copy phase is completed, the number of pages to be transferred is known, and WORK_ESTIMATED is set to this value.

3) stage/innodb/clone (redo copy) :

Indicates progress of the redo copy phase of cloning operation.
Once the page copy phase is completed, the number of redo chunks to be transferred is known, and WORK_ESTIMATED is set to this value.

Enabling Monitoring

mysql> update performance_schema.setup_instruments set ENABLED='YES' where NAME LIKE 'stage/innodb/clone%';
Query OK, 0 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0

For more details refer the MySQL Manual page Clone-Plugin-Monitoring

Replication Configuration :

The clone plugin supports replication, In addition to cloning data, a cloning operation extracts and transfers replication coordinates from the donor and applies them on the recipient.
The clone plugin for provisioning is considerably faster and more efficient than replicating a large number of transactions.
Both binary log position and GTID coordinates are extracted and transferred from the donor MySQL server instance.

Binlog and position :

The binlog and position’s are stored in clone_status table.Need to check this log file and position in donor node.

mysql> select binlog_file,binlog_position from performance_schema.clone_status;
+------------------+-----------------+
| binlog_file      | binlog_position |
+------------------+-----------------+
| mysql-bin.000479 | 483007997       |
+------------------+-----------------+

If you are using GTID use below query,

mysql> select @@global.gtid_executed;

Here i am using binlog co-ordinates for replication.

mysql> change master to master_host ='192.168.33.11', master_port =3306,master_log_file ='mysql-bin.000479',master_log_pos =483007997;

mysql> start slave user='repl' password='Repl@123';

Limitations :

The clone plugin has some limitations,
DDL, including TRUNCATE TABLE, is not permitted during a cloning operation. Concurrent DML is permitted.
An instance cannot be cloned from a different MySQL server version. The donor and recipient must have the same MySQL server version.
The clone plugin does not support cloning of binary logs.
The clone plugin only clones data stored in InnoDB. Other storage engine data is not cloned MyISAM and CSV engine tables.

Conclusion :

I believe for now creating replicas has become much easier with the help of the MySQL 8.0.17 clone plugin.
The clone plugin can be used to set up not only asynchronous replicas but provisioning Group Replication members also.

Advantages of a data lake architecture

After working for several years in a database-focused approach, the rapid growth in ironSource’s data made their previous system unviable from a cost and maintenance perspective. Instead, they adopted a data lake architecture, storing raw event data on object storage, and creating customized output streams that power multiple applications and analytic flows.

Why ironSource chose an AWS data lake

A data lake was the right solution for ironSource for the following reasons:

Scale – ironSource processes 500K events per second and over 20 billion events daily. The ability to store near-infinite amounts of data in S3 without preprocessing the data is crucial.
Flexibility – ironSource uses data to support multiple business processes. Because they need to feed the same data into multiple services to provide for different use cases, the company needed to bypass the rigidity and schema limitations entailed by a database approach. Instead, they store all the original data on S3 and create ad-hoc outputs and transformations as needed.
Resilience – Because all historical data is on S3, recovery from failure is easier, and errors further down the pipeline are less likely to affect production environments.

Why ironSource chose Upsolver

Upsolver’s streaming data platform automates the coding-intensive processes associated with building and managing a cloud data lake. Upsolver enables ironSource to support a broad range of data consumers and minimize the time DevOps engineers spend on data plumbing by providing a GUI-based, self-service tool for ingesting data, preparing it for analysis, and outputting structured tables to various query services.

Key benefits include the following:

Self-sufficiency for data consumers – As a self-service platform, Upsolver allows BI developers, Ops, and software teams to transform data streams into tabular data without writing code.
Improved performance – Because Upsolver stores files in optimized Parquet storage on S3, ironSource benefits from high query performance without manual performance tuning.
Elastic scaling – ironSource is in hyper-growth, so needs elastic scaling to handle increases in inbound data volume and peaks throughout the week, reprocessing of events from S3, and isolation between different groups that use the data.
Data privacy – Because ironSource’s VPC deploys Upsolver with no access from outside, there is no risk to sensitive data.

This post shows how ironSource uses Upsolver to build, manage, and orchestrate its data lake with minimal coding and maintenance.

Solution Architecture

The following diagram shows the architecture ironSource uses:

Streaming data from Kafka to Upsolver and storing on S3

Apache Kafka streams data from ironSource’s mobile SDK at a rate of up to 500K events per second. Upsolver pulls data from Kafka and stores it in S3 within a data lake architecture. It also keeps a copy of the raw event data while making sure to write each event exactly one time, and stores the same data as Parquet files that are optimized for consumption.

Building the input stream in Upsolver:

Using the Upsolver GUI, ironSource connects directly to the relevant Kafka topics and writes them to S3 precisely one time. See the following screenshot.

Image of the Upsolver UI showing the

After the data is stored in S3, ironSource can proceed to operationalize the data using a wide variety of databases and analytic tools. The next steps cover the most prominent tools.

Output to Athena

To understand production issues, developers and product teams need access to data. These teams can work with the data directly and answer their own questions by using Upsolver and Athena.

Upsolver simplifies and automates the process of preparing data for consumption in Athena, including compaction, compression, partitioning, and creating and managing tables in the AWS Glue Data Catalog. ironSource’s DevOps teams save hundreds of hours on pipeline engineering. Upsolver’s GUI creates each table one time, and from that point onwards, data consumers are entirely self-sufficient. To ensure queries in Athena run fast and at minimal cost, Upsolver also enforces performance-tuning best practices as data is ingested and stored on S3. For more information, see Top 10 Performance Tuning Tips for Amazon Athena.

Athena’s serverless architecture further compliments this independence, which means there’s no infrastructure to manage and analysts don’t need DevOps to use Amazon Redshift or query clusters for each new question. Instead, analysts can indicate the data they need and get answers.

Sending tables to Athena in Upsolver

In Upsolver, you can declare tables with associated schema using SQL or the built-in GUI. You can expose these tables to Athena through the AWS Glue Data Catalog. Upsolver stores Parquet files in S3 and creates the appropriate table and partition information in the AWS Glue Data Catalog by using Create and Alter DDL statements. You can also edit these tables with Upsolver Output to add, remove, or change columns. Upsolver automates the process of recreating table data on S3 and altering the metadata in the AWS Glue Data Catalog.

Creating the table

Image of the Upsolver UI showing the

Sending the table to Amazon Athena

Image of the Upsolver UI showing the

Editing the table option for Outputs

Image on the

Modifying an existing table in the Upsolver Output

Image showing

Output to BI platforms

IronSource’s BI analysts use Tableau to query and visualize data using SQL. However, performing this type of analysis on streaming data may require extensive ETL and data preparation, which can limit the scope of analysis and create reporting bottlenecks.

IronSource’s cloud data lake architecture enables BI teams to work with big data in Tableau. They use Upsolver to enrich and filter data and write it to Redshift to build reporting dashboards, or send tables to Athena for ad-hoc analytic queries. Tableau connects natively to both Redshift and Athena, so analysts can query the data using regular SQL and familiar tools, rather than relying on manual ETL processes.

Creating a reduced stream for Amazon ES

Engineering teams at IronSource use Amazon ES to monitor and analyze application logs. However, as with any database, storing raw data in Amazon ES is expensive and can lead to production issues.

Because a large part of these logs are duplicates, Upsolver deduplicates the data. This reduces Amazon ES costs and improves performance. Upsolver cuts down the size of the data stored in Amazon ES by 70% by aggregating identical records. This makes it viable and cost-effective despite generating a high volume of logs.

To do this, Upsolver adds a calculated field to the event stream, which indicates whether a particular log is a duplicate. If so, it filters the log out of the stream that it sends to Amazon ES.

Creating the calculated field

Image showing the Upsolver UI with the

Filtering using the calculated field

Upsolver UI showing the

Conclusion

Self-sufficiency is a big part of ironSource’s development ethos. In revamping its data infrastructure, the company sought to create a self-service environment for dev and BI teams to work with data, without becoming overly reliant on DevOps and data engineering. Data engineers can now focus on features rather than building and maintaining code-driven ETL flows.

ironSource successfully built an agile and versatile architecture with Upsolver and AWS data lake tools. This solution enables data consumers to work independently with data, while significantly improving data freshness, which helps power both the company’s internal decision-making and external reporting.

Some of the results in numbers include:

Thousands of engineering hours saved – ironSource’s DevOps and data engineers save thousands of hours that they would otherwise spend on infrastructure by replacing manual, coding-intensive processes with self-service tools and managed infrastructure.
Fees reduction – Factoring infrastructure, workforce, and licensing costs, Upsolver significantly reduces ironSource’s total infrastructure costs.
15-minute latency from Kafka to end-user – Data consumers can respond and take action with near real-time data.
9X increase in scale – Currently at 0.5M incoming events/sec and 3.5M outgoing events/sec.

“It’s important for every engineering project to generate tangible value for the business,” says Seva Feldman, Vice President of Research and Development at ironSource Mobile. “We want to minimize the time our engineering teams, including DevOps, spend on infrastructure and maximize the time spent developing features. Upsolver has saved thousands of engineering hours and significantly reduced total cost of ownership, which enables us to invest these resources in continuing our hypergrowth rather than data pipelines.”

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Seva Feldman is Vice President of R&D at ironSource Mobile. With over two decades of experience senior architecture, DevOps and engineering roles, Seva is an expert in turning operational challenges into opportunities for improvement.

Eran Levy is the Director of Marketing at Upsolver.

Roy Hasson is the Global Business Development Lead of Analytics and Data Lakes at AWS.. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

Feed: Planet MySQL
;
Author: Severalnines
;

Streaming Replication is a new feature which was introduced with the 4.0 release of Galera Cluster. Galera uses replication synchronously across the entire cluster, but before this release write-sets greater than 2GB were not supported. Streaming Replication allows you to now replicate large write-sets, which is perfect for bulk inserts or loading data to your database.

In a previous blog we wrote about Handling Large Transactions with Streaming Replication and MariaDB 10.4, but as of writing this blog Codership had not yet released their version of the new Galera Cluster. Percona has, however, released their experimental binary version of Percona XtraDB Cluster 8.0 which highlights the following features…

Streaming Replication supporting large transactions
The synchronization functions allow action coordination (wsrep_last_seen_gtid, wsrep_last_written_gtid, wsrep_sync_wait_upto_gtid)
More granular and improved error logging. wsrep_debug is now a multi-valued variable to assist in controlling the logging, and logging messages have been significantly improved.
Some DML and DDL errors on a replicating node can either be ignored or suppressed. Use the wsrep_ignore_apply_errors variable to configure.
Multiple system tables help find out more about the state of the cluster state.
The wsrep infrastructure of Galera 4 is more robust than that of Galera 3. It features a faster execution of code with better state handling, improved predictability, and error handling.

What’s New With Galera Cluster 4.0?

The New Streaming Replication Feature

With Streaming Replication, transactions are replicated gradually in small fragments during transaction processing (i.e. before actual commit, we replicate a number of small size fragments). Replicated fragments are then applied in slave threads, preserving the transaction’s state in all cluster nodes. Fragments hold locks in all nodes and cannot be conflicted later.

Galera SystemTables

Database Administrators and clients with access to the MySQL database may read these tables, but they cannot modify them as the database itself will make any modifications needed. If your server doesn’t have these tables, it may be that your server is using an older version of Galera Cluster.

#> show tables from mysql like 'wsrep%';

+--------------------------+

| Tables_in_mysql (wsrep%) |

+--------------------------+

| wsrep_cluster            |

| wsrep_cluster_members    |

| wsrep_streaming_log      |

+--------------------------+

3 rows in set (0.12 sec)

New Synchronization Functions

This version introduces a series of SQL functions for use in wsrep synchronization operations. You can use them to obtain the Global Transaction ID which is based on either the last write or last seen transaction. You can also set the node to wait for a specific GTID to replicate and apply, before initiating the next transaction.

Intelligent Donor Selection

Some understated features that have been present since Galera 3.x include intelligent donor selection and cluster crash recovery. These were originally planned for Galera 4, but made it into earlier releases largely due to customer requirements. When it comes to donor node selection in Galera 3, the State Snapshot Transfer (SST) donor was selected at random. However with Galera 4, you get a much more intelligent choice when it comes to choosing a donor, as it will favour a donor that can provide an Incremental State Transfer (IST), or pick a donor in the same segment. As a Database Administrator, you can force this via setting wsrep_sst_donor.

Why Use MySQL Galera Cluster Streaming Replication?

Long-Running Transactions

Galera’s problems and limitations always revolved around how it handled long-running transactions and oftentimes caused the entire cluster to slow down due to large write-sets being replicated. It’s flow control often goes high, causing the writes to slow down or even terminating the process in order to revert the cluster back to its normal state. This is a pretty common issue with previous versions of Galera Cluster.

Codership advises to use Streaming Replication for your long-running transactions to mitigate these situations. Once the node replicates and certifies a fragment, it is no longer possible for other transactions to abort it.

Large Transactions

This is very helpful when loading data to your report or analytics. Creating bulk inserts, deletes, updates, or using LOAD DATA statement to load large quantity of data can fall down in this category. Although it depends on how your manage your data for retrieval or storage. You must take into account that Streaming Replication has its limitations such that certification keys are generated from record locks.

Without Streaming Replication, updating a large number of records would result in a conflict and the whole transaction would have to be rolled back. Slaves that are also replicating large transactions are subject to the flow control as it hits the threshold and starts slowing down the entire cluster to process any writes as they tend to relax receiving incoming transactions from the synchronous replication. Galera will relax the replication until the write-set is manageable as it allows to continue replication again. Check this external blog by Percona to help you understand more about flow control within Galera.

With Streaming Replication, the node begins to replicate the data with each transaction fragment, rather than waiting for the commit. This means that there’s no way for any conflicting transactions running within the other nodes to abort since this simply affirms that the cluster has certified the write-set for this particular fragment. It’s free to apply and commit other concurrent transactions without blocking and process large transaction with a minimal impact on the cluster.

Hot Records/Hot Spots

Hot records or rows are those rows in your table that gets constantly get updated. These data could be the most visited and highly gets the traffic of your entire database (e.g. news feeds, a counter such as number of visits or logs). With Streaming Replication, you can force critical updates to the entire cluster.

As noted by the Galera Team at Codership

“Running a transaction in this way effectively locks the hot record on all nodes, preventing other transactions from modifying the row. It also increases the chances that the transaction will commit successfully and that the client in turn will receive the desired outcome.”

This comes with limitations as it might not be persistent and consistent that you’ll have successful commits. Without using Streaming Replication, you’ll end up high chances or rollbacks and that could add overhead to the end user when experiencing this issue in the application’s perspective.

Things to Consider When Using Streaming Replication

Certification keys are generated from record locks, therefore they don’t cover gap locks or next key locks. If the transaction takes a gap lock, it is possible that a transaction, which is executed on another node, will apply a write set which encounters the gap log and will abort the streaming transaction.
When enabling Streaming Replication, write-set logs are written to wsrep_streaming_log table found in the mysql system database to preserve persistence in case crash occurs, so this table serves upon recovery. In case of excessive logging and elevated replication overhead, streaming replication will cause degraded transaction throughput rate. This could be a performance bottleneck when high peak load is reached. As such, it’s recommended that you only enable Streaming Replication at a session-level and then only for transactions that would not run correctly without it.
Best use case is to use streaming replication for cutting large transactions
Set fragment size to ~10K rows
Fragment variables are session variables and can be dynamically set
Intelligent application can set streaming replication on/off on need basis

Conclusion

Thanks for reading, in part two we will discuss how to enable Galera Cluster Streaming Replication and what the results could look like for your setup.

Feed: AWS Partner Network (APN) Blog.
Author: Phil de Valence.

By Gil Peleg, Founder and CEO at Model9

Data insight is critical for businesses to gain a competitive advantage. Mainframe proprietary storage solutions such as virtual tape libraries (VTLs) hold valuable data locked in a platform with complex tools. This can lead to higher compute and storage costs, and make it harder to retain existing employees or train new ones.

When mainframe data is stored in a cloud storage service, however, it can be accessed by a rich ecosystem of applications and analytics tools.

Model9 is an AWS Partner Network (APN) Advanced Technology Partner that enables mainframe customers to benefit from cloud technologies and economics with backup and archive directly to AWS cloud storage services, such as Amazon Glacier and Amazon Simple Storage Service (Amazon S3).

Model9 makes data delivered to Amazon Web Services (AWS) storage services readable and structured, enabling new analytics use cases. In this post, I will describe Model9’s new features and benefits with a step-by-step walkthrough and several customer use cases.

Backup and Archival with Model9 Cloud Data Manager for Mainframe

Model9’s patented technology lets mainframe customers take advantage of AWS storage services, from affordable long-term options like Amazon Glacier and Glacier Deep Archive, to highly durable, geographically dispersed, and flexible low-cost options such as Amazon S3 object storage.

Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic File System (Amazon EFS) are also supported.

Figure 1 – Model9 Cloud Data Manager for Mainframe.

In the architecture above, you can see the two main components of the Model9 Cloud Data Manager for Mainframe software product—a lightweight agent running n z/OS providing secure data delivery and retrieval functions directly to Amazon S3, and a management server running on AWS.

Model9 Cloud Data Manager for Mainframe provides storage, backup, restore, archive/migrate, and automatic recall for all mainframe data sets, volume types and z/OS UNIX files, as well as space management, stand-alone restore, and disaster recovery.

Model9 can run side-by-side with existing data management solutions to provide cloud capabilities and cost reductions. To achieve dramatic cost reductions, Model9 provides a complete replacement for on-premises VTL and legacy data management tools.

Learn more about the Model9 backup and recovery features and benefits in this APN Blog post: How Cloud Backup for Mainframes Cuts Costs with Model9 and AWS.

Mainframe Data Analytics via Model9 Cloud Data Gateway for Mainframe

Model9 recently added new features for delivering and transforming mainframe data directly to AWS, enabling easy and secure integration with popular cloud analytics tools, data lakes, data warehouses, databases, and ETL solutions running. The Model9 solution unifies data delivery for analytics with backup and space management processes.

Mainframe customers typically use virtual tapes as a secondary storage solution for three types of data:

Daily incremental data set backups.
Migrated/archived data sets as part of daily space management processing.
DB2 database image copies.

When data is stored on proprietary mainframe virtual tapes, no other tools can access it from outside the mainframe ecosystem. In order to process the data, it must be retrieved from tape to the mainframe host, transformed into readable format, delivered to another platform, and then loaded into analytics tools.

To avoid this data retrieval, the majority of customers also send database updates and data sets to other platforms on a daily basis using ETL tools and data transfer software such as FTP.

This double data movement—intended solely to overcome the locked-in nature of on-premises, mainframe proprietary secondary storage—incurs high costs and wasted CPU consumption.

Model9 offers a new paradigm of writing mainframe data directly to a storage platform where the data can be accessed and consumed by non-mainframe analytics tools, without requiring double data movement.

Data set backups and archives created by Model9 Cloud Data Manager in Amazon S3 can be transformed into readable textual or binary format and processed by analytics tools such as Amazon Athena. DB2 image copies delivered by Model9 directly to S3 can be transformed to CSV or JSON format so that tables can be easily loaded into modern databases or data warehouses such as Amazon Aurora and Amazon Redshift.

How Model9 Cloud Data Gateway for Mainframe Works

Model9 Cloud Data Gateway for Mainframe runs on zIIP engines and delivers data sets directly to Amazon S3 cloud storage. Amazon EFS and Amazon EBS are supported as well.

Compression and encryption can be optionally applied before data is sent over the network. On AWS, the Model9 Data Transformation Service transforms data sets and databases to standard file formats (e.g. CSV or JSON) that can be consumed by analytics services.

When used together with Model9 Cloud Data Manager for z/OS, data transformation in the cloud is automatically applied to backed up and archived data, leveraging existing storage management scheduling policies and life cycle management.

Because mainframe data is kept in the cloud in its original format, it can be transformed in multiple ways to support future needs as your application requirements change.

Figure 2 – Model9 Cloud Data Gateway for Mainframe.

For efficient mass data delivery, Model9 leverages storage replication technologies such as FlashCopy, Concurrent copy, and DFDSS to deliver data to Amazon S3 in its original format. Mainframe data is then organized, indexed, and tagged with metadata, in order to enable identification, fast retrieval, and transformation.

Data sets can be transformed specifically or extracted from full volume dumps. For example, sequential, partitioned, and VSAM data sets are transformed to JSON files. DB2 image copies are transformed to CSV files.

Currently supported mainframe data sources for transformation include*:

DB2 image copies
VSAM data sets
Sequential data sets
Partitioned data sets
Extended format data sets

* Data sources are updated regularly; please inquire with Model9 for latest support.

Customer Use Cases

In this section, I will describe common customer use cases for leveraging mainframe data for analytics on AWS.

Data Retention Compliance

For companies with regulatory requirements to keep data for long retention periods, Model9 can securely archive mainframe data to Amazon S3, Amazon Glacier, or Glacier Deep Archive for long-term retention periods at attractive costs. Data sets are always available for transparent automatic-recall by mainframe applications.

For additional protection and compliance with regulations (such as SEC 17a-4), Amazon S3 object lock may be applied to data delivered by Model9. Amazon S3 object lock prevents data from being deleted or overwritten by providing a Write-Once-Read-Many (WORM) protection model.

Some companies retain data even after their mainframe platform has been decommissioned. The Model9 Management Interface can be used to search for mainframe data sets stored in cloud storage, and then invoke the Model9 Data Transformation Service running on AWS to make data available for applications and analytics tools with no need for a mainframe or for retaining old equipment.

Data Warehouse and Data Lake

As data analytics requirements evolve and business needs change, having the data in its original format enables changing and updating data analytics processes on-demand. However, because mainframe data is stored on proprietary storage systems, it’s very complex to access and manipulate from outside the mainframe platform.

Mainframe ETL and data integration services transform the data on the mainframe before loading it into a target data store. If, in the future, the data is needed in a different format or structure, it has to be transformed again on the mainframe and loaded again to a target data store.

Model9 offers a new approach, delivering mainframe data to the cloud in its original format and enabling any transformation, both in the present and in the future, to run outside of the mainframe platform with no access to a mainframe. This approach is known as Extract, Load, Transform (ELT), in contrast to the traditional Extract, Transform, Load (ETL) used by legacy mainframe tools.

To keep data fresh, data delivery can be scheduled at the desired frequency; for example, every 30 minutes. Once loaded into Amazon Redshift or kept within an Amazon S3-based data lake, the mainframe data can be queried and analyzed just like any other data.

Business Intelligence

Mainframes generate and store valuable core business data. Customers gain deep business insights and improve business decisions by leveraging modern business intelligence tools to analyze mainframe data jointly with other data sources.

Today, it’s very complex and expensive to load mainframe data into cloud analytics services because data usually has to be transformed on the mainframe before being delivered to cloud services. This transformation consumes expensive mainframe CPU cycles and increases customer software monthly license charges.

With Model9, data is transformed by the Model9 Data Transformation Service, which runs on AWS and does not waste any mainframe CPU cycles at all. Data is then loaded directly into AWS analytics services such as Amazon Athena, or processed through Redshift or Aurora.

Operational Intelligence

As a core business platform, where most business transactions run on, mainframes generate vast amounts of machine data such as system logs, security records, and audit statistics. When this data is stored on proprietary mainframe storage systems on-premises, it’s hard to use it for DevOps, monitoring, and automation processes.

By using Model9, you can send mainframe system, security, and audit data directly to the Amazon S3, where it can be transformed from binary machine data to structured data that can be loaded and parsed by operational intelligence services running on AWS.

For example, System Management Facility (SMF) records, which are regularly collected by mainframe customers into tape or generation data sets, can be sent via Model9 Cloud Data Gateway for Mainframe to Amazon Elasticsearch Service as part of the standard SMF collection process. This can be used together with machine data generated by other platforms to provide a complete monitoring picture.

Walkthrough

In this section, I will demonstrate how to deliver and load a DB2 table into Amazon Athena and Redshift. Once data is on AWS, it can be queried and analyzed by a variety of AWS services.

Task #1: Deliver DB2 image copy to Amazon S3 and transform into a CSV file

The following JCL job is used to deliver DB2 data, stored in a DB2 table image copy, directly to S3 and then transform it into a CSV file that can be queried by Amazon Athena or loaded into Redshift.

This JCL or similar job can be integrated into mainframe automation processes or scheduled by standard job schedulers to ensure ongoing mainframe data delivery to AWS.

Figure 3 – JCL job to copy data to Amazon S3 and transform to CSV format.

The first job step creates an image copy from the specified DB2 table. Image copies (ICs) are a common DB2 table backup format, created using standard DB2 utilities as part of the daily DB2 backup process. ICs can be incremental or may contain the full table data.

In the second job step, the DB2 image copy is delivered by Model9 to S3. For simplicity, this job uses a predefined data delivery policy, but many controls can be set such as the S3 region, bucket name, object name prefix, and AWS credentials. This step may also be performed from the Model9 graphical management interface.

The last job step invokes the Model9 service on AWS, to transform the image copy on S3 to a CSV file.

Task #2: Query data in Amazon S3 using Amazon Athena

The screenshot below shows how to define a table in Amazon Athena from the transformed DB2 table in S3. The pictured DDL query defines an external table from a file stored in S3 and defines the schema to access the file.

In this case, the file has been created by Model9 Cloud Data Manager backup of a DB2 table and transformed to CSV format by Model9 Cloud Data Gateway transformation service.

Figure 4 – Amazon Athena query to define table from DB2 CSV file.

After the table is defined, it can be queried just like any other table and there’s no difference between mainframe data and data that originated from other platforms.

The screenshot below shows the result of querying all columns in the table.

Figure 5 – Amazon Athena query result showing DB2 data.

Task #3: Load DB2 table into Amazon Redshift data warehouse

The following screenshot demonstrates how to load a table in CSV format from Amazon S3 to Redshift.

The pictured DDL query defines a table schema with multiple columns and copies the records from a file stored in S3 into the table based on the defined schema. The file was stored in AWS S3 using Model9 Cloud Data Manager backup of a DB2 table.

Figure 6 – Amazon Redshift table creation and DB2 CSV data load.

After the table is created, it can be queried just like any other table and there is no difference between mainframe data and data that originated from other platforms.

The screenshot below shows the result of querying all columns in the table.

Figure 7 – Amazon Redshift query result showing DB2 data.

After completing the tasks above, you will be able to use mainframe data for standard analytics processes on AWS, such as Amazon Athena and Amazon Redshift.

Summary

In this post, we discussed how to securely and efficiently deliver mainframe data directly to storage and analytics services on AWS using Model9 Cloud Data Manager and Model9 Cloud Data Gateway.

These software solutions help you avoid duplicate data movement for backup and data analytics, enabling you to leverage mainframe data to gain deep business insights.

I invite you to try Model9 for free and see that copying mainframe data sets to/from AWS can be as easy as running an IEBCOPY job. For more information, download our free Model9 cloud copy tool.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

Model9 – APN Partner Spotlight

Model9 is an APN Advanced Technology Partner. Its software connects the mainframe directly over TCP/IP to cloud storage and allows customers to supplement or eliminate the need for VTLs, physical tapes, and existing data management products.

Contact Model9 | Solution Overview

*Already worked with Model9? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

Feed: Planet PostgreSQL.

Some weeks ago, in the light of PostgreSQL v12 release, I wrote a general overview on various major version upgrade methods and benefits of upgrading in general – so if upgrading is a new thing for you I’d recommend to read that posting first. But this time I’m concentrating on the newest (available since v10) and the most complex upgrade method – called “Logical Replication” or LR shortly. For demonstration purposes I’ll be migrating from v10 to freshly released v12 as this is probably the most likely scenario. But it should work the same also with v11 to v12. But do read on for details.

Benefits of LR upgrades

First a bit of recap from the previous post on why would you use LR for upgrading at all. Well, in short – because it’s the safest option with shortest possible downtime! With that last point I’m already sold…but here again the list of “pros” / “cons”:

PROS

Minimal downtime required

After the initial setup burden one just needs to wait (and verify) that the new instance hast all the data from the old one…and then just shut down the old instance and point applications to the new instance. Couldn’t be easier!

Also before the switchover one can make sure that statistics are up to date, to minimize the typical “degraded performance” period seen after “pg_upgrade” for more complex queries (on bigger databases). For high load application one could even be more careful here and pull the most popular relations into shared buffers by using the (relatively unknown) “pg_prewarm” Contrib extension or by just running common SELECT-s in a loop, to counter the “cold cache” effect.

One can for example already make some changes on the target DB – add columns / indexes, change datatypes, leave out some old archive tables, etc. The general idea is that LR does not work on the binary, 1-to-1 level as”pg_upgrade” does, but rather JSON-like data objects are sent over to another master / primary instance, providing quite some freedom on the details.

Before the final switchover you can anytime abort the process and re-try if something seems fishy. The old instances data is not changed in any way even after the final switchover! Meaning you can easily roll back (with cost of some data loss typically though) to the old version if some unforeseen issues arise. One should only watch out for the replication slot on the source / publisher DB if the target server just taken down suddenly.

CONS

Quite a few steps to take and possibly one needs to modify the schema a bit.
Always per DB.
Could take a long time for big databases.
Large objects, if in use (should be a thing of the past really), need to be exported / imported manually.

Preparing for LR

As LR has some prerequisites on the configuration and schema, you’d first need to see if it’s possible to start with the migration process at all or some changes are needed on the old master node, also called the “publisher” in LR context.

Action points:

1) Enable LR on the old master aka subscriber aka source DB if not done already. This means setting “wal_level” to “logical” in postgresql.conf and making sure that “replication” connections are allowed in “pg_hba.conf” from the new host (also called the “subscriber” in LR context). FYI – changing “wal_level” needs server restart! To enable any kind of streaming replication some other params are needed but they are actually already set accordingly out of the box as of v10 so it shouldn’t be a problem.

2) Check that all tables have a Primary Key (which is good database design anyways) or alternatively have REPLICA IDENTITY set. Primary Keys don’t need much explaining probably but what is this REPLICA IDENTITY thing? A bit simplified – basically it allows to say which columns formulate uniqueness within a table and PK-s are automatically counted as such.

3) If there’s no PK for a particular table, you should create one, if possible. If you can’t do that, set unique constraints / indexes to serve as REPLICA IDENTITY, if at all possible. If even that isn’t possible, you can set the whole row as REPLICA IDENTITY, a.k.a. REPLICA IDENTITY FULL, meaning all columns serve as PK’s in an LR context – with the price of very slow updates / deletes on the subscriber (new DB) side, meaning the whole process could take days or not even catch up, ever! It’s OK not to define a PK for a table, as long as it’s a write-only logging table that only gets inserts.

Sample code:

psql -c “ALTER SYSTEM SET wal_level TO logical;”
sudo systemctl [email protected] restart

# find problematic tables (assuming we want to migrate everything "as is")
SELECT
    quote_ident(nspname) || '.' || quote_ident(relname) AS tbl
FROM
    pg_class c
    JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE
    relkind = 'r'
    AND NOT nspname LIKE ANY (ARRAY[E'pg\_%', 'information_schema'])
    AND NOT relhaspkey
    AND NOT EXISTS (SELECT * FROM pg_index WHERE indrelid = c.oid
            AND indisunique AND indisvalid AND indisready AND indislive)
ORDER BY
    1;

# set replica identities on tables highlighted by the previous query
ALTER TABLE some_bigger_table REPLICA IDENTITY USING INDEX unique_idx ;
ALTER TABLE some_table_with_no_updates_deletes REPLICA IDENTITY FULL ;

Fresh setup of the new “subscriber” DB

Second most important step is to set up a new totally independent instance with a newer Postgres version – or at least create a new database on an existing instance with the latest major version. And as a side note – same version LR migrations are also possible, but you’d be solving some other problem in that case.

This step is actually very simple – just a standard install of PostgreSQL, no special steps needed! With the important addition that to make sure everything works exactly the same way as before for applications – same encoding and collation should be used!

-- on old
SELECT pg_catalog.pg_encoding_to_char(d.encoding) AS "Encoding", d.datcollate as "Collate" FROM pg_database d WHERE datname = current_database();
-- on new
CREATE DATABASE appdb TEMPLATE template0 ENCODING UTF8 LC_COLLATE "en_US.UTF-8";

NB! Before the final switchover it’s important that no normal users have access to the new DB – as they might alter table data or structures and thereby inadvertently produce replication conflicts that mostly mean starting from scratch (or a costly investigation / fix) as “replay” is a sequential process.

Schema / roles synchronization

Next we need to synchronize the old schema onto the new DB as Postgres does not take care of that automatically as of yet. The simplest way is to use the official PostgreSQL backup tool called “pg_dump”, but if you have your schema initialization scripts in Git or such and they’re up to date then this is fine also. For syncing roles “pg_dumpall” can be used.

NB! After this point it’s not recommended to introduce any changes to the schema or be at least very careful when doing it, e.g. creating new tables / columns first on the subscriber and refreshing the subscriptions when introducing new tables – otherwise data synchronization will break! Tip – a good way to disable unwanted schema changes is to use DDL triggers! An approximate example on that is here. Adding new tables only on the new DB is no issue though but during an upgrade not a good idea anyways – my recommendation is to first upgrade and then to evolve the schema.

pg_dumpall -h $old_instance --globals-only | psql -h $new_instance
pg_dump -h $old_instance --schema-only appdb | psql -h $new_instance appdb

Create a “publication” on the old DB

If preparations on the old DB has been finished (all tables having PK-s or replication identities) then this is a oneliner:

CREATE PUBLICATION upgrade FOR ALL TABLES;

Here we added all (current and those added in future) tables to a publication (a replication set) named “upgrade” but technically we could also leave out some or choose to only replicate some operations like UPDATE-s, but for a pure version upgrade you want typically all.

NB! As of this moment the replication identities become important – and you might run into trouble on the old master if the identities are not in place on all tables that get changes! In such case you might see errors like that:

UPDATE pgbench_history SET delta = delta WHERE aid = 1;
ERROR:  cannot update table "pgbench_history" because it does not have a replica identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

Create a “subscription” on the target DB

Next step – create a “subscription” on the new DB. This is also a oneliner, that creates a logical replication slot on the old instance, pulls initial table snapshots and then starts to stream and apply all table changes as they happen on the source, resulting eventually in a mirrored dataset! Note that currently superuser rights are needed for creating the subscription and actually hit also makes life a lot easier on the publisher side.

CREATE SUBSCRIPTION upgrade_sub CONNECTION 'port=5432 user=postgres' PUBLICATION upgrade;
NOTICE:  created replication slot "upgrade_sub" on publisher
CREATE SUBSCRIPTION

WARNING! As of this step the 2 DB-s are “coupled” via a replication slot, carrying some dangers if the process is aborted abruptly and the old DB is not “notified” of that. If this sounds new please see the details from documentation.

Check replication progress

Depending on the amount of data it will take X minutes / days until everything is moved over and “live” synchronization is working.

Things to inspect for making sure there are no issues:

No errors in server logs on both sides
There’s an active “pg_replication_slots” entry on the master with the name that we used to create the “subscription” on the new DB
All tables are actively replicating on the subscriber side, i.e. “pg_subscription_rel.srsubstate” should be ‘r’ for all tables (ready – normal replication)

Basic data verification / switchover preparation

Although not a mandatory step, when it comes to data consistency / correctness, it always makes sense to go the extra mile and run some queries that validate that things (source – target) have the same data. For a running DB it’s of course a bit difficult as there’s always some replication lag but for “office hours” applications it should make a lot of sense. My sample script for comparing rowcounts (in a non-threaded way) is for example here but using some slightly more “costly” aggregation / hashing functions that really look at all the data would be even better there.

Also important to note if you’re using sequences (which you most probably are) – sequence state is not synchronized by LR and needs some manual work / scripting! The easiest option I think is that you leave the old DB ticking in read-only mode during switchover so that you can quickly access the last sequence values without touching the indexes for maximum ID-s on the subscriber side.

Switchover time!

We’re almost there with our little undertaking…with the sweaty part remaining – the actual switchover to start using the new DB! Needed steps are simple though and somewhat similar to switching over to a standard, “streaming replication” replica.

1) Re-check the system catalog views on replication status.
2) Stop the old instance. Make sure it’s a nice shutdown. The last logline should state “database system is shut down”, meaning all recent changes were delivered to connected replication clients, including our new DB. Start of downtime! PS Another alternative to make sure absolutely all data is received is to actually configure the new instance in “synchronous replication” mode! This has the usual synchronous replication implications of course so I’d avoid it for bigger / busier DBs.
3) Start the old DB in read-only mode by creating a recovery.conf file (from v12 this is achieved by declaring a “standby.signal” file)
4) Optionally make some more quick “health checks” if time constraints allow it – verify table sizes, row counts, your last transactions, etc. For “live” comparisons it makes sense to restart the old DB under a new, random port so that no-one else connects to it.
5) Synchronize the sequences. Given we’ll leave the old DB in read-only mode the easiest way is something like that:

psql -h $old_instance -XAtqc "SELECT $$select setval('$$ || quote_ident(schemaname)||$$.$$|| quote_ident(sequencename) || $$', $$ || last_value || $$); $$ AS sql FROM pg_sequences" appdb 
| psql -h $new_instance appdb

6) Reconfigure your pg_hba.conf to allow access for all “mortal” users, then reconfigure your application, connection pooler, DNS or proxy to start using the new DB! If the two DB-s were on the same machine then it’s even easier – just change the ports and restart. End of downtime!
7) Basically we’re done here, but would be nice of course to clean up and remove the (no-more needed) subscription not to accumulate errors in server log.

DROP SUBSCRIPTION upgrade_sub;

Note that if you won’t keep the old “publisher” accessible in read-only or normal primary mode (dangerous!) though, some extra steps are needed here before dropping:

ALTER SUBSCRIPTION  upgrade_sub DISABLE ;
ALTER SUBSCRIPTION  upgrade_sub SET (slot_name = NONE);
DROP SUBSCRIPTION upgrade_sub;

8) Time for some bubbly drinks

Summary

Although there are quite some steps and nuances involved, LR is worth adding to the standard upgrade toolbox for time-critical applications as it’s basically the best way to do major version upgrades nowadays – minimal dangers, minimal downtime!

FYI – if you’re planning to migrate dozens of DB-s the LR upgrade process can be fully automated! Even starting from version 9.4 actually, with the help of the “pglogical” extension. So feel free to contact us if you might need something like that and don’t particularly enjoy the details. Thanks for reading!

Feed: Clustrix Blog.
Author: Alejandro Infanzon.

This is part 3 of a 5-part series, if you want to start from the beginning see Temporal Tables Part 1: Introduction & Use Case Example

Up until now, we haven’t crisply defined what is meant by SYSTEM_TIME in the above examples. With the DDL statement above, the time that is recorded is the time that the change arrived at the database server. This suffices for many use cases, but in some cases, particularly when debugging the behavior of queries at specific points of time, it is more important to know when the change was committed to the database. Only at that point does the data become visible to other users of the database. MariaDB can record temporal information based on the commit time by using transaction-precise system version. Two extra columns, start_trxid and end_trxid must be manually declared on the table:

CREATE TABLE purchaseOrderLines(
    purchaseOrderID              INTEGER NOT NULL
  , LineNum                      SMALLINT NOT NULL
  , status                       VARCHAR(20) NOT NULL
  , itemID                       INTEGER NOT NULL
  , supplierID                   INTEGER NOT NULL
  , purchaserID                  INTEGER NOT NULL
  , quantity                     SMALLINT NOT NULL
  , price                        DECIMAL (10,2) NOT NULL
  , discountPercent              DECIMAL (10,2) NOT NULL
  , amount                       DECIMAL (10,2) NOT NULL
  , orderDate                    DATETIME
  , promiseDate                  DATETIME
  , shipDate                     DATETIME
  , start_trxid                  BIGINT UNSIGNED GENERATED ALWAYS AS ROW START
  , end_trxid                    BIGINT UNSIGNED GENERATED ALWAYS AS ROW END
  , PERIOD FOR SYSTEM_TIME(start_trxid, end_trxid) 
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING;

The rows now contain columns that represent the start and end the transaction ids for the change as recorded in the TRANSACTION_REGISTRY table in the system schema.

Temporal table: example 1

If you need to return the transaction commit time information from your temporal queries, you will need to join with this TRANSACTION_REGISTRY table, returning the commit_timestamp:

SELECT 
    commit_timestamp
  , begin_timestamp
  , purchaseOrderID
  , LineNum
  , status
  , itemID
  , supplierID
  , purchaserID
  , quantity
  , price
  , amount
FROM purchaseOrderLines, mysql.transaction_registry
WHERE start_trxid = transaction_id;

This will show when the change became visible to all sessions in the database (the most common scenario), or the begin_timestamp if you care about the beginning of the transaction that made the change.

Temporal table: example 2

Capturing the history of changes to a table does not come without some cost. As we showed earlier, one insert with three subsequent updates results in 4 rows being stored in the database.

Temporal table: example 3

For smaller tables, or tables that have infrequent changes to their rows, this may not be a problem. The storage and performance impact of additional rows might be insignificant compared to other activity. However, high-volume tables with many changes to rows may want to consider techniques for managing the growth of the historical data.

The first option is to disable temporal track for specific columns when appropriate. This is accomplished by using the WITHOUT SYSTEM VERSIONING modified on specific columns:

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID         INTEGER NOT NULL
  , LineNum                 SMALLINT NOT NULL
  , status                  VARCHAR(20) NOT NULL
  , itemID                  INTEGER NOT NULL
  , supplierID              INTEGER NOT NULL
  , purchaserID             INTEGER NOT NULL
  , quantity                SMALLINT NOT NULL
  , price                   DECIMAL(10,2) NOT NULL
  , discountPercent         DECIMAL(10,2) NOT NULL
  , amount                  DECIMAL(10,2) NOT NULL
  , orderDate               DATETIME
  , promiseDate             DATETIME
  , shipDate                DATETIME
  , comments                VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING;

Partitioning is another popular technique for managing the growth of historical data in temporal tables. The CURRENT keyword is understood by the partitioning logic when used on temporal tables with system versioning. Isolating the historical versions of the rows into their own partition is as simple as:

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID         INTEGER NOT NULL
  , LineNum                 SMALLINT NOT NULL
  , status                  VARCHAR(20) NOT NULL
  , itemID                  INTEGER NOT NULL
  , supplierID              INTEGER NOT NULL
  , purchaserID             INTEGER NOT NULL
  , quantity                SMALLINT NOT NULL
  , price                   DECIMAL (10,2) NOT NULL
  , discountPercent         DECIMAL (10,2) NOT NULL
  , amount                  DECIMAL (10,2) NOT NULL
  , orderDate               DATETIME
  , promiseDate             DATETIME
  , shipDate                DATETIME
  , comments                VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING
    PARTITION BY SYSTEM_TIME (
        PARTITION p_hist HISTORY
      , PARTITION p_cur CURRENT
);

This technique is especially powerful because partitions will be pruned when executing queries. Queries that access the current information will quickly skip historical data and only interact with the smaller data and associated indexes on the current partition.

Partitioning becomes an even more powerful tool when combined with interval definitions, dividing historical data into buckets that can then be managed individually.

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID          INTEGER NOT NULL
  , LineNum                  SMALLINT NOT NULL
  , status                   VARCHAR(20) NOT NULL
  , itemID                   INTEGER NOT NULL
  , supplierID               INTEGER NOT NULL
  , purchaserID              INTEGER NOT NULL
  , quantity                 SMALLINT NOT NULL
  , price                    DECIMAL (10,2) NOT NULL
  , discountPercent          DECIMAL (10,2) NOT NULL
  , amount                   DECIMAL (10,2) NOT NULL
  , orderDate                DATETIME
  , promiseDate              DATETIME
  , shipDate                 DATETIME
  , comments                 VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING
    PARTITION BY SYSTEM_TIME INTERVAL 1 WEEK (
        PARTITION p0 HISTORY
      , PARTITION p1 HISTORY
      , PARTITION p2 HISTORY
      ...
      , PARTITION p_cur CURRENT
);

Once a temporal table is partitioned based on intervals, administrators can use the Transportable Tablespaces feature of the InnoDB storage engine and the EXCHANGE PARTITION command syntax to manage table growth. Copying, dropping, and restoring partitions become simple data definition language (DDL) commands and file system operations, avoiding performance impact of changing individual rows.

Continue to Temporal Tables Part 4: Application Time to learn more.

Feed: Clustrix Blog.
Author: Alejandro Infanzon.

This is part 4 of a 5-part series, if you want to start from the beginning see Temporal Tables Part 1: Introduction & Use Case Example

The preceding parts of this series introduced system time and transaction precise time. It showed how MariaDB can automatically capture temporal information based on those definitions. However, application defined timing information is often more important. Often times it is less important to know when a change was recorded in the system, than it is to know when that change is meant to take effect. Let’s take a Human Resource Management System (HRMS) and a simplified view of how employee information is tracked.

CREATE TABLE Employees (
    empID               INTEGER
  , firstName           VARCHAR(100)
  , lastName            VARCHAR(100)
  , address_1           VARCHAR(100)
  , city                VARCHAR(100)
  , state               VARCHAR(50)
  , zip                 VARCHAR(20)
  , departmentName      VARCHAR(20) 
  , startDate           DATETIME NOT NULL
  , endDate             DATETIME NOT NULL
);

The startDate and endDate columns define the period during which a particular entry is valid. Application logic then uses standard SQL to query for a specific point in time.

SELECT *
FROM Employees
WHERE startDate >='2017-01-01 00:00:00'
  AND endDate < '2017-01-01 23:59:59';

The 10.4 release of MariaDB Server has added additional support these application time definitions. The DDL statement can now declare that the startDate and endData columns work together to define a time period.

CREATE TABLE Employees (
    empID INTEGER
  , firstName           VARCHAR(100)
  , lastName            VARCHAR(100)
  , address             VARCHAR(100)
  , city                VARCHAR(100)
  , state               VARCHAR(50)
  , zip                 VARCHAR(20)
  , departmentName      VARCHAR(20) 
  , startDate           DATETIME NOT NULL
  , endDate             DATETIME NOT NULL
  , PERIOD FOR appl_time (startDate, endDate)
);

The benefit of declaring the time period comes when performing deletes or updates that can take advantage of the FOR PORTION OF syntax to make changes that respect period definitions.

Let’s take an example where we want to make some retroactive changes to employee’s titles. First let’s load for a single employee.

INSERT INTO Employees VALUES
(1, 'John', 'Smith', 'Sales', '2015-01-01', '2018-12-31');

Imagine we want to record that John has been assigned to the marketing department from 2017-01-01 to 2017-06-30. Without application time period support, this would involve individual SQL statements to update the endDate on one row, the startDate on another row, and a statement to insert a new row. With MariaDB application time support and the FOR PORTION OF syntax, this can be accomplished with a single statement.

UPDATE Employees
  FOR PORTION OF appl_time
FROM '2016-01-01' to '2016-06-30'
  SET departmentName = 'Marketing'
WHERE empId = 1;

The application can query the data using the below SELECT statement:

SELECT *
FROM Employees
WHERE empId = 1
ORDER BY startDate;

To retrieve the different departments John has been working over different periods of time:

Temporal table: Example 1

Similar syntax can be used to delete information. In this case, let’s assume we have learned that John was not employed by the company for a portion of the time that we currently have him assigned to marketing.

DELETE from test.Employees
FOR PORTION OF appl_time
FROM '2016-02-01' to '2016-04-30';

Like before a simple SELECT statement is all you need:

SELECT *
FROM Employees
WHERE empId = 1
ORDER BY startDate;

Now we can see the period of time John has not been working for the company is reflected in the output:

Temporal table: Example 2

While somewhat simplified, the above examples illustrate how making changes to application time are much easier when using the new syntax available MariaDB 10.4. Working with application time will get even easier with subsequent releases, as there as future enhancements planned These include additional syntax for querying time periods (MDEV-16976), and enforcement of non-overlapping time periods via an integrity constraint (MDEV-16978).

Things get really interesting when you combine system time and application time. Continue to Temporal Tables Part 5: Bitemporal Tables and Queries to learn more.

Feed: Clustrix Blog.
Author: Ken Geiselhart.

Let’s talk about database backups for a minute. Making sure your data is safe and recoverable is a key component of Database Administration. Keeping regular and reliable backups are essential to the success of any organization. When something unexpected happens like a server crashes, data corruption or even data tampering, you need to be able to recover mission critical applications quickly and reliably.

One of the new features in MariaDB Enterprise Server and MariaDB Enterprise Backup is the use of non-blocking backup lock to minimize workload impact during backup operations. It does this by doing the majority of the backup with a minimal number of server locks. This works really well on transactional tables like InnoDB and MyRocks. A short backup lock is needed for new commits and copying statistics and log tables. MariaDB Enterprise Backup comes from InnoDB’s recovery technology. It copies InnoDB tablespaces, which are not consistent after a straight copy. Then a crash recovery is performed to gain consistency on these copied tables by using redo logs, undo logs, the backup’s new DDL log, and backup locks when needed. This non-blocking feature temporarily locks tables during the backup stages to give you a consistent point in time for your backups.

MariaDB Enterprise Backup works by observing and noting the log sequence numbers in the transaction log before it starts copying the tablespaces. Simultaneously, it starts another background process to keep track of the transaction log entries. Later these logs will be used this to keep the copied tablespaces consistent. Backup locks block DDL operations on a table temporarily for a short time at the end of the backup so that the DDL can be made consistent with the data. This minimizes the impact and duration that a table is locked for backups. For example, for an Aria table that stores data in 3 files with extensions .frm, .MAI and .MAD, normal read/write operations can continue as normal.

The non-blocking backup functionality, which is only available in MariaDB Enterprise backups, is different from historical backup functionality in that it includes optimizations in the backup stages, including DDL statement tracking, which reduces lock-time during backups. Earlier releases of MariaDB (Community versions) use FLUSH TABLE WITH READ LOCK, this closes the open tables and only allows them to be reopened with a read lock during the backup stage. This also means the backup cannot run if there are long running queries. The reason why FLUSH is needed is to ensure that all the tables are in a consistent state at the exact same point in time.

MariaDB Enterprise Backup works by using multiple backup stages. Each stage performs certain functions in the backup process. The first stage is called the BACKUP STAGE START. At this stage non-transactional tables not in use are copied. It also starts the copy of tablespaces and other logs, it blocks redo log purges for certain storage engines and logs DDL commands into a ddl.log.

BACKUP STAGE FLUSH, blocks new writes locks on non-transactional tables, flushes all changes for non-transactional tables, with the exception of statistics and log tables. It closes tables not in use and marks them so. It does not block DDL statements in this stage, because table inconsistency will be addressed later when the transaction logs and redo logs are applied. It also does not block or wait on read-only transactions.

During the BACKUP STAGE BLOCK_DDL, it waits for all non-transactional tables with write locks to complete. It blocks CREATE, DROP, TRUNCATE and RENAME table commands. ALTER TABLE is handled in different ways depending on which stage ALTER TABLE is in. It blocks any new ALTER table commands and it blocks the final RENAME command of currently running ALTER TABLE. Currently running ALTER TABLE statements are not blocked.

Next is BACKUP STAGE BLOCK_COMMIT. In this stage the binary log is locked–this also prevents commits and rollbacks. If there are active commits or data to be copied to the binary log this will be allowed to finish. This stage doesn’t lock temporary tables that are not used by replication. These will be blocked when it is time to write to the binary log. System log tables and statistics tables are locked and flushed at this time. When the BLOCK_COMMIT stage returns, this is our consistent ‘backup time’. Everything that should be in this backup is committed and what shouldn’t is rolled back.

The final stage of MariaDB Enterprise Backup is BACKUP STAGE END. This ends DDL logging and frees up resources used by MariaDB Enterprise Backup. For more detail please look at MariaDB’s Enterprise backup stage documentation. It can be found here.

For a more interactive explanation, my fellow Enterprise Architect, Manjot Singh, has a video that describes these stages as well:

[embedded content]

Practical Application

Now that we understand MariaDB Enterprise Backup locks. I will run through a quick test of the new non-blocking backup locks for MariaDB Enterprise Server 10.2.25-1. I create two separate database instances running MariaDB Enterprise Server 10.2.25-1. One database instance will be for taking the backup and the other database instance will be for the recovery process. We then can compare the two database instances for accuracy and show that they are identical. For further information about MariaDB Enterprise Backups please click here.

Connect to the server, log on to the first database instance and create a database for testing. This instance will be known as “bob”.

ssh @.com
MariaDB [(none)]> CREATE DATABASE bob;

We then create a backup user with the appropriate privileges for backing up and restoring the instance.

MariaDB [(none)]> CREATE USER 'mariadb_bob'@'localhost' IDENTIFIED BY 'mariadb_bob';
MariaDB [(none)]> GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'mariadb_bob'@'localhost';

Create some tables to be populated with a large volume of rows, to be later filled with the scripts below.

MariaDB [(none)]> use bob;
create table bigt1(x int primary key auto_increment);
create table bigt2(x int primary key auto_increment);
create table bigt3(x int primary key auto_increment);
create table bigt4(x int primary key auto_increment);
create table bigt5(x int primary key auto_increment);
create table bigt6(x int primary key auto_increment);
create table bigt7(x int primary key auto_increment);

Populate those tables with at least 5Gb worth of data. We do this by inserting simple values into the tables.

insert into bigt1 () values(),(),();
insert into bigt2 () values(),(),();
insert into bigt3 () values(),(),();
insert into bigt4 () values(),(),();
insert into bigt5 () values(),(),();
insert into bigt6 () values(),(),();
insert into bigt7 () values(),(),();

Then we insert some more rows with these commands. Running it each time, doubling the table size. This is put into a script to be continuously run until we get the acceptable table sizes we want to test against.

insert into bigt1 (x) select x + (select count(*) from bigt1) from bigt1;
insert into bigt2 (x) select x + (select count(*) from bigt2) from bigt2;
insert into bigt3 (x) select x + (select count(*) from bigt3) from bigt3;
insert into bigt4 (x) select x + (select count(*) from bigt4) from bigt4;
insert into bigt5 (x) select x + (select count(*) from bigt5) from bigt5;
insert into bigt6 (x) select x + (select count(*) from bigt6) from bigt6;
insert into bigt7 (x) select x + (select count(*) from bigt7) from bigt7;

To check the size of each of the tables as they grow, use the select statements below.

SELECT count(*) from bigt1;

SELECT table_schema as `Database`, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 ), 2) `Size in MB` FROM information_schema.TABLES  WHERE table_schema="bob" ORDER BY (data_length + index_length) DESC;

We now have a database instance with some volume of data. To show the value of the non-blocking lock feature available in MariaDB Enterprise Backups in MariaDB Enterprise Server 10.2.25-1. We will do different levels of mariabackup while introducing a workload against the database instance “bob” showing that backups do not interfere with the running workload. Next, we then restore the different levels of mariabackup to the second database instance, comparing the two instances to ensure they are identical.

We need to create some backup directories to store the backup files. Each sub directory will store a different level of mariabackup. L0 for the full backup, known as level 0. L1 for the first incremental mariabackup. L2 for the second incremental mariabackup.

ssh @.com
cd /backup/backup_test
mkdir L0 L1 L2

Running the script below, introduces a workload. It continuously does an ALTER TABLE, INSERT INTO … multiple tables, a constant RENAME of a table, and a DROP TABLE, against the “bob” instance to simulate the workload.

 :/backup/backup_test] # ./jobs_10.2.25-1.sh

With the workload running we start a level 0 backup with the following command:

date; /mariabackup --defaults-file=/mysql/test_3600/etc/my.cnf --no-version-check --user=mariadb_bob --password=mariadb_bob --throttle=50000 --parallel=1 --ftwrl-wait-timeout=300 --ftwrl-wait-threshold=86400 --ftwrl-wait-query-type=update --backup --target-dir=/backup/test_3600/L0 | & tee -a /backup/backup_test/L0_10.2.25-1-bkp.log; date

Check for backup completion. This can be found by looking at the xtrabackup_info file created from the mariabackup.

# Example.
[00] 2019-09-10 07:54:42 Writing xtrabackup_info
[00] 2019-09-10 07:54:42      ...done
[00] 2019-09-10 07:54:42 Redo log (from LSN 620478195674 to 622323984403) was copied.
[00] 2019-09-10 07:54:42 completed OK!

Keeping the workload active against the instance “bob”. We now start an incremental backup calling it L1.

/mariabackup --defaults-file=/mysql/test_3600/etc/my.cnf --no-version-check --user=mariadb_bob --password=mariadb_bob --throttle=50000 --parallel=1 --ftwrl-wait-timeout=300 --ftwrl-wait-threshold=86400 --ftwrl-wait-query-type=update --backup –target-dir=/backup/test_3600/L1 --incremental-basedir=/backup/backup_test/L0 | & tee -a /backup/backup_test/L1_10.2.25_bkp.log; date

Again, check for backup completion.

# Example.
[00] 2019-09-10 07:59:34 Writing xtrabackup_info
[00] 2019-09-10 07:59:34      ...done
[00] 2019-09-10 07:59:34 Redo log (from LSN 622780501285 to 625194016167) was copied.
[00] 2019-09-10 07:59:34 completed OK!

Stop the jobs script that was simulating the workload against “bob”.

ps -ef | grep .py
root   33966      1 0 09:53 pts/0 00:00:00 python ./Tab5_Insert.py 3626
root   34037      1 0 09:53 pts/0 00:00:00 python ./Tab6_Insert.py 3626
root   34178      1 0 09:53 pts/0 00:00:00 python ./global_Alter.py 3626 bigt1
root   34694      1 0 09:53 pts/0 00:00:00 python ./global_Alter.py 3626 bigt2
….

Kill -9

Start another incremental backup and call it L2.

/mariabackup --defaults-file=/mysql/test_3600/etc/my.cnf --no-version-check --user=mariadb_bob --password=mariadb_bob --throttle=50000 --parallel=1 --ftwrl-wait-timeout=300 --ftwrl-wait-threshold=86400 --ftwrl-wait-query-type=update --backup --target-dir=/backup/backup_test/L2 --incremental-basedir=/backup/backup_test/L1 |& tee -a /backup/backup_test/L2_10.2.25_bkp.log

Once again checking for backup completion.

# Example.
[00] 2019-09-10 08:20:35 Writing xtrabackup_info
[00] 2019-09-10 08:20:35      ...done
[00] 2019-09-10 08:20:35 Redo log (from LSN 633112486665 to 633112486674) was copied.
[00] 2019-09-10 08:20:35 completed OK!

Recover this backup to another instance.

Stop the second database instance if it is running with the following command:

service mysql stop

Remove the datafiles from the data_dir directory for database instance 3601.

cd /mysqld/test_3601/
rm -rf data/*
rm -rf log/*.log
rm -rf log/ib* 
rm -rf log/binlog/*

Prepare the 10.2.21-1 Enterprise MariaDB backup.

mariabackup --defaults-file=/mysql/test_3600/etc/my.cnf --prepare --target=/backup/backup_test/L0

Check for completion.

# Example.
[00] 2019-09-10 08:24:20 Last binlog file /mysql/test_3600/log/mysql-binlog.000326, position 969970776
[00] 2019-09-10 08:24:20 completed OK!

Prepare the L1 incremental backups to the base backup.

/mariabackup --defaults-file=/mysql/test_3600/etc/my.cnf --prepare --target=/backup/backup_test/L0 --incremental-dir=/backup/backup_test/L1

Check for completion.

 # Example.
[00] 2019-09-10 08:40:03 Copying /backup/backup_test_3600/L1/xtrabackup_info to ./xtrabackup_info
[00] 2019-09-10 08:40:03      ...done
[00] 2019-09-10 08:40:03 completed OK!

Prepare the L2 incremental backups to the base backup.

mariabackup --defaults-file=/mysql/backup_test_3600/etc/my.cnf --prepare --target=/backup/backup_test/L0 --incremental-dir=/backup/backup_test/L2

Check for completion.

# Example.
[00] 2019-09-10 08:44:05 Copying /backup/test_3600/L2/xtrabackup_info to ./xtrabackup_info
[00] 2019-09-10 08:44:05      ...done
[00] 2019-09-10 08:44:05 completed OK!

Copy the backup data, back to the 3601 data_dir directory.

mariabackup --defaults-file=/mysql/backup_test_3600/etc/my.cnf --copy-back –target-dir=/mysql/backup_test_3601; date

Check and change file permissions.

chown -R mysql:mysql /mysql/backup_test_3601/*

Start the 10.2.21-1 Enterprise Server on the 3601 database instance.

service mysql start

To test, check both instances for the “bob” databases and consistency within. Use the following commands in each instance (3600 and 3601) to verify the ALTER TABLE,INSERT INTO,RENAME and DROP table scripts from the simulated workload.

Check the ALTER TABLE

Show create table bigt1;

Check the INSERT TABLE

Select count(*) from bigt1;

and

SELECT table_schema as `Database`, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 ), 2) `Size in MB` FROM information_schema.TABLES  WHERE table_schema="bob" ORDER BY (data_length + index_length) DESC;

Checking the RENAME.

Show tables;

If all was done correctly. You will have two identical database instances complete with the same tables. Each with the correct number of rows. The table that was continuously renamed will be identical in each instance. The dropped and recreated table will be gone. Lastly, all altered tables will be consistent as well.

MariaDB Database Backup plays a substantial role in achieving Recovery Point Objective (RPO), which defines the maximum amount of data a business is willing to lose and Recovery Time Objective (RTO), which defines how quickly a business needs to restore service in the event of a fault. Using MariaDB Enterprise Backup’s non-blocking feature reduces the amount of time your database is locked, improving RPO and RTO. Try it out yourself and see how MariaDB Enterprise Backup locks feature can help your organization by downloading MariaDB Enterprise Server, part of MariaDB Platform here.

Feed: Clustrix Blog.
Author: Ken Geiselhart.

In the world of databases, having multiple copies of your database helps to serve your business in a multitude of ways. To achieve this, replication is used. Replication copies the contents of one server called a Primary to one or more servers called Replicas.

One advantage of replication is the ability to separate the read requests from the write requests and direct those reads to a Replica, offloading that portion of a workload away from the Primary improving performance on the Primary. The main difference in basic replication, between a Primary and a Replica, is that the Primary must handle all the writes to a database. MariaDB Cluster is a multi-primary setup which is a different architecture and is excluded here. In order to keep the Primary and a Replica in sync, two replication threads are used.

Before MariaDB 10.0, replication was done with a single I/O thread. Each transaction in the Primary’s binlog is replicated to the relay logs on a Replica, by a single I/O thread. The SQL thread on a Replica would then apply those changes; one right after the other. Being single threaded, dealing with network latency, system resources and other components, replication can sometimes lag. The replication lag makes it difficult to keep the Primary and a Replica in sync. Replication can be started and stopped. Planned or unplanned, it can pick up where it left off. However, stopping replication for any period of time can and will cause replication lag, resulting in your data across the enterprise not being in sync. This was a major problem for single threaded replication. It is up to a Replica to catch up with the Primary.

Starting with MariaDB 10.0, parallel replication was introduced. Parallel replication creates multiple replication worker threads, working in parallel to apply transactions from the relay logs to a Replica. The SQL thread can now hand off transactions to be applied at the same time. This process potentially increases replication speed on a Replica and thereby reducing and eliminating replication lag. Let’s see how this works.

First, set up basic replication with MariaDB 10.x.x. The example below is using MariaDB Server 10.2.21. Next, enable parallel replication, on a Replica by setting the slave_parallel_threads variable. The slave_parallel_threads variable is dynamic, meaning it can be changed without restarting mysqld. Add this variable to the my.cnf configuration file to make it permanent. However, replication must be stopped on a Replica to enable it. The default value is zero, no worker threads are created and the SQL thread will handle this portion of the replication process, one at a time. Setting it to a non-zero value, specifies the number of threads a Replica will create. The slave_parallel_threads variable can be set between 0 and 16383. These threads will apply multiple transactions from the relay log in parallel. The number of parallel transactions that can be applied at the same time is limited by the slave_domain_parallel_threads parameter for a given replication domain on a Replica. This parameter is dynamic as well. Remember, replication on a Replica, must be stopped to enable it.

Example: (Dynamically done on a Replica.)

MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_threads=8; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_domain_parallel_threads=8; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.01 sec)

Parallel replication is handled in two ways with MariaDB, In-Order and Out-of-Order replication. Out-of-Order replication deals with GTIDs and needs to be explicitly enabled by the application using domain ID. Out-of-Order replication will not be addressed here.

In-Order replication executes transactions from the relay logs on a Replica in the same order as they happen on the Primary. If it can be determined, that those transactions have no conflicts, then those transactions will be applied to a Replica in parallel. In-Order replication has five different modes to handle replication. They are conservative, optimistic, aggressive, minimal and none. The slave_parallel_mode global variable controls which mode. This variable is dynamic as well. Remember to stop replication on a Replica, before changing it.

Example:

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> set global slave_parallel_mode=conservative;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like 'slave_parallel_mode';
+---------------------+--------------+
| Variable_name | Value |
+---------------------+--------------+
| slave_parallel_mode | conservative |
+---------------------+--------------+

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.01 sec)

Conservative mode is the default mode for parallel replication and was introduced with MariaDB version 10.0. This mode uses group commit on the Primary. It reads ahead in the relay logs to find potential transactions to apply in parallel on a Replica. If transactions are committed together on the Primary, their commit id will be identical in the binlogs. These transactions then can be committed on a Replica in parallel using different slave parallel worker threads. They still have to maintain the same commit order as they happened on the Primary. If two transactions that were committed separately on the Primary and can potentially have a conflict, then the second transaction will wait for the first transaction to commit, before starting. These types of transactions are not committed in parallel.

Optimistic mode was introduced in MariaDB Server 10.1.3. It is just that, optimistically, it assumes that there will be few conflicts. It allows any transactional DML (INSERT/UPDATE/DELETE) commit to be applied in parallel on a Replica. If conflicts happen between any two commits, the second commit will be rolled back to allow the first commit to finish. Then the second commit will retry and commit. Keeping the same order on a Replica as it happened on the Primary. This takes valuable time. The extra time is offset by the time gained by using parallel replication. There are some heuristics in optimistic mode to prevent conflicts. One example of this is, if a transaction executed a row lock wait on the Primary, it will not run in parallel on a Replica.

DDL and non-transactional DML can not be applied in parallel. Simply because they cannot be rolled back. These types of transactions are applied individually and not in parallel. The different types of transactions and how they are applied can be seen in the binlogs. These logs are not readable in plain text. Mysqlbinlog is needed to read their contents.

Example:

# cd /binlog/directory
# mysqlbinlog mysql-binlog.00420 | more

Aggressive mode was introduced in MariaDB Server 10.1.3. It is very similar to optimistic mode. The main difference between aggressive mode and optimistic is that the heuristics are disabled in aggressive mode. This means more commits can be applied, but at the cost of greater amounts of conflicts.

Minimal mode was introduced in MariaDB Server 10.1.3. This mode only allows the commit step of a transaction to be applied in parallel.

To check the status of the worker threads on a Replica use the SHOW PROCESSLIST. The worker threads will be under “system user”. Their state will show the query they are currently working on or it will show one of the following messages:

Waiting for work from SQL threads –
This means that the worker thread is idle, no work is available for it at the moment.

Waiting for prior transaction to start commit before starting next transaction –
This means that the previous batch of transactions that committed together on the Primary has to complete first. This worker thread is waiting for that to happen before it can start working on the following batch.

Waiting for prior transaction to commit –
This means that the transaction has been executed by the worker thread. In order to ensure in-order commit, the worker thread is waiting to commit until the previous transaction is ready to commit before it.

For better performance with parallel replication on a Replica, it is best to have more group commits on the Primary, when using conservative mode of in-order replication. To calculate the number of transactions in a binlog group commit you divide binlog_commits , which is the total number of transactions committed to the binary log, by binlog_group_commits, which is the total number of groups committed to the binary log, in SHOW STATUS. To see how group commit is helping parallel replication, take two calculations. If your group commit ratio is close to 1, meaning having parallel replication threads may not help, then it may help to change your group commit frequency. To calculate the group commit frequency use the following formula:

group commit = (2nd Binlog_commits - 1stBinlog_commits)/(2nd Binlog_group_commits)–(1st Binlog_group_commits)

Example: 1st collection:

SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| Binlog_commits | 35275333 |
| Binlog_group_commits | 13818856 |
+----------------------+----------+

Commit some transactions (small workload).
2nd collection:

SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| Binlog_commits | 42787157 |
| Binlog_group_commits | 15244542 |
+----------------------+----------+

This gives the number of transactions in a binlog group commit.

group commit = ( 7511824 / 1425686 = 5.26 group commit

Starting with MariaDB 10.0. Group commits frequency can be changed by configuring the system variables binlog_commit_wait_usec, which is the given number of transactions that are ready to be flushed as a group, and the binlog_commit_wait_count, which delays flushing a committed transaction into the binary log until that transaction has waited the configured number of microseconds.

Binlog_commit_wait_count has a direct correlation to the slave_parallel_threads on a Replica. Bringing us back to the different variables on a Replica that influence parallel replication like slave_parallel_threads. Another variable that influence parallel replication on a Replica is slave_parallel_max_queued. In parallel replication the SQL thread reads the relay logs on a Replica, instead of applying those transactions itself, one at a time, it hands those commits off to worker threads to \\The slave_parallel_max_queued controls the maximum size of a parallel slave queue can have in memory. If set too high, there is replication lag. It could fill up the queue faster than the worker threads can commit. If set too low, then there would not be enough space in memory to keep all the worker threads busy. You would see this in the SHOW PROCESSLIST on a Replica, with Waiting for room in worker thread event queue.

In order to ensure durability with group commit, the “D” in ACID compliance, is accomplished by setting both the sync_binlog variable to one, and innodb_flush_log_at_trx_commit variable to one as well. Setting these both to one gives you the greatest level of fault tolerance. But, may come at a cost. Parallel replication may help alleviate that cost.

Let’s see how parallel replication affects these variables. I set up two database instances on two separate servers, both running MariaDB 10.2.21. Port 3502 will be known as the Primary and port 3503 will be a Replica.

Connect to the first server, log on to the first database instance and create a database for testing. This instance will be known as “bob”.

ssh @.com
MariaDB [(none)]> CREATE DATABASE bob;

Create two users. One is a replication user and the other is for ptstalk. Ptstalk is a great tool to connect system statistics. Give them the appropriate privileges.
On Primary:

MariaDB [(none)]> CREATE user 'repl_bob'@'localhost' identified by 'repl_bob';
MariaDB [(none)]> GRANT PROCESS, REPLICATION SLAVE, REPLICATION CLIENT on *.* to 'repl_bob'@'localhost';
MariaDB [(none)]> CREATE user 'tracker'@'localhost' identified by "tracker";
MariaDB [(none)]> GRANT ALL on *.* to 'tracker'@'localhost';

Then,

MariaDB [(none)]> Show master status;

On Replica:

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST=.com,
MASTER_USER='repl_bob', MASTER_PASSWORD='repl_bob', MASTER_PORT=3502, MASTER_LOG_FILE='mysql-binlog.0000010', MASTER_LOG_POS=123456789, MASTER_CONNECT_RETRY=10;

MariaDB [(none)]> Start slave;
MariaDB [(none)]> Show slave statusG

At this point, basic replication should be running. Back on the Primary we create tables to be populated and help with showing replication lag using this workload.
On Primary:

MariaDB [(none)]>
CREATE TABLE Tab1(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab2(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab3(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab4(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );
CREATE TABLE Tab5(id VARCHAR(30)PRIMARY KEY, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50), reg_date TIMESTAMP );

Let’s do some testing:

I run a replication workload on the Primary, for fifteen minutes to get a baseline. Take a timestamp on how far behind replication gets on a Replica and how long it takes for replication lag to catch up. Shown below are the changing of different replication variables for each test and the results of those tests.

The following procedures were performed for each test.

Start pt-stalk on the Primary before running the replication workload script. This helps gather valuable information about what the database is doing.

pt-stalk –no-stalk –iterations 200 –sleep 5 –dest ./pt-stalk –user=tracker –socket=/tmp/mysqld_test01_3502.sock port=3502 –password=tracker &

On a Replica start pt-stalk to collect a Replica’s statistics.

pt-stalk --no-stalk --iterations 200 --sleep 5 --dest ./pt-stalk --user=tracker --socket=/tmp/mysqld_test01_3503.sock --port=3503 --password=tracker &

Start a workload on the Primary. The repl_wk_load.sh script contains one INSERT, one UPDATE and one DELETE statement for each table that continuously execute for fifteen minutes.

nohup ./repl_wk_load.sh nohuppy &

Manually monitoring the status of the replication workload script can be done with SHOW PROCESSLIST on the Primary. Again, this information is also being obtained by ptstalk.

MariaDB [(none)]> SHOW PROCESSLIST;

On a Replica, check the processes and replication lag with:

MariaDB [(none)]> SHOW PROCESSLIST;
MariaDB [(none)]> SHOW SLAVE STATUSG

=============================================================================
Test 0 – Baseline
Primary –

sync_binlog = 0
Innodb_flush_log_at_trx_commit = 0
binlog_format = Mixed
binlog_commit_wait_count = 0
binlog_commit_wait_usec = 100000
Replica - Defaults
slave_parallel_threads = 0
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 1
Primary –

sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 0
binlog_commit_wait_usec = 100000
Replica - no change
slave_parallel_threads = 0
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 2
Primary –

sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 5
binlog_commit_wait_usec = 100000
Replica -
slave_parallel_threads = 5
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Test 3
Primary –

sync_binlog = 1
Innodb_flush_log_at_trx_commit = 1
binlog_format = Mixed
binlog_commit_wait_count = 5
binlog_commit_wait_usec = 50000
Replica -
slave_parallel_threads = 5
slave_parallel_max_queued = 131072
slave_compressed_protocol = 0

Results:

	Start Time/Workload (15 minutes)	Endtime/Workload, Stopped	Sec Behind Master at Endtime	Max Sec Behind Master	Replication Catch Up Time
Baseline	10:33:27	10:48:27	673	2892	11:36:28 (48 minutes)
Test 1	12:14:41	12:29:41	753	6397	14:06:33 (97 minutes)
Test 2	08:14:41	08:29:40	491	1401	08:54:00 (24 minutes)
Test 3	09:04:15	09:19:15	0	0	No Lag

Below are the measured group commits on the Primary.

Baseline –

MariaDB [sample5]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 42180283 |
| Binlog_group_commits | 32312303 |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 54229555 |
| Binlog_group_commits | 43653895 |
+----------------------+----------+

(54229555-42180283)/(43653895-32312303) = 12049272/11341592 = 1.06xxx

Test 1 –

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| Binlog_commits       | 5335214 |
| Binlog_group_commits | 2632599 |
+----------------------+---------+
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| Binlog_commits       | 13733155 |
| Binlog_group_commits | 6773225  |
+----------------------+----------+

(13733155-5335214)/(6773225-2632599) = 8397941/4140626 = 2.02xxx

Test 2 –

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 23916530 |
| Binlog_group_commits | 9063206  |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 33913182 |
| Binlog_group_commits | 13566537 |
+----------------------+----------+

(13566537-9063206)/(33913182-23916530) = 9996652/4503331 = 2.21

Test 3 –

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name IN('Binlog_commits', 'Binlog_group_commits');
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 35275333 |
| Binlog_group_commits | 13818856 |
+----------------------+----------+
+----------------------+----------+
| Variable_name    	| Value	|
+----------------------+----------+
| Binlog_commits   	| 42787157 |
| Binlog_group_commits | 15244542 |
+----------------------+----------+
 
(15244542-13818856)/(42787157-35275333) = 7511824/1425686 = 5.26

Drawing conclusions from the above data you can see how using slave_parallel_threads with parallel replication can reduce or remove replication lag completely. We can also see what affect binlog_commit_wait_count and binlog_commit_wait_usec have on group commits. Each workload is different, just enabling parallel replication may not be enough. Test and evaluate which settings are right for your workload.

Feed: AWS Big Data Blog.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This post details the two new preview features that you can start using today: connecting to Apache Hive Metastore and using user-defined functions.

Connecting Athena to your Apache Hive Metastore

Several customers use the Hive Metastore as a common metadata catalog for their big data environments. Such customers run Apache Spark, Presto, and Apache Hive on Amazon EC2 and Amazon EMR clusters with a self-hosted Hive Metastore as the common catalog. AWS also offers the AWS Glue Data Catalog, which is a fully managed catalog and drop-in replacement for the Hive Metastore. With the release as of this writing, you can now use the Hive Metastore in addition to the Data Catalog with Athena. Athena now allows you to connect to multiple Hive Metastores along with existing Data Catalog.

To connect to a self-hosted Hive Metastore, you need a metastore connector. You can download a reference implementation of this connector, which runs as a Lambda function in your account. The current version of the implementation supports only SELECT queries. DDL support is limited to basic metadata syntax. For more information, see please check Considerations and Limitations of this feature. You can also write a Hive Metastore connector using the previous reference implementation as an example. You can deploy your implementation as a Lambda function, and subsequently use it with Athena. For more information about the feature, see the Using Athena Data Connector for External Hive Metastore (Preview) documentation.

Using user-defined functions in Athena

Athena also offers preview support for scalar user-defined functions (UDFs). UDFs enable you to write functions and invoke them in SQL queries. A scalar UDF is applied one row at a time and returns a single column value. Athena invokes your scalar UDF with batches of rows to limit the performance impact associated with making a remote call for the UDF itself.

With the latest release as of this writing, you can use the Athena Query Federation SDK to define your functions and invoke them inline in SQL queries. You can now compress and decompress row values, scrub personally identifiable information (PII) from columns, transform dates to a different format, read image metadata, and execute proprietary custom code in your queries. You can also execute UDFs in both the SELECT and FILTER phase of the query and invoke multiple UDFs in the same query.

For more information about UDFs, see our documentation. For common UDF example implementations, see the GitHub repo. For more information about writing functions using the Athena Query Federation SDK, please visit this link.

Testing the preview features

All Athena queries originating from the workgroup AmazonAthenaPreviewFunctionality are considered Preview test queries.

Create a new workgroup AmazonAthenaPreviewFunctionality using Athena APIs or the Athena console. For more information, see Create a Workgroup.

The following considerations are important when using preview features.

Do not edit the workgroup name. You can edit other workgroup properties, such as enabling Amazon CloudWatch metrics and requester pays. You can use the Athena console, JDBC/ODBC drivers, or APIs to submit your test queries. Specify the workgroup AmazonAthenaPreviewFunctionality when you submit test queries.

Preview functionality is available only in the us-east-1 Region. If you use Athena in any other Region and submit queries using AmazonAthenaPreviewFunctionality, your query fails. Cross-Region calls are not supported in preview mode.

During the preview, you do not incur charges for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS SAM. For example, you are charged S3 rates for storage, requests, and inter-Region data transfer. By default, query results are stored in an S3 bucket of your choice and are billed at standard S3 rates. If you use Lambda, you are charged based on the number of requests for your functions and the duration (the time it takes for your code to execute).

It is not recommended to onboard your production workload to AmazonAthenaPreviewFunctionality.

Query performance may vary between the preview workgroup and the other workgroups in your account. Additionally, new features and bug fixes may not be backwards compatible.

Summary

In summary, we introduced Athena’s two new features that released today in Preview.

Customers who use the Apache Hive Metastore for metadata management, and were previously unable to use Athena, can now connect their Hive Metastore to Athena to run queries. Also, customers can now use Athena’s Query Federation SDK to define and invoke their own functions in their SQL queries in Athena.

Both these features are available in Preview in the AWS us-east-1 region. Begin your Preview now by following these steps in the Athena FAQ. We welcome your feedback at Athena-feedback@amazon.com

About the Author

Janak Agarwal is a product manager for Athena at AWS.

Feed: Planet MySQL
;
Author: Federico Razzoli
;

MySQL supports what it calls replication filters. They allow to decide which tables are replicated, and which are not. They can be set on a slave to filter-in tables and databases: the slave will not replicate everything from the master, but only the specified databases and tables. Or they can filter-out tables and databases: the specified databases and tables will be ignored, everything else will be replicated. Or we can use a combination of in-filters and out-filters. And we can filter specific databases, specific tables, or patterns for table names (for example, all the tables with a certain suffix).

A bit confusing but powerful, right? Unfortunately, this feature is also very dangerous, as it can easily cause inconsistencies (different data) between the master and the slave. This can happen because applications run queries in a wrong way, or because a DBA runs a migration without following certain rules.

Inconsistent data is per se bad news. But keep also in mind that, if MySQL detects an inconsistency, replication will break. MySQL detects inconsistencies, for example, and it tries to replicate a row deletion but the target row doesn’t exist anymore.

Before listing the risks themselves, let me clarify a couple of concepts that some of my readers may not familiar with.

Glasgow, seen from Queens Park — Glasgow from Queens Park.
Westminster’s decisions are replicated with some filters by the national governments,
but is devolution enough?

Table of Contents
hide

Some MySQL concepts

Let me explain a couple of MySQL concepts first, because they’re important to understand the risks of replication filters.

I want this article to be understandable for non-expert MySQL users. But if you know everything about binary log formats and you know what default database means, just skip to How replication filters can cause inconsistencies and Binary log filters, below.

Statement replication

All the risks only affect the replication of events from the master, in the STATEMENT format. Events are logged as STATEMENT if:

binlog_format = STATEMENT;
binlog_format = MIXED(for non-deterministic statements);
If a user with the SUPER privilege changes the binlog_format, maybe just temporarily and at session level;
- This could also happen in a stored procedure created by a superuser;
DDL statements (CREATE TABLE, ALTER TABLE, DROP TABLE…) are always logged as STATEMENT.

For more details, see Setting The Binary Log Format, in MySQL documentation.

Default database

A default database is the currently selected database. If there is none, you’ll need to write queries in this form:

SELECT * FROM my_database.my_table;

Using a default database means to write something like this:

USE my_database;
SELECT * FROM my_table;

When writing an application, usually one specified a database when establishing the connection to MySQL, instead of running USE.

How replication filters can cause inconsistencies

When no default database is selected, database-level filters won’t work.
When a database is selected but we are writing to a table into a different database, database-level filters won’t work.
replicate_do_table doesn’t prevent changes made in stored functions. One could object that functions should not write to tables, but that is not always true. A function could treat table data as static variables. Here’s an example.
Replication breaks when a single statement modifies both a filtered-in table and a filtered-out table. (multi-table DELETE, multi-table UPDATE).

The last risk is particularly tricky if you use patterns to exclude (or include) table names. For example, you may add a filter to ignore all tables starting with old_. But after some months/years a new table called old_index may be created, and you may want to replicate it. At that point, it’s very easy to make a mistake.

Binary log filters

Replication filters are set on the slave to decide which events – sent by the master – will be replicated. Binary log filters are set on the master to decide which events should be written in the binary log. Filtered-out events are not logged and not sent to any slave. They are twin options: identical, with a different prefix (replicate_ or binlog_).

Binary log filters are subject to the same limitations/risks as replication filters. But, while they sound like an optimisation (because events that we don’t want to replicate are not logged by the master and not seen by the slaves at all), they are inherently subject to two important problems:

Not logged events cannot be replicated by any slave;
Using the binary log as an incremental backup will not restore any non-logged change;
- A way to undo a destructive change is to restore the latest backup and re-apply the changes from the binary log, except for the one we want to undo. But if the binary log doesn’t include all changes made to the data, this will not be possible.

Recommendations for a safe replication

My recommendations about the use of these features are:

Don’t use binary log filters at all.
Filtering out the mysql system database normally does not harm. This is desirable to avoid replicating master users and permissions on the slave.
- The only case when this is dangerous is when we write into the mysql database manually, instead of using statements like CREATE USER.
sys database should be different for every MySQL version. It is a good idea to filter it out.
Filtering out information_schema or performance_schema has no effect, simply don’t.
Sometimes we really want to avoid replicate some regular databases. And using replication filters is the only way. But at least, we should follow some rules strictly.
- Use ROW binary log format.
- Make sure that applications select a default database and never write to a non-default database.
- If you do migrations manually, make sure to properly use USE.

Conclusions

We discussed the risks of replication filters and binary log filters, and I offered my recommendations to make your replication safe.

Did you notice any mistake? Do you have different opinions? Do you have ideas to share? Please comment!

As usual, I’ll be happy to fix errors and discuss your ideas. I want to thank all the persons who contributed this website with their comments, creating a valuable shared knowledgebase.

Toodle pip,
Federico

Photo credit

Feed: MemSQL Blog.
Author: Eric Hanson.

MemSQL is uniquely suited to real-time analytics, where data is being ingested, updated, and queried concurrently with aggregate queries. Real-time analytics use cases often are based on event data, where each separate event has a timestamp. It’s natural to interpret such a sequence of events as a time series.

Prior to the 7.0 release, MemSQL delivered many capabilities that make it well-suited to time-series data management [Han19]. These include:

a scaled-out, shared-nothing architecture that supports transactional and analytical workloads with a standard SQL interface,
fast query execution via compilation and vectorization, combined with scale out,
ability to load data phenomenally fast using the Pipelines feature, which supports distributed, parallel ingestion,
non-blocking concurrency control so readers and writers never make each other wait,
window functions for ranking, moving averages, and so on,
a highly-compressed columnstore data format suitable for large historical data sets.

Hence, many of our customers are using MemSQL to manage time series data today.

For the MemSQL 7.0 release, we decided to build some special-purpose features to make it even easier to manage time-series data. These include FIRST(), LAST(), TIME_BUCKET(), and the ability to designate a table column as the SERIES TIMESTAMP [Mem19a-d]. Taken together, these allow specification of queries to summarize time series data with far fewer lines of code and fewer complex concepts. This makes expert SQL developers more productive, and opens up the ability to query time series data to less expert developers.

We were motivated to add special time series capability in MemSQL 7.0 for the following reasons:

Many customers were using MemSQL for time series data already, as described above.
Customers were asking for additional time series capability.
Bucketing by time, a common time series operation, was not trivial to do.
Use of window functions, while powerful for time-based operations, can be complex and verbose.
We’ve seen brief syntax for time bucketing in event-logging data management platforms like Splunk [Mil14] and Azure Data Explorer (Kusto) [Kus19] be enthusiastically used by developers.
We believe we can provide better overall data management support for customers who manage time series data than the time series-specific database vendors can. We offer time series-specific capability and also outstanding performance, scalability, reliability, SQL support, extensibility, rich data type support, and so much more.

Designating a Time Attribute in Metadata

To enable simple, brief SQL operations on time series data, we recognized that all our new time series functions would have a time argument. Normally, a table has a single, well-known time attribute. Why not make this attribute explicit in metadata, and an implicit argument of time-based functions, so you don’t have to reference it in every query expression related to time?

So, in MemSQL 7.0 we introduced a special column designation, SERIES TIMESTAMP, that indicates a default time column of a table. This column is then used as an implicit attribute in time series functions. For example, consider this table definition:

CREATE TABLE tick(
  ts datetime(6) series timestamp,
  symbol varchar(5),
  price numeric(18,4));

It defines a table, tick, containing hypothetical stock trade data. The ts column has been designated as the series timestamp. In examples to follow, we’ll show how you can use it to make queries shorter and easier to write.

The Old Way of Querying Time Series

Before we show the new way to write queries briefly using time series functions and the SERIES TIMESTAMP designation in 7.0, consider an example of how MemSQL could process time series data before 7.0. We’ll use the following data for examples:

INSERT INTO tick VALUES
 ('2020-02-18 10:55:36.179760', 'ABC', 100.00),
 ('2020-02-18 10:57:26.179761', 'ABC', 101.00),
 ('2020-02-18 10:59:16.178763', 'ABC', 102.50),
 ('2020-02-18 11:00:56.179769', 'ABC', 102.00),
 ('2020-02-18 11:01:37.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:46.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:59.179769', 'ABC', 102.60),
 ('2020-02-18 11:02:46.179769', 'XYZ', 103.00),
 ('2020-02-18 11:02:59.179769', 'XYZ', 102.60),
 ('2020-02-18 11:03:59.179769', 'XYZ', 102.50);

The following query works in MemSQL 6.8 and earlier. As output, it produces a separate row, for each stock, for each hour it was traded at least once. (So if a stock is traded ten or more times, in ten separate hours, ten rows are produced for that stock. A row will contain either a single trade, if only one trade occurred in that hour, or a summary of the trades – two or more – that occurred during the hour.) Each row shows the time bucket, stock symbol, and the high, low, open, and close for the bucket period. (If only one trade occurred in that hour, the high, low, open, and close will all be the same – the price the stock traded at in that hour.)

WITH ranked AS
(SELECT symbol,
    RANK() OVER w as r,
    MIN(price) OVER w as min_pr,
    MAX(price) OVER w as max_pr,
    FIRST_VALUE(price) OVER w as first,
    LAST_VALUE(price) OVER w as last,
    from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) as ts
    FROM tick
    WINDOW w AS (PARTITION BY symbol, 
               from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) 
               ORDER BY ts
               ROWS BETWEEN UNBOUNDED PRECEDING
               AND UNBOUNDED FOLLOWING))
 
SELECT ts, symbol, min_pr, max_pr, first, last
FROM ranked
WHERE r = 1
ORDER BY symbol, ts;

This query produces the following output, which can be used to render a candlestick chart [Inv19], a common type of stock chart.

+---------------------+--------+----------+----------+----------+----------+
| ts                  | symbol | min_pr   | max_pr   | first    | last     |
+---------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+---------------------+--------+----------+----------+----------+----------+

The query text, while understandable, is challenging to write because it uses a common table expression (CTE), window functions with a non-trivial window definition, a subtle use of ranking to pick one row per group, and a non-obvious divide/multiply trick to group time to a 60*60 second bucket.

New Time-Series Functions in MemSQL 7.0

Here I’ll introduce the new time series functions, and then show an example where we write an equivalent query to the “candlestick” query above using the new functions. I think you’ll be impressed by how concise it is!

Also see the latest documentation for analyzing time series data and for the new time series functions.

FIRST()

The FIRST() function is an aggregate function that takes two arguments, as follows:

FIRST (value[, time]);

Given a set of input rows, it returns the value for the smallest associated time.

The second argument is optional. If it is not specified, it is implicitly the SERIES TIMESTAMP column of the table being queried. It’s an error if there is no SERIES TIMESTAMP available, or if there is more than one available in the context of the query where FIRST is used; in that case, you should specify the time explicitly.

For example, this query gives the symbol of the first stock traded among all stocks in the tick table:

SELECT first(symbol) FROM tick;

The result is ABC, which you can see is the first one traded at 10:55:36.179760 in the rows inserted above.

LAST()

LAST is just like FIRST except it gives the value associated with the latest time.

TIME_BUCKET()

TIME_BUCKET takes a time value and buckets it to a specified width. You can use very brief descriptions of bucket width, like ‘1d’ for one day, ‘5m’ for five minutes, and so on. The function takes these arguments:

TIME_BUCKET (bucket_width [, time [,origin]])

The only required argument is bucket_width. As with FIRST and LAST, the time argument is inferred to be the SERIES TIMESTAMP if it is not specified. The origin argument is used if you want your buckets to start at a non-standard boundary – say, if you want day buckets that begin at 8am every day.

Putting It All Together

Now that we’ve seen FIRST, LAST, TIME_BUCKET, and SERIES TIMESTAMP, let’s see how to use all of them to write the candlestick chart query from above. A new version of the same query is simply:

SELECT time_bucket('1h') as ts, symbol, min(price) as min_pr,
    max(price) as max_pr, first(price) as first, last(price) as last
FROM tick
group by 2, 1
order by 2, 1;

The new version of the query produces this output, which is essentially the same as the output of the original query.

+----------------------------+--------+----------+----------+----------+----------+
| ts                         | symbol | min_pr   | max_pr   | first    | last     |
+----------------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00.000000 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00.000000 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00.000000 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+----------------------------+--------+----------+----------+----------+----------+

Look how short this query is! It is 5 lines long vs. 18 lines for the previous version. Moreover, it doesn’t use window functions or CTEs, nor require the divide/multiply trick to bucket time. It just uses standard aggregate functions and scalar functions.

Conclusion

MemSQL 7.0 makes it much simpler to specify many time-series queries using special functions and the SERIES TIMESTAMP column designation. For a realistic example, we reduced lines of code by more than three-fold, and eliminated the need to use some more advanced SQL concepts.

Given the high performance, unlimited scalability, and full SQL support of MemSQL, it was a strong platform for time series data in earlier releases. Now, in MemSQL 7.0, we’ve taken that power and added greater simplicity with these new built-in capabilities. How can you apply MemSQL 7.0 to your time-oriented data?

References

[Han19] Eric Hanson, What MemSQL Can Do For Time Series Applications, https://www.memsql.com/blog/what-memsql-can-do-for-time-series-applications/, March 2019.

[Inv19] Understanding Basic Candlestick Charts, Investopedia, https://www.investopedia.com/trading/candlestick-charting-what-is-it/, 2019.

[Kus19] Summarize By Scalar Values, Azure Data Explorer Documentation, https://docs.microsoft.com/en-us/azure/kusto/query/tutorial#summarize-by-scalar-values, 2019.

[Mem19a] FIRST, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/first/, 2019.

[Mem19b] LAST, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/last/, 2019.

[Mem19c] TIME_BUCKET, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/time_bucket/, 2019.

[Mem19c] CREATE TABLE Topic, SERIES TIMESTAMP, https://docs.memsql.com/v7.0-beta/reference/sql-reference/data-definition-language-ddl/create-table/, 2019.

[Mil14] James Miller, Splunk Bucketing, Mastering Splunk, O’Reilly, https://www.oreilly.com/library/view/mastering-splunk/9781782173830/ch03s02.html, 2014.

Feed: Planet MySQL
;
Author: Mark Callaghan
;

This is a post about InnoDB SMP performance improvements in 200X written around 2008. It was shared at code.google.com which has since shutdown. It describes work done by my team at Google. The big-3 problems for me back then were: lack of monitoring, replication and InnoDB on many-core.

Introduction

This describes performance improvements from changes in the v2 Google patch. While the changes improve performance in many cases, a lot of work remains to be done. It improves performance on SMP servers by:

disabling the InnoDB memory heap and associated mutex
replacing the InnoDB rw-mutex and mutex on x86 platforms
linking with tcmalloc

While tcmalloc makes some of the workloads much faster, we don’t recommend its use yet with MySQL as we are still investigating its behavior.

Database reload

This displays the time to reload a large database shard on a variety of servers (HW + SW). Unless otherwise stated, my.cnf was optimized for a fast (but unsafe) reload with the following values. Note that

innodb_flush_method=nosync is only in the Google patch and is NOT crash safe (kind of like MyISAM). This uses a real data set that produces a 100GB+ database.
innodb_log_file_size=1300M
innodb_flush_method=nosync
innodb_buffer_pool_size=8000M
innodb_read_io_threads=4
innodb_write_io_threads=2
innodb_thread_concurrency=20

The data to be reloaded was in one file per table on the db server. Each file was compressed and reloaded by a separate client. Each table was loaded by a separate connection except for the largest tables when there was no other work to be done. 8 concurrent connections were used.

The smpfix RPM is MySQL 5.0.37 plus the v1 Google patch and the SMP fixes that include:

InnoDB mutex uses atomic ops
InnoDB rw-mutex uses lock free methods to get and set internal lock state
tcmalloc is used in place of glibc malloc
the InnoDB malloc heap is disabled

The base RPM is MySQL 5.0.37 and the v1 Google patch. It does not have the SMP fixes.

The servers are:

8core – the base RPM on an 8-core x86 server
4core-128M – the base RPM on a 4-core x86 server with innodb_log_file_size=128M
8core-tc4 – the base RPM on an 8-core x86 server with innodb_thread_concurrency=4
smpfix-128M – the smpfix RPM with innodb_log_file_size=128M
4core – the base RPM on a 4-core x86 server
smpfix-4core – the smpfix RPM on a 4-core x86 server
smpfix-512M – the smpfix RPM on an 8-core x86 server with innodb_log_file_size=512M
smpfix – the smpfix RPM on an 8-core x86 server
onlymalloc – the base RPM on an 8-core x86 server with the InnoDB malloc heap disabled
smpfix-notcmalloc – the smpfix RPM on an 8-core x86 server without tcmalloc

The x-axis is time so larger is slower.

Sysbench readwrite

Sysbench includes a transaction processing benchmark. The readwrite version of the sysbench OLTP test is measured here using 1, 2, 4, 8, 16, 32 and 64 threads.

The sysbench command line options are:

# N is 1, 2, 4, 8, 16, 32 and 64
–test=oltp –oltp-table-size=1000000 –max-time=600 –max-requests=0 –mysql-table-engine=innodb –db-ps-mode=disable –mysql-engine-trx=yes –num-threads=N

The my.cnf options are:

innodb_buffer_pool_size=8192M
innodb_log_file_size=1300M
innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_file_per_table
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size = 200m
innodb_thread_concurrency=0
log_bin
key_buffer_size = 50m
max_heap_table_size=1000M
max_heap_table_size=1000M
tmp_table_size=1000M
max_tmp_tables=100

The servers are:

base – MySQL 5.0.37 without the smp fix
tc4 – MySQL 5.0.37 without the smp fix, innodb_thread_concurrency=4
smpfix – MySQL 5.0.37 with the smp fix and tcmalloc
notcmalloc – MySQL 5.0.37 with the smp fix, not linked with tcmalloc
onlymalloc – MySQL 5.0.37 with the InnoDB malloc heap disabled
my4026 – unmodified MySQL 4.0.26
my4122 – unmodified MySQL 4.1.22
my5067 – unmodified MySQL 5.0.67
my5126 – unmodified MySQL 5.1.26
goog5037 – the same as base, MySQL 5.0.37 without the smp fix

Results for sysbench readonly

Sysbench includes a transaction processing benchmark. The readonly version of the sysbench OLTP test is measured here using 1, 2, 4, 8, 16, 32 and 64 threads.

The sysbench command line options are:

# N is 1, 2, 4, 8, 16, 32 and 64
–test=oltp –oltp-read-only –oltp-table-size=1000000 –max-time=600 –max-requests=0 –mysql-table-engine=innodb –db-ps-mode=disable –mysql-engine-trx=yes –num-threads=N

The my.cnf options are:

innodb_buffer_pool_size=8192M
innodb_log_file_size=1300M
innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_file_per_table
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size = 200m
innodb_thread_concurrency=0
log_bin
key_buffer_size = 50m
max_heap_table_size=1000M
max_heap_table_size=1000M
tmp_table_size=1000M
max_tmp_tables=100

The servers are:

base – MySQL 5.0.37 without the smp fix
tc4 – MySQL 5.0.37 without the smp fix, innodb_thread_concurrency=4
smpfix – MySQL 5.0.37 with the smp fix and tcmalloc
notcmalloc – MySQL 5.0.37 with the smp fix, not linked with tcmalloc
onlymalloc – MySQL 5.0.37 with the InnoDB malloc heap disabled
my4026 – unmodified MySQL 4.0.26
my4122 – unmodified MySQL 4.1.22
my5067 – unmodified MySQL 5.0.67
my5126 – unmodified MySQL 5.1.26
goog5037 – the same as base, MySQL 5.0.37 without the smp fix

Concurrent joins

This test runs a query with a join. It is run using concurrent sessions. The data fits in the InnoDB buffer cache. The query is:

select count(*) from T1, T2 where T1.j > 0 and T1.i = T2.i

The data for T1 and T2 matches that used for the sbtest table by sysbench. This query does a full scan of T1 and joins to T2 by primary key.

The servers are:

base – MySQL 5.0.37 without the smp fix
tc4 – MySQL 5.0.37 without the smp fix, innodb_thread_concurrency=4
smpfix – MySQL 5.0.37 with the smp fix and tcmalloc
notcmalloc – MySQL 5.0.37 with the smp fix, not linked with tcmalloc
onlymalloc – MySQL 5.0.37 with the InnoDB malloc heap disabled
my4026 – unmodified MySQL 4.0.26
my4122 – unmodified MySQL 4.1.22
my5067 – unmodified MySQL 5.0.67
my5126 – unmodified MySQL 5.1.26
goog5037 – the same as base, MySQL 5.0.37 without the smp fix

I only have results for 8 and 16 core servers here. Lower times are better.

With data from the worst case:

Without data from the worst case:

With data from the worst case:

Without data from the worst case:

Concurrent inserts

This test reloads tables in parallel. Each connection inserts data for a different table. Tests were run using 1, 2, 4, 8 and 16 concurrent sessions. The regression for 5.0.37 is in the parser and was fixed by 5.0.54. A separate table is used for each connection. DDL for the tables is:

create table T$i (i int primary key, j int, index jx(j)) engine=innodb

Multi-row insert statements are used that insert 1000 rows per insert statement. Auto-commit is used. The insert statements look like:

INSERT INTO T1 VALUES (0, 0), (1, 1), (2, 2), …, (999,999);

The servers are:

base – MySQL 5.0.37 without the smp fix
tc4 – MySQL 5.0.37 without the smp fix, innodb_thread_concurrency=4
smpfix – MySQL 5.0.37 with the smp fix and tcmalloc
notcmalloc – MySQL 5.0.37 with the smp fix, not linked with tcmalloc
onlymalloc – MySQL 5.0.37 with the InnoDB malloc heap disabled
my4026 – unmodified MySQL 4.0.26
my4122 – unmodified MySQL 4.1.22
my5067 – unmodified MySQL 5.0.67
my5126 – unmodified MySQL 5.1.26
goog5037 – the same as base, MySQL 5.0.37 without the smp fix

MySQL 5.0.37 has a performance regression in the parser. This was fixed in 5.0.54. See bug 29921.

Note, lower values for Time are better.

With data from the worst case:

Without data from the worst case:

With data from the worst case:

Without data from the worst case:

Feed: Planet MySQL
;
Author: Mark Callaghan
;

This is a post about work done by Wei Li at Google to make MySQL replication state crash safe. Before this patch it was easy for a MySQL storage engine and replication state to disagree after a crash. Maybe it didn’t matter as much for people running MyISAM because that too wasn’t crash safe. But so many people had to wake up late at night to recover from this problem which would manifest as either a duplicate key error or silent corruption. The safe thing to do was to not restart a replica after a crash and instead restore a new replica from a backup.

Wei Li spent about 12 months fixing MySQL replication adding crash safety and several other features. That was an amazing year from him. I did the code reviews. My reviews were weak. MySQL replication code was difficult back then.

I got to know Domas Mituzas after he extracted this feature from the Google patch to use for Wikipedia. I was amazed he did this and he continued to make my life with MySQL so much better at Google and then Facebook. When I moved to Google I took too long to port this patch for them. My excuse is that Domas wasn’t yelling enough — there were many problems and my priority list was frequently changing.

This post was first shared at code.google.com which has since shutdown. This feature was in production around 2007, many years before something similar was provided by upstream. I can’t imagine doing-web-scale MySQL without it. The big-3 problems for me back then were: lack of monitoring, replication and InnoDB on many-core.

Introduction

Replication state on the slave is stored in two files: relay-log.info and master.info. The slave SQL thread commits transactions to a storage engine and then updates these files to indicate the next event from the relay log to be executed. When the slave mysqld process is stopped between the commit and the file update, replication state is inconsistent and the slave SQL thread will duplicate the last transaction when the slave mysqld process is restarted.

Details

This feature prevents that failure for the InnoDB storage engine by storing replication state in the InnoDB transaction log. On restart, this state is used to make the replication state files consistent with InnoDB.

The feature is enabled by the configuration parameter rpl_transaction_enabled=1. Normally, this is added to the mysqld section in /etc/my.cnf. The state stored in the InnoDB transaction log can be cleared setting a parameter and then committing a transaction in InnoDB. For example:

set session innodb_clear_replication_status=1;
create table foo(i int) type=InnoDB;
insert into foo values (1);
commit;
drop table foo;

Replication state is updated in the InnoDB transaction log for every transaction that includes InnoDB. It is updated for some transactions that don’t include InnoDB. When the replication SQL thread stops, it stores its offset in InnoDB.

The Dream

We would love to be able to kill the slave (kill -9) and have it always recover correctly. We are not there yet for a few reasons:

We don’t update the state in InnoDB for some transactions that do not use InnoDB
DDL is not atomic in MySQL. For *drop table* and *create table* there are two steps: create or drop the table in the storage engine and create or drop the frm file that describes the table. A crash between these steps leaves the storage engine out of sync with the MySQL dictionary.
Other replication state is not updated atomically. When relay logs are purged, the files are removed and then the index file is updated. A crash before the index file update leaves references to files that don’t exist. Replication cannot not be started in that case. Also, the index file is not updated in place rather than atomically (write temp file, sync, rename).