Quantcast
Channel: DDL – Cloud Data Architect
Viewing all 275 articles
Browse latest View live

Yorvi Arias: Migrating from Oracle to PostgreSQL: Tips and Tricks

$
0
0

Feed: Planet PostgreSQL.

Migrating to PostgreSQL from Oracle is a topic that often comes up in discussions around PostgreSQL.  At Crunchy Data, we are of course not surprised that there is broad interest in moving to the world’s most advanced database.

There are a lot of great reasons to migrate to PostgreSQL, and if you haven’t looked at PostgreSQL lately, it would be a good time to do so again given the various improvements in PostgreSQL 12.

That said, migrating from one database to another is inherently new technology and can raise a multitude of questions. To help ease the transition, we are providing a few frequently asked questions and answers from users transitioning from Oracle to PostgreSQL, based on real life scenarios, to serve as a helpful resource.

Common Questions, Differences and Solutions

How To Install Orafce

Orafce is a useful extension that allows you to implement some functions from Oracle in PostgreSQL. For example, if you are used to DATE functions in Oracle, this extension allows you to use those functions. For additional information about Orafce: https://github.com/orafce/orafce. 

Simply follow these steps to get Orafce up and running in a PostgreSQL 12 and RHEL 7 environment.

Typically, the process to build Orafce from source code is relatively user-friendly, but requires a number of dependencies. First, it is necessary to have the postgresql12-devel package installed, as it contains the binary for pg_config. Assuming postgresql12-devel is installed, you may proceed to the following steps to build Orafce and create the extension.

  1. Install the dependencies:
    sudo yum -y install flex bison readline-devel zlib-devel openssl-devel wget libicu-devel install llvm5.0-devel llvm-toolset-7-clang gcc-c++
  2. Download the full Orafce source code, available on GitHub. If you are able to connect to GitHub directly, you may use the following command:
    git clone git@github.com:orafce/orafce.git
  3. Make sure you have pg_config in your path. You may use echo $PATH to check if /usr/pgsql-12/bin is present. If not, do the following: export PATH=$PATH:/usr/pgsql-12/bin/
  4. Build the source code. From within the orafce directory, run the following command:
    make all
  5. Install the source code. From within the orafce directory, run the following command:
    make install
  6. Create the orafce extension inside the database. Connect to the database as a user with extension creating privileges and use the following command:
    CREATE EXTENSION orafce;

You will also need to have rhel-7-server-devtools-rpms enabled in order to access the llvm-toolset-7-clang package. This repo can be enabled by running the following command as superuser: subscription-manager repos --enable=rhel-7-server-devtools-rpms.

Having performed all of these steps, you will have successfully created the orafce extension for your PostgreSQL database.

How To Disable and Enable Constraints

As you may know, Oracle allows you to disable and enable a constraint as many times as needed. This is something that is not commonly done in PostgreSQL and generally isn’t recommended in any database instance. Even though Oracle allows users to disable and enable constraints, this can cause you to run into data corruption if not handled with great care.

In PostgreSQL, instead of disabling constraints, one typically creates the constraints as deferrable and then uses the SET CONSTRAINTS command to defer them. If the constraint isn’t currently deferrable then it will need to be dropped and recreated as deferrable. When creating a constraint, the deferrable clause specifies the default time to check the constraint.

It may also possible to alter the constraint and make it deferrable, avoiding the need to drop and recreate. Note that all DDL in PostgreSQL is transactional, so if you wish to drop and recreate things without letting users enter potentially bad data, you can put all of the DDL in a transaction denoted by the BEGIN/COMMIT block. The tables will be locked for the transaction.

How To Disable ‘NOT NULL’ Constraint

Similar to the question above, we were asked how to disable NOT NULL constraint in PostgreSQL. In Oracle, when you run the command DISABLE CONSTRAINT it disabled all of the constraints including NOT NULL. As mentioned before it is not recommended to disable and enable constraints because the constraints can allow bad data into the database table without warning or notice. If this happens there would be no way to tell how long queries will have been returning necessarily insufficient and/or incorrect results based on bad data.

Fortunately, it is currently not possible to disable/enable NOT NULL in PostgreSQL. If you are required to do this, a better way is to drop and re-add the constraint. The command ALTER TABLE tablename ALTER COLUMN columnname DROP NOT NULL; is how you drop the NOT NULL constraint.

To re-add the NOT NULL constraint you will use the command ALTER TABLE tablename ALTER COLUMN columnname SET NOT NULL;. Re-adding the NOT NULL constraint will cause the constraint to be validated again, so this is not an instant operation. However, dropping and re-adding the NOT NULL constraints might be faster than having the constraint evaluated on every write during your process.

The GRANT Command

The GRANT command in PostgreSQL operates very similarly to Oracle. There are two basic variants to the command. It can grant privileges on a database object and grant membership to a role. A common question is how to grant create procedure or trigger to a role in PostgreSQL.

In PostgreSQL, you can grant the TRIGGER privilege to a table which gives the ability to create triggers on that table, not to use them. So, if trigger creation is all you are trying to grant, that is the only privilege you need to grant. You do not have to grant any special privileges to roles other than normal write privileges (INSERT,UPDATE,DELETE) in order to be able to then use the triggers on that table. As long as a role has normal write privileges to that table the triggers will automatically fire as needed.

All triggers in PostgreSQL use FUNCTIONS as the underlying object that performs the trigger action, not PROCEDURES. PROCEDURES code> did not exist in PostgreSQL prior to version 11 and as of version 11 they are two distinct object types.

The command syntax for CREATE TRIGGER requires some consideration. Prior to version 11, the clause to the CREATE TRIGGER command used the phrase EXECUTE PROCEDURE to name the object that the trigger will fire. As of version 11, it allows you to use the clause EXECUTE PROCEDURE or FUNCTION, however a function is still the only object allowed to be given here as the argument. As the current documentation for the command states:

“In the syntax of CREATE TRIGGER, the keywords FUNCTION and PROCEDURE are equivalent, but the referenced function must in any case be a function, not a procedure. The use of the keyword PROCEDURE here is historical and depreciated.”

Additional information regarding CREATE TRIGGER can be found here.

You can also do mass grants of specific privileges on existing objects, to grant all privileges to all procedures in the given schema. So to grant trigger creation privileges on all tables in a given schema you can do:
GRANT TRIGGER ON ALL TABLES IN SCHEMA TO ;

Note that it does not give the privilege to then drop triggers. Only the owner of a table can drop them. This command only does that for existing objects and not any future objects that may be created.

Is it possible to drop database objects in PostgreSQL?

In PostgreSQL, only the owner of the database or a super user is allowed to drop objects.

As per the following documentation, “The right to drop or alter an object, is not treated as a grantable privilege. The owner inherits this privilege. (However, a similar effect can be obtained by granting or revoking membership in the role that owns the object) The owner implicitly has all grant options for the object, too.”

Please note: If your application or service depends on the Oracle ability to drop objects, it is possible that you might need to rewrite or reconfigure how this action is performed.

How To Check for NOT NULL

In Oracle, if you want to know which columns in a table are NOT NULL you can use the command CHECK ( IS NOT NULL). PostgreSQL does this a little differently. Here’s how to check for this.

There is a NOT NULL constraint column in the pg_attribute systems catalog. The pg_attribute catalog stores information about table columns. As stated in the documentation (https://www.postgresql.org/docs/current/catalog-pg-attribute.html), “there will be exactly one pg_attribute row for every column in every table in the database.” attnotnull is the column name in pg_attribute that represents NOT NULL constraints.

If you are wondering where other constraints are stored in the system catalog, you can look at the following documentation (https://www.postgresql.org/docs/11/catalog-pg-constraint.html).

For an example query of how to link catalogs together to find not null constraint information, the query below that shows all user tables in the database that have not-null columns along with which columns they are.
If you also want to see these columns in the system catalogs, you can remove the WHERE condition that excludes the system schemas. Many links in system catalogs are managed via “oid” values and which tables they relate to are explicitly mentioned in the documentation for that system catalog (Ex: pg_attribute.attrelid relates to pg_class.oid).

SELECT n.nspname as schemaname, c.relname as tablename, a.attname as columnname
FROM pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
JOIN pg_attribute a ON c.oid = a.attrelid
WHERE a.attnotnull IS NOT NULL
AND n.nspname NOT IN ('information_schema', 'pg_catalog', 'pg_toast')
AND a.attnum > 0
ORDER BY 1,2,3;

ROWID, CTID and Identity columns

Oracle has a pseudocolumn called ROWID, which returns the address of the row in a table. PostgreSQL has something similar to this called CTID. The only problem is that the CTID gets changed after every VACUUM function. Fortunately, there is a good alternative for this: identity columns.

Since there is no ROWID in PostgreSQL we suggest using self-generated unique identifiers. This can be achieved in the form of identity columns. Identity columns will help you because they are generated when the row is created and will never change during the life of that row. It is important to know that the way IDENTITY is implemented means that you cannot pre-allocate values. IDENTITY also has additional logic controlling their generation/application, even though it is backed by a sequence. Use the following syntax to create an identity column:

column_name type GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY[ ( sequence_option ) ].

In the syntax you can see the additional logic as how you would generate the identity column. GENERATED ALWAYS instructs PostgreSQL to always generate a value for the identity column. If you try to insert/update a value in this GENERATED ALWAYS column, PostgreSQL will give you a warning. This is because the values are system generated and GENERATED ALWAYS means it can only have the system generated values.

GENERATED BY DEFAULT also instructs PostgreSQL to generate a value for the identity column. However, with GENERATED BY DEFAULT you can insert or update a value into the column, and PostgreSQL will use that value for the identity column instead of using the system-generated value.

We hope that this overview of a few common issues in transitioning from Oracle to PostgreSQL eases the process and gives long time Oracle users greater comfort as they evaluate PostgreSQL as an alternative.

For those just getting started with the migration consideration, no blog post on Oracle to PostgreSQL migration would be complete without a mention of ora2pg – a great open source Oracle migration tool that can help evaluate the of difficulty in your migration. Of course Crunchy Data is always here and happy to assist as well!


Kaarel Moppel: Row change auditing options for PostgreSQL

$
0
0

Feed: Planet PostgreSQL.

Recently, I was asked for advice on how to reasonably implement a common task of table change tracking – meaning a so-called “audit trail” for all row changes over time was required. The keyword “compliance” might also ring a bell in this context, here for those who work in finance or for government institutions. But as always with technology, there are a couple of different approaches with different benefits / drawbacks to choose from; let’s lay it out for the Internet Citizens! There’s a summary of pros and cons down below if you’re in a hurry.

TL;DR: sadly, there are no real standard built-in tools for these types of tasks, but in real life it mostly boils down to creating some “shadow” tables and writing some triggers.

The log_statement configuration parameter

This server parameter named log_statement has 4 possible values – none, ddl, mod, all. By default, it’s configured conservatively (in this case, with ‘none’), as all PostgreSQL parameters generally are, in order to avoid accidentally wasting resources. That way, it doesn’t log normal SELECT or DML statements to the server log – only warnings and errors. But when set to ‘mod’ (or ‘all’ which is a superset of ‘mod’), all top level data-changing SQL statements (INSERT, UPDATE, DELETE, TRUNCATE) will be stored in the server log!

Make sure to note the “top level” part though – this means that if you have some business logic exercising triggers on your tables and a chain of changes is triggered, you’ll not see any traces of these subsequent changes in the server log! Also, log_statement doesn’t work well if you rely on stored procedures to change data – since they’re usually called via normal SELECT-s, which don’t get flagged as data-changing operations with the ‘mod’ log_statement level. Worse, if you need to do some dynamic SQL within your stored procedures – even the ‘all’ level won’t catch them!

In short – the use cases for the whole approach are somewhat limited to basic CRUD patterns, and log_statement is not necessarily suitable for enterprise requirements.

PS – also note that superusers can change this log_statement setting on the fly; thus bypassing the auditing, and do stuff secretly– without any traces left! To remind you – “by design” with PostgreSQL, it’s never really possible to guard the system against a malicious superuser. Some methods just need some more work from the user, in short better be careful to whom you hand out superuser rights. Preferably, do not even allow remote superuser access, but that’s another topic – see here for more info, if you’re interested in security.

Pros

  • Absolutely the easiest setup ever- only one parameter to enable. This is even true during live operation, so that it can also be used for ad-hoc troubleshooting.

Cons

  • Catches only the top level DML statement issued by the user.
  • Only explicitly defined column values will be stored to the server log, thus nothing usable from a statement such as: UPDATE tbl SET x = x + 1.
  • No row context for multi-row updates – i.e. you’ll have no indication how many rows were altered by some action.
  • No table / schema based filtering – all or nothing will be logged.
  • Information storage is purely text based – possibly need to deal with huge log files where information is all tangled up, and searching is difficult. Might need additional extraction / parsing / indexing for fast search.
  • Queries from failed transactions are also included.

The pgaudit extension

In short, pgaudit is a 3rd-party PostgreSQL extension that tries to improve on the quite limited default PostgreSQL text-based auditing / logging capabilities. It has been around for ages, so it’s stable enough to use, and there are even packages provided by PGDG repos for Debian / RH-based distros.

Its main drawback is the same as with the previous method, though – everything goes to the same server log with normal status / error messages. There’s no clear separation of “concerns”– thus searching will be a bit of work, and for fast “trail” access, you probably need to parse the files and store them in some other system, properly indexed. It’s also the same story for the generated volume of logs. At default settings, (when just enabling all) it’s way more write heavy than the log_statement-based approach. In short, be sure to tune the plentiful parameters to your needs. To warn users about that, the project README also nicely says: … be sure to assess the performance impact while testing and allocate plenty of space on the log volume.

Pros

  • Quite granular logging / auditing options. Configurable by change type or by some role’s access privileges to table, view, etc.
  • Internally / dynamically generated SQL-s can also be logged.

Cons

  • A 3rd party extension.
  • Possibly heavy disk / IO footprint, same as for the previous method.
  • No row context for multi-row updates.
  • Information storage is purely text-based – need to possibly deal with huge log files where information is all tangled up, and searching is difficult. Might need additional extraction / parsing / indexing for fast search.
  • Queries from failed transactions are also included.

Custom audit tables and triggers for all tables

Custom audit tables and triggers must be the most classic / common approach to auditing, and all those working with RDBMS systems for a longer period have surely seen or implemented something like these features. The main idea – create a “shadow” history tracking / audit table for all relevant tables, where all changes to the main table will be logged, via triggers. However, since triggers are considered black magic by a lot of developers these days, I’ve also seen implementations via application code…but this can’t be recommended, as only in a non-existent perfect world are there no ad-hoc manual data changes.

The setup process here would look something like what you see below for every target table X, where we want to track who changed what rows / columns and when:

  1. Create the “shadow” table for X, typically X_LOG with some typical audit columns like “change time”, “user doing the change”, “type of change” and then all or only important data columns from X.
  2. Create a trigger function FX which inserts the new or old values (it’s a matter of taste) into X_LOG with above declared audit and data columns filled.
  3. Declare a trigger on table X to call our tracking function FX for each altered row. The trigger would be typically an AFTER trigger as we don’t want to alter anything and just protocol, but when doing heavy multi-row transactions (thousands of rows per TX) it would make sense to test BEFORE triggers as well, as they should be more resource-friendly (give rollbacks / exceptions are rare).

Pros

  • Explicit audit targeting, track only exactly what is important
  • Isolation of concerns – changes are in a nice table, ready for fast inspection
  • Fast search – normal tables can easily be indexed per need

Cons

  • Need to write stored procedures and manage triggers definitions. FYI – triggers can also be written in other supported PL languages like Python, if you happen to dislike the de-facto trigger language of PL/pgSQL.
  • Some schema duplication.
  • Database growth. Previously, changes were written into server logs that are usually recycled and federated in any case, so it was not a big concern. At present, however, the audit tables may need explicit “care”.

One generic audit trigger and table for all audited tables

On a high level, this method is very similar to the previous one; the only change being that instead of having custom audit tables / trigger code for all “business” tables, we create a generic trigger function that can be used on all target tables, and that also logs all changes into a single table! By doing that, we’ve minimized the amount of table / code duplication – which could be of real value for big and complex systems – remember, DRY!

And how, you may wonder, would be the best way to implement it? Well, the best way to achieve such generic behaviour is to utilize the superb JSON functions of PostgreSQL, preferably the JSONB data type (available since v9.4), due to some space saving and faster search capabilities. BTW, if you happen to be running some earlier version, you should really think of upgrading, as versions 9.3 and lesser are not officially supported any more…and soon (February 13, 2020) PostgreSQL 9.4 will stop receiving fixes.

Since this approach is relatively unknown to the wider public, a piece of sample code probably wouldn’t hurt; check below for a sample. Note, however, that fiddling with JSONB along with the fact that this is basically a NoSQL type of storage, is not exactly as effective as normal tables / columns. You’ll have to pay a small performance and storage premium for this “generic” convenience.

CREATE TABLE generic_log (
  mtime timestamptz not null default now(),
  action char not null check (action in ('I', 'U', 'D')),
  username text not null,
  table_name text not null,
  row_data jsonb not null
);

CREATE INDEX ON generic_log USING brin (mtime);
CREATE INDEX ON generic_log ((row_data->>'my_pk’)) WHERE row_data->>'my_pk' IS NOT NULL;  // note the cast to text as JSONB can’t be indexed with B-tree
CREATE EXTENSION IF NOT EXISTS btree_gin;
CREATE INDEX ON generic_log USING gin (table_name);  // GiN is better for lots of repeating values


CREATE OR REPLACE FUNCTION public.generic_log_trigger()
 RETURNS trigger LANGUAGE plpgsql
AS $function$
BEGIN
  IF TG_OP = 'DELETE' THEN
    INSERT INTO generic_log VALUES (now(), 'D', session_user, TG_TABLE_NAME, to_json(OLD));
  ELSE
    INSERT INTO generic_log VALUES (now(), TG_OP::char , session_user, TG_TABLE_NAME, to_json(NEW));
  END IF;
  RETURN NULL;
END;
$function$;

CREATE TRIGGER log_generic AFTER INSERT OR UPDATE OR DELETE ON some_table FOR EACH ROW EXECUTE FUNCTION generic_log_trigger();

Pros

  • Less code to manage.
  • Automatic attachment of audit trail triggers can be easily configured for new tables, e.g. via event triggers.

Cons

  • A bit more resources burned compared to custom per-table triggers
  • Some more exotic indexing (GiN) may be needed
  • SQL for searching may become a bit more complex

PS – note again, that with both of these trigger-based methods, superusers (or table owners) can temporarily disable the triggers and thus bypass our audit trail.

Logical replication

Logical replication, also known as “pub-sub” replication, is a relatively new thing in PostgreSQL (introduced in v10), and originally, not really an auditing feature but rather a near-zero-downtime upgrade mechanism (read this blog post with more details on that).

It can also be “abused” for auditing or CDC (Change Data Capture) purposes…and actually quite well! The biggest benefit – storage of any extra auditing data can be “outsourced” to an external system, and so-called “write amplification” can be avoided – meaning generally better performance, since extra writing to the disk happens somewhere else.

You need to choose between 2 implementation options though – PostgreSQL native or the custom application way

Logical replication – PostgreSQL native

PostgreSQL native logical replication means that you build up a master server similarly structured to the original server, re-adjust the schema a bit – dropping PK/UQ-s, create some triggers that tweak or throw away uninteresting data or store it in “shadow” tables (just like with the normal trigger-based approaches) and then configure data streaming with CREATE PUBLICATION / CREATE SUBSCRIPTION commands.

As usual, some constraints still apply – you might need to alter the schema slightly to get going. Large objects (up to 4TB blobs) are not supported, and with default settings, you’d only be getting the primary key and changed column data, i.e. not the latest “row image”. Also, it’s generally more hassle to set up and run – an extra node and monitoring is needed, since the publisher and subscriber will be sort of “physically coupled”, and there will be operational risks for the publisher (source server) – if the subscriber goes on a ”vacation” for too long the publisher might run out of disk space as all data changes will be reserved and stored as transaction logs (WAL) until they’re fetched (or the slot deleted). The latter actually applies for the “custom application” approach. So you should definitely spend a minute in careful consideration before jumping into some implementations.

On a positive note from the security side – “PostgreSQL native” can actually be configured in such a way that it’s not even possible for superusers on the source system to disable / bypass the auditing process and change something secretly! (i.e. temporarily leaving out some tables from the replication stream so that the subscriber doesn’t notice!) However, this only works with the standard (for upgrades at least) FOR ALL TABLES setup.

Logical replication – with custom applications

The “application way” means using some programming language (C, Python, Java, etc) where the PostgreSQL driver supports logical decoding. You’ll always be streaming the changes as they happen, and then inspect or stash away the data in your favourite format, or propagate into some other database system altogether. See here for a sample PostgreSQL implementation that also can easily be tested on the command line. To simplify it a bit – you can live-stream JSON change-sets out of PostgreSQL and do whatever you like with the data.

Pros

  • Minimal extra resource usage penalty on the source system.
  • Can be well-configured on the table level – one could leave out some tables, or only stream certain changes like INSERTS for some tables.
  • Can be the safest option with regard to data integrity.
  • Subscribers can also be purely virtual, i.e. applications doing “something” with the changed data

Cons

  • Somewhat complex to set up and run.
  • Postgres-native way requires careful syncing of future schema changes.
  • Means coupling of 2 servers via replication slots, so monitoring is needed.

Summary of top pros and cons

Approach Pros Cons
log_statement=’mod’ Simplest way for basic requirements – just flip the built-in switch, even during runtime. Text-based: volume and search can be problematic.

Captures only top level statements issued by users.

Does not capture bulk update details.

No table level configuration.

Pgaudit extension Options to configure processing according to  operation type and object / role.

Logs also dynamically generated statements.

Text-based: volume and search can be problematic.

Does not capture bulk update details.

A 3rd party extension.

Explicit audit tables and triggers for all (relevant) tables Less resources burnt than with text-based approaches.

Fast search.

Easily customizable per table.

Write amplification.

Lots of code to manage and some structure duplication introduced.

A single audit table and trigger for all (relevant) tables Less resources burnt than with text-based approaches.

Still a fast search.

Customizable per table.

Write amplification.

Audit search queries might need some JSON skills.

Logical Replication Least amount of resources burnt on the source system.

Highly customizable on object level.

Can be well secured to guard data integrity.

Fast search.

Complex setup.

Needs extra hardware / custom application.

Typically requires some schema changes and extra care when the schema evolves.

Hope you got some ideas for your next auditing project with PostgreSQL!

Google Summer of Code 2020

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

This year, again, we are participating in the Google Summer of Code. The MariaDB Foundation believes we are making a better database that remains application compatible with MySQL. We also work on making LGPL connectors (currently C, ODBC, Java) and on MariaDB Galera Cluster, which allows you to scale your reads & writes. And we have MariaDB ColumnStore, which is a columnar storage engine, designed to process petabytes of data with real-time response to analytical queries.

To improve your chances of being accepted, it is a good idea to submit a pull request with a bug fix to the server.

Evaluate subquery predicates earlier or later depending on their SELECTIVITY

(Based on conversation with Igor)
There are a lot of subquery conditions out there that are inexpensive to
evaluate and have good selectivity.

If we just implement MDEV-83, we may get regressions. We need to take
subquery condition’s selectivity into account.

It is difficult to get a meaningful estimate for an arbitrary, correlated
subquery predicate.

One possible solution is to measure selectivity during execution and reattach
predicates on the fly.

We don’t want to change the query plan all the time, one way to dynamically
move items between item trees is to wrap them inside Item_func_trig_cond so
we can switch them on and off.


Add support for Indexes on Expressions

An index on expression means something like

CREATE TABLE t1 (a int, b int, INDEX (a/2+b));
...
SELECT * FROM t1 WHERE a/2+b=100;

in this case the optimizer should be able to use an index.

This task naturally splits in two steps:

  1. add expression matching into the optimizer, use it for generated columns. Like in CREATE TABLE t1 (a int, b int, v INT GENERATED ALWAYS AS (a/2+b), INDEX (v));
  2. support the syntax to create an index on expression directly, this will automatically create a hidden generated column under the hood

original task description is visible in the history


Histograms with equal-width bins in MariaDB

Histograms with equal-width bins are easy to construct using samples. For this it’s enough
to look through the given sample set and for each value from it to figure out what bin this value can be placed in. Each bin requires only one counter.
Let f be a column of a table with N rows and n be the number of samples by which the equal-width histogram of k bins for this column is constructed. Let after looking through all sample
rows the counters created for the histogram bins contain numbers c[1],..,c[k]. Then
m[i]= c[i]/n * 100 is the percentage of the rows whose values of f are expected to be in the interval

 (max(f)-min(f))/k *(i-1), max(f)-min(f))/k *i-1).

It means that if the sample rows have been chosen randomly the expected number of rows with the values of f from this interval can be approximated by the number m[i]*/100 * N.

To collect such statistics it is suggested to use the following variant of the ANALYZE TABLE command:

ANALYZE FAST TABLE tbl [ WITH n ROWS ] [SAMPLING p PERCENTS ]
   PERSISTENT FOR COLUMNS (col1 [IN RANGE r] [WITH k INTERVALS],...)

Here:

  • ‘WITH n ROWS’ provides an estimate for the number of rows in the table in the case when this estimate cannot be obtained from statistical data.
  • ‘SAMPLING p PERCENTS’ provides the percentage of sample rows to collect statistics. If this is omitted the number is taken from the system variable samples_ratio.
  • ‘IN RANGE r’ sets the range of equal-width bins of the histogram built for the column col1. If this is omitted then and min and max values for the column can be read from statistical data then the histogram is built for the range [min(col1), max(col1)]. Otherwise the range [MIN_type(col1), MAX_type(col1) is considered]. The values beyond the given range, if any, are also is taken into account in two additional bins.
  • WITH k INTERVALS says how many bins are included in the histogram. If it is omitted this value is taken from the system variable histogram_size.

Add FULL OUTER JOIN to MariaDB

Add support for FULL OUTER JOIN

https://www.w3schools.com/sql/sql_join_full.asp

One of the way how to implement is to re-write the query

select t1.*, t2.* from t1 full outer join t2 on P(t1,t2) 

into the following union all:

select t1.*, t2.* from t1 left outer join t2 on P(t1,t2) 
union all 
select t1.*,t2.* from t2 left outer join t1 on P(t1,t2) where t1.a is null

Here t1.a is some non-nullable column of t1 (e.g. the column of single column primary key).


Recursive CTE support for UPDATE (and DELETE) statements

     CREATE TABLE tree (
       `Node` VARCHAR(3),
       `ParentNode` VARCHAR(3),
       `EmployeeID` INTEGER,
       `Depth` INTEGER,
       `Lineage` VARCHAR(16)
     );

     INSERT INTO tree
       (`Node`, `ParentNode`, `EmployeeID`, `Depth`, `Lineage`)
     VALUES
       ('100', NULL, '1001', 0, '/'),
       ('101', '100', '1002', NULL, NULL),
       ('102', '101', '1003', NULL, NULL),
       ('103', '102', '1004', NULL, NULL),
       ('104', '102', '1005', NULL, NULL),
       ('105', '102', '1006', NULL, NULL);
 
     WITH RECURSIVE prev AS (
     SELECT * FROM tree WHERE ParentNode IS NULL
     UNION
     SELECT t.Node,t.ParentNode,t.EmployeeID,p.Depth + 1 as Depth, CONCAT(p.Lineage, t.ParentNode, '/')
     FROM tree t JOIN prev p ON t.ParentNode = p.Node
     )
     SELECT * FROM prev;

     WITH RECURSIVE prev AS (
     SELECT * FROM tree WHERE ParentNode IS NULL
     UNION
     SELECT t.Node,t.ParentNode,t.EmployeeID,p.Depth + 1 as Depth, CONCAT(p.Lineage, t.ParentNode, '/')
     FROM prev p JOIN tree t ON t.ParentNode = p.Node
     )
     UPDATE tree t, prev p SET t.Depth=p.Depth, t.Lineage=p.Lineage WHERE t.Node=p.Node; 

You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'UPDATE tree t, prev p SET t.Depth=p.Depth, t.Lineage=p.Lineage WHERE t.Node=p.No' at line 7

supported in MySQL-8.0 and MSSQL


Implement EXCEPT ALL and INTERSECT ALL operations

SQL Standard allows to use EXCEPT ALL and INTERSECT ALL as set operations.
Currently MariaDB Server does not support them.

The goal of this task is to support EXCEPT ALL and INTERSECT ALL
1. at syntax level – allow to use operators EXCEPT ALL and INTERSECT ALL in query expression body
2. at execution level – implement these operations employing temporary tables
(the implementation could use the idea similar to that used for the existing implementation of the INTERSECT operation).


Implement UPDATE with result set

Add an UPDATE operation that returns a result set of the changed rows to the client.

UPDATE [LOW_PRIORITY] [IGNORE] tbl_name
    SET col_name1={expr1|DEFAULT} [, col_name2={expr2|DEFAULT}] ...
    [WHERE where_condition]
    [ORDER BY ...]
    [LIMIT row_count]
RETURNING select_expr [, select_expr ...]

I’m not exactly sure how the corresponding multiple-table syntax should look like, or if it is possible at all. But already having it for single-table updates would be a nice feature.


Automatic provisioning of slave

Idea

The purpose of this task is to create an easy-to-use facility for setting up a
new MariaDB replication slave.

Setting up a new slave currently involves: 1) installing MariaDB with initial
database; 2) point the slave to the master with CHANGE MASTER TO; 3) copying
initial data from the master to the slave; and 4) starting the slave with
START SLAVE. The idea is to automate step (3), which currently needs to be
done manually.

The syntax could be something as simple as

LOAD DATA FROM MASTER

This would then connect to the master that is currently configured. It will
load a snapshot of all the data on the master, and leave the slave position at
the point of the snapshot, ready for START SLAVE to continue replication from
that point.

Implementation:

The idea is to do this non-blocking on the master, in a way that works for any
storage engine. It will rely on row-based replication to be used between the
master and the slave.

At the start of LOAD DATA FROM MASTER, the slave will enter a special
provisioning mode. It will start replicating events from the master at the
master’s current position.

The master dump thread will send binlog events to the slave as normal. But in
addition, it will interleave a dump of all the data on the master contained in
tables, views, or stored functions. Whenever the dump thread would normally go
to sleep waiting for more data to arrive in the binlog, the dump thread will
instead send another chunk of data in the binlog stream for the slave to apply.

A “chunk of data” can be:

  • A CREATE OR REPLACE TABLE / VIEW / PROCEDURE / FUNCTION
  • A range of N rows (N=100, for example). Each successive chunk will do a range scan on the primary key from the end position of the last chunk.

Sending data in small chunks avoids the need for long-lived table locks or
transactions that could adversely affect master performance.

The slave will connect in GTID mode. The master will send dumped chunks in a
separate domain id, allowing the slave to process chunks in parallel with
normal data.

During the provisioning, all normal replication events from the master will
arrive on the slave, and the slave will attempt to apply them locally. Some of
these events will fail to apply, since the affected table or row may not yet
have been loaded. In the provisioning mode, all such errors will be silently
ignored. Proper locking (isolation mode, eg.) must be used on the master when
fetching chunks, to ensure that updates for any row will always be applied
correctly on the slave, either in a chunk, or in a later row event.

In order to make the first version of this feature feasible to implement in a
reasonable amount of time, it should set a number of reasonable restrictions
(which could be relaxed in a later version of the feature):

  • Give up with an error if the slave is not configured for GTID mode (MASTER_USE_GTID != NO).
  • Give up with error if the slave receives any event in statement-based binlogging (so the master must be running in row-based replication mode, and no DDL must be done while the provisioning is running).
  • Give up with an error if the master has a table without primary key.
  • Secondary indexes will be enabled during the provisioning; this means that tables with large secondary indexes could be expensive to provision.

connection encryption plugin support

As a follow-on to MDEV-4691 we would like GSSAPI encryption (in addition to authentication) support in MariaDB. I am told that the current plan is to create a plugin interface and then we can build GSSAPI encryption on top of that, so here is a ticket for that.

From having written GSSAPI for the internal interface, there were a couple things I would like to see in the plugin encryption interface.

First, GSSAPI is weird in that it does authentication before encryption (TLS/SSL are the other way around, establishing an encrypted channel and then doing authentication over it). Of course support for this is needed, but more importantly, packets must be processed in a fully serialized fashion. This is because encrypted packets may be queued while one end of the connection is still finishing up processing the authentication handshake. One way to do this is registering “handle” callbacks with connection-specific state, but there are definitely others.

Additionally, for whatever conception there ends up being of authentication and encryption, it needs to be possible to share more data than just a socket between them. The same context will be used for authentication and encryption, much as an SSL context is (except of course we go from authentication to encryption and not the other way around).

This ties into an issue of dependency. If authentication plugins are separate entities from encryption plugins in the final architecture, it might make sense to do mix-and-match authentication with encryption. However, there are cases – and GSSAPI is one – where doing encryption requires a certain kind of authentication (or vice versa). You can’t do GSSAPI encryption without first doing GSSAPI authentication. (Whether or not it’s permitted to do GSSAPI auth->encryption all over a TLS channel, for instance, is not something I’m concerned about.)

Finally, encrypted messages are larger than their non-encrypted counterparts. The transport layer should cope with this so that plugins don’t have to think about reassembly, keeping in mind that there may not be a way to get the size of a message when encrypted without first encrypting it.

It’s unfortunately been a little while since I wrote that code, but I think those were the main things that we’ll need for GSSAPI. Thanks!


Add RETURNING to INSERT

Please add a RETURNING option to INSERT.

Example from PostgreSQL

postgres=# CREATE TABLE t1 (id SERIAL, name VARCHAR(100));
CREATE TABLE
postgres=# INSERT INTO t1(name) VALUES('test') RETURNING id;
 id 
----
  1
(1 row)

INSERT 0 1

Inspired by: https://evertpot.com/writing-sql-for-postgres-mysql-sqlite/

This could make it easier to write statements which work with both MariaDB and PostgreSQL. And this might improve compatibility with Oracle RDBMS.


Aggregate Window Functions

With a few exceptions, most native aggregate functions are supported as window functions.
https://mariadb.com/kb/en/library/aggregate-functions-as-window-functions/

In MDEV-7773, support for creating of custom aggregate functions was added.
This task proposes to extend that feature and allow custom aggregate functions to be used as window functions

An example of a creating a custom aggregate function is given below:

create aggregate function agg_sum(x INT) returns double
begin
  declare z double default 0;
  declare continue handler for not found return z;
  loop
    fetch group next row;
    set z = z + x;
  end loop;
end|

This functions can be used in the following query:

create table balances (id int, amount int);
insert into balances values (1, 10), (2, 20), (3, 30);
 
select agg_sum(amount) from balances;

After this task is complete the following must also work:

select agg_sum(amount) over (order by id);

True ALTER LOCK=NONE on slave

Currently no true LOCK=NONE exists on slave.
Alter table is first committed on master then it is replicated on slaves.
The purpose of this task is to create a true LOCK=NONE

Implementation Idea

Master will write BEGIN_DDL_EVENT in binlog after it hits ha_prepare_inplace_alter_table.
Then master will write QUERY_EVENT on binlog with actual alter query .
On commit/rollback master will write COMMIT_DDL_EVENT/ROLLBACK_DDL_EVENT.

On slave there will be pool of threads(configurable global variable), which
will apply these DDLs. On reciving BEGIN_DDL_EVENT slave thread will pass the
QUERY_EVENT to one of the worker thread. Worker thread will execute untill
ha_inplace_alter_table. Actual commit_inplace_alter will be called by sql thread.
If sql thread recieve some kind of rollback event , then it will somehow signal
worker thread to stop executing alter. If none of the worker threads are avaliable
then event will be enqueued, then If we recieved rollback event the we will simply
discard event from queue, If we recieved commit event then SQL thread will syncrolysly
process DDL event.


Improve mysqltest language

mysqltest has a lot of historical problems:

  • ad hoc parser, weird limitations
  • commands added as needed with no view over the total language structure
  • historical code issues (e.g. casts that become unnecessary 10 years ago)
    etc

A lot can be done to improve it.
Ideas

  • control structures, else in if, break and continue in while, for (or foreach) loop
  • proper expression support in let, if, etc
  • rich enough expressions to make resorting to sql unnecessary in most cases
  • remove unused and redundant commands (e.g. system vs exec, query_vertical vs vertical_results ONCE)
  • remove complex commands that do many sql statements under the hood, if they can be scripted, e.g. sync_slave_with_master
  • remove over-verbose treatment of rpl test failures
  • scoped variables
  • parameters for the source command
  • remove dead code

Implement multiple-table UPDATE/DELETE returning a result set

A multiple-table UPDATE first performs join operations, then it updates the matching rows.
A multiple-table UPDATE returning a result set does the following:

  • first performs join operations
  • for each row of the result of the join it calculates some expressions over the columns of the join and forms from them a row of the returned result set
  • after this it updates the matching rows

A multiple-table DELETE first performs join operations, then it deletes the matching rows.
A multiple-table DELETE returning a result set does the following:

  • first performs join operations
  • for each row of the result of the join it calculates some expressions over the columns of the join and forms from them a row of the returned result set
  • after this it deletes the matching rows

sort out the compression library chaos

As MariaDB is getting more storage engines and as they’re getting more features, MariaDB can optionally use more and more compression libraries for various purposes.

InnoDB, TokuDB, RocksDB — they all can use different sets of compression libraries. Compiling them all in would result in a lot of run-time/rpm/deb dependencies, most of which will be never used by most of the users. Not compiling them in, would result in requests to compile them in. While most users don’t use all these libraries, many users use some of these libraries.

A solution could be to load these libraries on request, without creating a packaging dependency. There are different ways to do it

  • hide all compression libraries behind a single unified compression API. Either develop our own or use something like Squash. This would require changing all engines to use this API
  • use the same approach as in server services — create a service per compression library, a service implementation will just return an error code for any function invocation if the corresponding library is not installed. this way — may be — we could avoid modifying all affected storage engines

Control over memory allocated for SP/PS

SP/PS (Stored Procedures / Prepared Statements) allocates memory till the PS cache of SP will be destroyed. There is no way to see how many memory allocated and if it grows with each execution (first 2 execution can lead to new memory allocation but not more)

Task minimum:

Status variables which count the memory used/allocated for SP/PS by thread and/or for the server.

Other ideas:

  • Automatic stop allocation in debvugging version after second execution and call exception on attempt.
  • Information schema by threads and SP/PS with information about allocated and used memory

Information can be collected in MEM_ROOTs of SP/PS. Storing info about status of mem_root before execution then checking after new allocated memory can be found.

MEM_ROOT can be changed to have debug mode which make it read only which can be switched on after second execution.

Automated Refactoring from Mainframe to Serverless Functions and Containers with Blu Age

$
0
0

Feed: AWS Partner Network (APN) Blog.
Author: Phil de Valence.

By Alexis Henry, Chief Technology Officer at Blu Age
By Phil de Valence, Principal Solutions Architect for Mainframe Modernization at AWS

Mainframe workloads are often tightly-coupled legacy monoliths with millions of lines of code, and customers want to modernize them for business agility.

Manually rewriting a legacy application for a cloud-native architecture requires re-engineering use cases, functions, data models, test cases, and integrations. For a typical mainframe workload with millions of lines of code, this involves large teams over long periods of time, which can be risky and cost-prohibitive.

Fortunately, Blu Age Velocity accelerates the mainframe transformation to agile serverless functions or containers. It relies on automated refactoring and preserves the investment in business functions while expediting the reliable transition to newer languages, data stores, test practices, and cloud services.

Blu Age is an AWS Partner Network (APN) Select Technology Partner that helps organizations enter the digital era by modernizing legacy systems while substantially reducing modernization costs, shortening project duration, and mitigating the risk of failure.

In this post, we’ll describe how to transform a typical mainframe CICS application to Amazon Web Services (AWS) containers and AWS Lambda functions. We’ll show you how to increase mainframe workload agility with refactoring to serverless and containers.

Customer Drivers

There are two main drivers for mainframe modernization with AWS: cost reduction and agility. Agility has many facets related to the application, underlying infrastructure, and modernization itself.

On the infrastructure agility side, customers want to go away from rigid mainframe environments in order to benefit from the AWS Cloud’s elastic compute, managed containers, managed databases, and serverless functions on a pay-as-you-go model.

They want to leave the complexity of these tightly-coupled systems in order to increase speed and adopt cloud-native architectures, DevOps best practices, automation, continuous integration and continuous deployment (CI/CD), and infrastructure as code.

On the application agility side, customers want to stay competitive by breaking down slow mainframe monoliths into leaner services and microservices, while at the same time unleashing the mainframe data.

Customers also need to facilitate polyglot architectures where development teams decide on the most suitable programming language and stack for each service.

Some customers employ large teams of COBOL developers with functional knowledge that should be preserved. Others suffer from the mainframe retirement skills gap and have to switch to more popular programming languages quickly.

Customers also require agility in the transitions. They want to choose when and how fast they execute the various transformations, and whether they’re done simultaneously or independently.

For example, a transition from COBOL to Java is not only a technical project but also requires transitioning code development personnel to the newer language and tools. It can involve retraining and new hiring.

A transition from mainframe to AWS should go at a speed which reduces complexity and minimizes risks. A transition to containers or serverless functions should be up to each service owner to decide. A transition to microservices needs business domain analysis, and consequently peeling a monolith is done gradually over time.

This post shows how Blu Age automated refactoring accelerates the customer journey to reach a company’s desired agility with cloud-native architectures and microservices. Blu Age does this by going through incremental transitions at a customer’s own pace.

Sample Mainframe COBOL Application

Let’s look at a sample application of a typical mainframe workload that we will then transform onto AWS.

This application is a COBOL application that’s accessed by users via 3270 screens defined by CICS BMS maps. It stores data in a DB2 z/OS relational database and in VSAM indexed files, using CICS Temporary Storage (TS) queues.

Blu-Age-Serverless-1

Figure 1 – Sample COBOL CICS application showing file dependencies.

We use Blu Age Analyzer to visualize the application components such as programs, copybooks, queues, and data elements.

Figure 1 above shows the Analyzer display. Each arrow represents a program call or dependency. You can see the COBOL programs using BMS maps for data entry and accessing data in DB2 database tables or VSAM files.

You can also identify the programs which are data-independent and those which access the same data file. This information helps define independent groupings that facilitate the migration into smaller services or even microservices.

This Analyzer view allows customers to identify the approach, groupings, work packages, and transitions for the automated refactoring.

In the next sections, we describe how to do the groupings and the transformation for three different target architectures: compute with Amazon Elastic Compute Cloud (Amazon EC2), containers with Amazon Elastic Kubernetes Service (Amazon EKS), and serverless functions with AWS Lambda.

Automated Refactoring to Elastic Compute

First, we transform the mainframe application to be deployed on Amazon EC2. This provides infrastructure agility with a large choice of instance types, horizontal scalability, auto scaling, some managed services, infrastructure automation, and cloud speed.

Amazon EC2 also provides some application agility with DevOps best practices, CI/CD pipeline, modern accessible data stores, and service-enabled programs.

Blu-Age-Serverless-2

Figure 2 – Overview of automated refactoring from mainframe to Amazon EC2.

Figure 2 above shows the automated refactoring of the mainframe application to Amazon EC2.

The DB2 tables and VSAM files are refactored to Amazon Aurora relational database. Amazon ElastiCache is used for in-memory temporary storage or for performance acceleration, and Amazon MQ takes care of the messaging communications.

Once refactored, the application becomes stateless and elastic across many duplicate Amazon EC2 instances that benefit from Auto Scaling Groups and Elastic Load Balancing (ELB). The application code stays monolithic in this first transformation.

With such monolithic transformation, all programs and dependencies are kept together. That means we create only one grouping.

Figure 3 below shows the yellow grouping that includes all application elements. Using Blu Age Analyzer, we define groupings by assigning a common tag to multiple application elements.

Blu-Age-Serverless-3

Figure 3 – Blu Age Analyzer with optional groupings for work packages and libraries.

With larger applications, it’s very likely we’d break down the larger effort by defining incremental work packages. Each work package is associated with one grouping and one tag.

Similarly, some shared programs or copybooks can be externalized and shared using a library. Each library is associated with one grouping and one tag. For example, in Figure 3 one library is created based on two programs, as shown by the grey grouping.

Ultimately, once the project is complete, all programs and work packages are deployed together within the same Amazon EC2 instances.

For each tag, we then export the corresponding application elements to Git.

Blu-Age-Serverless-4.1

Figure 4 – Blu Age Analyzer export to Git.

Figure 4 shows the COBOL programs, copybooks, DB2 Data Definition Language (DDL), and BMS map being exported to Git.

As you can see in Figure 5 below, the COBOL application elements are available in the Integrated Development Environment (IDE) for maintenance, or for new development and compilation.

Blu Age toolset allows maintaining the migrated code in either in COBOL or in Java.

Blu-Age-Serverless-5

Figure 5 – Integrated Development Environment with COBOL application.

The code is recompiled and automatically packaged for the chosen target Amazon EC2 deployment.

During this packaging, the compute code is made stateless with any shared or persistent data externalized to data stores. This follows many of The Twelve-Factor App best practices that enable higher availability, scalability, and elasticity on the AWS Cloud.

In parallel, based on the code refactoring, the data from VSAM and DB2 z/OS is converted to the PostgreSQL-compatible edition of Amazon Aurora with corresponding data access queries conversions. Blu Age Velocity also generates the scripts for data conversion and migration.

Once deployed, the code and data go through unit, integration, and regression testing in order to validate functional equivalence. This is part of an automated CI/CD pipeline which also includes quality and security gates. The application is now ready for production on elastic compute.

Automated Refactoring to Containers

In this section, we increase agility by transforming the mainframe application to be deployed as different services in separate containers managed by Amazon EKS.

The application agility increases because the monolith is broken down into different services that can evolve and scale independently. Some services execute online transactions for users’ direct interactions. Some services execute batch processing. All services run in separate containers in Amazon EKS.

With such an approach, we can create microservices with both independent data stores and independent business functionalities. Read more about How to Peel Mainframe Monoliths for AWS Microservices with Blu Age.

Blu-Age-Serverless-6

Figure 6 – Overview of automated refactoring from mainframe to Amazon EKS.

Figure 6 shows the automated refactoring of the mainframe application to Amazon EKS. You could also use Amazon Elastic Container Service (Amazon ECS) and AWS Fargate.

The mainframe application monolith is broken down targeting different containers for various online transactions, and different containers for various batch jobs. Each service DB2 tables and VSAM files are refactored to their own independent Amazon Aurora relational database.

AWS App Mesh facilitates internal application-level communication, while Amazon API Gateway and Amazon MQ focus more on the external integration.

With the Blu Age toolset, some services can still be maintained and developed in COBOL while others can be maintained in Java, which simultaneously allows a polyglot architecture.

For the application code maintained in COBOL on AWS, Blu Age Serverless COBOL provides native integration COBOL APIs for AWS services such as Amazon Aurora, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, Amazon ElastiCache, and Amazon Kinesis, among others.

With such refactoring, programs and dependencies are grouped into separate services. This is called service decomposition and means we create multiple groupings in Blu Age Analyzer.

Blu-Age-Serverless-7

Figure 7 – Blu Age Analyzer with two services groupings and one library grouping.

Figure 7 shows one service grouping in green, another service grouping in rose, and a library grouping in blue. Groupings are formalized with one tag each.

For each tag, we export the corresponding application elements to Git and open them in the IDE for compilation. We can create one Git project per tag providing independence and agility to individual service owner.

Blu-Age-Serverless-8

Figure 8 – COBOL program in IDE ready for compilation.

The Blu Age compiler for containers compiles the code and packages it into a Docker container image with all the necessary language runtime configuration for deployment and services communication.

The REST APIs for communication are automatically generated. The container images are automatically produced, versioned and stored into Amazon Elastic Container Registry (Amazon ECR), and the two container images are deployed onto Amazon EKS.

Blu-Age-Serverless-9

Figure 9 – AWS console showing the two container images created in Amazon ECR.

Figure 9 above shows the two new Docker container images referenced in Amazon ECR.

After going through data conversion and extensive testing similar to the previous section, the application is now ready for production on containers managed by Amazon EKS.

Automated Refactoring to Serverless Functions

Now, we can increase agility and cost efficiency further by targeting serverless functions in AWS Lambda.

Not only is the monolith broken down into separate services, but the services become smaller functions with no need to manage servers or containers. With Lambda, there’s no charge when the code is not running.

Not all programs are good use-cases for Lambda. Technical characteristics make Lambda better suited for short-lived lightweight stateless functions. For this reason, some services are deployed in Lambda while others are still deployed in containers or elastic compute.

For example, long-running batch processing cannot run in Lambda but they can run in containers. Online transactions or batch-specific short functions, on the other hand, can run in Lambda.

With this approach, we can create granular microservices with independent data stores and business functions.

Blu-Age-Serverless-10

Figure 10 – Overview of automated refactoring from mainframe to AWS Lambda.

Figure 10 shows the automated refactoring of the mainframe application to Lambda and Amazon EKS. Short-lived stateless transactions and programs are deployed in Lambda, while long-running or unsuitable programs run in Docker containers within Amazon EKS.

Amazon Simple Queue Service (SQS) is used for service calls within or across Lambda and Amazon EKS. Such architecture is similar to a cloud-native application architecture that’s much better positioned in the Cloud-Native Maturity Model.

With this refactoring, programs and dependencies are grouped into more separate services in Blu Age Analyzer.

Blu-Age-Serverless-11

Figure 11 – Blu Age Analyzer with two AWS Lambda groupings, on container grouping and one library grouping.

In Figure 11 above, the green grouping and yellow grouping are tagged for Lambda deployment. The rose grouping stays tagged for container deployment, while the blue grouping stays a library. Same as before, the code is exported tag after tag into Git, then opened within the IDE for compilation.

The compilation and deployment for Lambda does not create a container image, but it creates compiled code ready to be deployed on Blu Age Serverless COBOL layer for Lambda.

Here’s the Serverless COBOL layer added to the deployed functions.

Blu-Age-Serverless-12

Figure 12 – Blu Age Serverless COBOL layer added to AWS Lambda function.

Now, here’s the two new Lambda functions created once the compiled code is deployed.

Blu-Age-Serverless-13

Figure 13 – AWS console showing the two AWS Lambda functions created.

After data conversion and thorough testing similar to the previous sections, the application is now ready for production on serverless functions and containers.

With business logic in Lambda functions, this logic can be invoked from many sources (REST APIs, messaging, object store, streams, databases) for innovations.

Incremental Transitions

Automated refactoring allows customers to accelerate modernization and minimize project risks on many dimensions.

On one side, the extensive automation for the full software stack conversion including code, data formats, dependencies provides functional equivalence preserving core business logic.

On the other side, the solution provides incremental transitions and accelerators tailored to the customer constraints and objectives:

  • Incremental transition from mainframe to AWS: As shown with Blu Age Analyzer, a large application migration is piece-mealed into small work packages with coherent programs and data elements. The migration does not have to be a big bang, and it can be executed incrementally over time.
    .
  • Incremental transition from COBOL to Java: Blu Age compilers and toolset supports maintaining the application code either in the original COBOL or Java.
    .
    All the deployment options described previously can be maintained similarly in COBOL or in Java and co-exist. That means you can choose to keep developing in COBOL if appropriate, and decide to start developing in Java when convenient facilitating knowledge transfer between developers.
    .
  • Incremental transition from elastic compute, to containers, to functions: Some customers prefer starting with elastic compute, while others prefer jumping straight to containers or serverless functions. Blu Age toolset has the flexibility to switch from one target to the other following the customer specific needs.
    .
  • Incremental transition from monolith to services and microservices: Peeling a large monolith is a long process, and the monolith can be kept and deployed on the various compute targets. When time comes, services or microservices are identified in Blu Age Analyzer, and then extracted and deployed on elastic compute, containers, or serverless functions.

From a timeline perspective, the incremental transition from mainframe to AWS is a short-term project with achievable return on investment, as shown on Figure 14.

Blu-Age-Serverless-14

Figure 14 – Mainframe to AWS transition timeline.

We recommend starting with a hands-on Proof-of-Concept (PoC) with customers’ real code. It’s the only way to prove the technical viability and show the outcome quality within 6 weeks.

Then, you can define work packages and incrementally refactor the mainframe application to AWS targeting elastic compute, containers, or serverless functions.

The full refactoring of a mainframe workload onto AWS can be completed in a year. As soon as services are refactored and in production on AWS, new integrations and innovations become possible for analytics, mobile, voice, machine learning (ML), or Internet of Things (IoT) use cases.

Summary

Blu Age mainframe automated refactoring provides the speed and flexibility to meet the agility needs of customers. It leverages the AWS quality of service for high security, high availability, elasticity, and rich system management to meet or exceed the mainframe workloads requirements.

While accelerating modernization, Blu Age toolset allows incremental transitions adapting to customers priorities. It accelerates mainframe modernization to containers or serverless functions

Blu Age also gives the option to keep developing in COBOL or transition smoothly to Java. It facilitates the identification and extraction of microservices.

For more details, visit the Serverless COBOL page and contact Blu Age to learn more.

.
Blu-Age-APN-Blog-CTA-1
.


Blu Age – APN Partner Spotlight

Blu Age is an APN Select Technology Partner that helps organizations enter the digital era by modernizing legacy systems while substantially reducing modernization costs, shortening project duration, and mitigating the risk of failure.

Contact Blu Age | Solution Overview | AWS Marketplace

*Already worked with Blu Age? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

Importing Data from SQL Server to MariaDB

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

There are several ways to move data from SQL Server to MariaDB. There are also some caveats.

Moving Data Definition from SQL Server to MariaDB

To copy an SQL Server data structures to MariaDB, one has to:

  1. Generate a file containing the SQL commands to recreate the database.
  2. Modify the syntax so that it works in MariaDB.
  3. Run the file in MariaDB.

Variables That Affect DDL Statements

DDL statements are affected by some server system variables.

sql_mode determines the behavior of some SQL statements and expressions, including how strict error checking is, and some details regarding the syntax. Objects like stored procedures, stored functions triggers and views, are always executed with the sql_mode that was in effect during their creation. sql_mode='MSSQL' can be used to have MariaDB behaving as close to SQL Server as possible.

innodb_strict_mode enables the so-called InnoDB strict mode. Normally some errors in the CREATE TABLE options are ignored. When InnoDB strict mode is enabled, the creation of InnoDB tables will fail with an error when certain mistakes are made.

updatable_views_with_limit determines whether view updates can be made with an UPDATE or DELETE statement with a LIMIT clause if the view does not contain all primary or not null unique key columns from the underlying table.

Dumps and sys.sql_modules

The SQL Server Management Studio allows to create an SQL script to recreate a database – something that MariaDB users refer to as a dump. Similarly, the sp_helptext() procedure returns the SQL statement to recreate a certain object, and those definitions are also present in the sql_modules table (definition column), in the sys schema. Therefore, it is easy to create a text files containing the SQL statements to recreate a database or part of it.

Remember however that MariaDB does not support schemas. An SQL Server schema is approximately a MariaDB database.

To execute a dump, we can pass the file to mysql, the MariaDB command-line client.

Provided that a dump file contains syntax that is valid with MariaDB, it can be executed in this way:

mysql --show-warnings < dump.sql

--show-warnings tells MariaDB to output any warnings produced by the statements contained in the dump. Without this option, warnings will not appear on screen. Warnings don't stop the dump execution.

Errors will appear on screen. Errors will stop the dump execution, unless the --force option (or just -f) is specified.

For other mysql options, see mysql Command-line Client Options.

Another way to achieve the same purpose is to start the mysql client in interactive mode first, and then run the source command. For example:

root@d5a54a082d1b:/# mysql -uroot -psecret
Welcome to the MariaDB monitor.  Commands end with ; or g.
Your MariaDB connection id is 22
Server version: 10.4.7-MariaDB-1:10.4.7+maria~bionic mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

MariaDB [(none)]> W
Show warnings enabled.
MariaDB [(none)]> source dump.sql

In this case, to show warnings we used the W command, where "w" is uppercase. To hide warnings (which is the default), we can use w (lowercase).

For other mysql commands, see mysql Commands.

Syntax Notes

Moving Data from SQL Server to MariaDB

Importing a CSV file

CONNECT

Syntax Differences between MariaDB and SQL Server

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

This article is meant to show a non-exhaustive list of syntax differences between MariaDB and SQL Server, and it’s written for SQL Server users that are unfamiliar with MariaDB.

Compatibility Features

Some features are meant to improve syntax and semantics compatibility between MariaDB versions, between MariaDB and MySQL, and between MariaDB and other DBMSs. This section focuses on the compatibility between MariaDB and SQL Server.

sql_mode and old_mode

The SQL semantics and syntax, in MariaDB, are affected by the sql_mode variable. Its value is a comma-separated list of flags, and each of them if specified affects a different aspects of the SQL syntax and semantics.

A particularly important flag for users familiar with SQL Server is MSSQL.

sql_mode can be changed locally, in which case it only affects the current session; or globally, in which case it will affect all new connections (but not the connections already established). sql_mode must be assigned a comma-separated list of flags.

A usage example:

# check the current global and local sql_mode values
SELECT @@global.sql_mode;
SELECT @@session.sql_mode;
# empty sql_mode for all usaers
SET GLOBAL sql_mode = '';
# add MSSQL flag to the sql_mode for the current session
SET SESSION sql_mode = CONCAT(sql_mode, ',MSSQL');

old_mode is very similar to sql_mode, but its purpose is to provide compatibility with older MariaDB versions. Its flags shouldn’t affect the compatibility with SQL Server (though it is theoretically possible that some of them do, as a side effect).

MariaDB supports executable comments. These are designed to write generic queries that are only executed by MariaDB, and optionally only certain versions.

The following examples show how to insert SQL code that will be ignored by SQL Server but executed by MariaDB, or some of its versions.

  • Executed by MariaDB and MySQL:
SELECT * FROM tab /*! FORCE INDEX (idx_a) */ WHERE a = 1 OR b = 2;
  • Executed by MariaDB only:
SELECT * /*!M , @in_transaction */ FROM tab;
  • Executed by MariaDB starting from version 10.0.5:
DELETE FROM user WHERE id = 100 /*!M100005 RETURNING email */;

Data Definition Language

MariaDB supports useful syntax that SQL Server users may find useful.

IF EXISTS, IF NOT EXISTS, OR REPLACE

Most DDL statements, including ALTER TABLE, support the following syntax:

  • DROP IF EXISTS: A warning (not an error) is produced if the object does not exist.
  • OR REPLACE: If the object exists, it is dropped and recreated; otherwise it is created. This operation is atomic, so in no point in time the object does not exist.
  • CREATE IF NOT EXISTS: If the object already exists, a warning (not an error) is produced. The object will not be replaced.

These statements are functionally similar (but less verbose) than SQL Server snippets similar to the following:

IF NOT EXISTS (SELECT name FROM sysobjects WHERE name = 'my_table' AND xtype = 'U')
    CREATE TABLE my_table (
        ...
    )
go

SHOW CREATE

In general, for each CREATE statement MariaDB also supports a SHOW CREATE statement. For example there is a SHOW CREATE TABLE that returns the CREATE TABLE statement that can be used to recreate a table.

Though SQL Server has no way to show the DDL statement to recreate an object, SHOW CREATE statements are functionally similar to sp_helptext().

See Also

How to Deploy and using MySQL InnoDB Replica Set in Production ?

$
0
0

Feed: Planet MySQL
;
Author: Chandan Kumar
;

How to Deploy MySQL InnoDB Replica Set in Production?

Before i talk about Deployment process of MySQL InnoDB Replica Set , it is more important to know below details:-
  • What is MySQL InnoDB Replica Set?
  • What is prerequisite and limitation of using
    MySQL Replica Set?
  •  In what kind of scenarios MySQL Replica Set is
    not recommended.
  •  How to configure and deploy MySQL Replica Set- (step by step guide )
  • How to use InnoDB Replica Set?
  • What if Primary goes down? Does select query re-routed to another server?
  • What if Secondary goes down while executing select queries?

§      I will answer these all question in this blog.

   What is Replica Set ?

MySQL InnoDB ReplicaSet a quick and easy way to get MySQL
replication(Master-Slave), making it well suited to scaling out reads, and
provides manual failover capabilities in use cases that do not require the high
availability offered by MySQL InnoDB cluster.

Suppose you have one server is running for deriving workloads and you have to bring high availability
in place for an application, basic says in MySQL to achieve high availability
you require minimum 02 MySQL Server running in two different host.

And to set up link between these two host until earlier we
have to prepare and qualify server to be part of HA, which requires you must
know basis of MySQL BUT from MySQL 8.0.19 you don’t have to spend time on
preparation and qualification and doing configuration level changes , MySQL InnoDB REPLICA SET makes your JOB  AUTOMATED.


MySQL Replica Set is set of three components which is

·      
MySQL Shell

·      
MySQL Router

·      
Set of MySQL Servers(min no of server – 02)

It works only with Single Primary and multiple secondary server,
which is in ASYNC mode.

MySQL Shell includes AdminAPI, which enables
you to easily configure, administrator, and deploy a group of MySQL Servers.

MySQL Router which
is part of Replica Set and is lightweight middleware that provides transparent
routing between your application and back-end MySQL Servers. Purpose is to
serve R/W request to primary instance through port 6446 and R/O request to
multiple primary instance through port 6447.

It is always recommended to Install MySQL Router into app
server because of below reasons

·      
app is the one who has to send request to .

Application——->Router——->List of MySQL Servers.

·      
To decrease network latency

What is prerequisite and limitation of using MySQL Replica Set?

§ 
Manual Failover.

§ 
No Multi Primary Topology.

§ 
All Secondary members replicate from primary.

§ 
GTID based.

§ 
All MySQL Server version 8.0.19.

§ 
Rows based replication supported.

§ 
Replication Filter is not supported.

§ 
Replica Set must be managed by MySQL Shell.

§ 
Try to always use MySQL Cloning over Incremental
Recovery as Recovery Method.

More Limitations:-

In what kind of scenarios MySQL Replica Set is Recommended ?

Below are top Features which makes life of DBA simple:-

ü 
To scale Read workloads.

ü 
Manual failover in event of primary node goes
down.

ü 
Useful where we can compromise RPO/RTO time.

ü 
MySQL Shell Automatically configures users and
Replication.

ü 
Easy to deploy without editing into
my.cnf/my.ini file.

ü 
Not to spend time on Backup àRestore to provision
new node , MySQL CLONE feature in-built which will save a lot  time to bring another server for replication. More
on Cloning:- https://mysqlserverteam.com/clone-create-mysql-instance-replica/

ü 
Integrated MySQL Router Load balancing .

ü 
Easy to getting started into MySQL high
availability for all tier type applications.

How to configure and deploy MySQL Replica Set

Step by step guide to deploy MySQL Replica Set in Production

In this tutorial I will use two machine where MySQL is
running

Machine 01:- 10.0.10.33

Machine 02:- 
10.0.10.38

Make sure below software is installed:-

1.      
Mysql Server 8.0.19

2.      
MySQL Shell

3.      
MySQL Router. (it can install on either MySQL
Server or Application Server which is Recommended).

Step 1:- Configure Machine to participate into InnoDB
Replica Set

##In Machine 01

mysqlsh

shell.connect(“root@10.0.10.33:3306”);

Creating a session to
‘root@10.0.10.33:3306’

Please provide the password for
‘root@10.0.10.33:3306’: ********

Save password for
‘root@10.0.10.33:3306’? [Y]es/[N]o/Ne[v]er (default No): Y

Fetching schema names for autocompletion…
Press ^C to stop.

Your MySQL connection id is 13

Server version: 8.0.19-commercial
MySQL Enterprise Server – Commercial

No default schema selected; type
use to set one.

 MySQL 
10.0.10.33:3306 ssl  JS >

dba.configureReplicaSetInstance(“root@10.0.10.33:3306”,{clusterAdmin:
“‘rsadmin’@’10.0.10.33%'”});

Configuring local MySQL instance listening
at port 3306 for use in an InnoDB ReplicaSet…

This instance reports its own
address as Workshop-33:3306

Clients and other cluster members
will communicate with it through this address by default. If this is not
correct, the report_host MySQL system variable should be changed.

Password for new account: ********

Confirm password: ********

NOTE: Some configuration options
need to be fixed:

+————————–+—————+—————-+————————————————–+

| Variable                 | Current Value | Required
Value | Note                                            
|

+————————–+—————+—————-+————————————————–+

| enforce_gtid_consistency |
OFF           | ON             | Update read-only variable and
restart the server |

| gtid_mode                | OFF           | ON             | Update read-only variable and
restart the server |

| server_id                | 1             |     | Update read-only variable and restart
the server |

+————————–+—————+—————-+————————————————–+

Some variables need to be changed,
but cannot be done dynamically on the server.

Do you want to perform the
required configuration changes? [y/n]: y

Do you want to restart the
instance after configuring it? [y/n]: y

Cluster admin user
‘rsadmin’@’10.0.10.33%’ created.

Configuring instance…

The instance ‘Workshop-33:3306’
was configured to be used in an InnoDB ReplicaSet.

Restarting MySQL…

NOTE: MySQL server at
Workshop-33:3306 was restarted.

##In Machine 2

mysqlsh

shell.connect(“root@10.0.10.38:3306”);

Creating a session to
‘root@10.0.10.38:3306’

Please provide the password for
‘root@10.0.10.38:3306’: ********

Save password for
‘root@10.0.10.38:3306’? [Y]es/[N]o/Ne[v]er (default No): Y

Fetching schema names for
autocompletion… Press ^C to stop.

Your MySQL connection id is 10

Server version: 8.0.19-commercial
MySQL Enterprise Server – Commercial

No default schema selected; type
use to set one.

dba.configureReplicaSetInstance(“root@10.0.10.38:3306”,{clusterAdmin:
“‘rsadmin’@’10.0.10.38%'”});

Configuring local MySQL instance
listening at port 3306 for use in an InnoDB ReplicaSet…

This instance reports its own
address as Workshop-38:3306

Clients and other cluster members
will communicate with it through this address by default. If this is not
correct, the report_host MySQL system variable should be changed.

Password for new account: ********

Confirm password: ********

NOTE: Some configuration options
need to be fixed:

+————————–+—————+—————-+————————————————–+

| Variable                 | Current Value | Required
Value | Note                                            
|

+————————–+—————+—————-+————————————————–+

| enforce_gtid_consistency |
OFF           | ON             | Update read-only variable and
restart the server |

| gtid_mode                | OFF           | ON             | Update read-only variable and
restart the server |

| server_id                | 1             |     | Update read-only variable and restart
the server |

+————————–+—————+—————-+————————————————–+

Some variables need to be changed,
but cannot be done dynamically on the server.

Do you want to perform the
required configuration changes? [y/n]:
y

Do you want to restart the
instance after configuring it? [y/n]:
y

Cluster admin user
‘rsadmin’@’10.0.10.38%’ created.

Configuring instance…

The instance ‘Workshop-38:3306’
was configured to be used in an InnoDB ReplicaSet.

Restarting MySQL…

NOTE: MySQL server at
Workshop-38:3306 was restarted.

 MySQL 
10.0.10.38:3306 ssl  JS >

Step 2:- Create Replica Set and Add database node to form
Replica Set.

##Connect to Machine
01 :-

mysqlsh

shell.connect(“root@10.0.10.33:3306”);

var rs =
dba.createReplicaSet(“MyReplicatSet”)

A new replicaset with instance
‘Workshop-33:3306’ will be created.

* Checking MySQL instance at Workshop-33:3306

This instance reports its own
address as Workshop-33:3306

Workshop-33:3306: Instance
configuration is suitable.

* Updating metadata…

ReplicaSet object successfully
created for Workshop-33:3306.

Use rs.addInstance() to add more
asynchronously replicated instances to this replicaset and rs.status() to
check its status.

 MySQL 
10.0.10.33:3306 ssl  JS >
rs.addInstance(“10.0.10.38:3306”);
Adding instance to the
replicaset…

* Performing validation checks

This instance reports its own
address as Workshop-38:3306

Workshop-38:3306: Instance
configuration is suitable.

* Checking async replication
topology…

* Checking transaction state of
the instance…

The safest and most convenient way
to provision a new instance is through automatic clone provisioning, which
will completely overwrite the state of ‘Workshop-38:3306’ with a physical
snapshot from an existing replicaset member. To use this method by default,
set the ‘recoveryMethod’ option to ‘clone’.

WARNING: It should be safe to rely
on replication to incrementally recover the state of the new instance if you
are sure all updates ever executed in the replicaset were done with GTIDs
enabled, there are no purged transactions and the new instance contains the
same GTID set as the replicaset or a subset of it. To use this method by
default, set the ‘recoveryMethod’ option to ‘incremental’.

Incremental state recovery was
selected because it seems to be safely usable.

* Updating topology

** Configuring Workshop-38:3306 to
replicate from Workshop-33:3306

** Waiting for new instance to
synchronize with PRIMARY…

The instance ‘Workshop-38:3306’
was added to the replicaset and is replicating from Workshop-33:3306.

 MySQL 
10.0.10.33:3306 ssl  JS >

rs.status();

{

   
“replicaSet”: {

        “name”:
“ReplicatSet”,

        “primary”:
“Workshop-38:3306”,

        “status”: “AVAILABLE”,

        “statusText”: “All
instances available.”,

        “topology”: {

            “10.0.10.39:3306”: {

                “address”:
“10.0.10.39:3306”,

                “instanceRole”:
“SECONDARY”,

                “mode”:
“R/O”,

                “replication”: {

                   
“applierStatus”: “APPLIED_ALL”,

                   
“applierThreadState”: “Slave has read all relay log;
waiting for more updates”,

                   
“receiverStatus”: “ON”,

                   
“receiverThreadState”: “Waiting for master to send
event”,

                   
“replicationLag”: null

                },

                “status”:
“ONLINE”

            },

            “Workshop-38:3306”: {

                “address”:
“Workshop-38:3306”,

                “instanceRole”:
“PRIMARY”,

                “mode”:
“R/W”,

                “status”:
“ONLINE”

            }

        },

        “type”: “ASYNC”

   
}

}

Step 3:- Configure Router to talk from App to Replica Set.

mysqlrouter –force 
–user=root –bootstrap root@10.0.10.38:3306 –directory myrouter

#In Case
Router from Remote Machine:-cluster in 10.0.10.14

mysqlrouter –bootstrap root@10.0.10.14:3310 –directory
myrouter

Step 4: Start Router

myrouter/start.sh

Step 5: Using Replica Set
mysqlsh

MySQL
JS>

shell.connect(“root@127.0.0.1:6446”);

sql

SQL>SELECT * FROM
performance_schema.replication_group_members;

CREATE DATABASE sales;USE sales;

CREATE TABLE if not exists sales.employee(empid int
primary key auto_increment,empname varchar(100),salary int,deptid int);

INSERT sales.employee(empname,salary,deptid)
values(‘Ram’,1000,10);

INSERT sales.employee(empname,salary,deptid)
values(‘Raja’,2000,10);

INSERT sales.employee(empname,salary,deptid)
values(‘Sita’,3000,20);

SELECT * FROM 
sales.employee;

Connect
Router to another machine to verify changes.

mysqlsh

JS>shell.connect(“root@127.0.0.1:6447”);

sql

SQL>SELECT * FROM sales.employee;

INSERT sales.employee values(100,’Ram’,1000,10);

because this machine is not allowed to
execute DML,DDL statements.>

##Create
Disaster

#service
mysqld stop

RS1= dba.getReplicaSet()

RS1.status();

MySQL  10.0.10.38:3306 ssl  JS > RS1.status()

ReplicaSet.status:
Failed to execute query on Metadata server 10.0.10.38:3306: Lost connection
to MySQL server during query (MySQL Error 2013)

 MySQL 
10.0.10.38:3306 ssl  JS >
RS1.status()

ReplicaSet.status:
The Metadata is inaccessible (MetadataError)

 MySQL 
10.0.10.38:3306 ssl  JS >
RS1.status()

ReplicaSet.status:
The Metadata is inaccessible (MetadataError)

 MySQL 
10.0.10.38:3306 ssl  JS >

MySQL-JS>shell.connect(“root@localhost:6446”);

Creating a session to
‘root@10.0.10.38:6446’

Please provide the password
for ‘root@10.0.10.38:6446’: ********

Shell.connect: Can’t connect to remote MySQL
server for client connected to ‘0.0.0.0:6446’ (MySQL Error 2003)

#service mysqld start

MySQL  10.0.10.38:3306 ssl  JS > RS1.status()

ReplicaSet.status: The Metadata
is inaccessible (MetadataError)

 MySQL 
10.0.10.38:3306 ssl  JS >
RS1=dba.getReplicaSet()

You are connected to a member of
replicaset ‘ReplicatSet’.

RS1=dba.getReplicaSet()

RS1.status()

##Again Connect to Router to send the traffic

mysqlsh

shell.connect(“root@localhost:6447”);

sql

SQL>SELECT * FROM sales.employee;

Scenario#1 Assume
primary goes down :and if you run

MySQL  10.0.10.38:3306
ssl  JS > RS1.status()

Error :- ReplicaSet.status:
The Metadata is inaccessible (MetadataError)

 MySQL  10.0.10.38:3306 ssl  JS >

Now Primary machine UP and if you run

MySQL  10.0.10.38:3306
ssl  JS > RS1.status()

ReplicaSet.status: The Metadata is inaccessible
(MetadataError)

>>It not get refreshed.

Fix :-
RS1= dba.getReplicaSet()

RS1.status();

Scenario #02

Create Disaster # What if Primary
Node Fails while executing below query from application

while
[ 1 ]; do sleep 1; mysql -h127.0.0.1 -uroot -p123456  -P6446 -e ” INSERT
sales.employee(empname,salary,deptid) values(‘Ram’,1000,10); select count(*)
from sales.employee”; done

JS

#Stop Primary MySQL Instance

service mysqld stop

You can see Insert Query is stopped working , Ended with ERROR
Now Lets execute only SELECT query let see what happens… since primary node goes down which means mysql router will stopped send any query into 6446  BUT router has another port OPEN for sending ONLY SELECT query. which meant router will use port 6447 to send select query.
 see below

Let’s re-execute same query with only SELECT query connecting to R/O port 6447
while [ 1 ]; do sleep 1; mysql -h127.0.0.1 -uroot -p123456  -P6447 -e ” Select count(*) from sales.employee”; done
You are able to access another machine which is Replica (10.0.0.38).
Now , Let’s Re-connect to Primary Node(10.0.10.33) what will happen? it will work or not?…
Which means that even if primary node goes down and second
replicas are alive then select query will work

select @@hostname; –> 10.0.10.38

Scenario #03

Create Disaster # What if Secondary
Node Fails…

#Stop MySQL Instance

10.0.10.38$service mysqld stop

Primary will still works even though Secondary node goes
down… that’s by design of MySQL Replication.

Now since secondary node goes down let’s connect to 6447 and
send only SELECT query

while
[ 1 ]; do sleep 1; mysql -h127.0.0.1 -uroot -p123456  -P6447 -e ” select count(*) from
sales.employee”; done

What will happen? 
Even though Secondary Node goes down , MySQL Router will re-routing to Primary server and return results as you see in above image.
Re-confirm:-

Can you Observe one important observation? why port 6447 is executing R/W query?
When we execute R/W and R/O on 6447
port Router does routing to Primary Node 6446.

Because as per documentation:-

When you use MySQL Router
with a replica set, be aware that:

·       The
read-write port of MySQL Router directs client connections to the primary
instance of the replica set

·       The
read-only port of MySQL Router direct client connections to a secondary
instance of the replica set, although it could also direct them to the primary
Please try this brand new features to set up MySQL Replication with the help of MySQL Shell.

Want to Know more?

Fun with Bugs #94 – On MySQL Bug Reports I am Subscribed to, Part XXVIII

$
0
0

Feed: Planet MySQL
;
Author: Valeriy Kravchuk
;

I may get a chance to speak about proper bugs processing for open source projects later this year, so I have to keep reviewing recent MySQL bugs to be ready for that. In my previous post in this series I listed some interesting MySQL bug reports created in December, 2019. Time to move on to January, 2020! Belated Happy New Year of cool MySQL Bugs!

As usual I mostly care about InnoDB, replication and optimizer bugs and explicitly mention bug reporter by name and give link to his other active reports (if any). I also pick up examples of proper (or improper) reporter and Oracle engineers attitudes. Here is the list:

  • Bug #98103 – “unexpected behavior while logging an aborted query in the slow query log“.  Query that was killed while waiting for the table metadata lock is not only get logged, but also lock wait time is saved as query execution time. I’d like to highlight how bug reporter, Pranay Motupalli, used gdb to study what really happens in the code in this case. Perfect bug report!
  • Bug #98113 – “Crash possible when load & unload a connection handler“. The (quite obvious) bug was verified based on code review, but only after some efforts were spent by Oracle engineer on denial to accept the problem and its importance. This bug was reported by Fangxin Flou.
  • Bug #98132 – “Analyze table leads to empty statistics during online rebuild DDL “. Nice addition to my collections! This bug with a nice and clear test case was reported by Albert Hu, who also suggested a fix.
  • Bug #98139 – “Committing a XA transaction causes a wrong sequence of events in binlog“. This bug reported by Dehao Wang was verified as a “documentation” one, but I doubt documenting current behavior properly is an acceptable fix. Bug reporter suggested to commit in the binary log first, for example. Current implementation that allows users to commit/rollback a XA transaction by using another connection if the former connection is closed or killed, is risky. A lot of arguing happened in comments in the process, and my comment asking for a clear quote from the manual:
    Would you be so kind to share some text from this page you mentioned:

    https://dev.mysql.com/doc/refman/8.0/en/xa.html

    or any other fine MySQL 8 manual page stating that XA COMMIT is NOT supported when executed from session/connection/thread other than those prepared the XA transaction? I am doing something wrong probably, but I can not find such text anywhere.

    was hidden. Let’s see what happens to this bug report next.

  • Bug #98211 – “Auto increment value didn’t reset correctly.“. Not sure what this bug reported by Zhao Jianwei has to do with “Data Types”, IMHO it’s more about DDL or data dictionary. Again, some sarcastic comments from Community users were needed to put work on this bug back on track…
  • Bug #98220 – “with log_slow_extra=on Errno: info not getting updated correctly for error“. This bug was reported by lalit Choudhary from Percona.
  • Bug #98227 – “innodb_stats_method=’nulls_ignored’ and persistent stats get wrong cardinalities“. I think category is wrong for this bug. It’s a but in InnoDB’s persistent statistics implementation, one of many. The bug was reported by Agustín G from Percona.
  • Bug #98231 – “show index from a partition table gets a wrong cardinality value“. Yet another by report by Albert Hu. that ended up as a “documentation” bug for now, even though older MySQL versions provided better cardinality estimations than MySQL 8.0 in this case (so this is a regression of a kind). I hope the bug will be re-classified and properly processed later.
  • Bug #98238 – “I_S.KEY_COLUMN_USAGE is very slow“. I am surprised to see such a bug in MySQL 8. According to the bug reporter, Manuel Mausz, this is also a kind of regression comparing to older MySQL version, where these queries used to run faster. Surely, no “regression” tag in this case was added.
  • Bug #98284 – “Low sysbench score in the case of a large number of connections“. This notable performance regression of MySQL 8 vs 5.7 was reported by zanye zjy. perf profiling pointed out towards ppoll() where a lot of time is spent. There is a fix suggested by Fangxin Flou (to use poll() instead), but the bug is still “Open”.
  • Bug #98287 – “Explanation of hash joins is inconsistent across EXPLAIN formats“. This bug was reported by Saverio M and ended up marked as a duplicate of Bug #97299 fixed in upcoming 8.0.20. Use EXPLAIN FORMAT=TREE in the meantime to see proper information about hash joins usage in the plan.
  • Bug #98288 – “xa commit crash lead mysql replication error“. This bug report from Phoenix Zhang (who also suggested a patch) was declared a duplicate of Bug #76233 – “XA prepare is logged ahead of engine prepare” (that I’ve already discussed among other XA transactions bugs here).
  • Bug #98324 – “Deadlocks more frequent since version 5.7.26“. Nice regression bug report by Przemyslaw Malkowski from Percona, with additional test provided later by Stephen Wei . Interestingly enough, test results shared by Umesh Shastry show that MySQL 8.0.19 is affected in the same way as 5.7.26+, but 8.0.19 is NOT listed as one of versions affected. This is a mistake to fix, along with missing regression tag.
  • Bug #98427 – “InnoDB FullText AUX Tables are broken in 8.0“. Yet another regression in MySQL 8 was found by Satya Bodapati. Change in default collation for utf8mb4 character set caused this it seems. InnoDB FULLTEXT search was far from perfect anyway…
The are clouds in the sky of MySQL bugs processing.

To summarize:

  1.  Still too much time and efforts are sometimes spent on arguing with bug reporter instead of accepting and processing bugs properly. This is unfortunate.
  2. Sometimes bugs are wrongly classified when verified (documentation vs code bug, wrong category, wrong severity, not all affected versions are listed, ignoring regression etc). This is also unfortunate.
  3. Percona engineers still help to make MySQL better.
  4. There are some fixes in upcoming MySQL 8.0.20 that I am waiting for 🙂
  5. XA transactions in MySQL are badly broken (they are not atomic in storage engine + binary log) and hardly safe to use in reality.

Addressing the Drop Column Bug in Oracle 18c and 19c

$
0
0

Feed: Databasejournal.com – Feature Database Articles.
Author: .

The road of progress can be rough at times. Oracle versions 18 and 19 are no exception. Up until version 18.x Oracle had no issues with marking columns as unused and eventually dropping them. Given some interesting circumstances, the latest two Oracle versions can throw ORA-00600 errors when columns are set as unused and then dropped. The conditions that bring about this error may not be common but there are a large number of Oracle installations across the globe and it is very likely someone, somewhere, will encounter this bug.

The tale begins with two tables and a trigger:


create table trg_tst1 (c0 varchar2(30), c1 varchar2(30), c2 varchar2(30), c3 varchar2(30), c4 varchar2(30));
create table trg_tst2 (c_log varchar2(30));

create or replace trigger trg_tst1_cpy_val
after insert or update on trg_tst1
for each row
begin
        IF :new.c3 is not null then
                insert into trg_tst2 values (:new.c3);
        end if;
end;
/

Data is inserted into table TRG_TST1 and, provided the conditions are met, data is replicated to table TRG_TST2. Two rows are inserted into TRG_TST1 so that only one of the inserted rows will be copied to TRG_TST2. After each insert table TRG_TST2 is queried and the results displayed:


SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c3) values ('Inserting c3 - should log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c4) values ('Inserting c4 - should not log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 

Now the ‘fun’ begins — two columns in TST_TRG1 are marked ‘unused’ and then dropped, and table TST_TRG2 is truncated. The inserts into TST_TRG1 are executed again, but this time the dreaded ORA-00600 errors are produced. To see why these errors occur, the status of the trigger is reported from USER_OBJECTS:


SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > --  Drop some columns in two steps then
SMERBLE @ gwunkus > --  truncate trg_tst2 and repeat the test
SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > --  ORA-00600 errors are raised
SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > --  The trigger is not invalidated and
SMERBLE @ gwunkus > --  thus is not recompiled.
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > alter table trg_tst1 set unused (c1, c2);

Table altered.

SMERBLE @ gwunkus > alter table trg_tst1 drop unused columns;

Table altered.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > select object_name, status from user_objects where object_name in (select trigger_name from user_triggers);


OBJECT_NAME                         STATUS
----------------------------------- -------
TRG_TST1_CPY_VAL                    VALID

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > truncate table trg_tst2;

Table truncated.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c3) values ('Inserting c3 - should log');

insert into trg_tst1(c3) values ('Inserting c3 - should log')
            *
ERROR at line 1:
ORA-00600: internal error code, arguments: [insChkBuffering_1], [4], [4], [], [], [], [], [], [], [], [], []


SMERBLE @ gwunkus > select * from trg_tst2;

no rows selected

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c4) values ('Inserting c4 - should not log');

insert into trg_tst1(c4) values ('Inserting c4 - should not log')
            *
ERROR at line 1:
ORA-00600: internal error code, arguments: [insChkBuffering_1], [4], [4], [], [], [], [], [], [], [], [], []


SMERBLE @ gwunkus > select * from trg_tst2;

no rows selected

SMERBLE @ gwunkus > 

The issue is that, in Oracle 18c and 19c, the ‘drop unused columns’ action does NOT invalidate the trigger leaving it in a ‘VALID’ state and setting the next transactions up for failure. Since the trigger was not recompiled at the next call the original compilation environment is still in effect, an environment that includes the now-dropped columns. Oracle can’t find columns C1 and C2, but the trigger is still expecting them to exist, thus the ORA-00600 error. My Oracle Support reports this as a bug:


Bug 30404639 : TRIGGER DOES NOT WORK CORRECTLY AFTER ALTER TABLE DROP UNUSED COLUMN.

and reports that the cause is, in fact, the failure to invalidate the trigger with the deferred column drop.

So how to get around this issue? One way is to explicitly compile the trigger after the unused columns are dropped:


SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > -- Compile the trigger after column drops
SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > alter trigger trg_tst1_cpy_val compile;

Trigger altered.

SMERBLE @ gwunkus > 

With the trigger now using the current environment and table configuration the inserts function correctly and the trigger fires as expected:


SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > -- Attempt inserts again
SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > insert into trg_tst1(c3) values ('Inserting c3 - should log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c4) values ('Inserting c4 - should not log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 

Another way exists to get around this issue; Don’t mark the columns as unused and simply drop them from the table. Dropping the original tables, recreating them and executing this example with a straight column drop shows no sign of an ORA-00600, and the trigger status after the column drop proves that no such errors will be thrown:


SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > drop table trg_tst1 purge;

Table dropped.

SMERBLE @ gwunkus > drop table trg_tst2 purge;

Table dropped.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > --  Re-run the example without marking
SMERBLE @ gwunkus > --  columns as 'unused'
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > create table trg_tst1 (c0 varchar2(30), c1 varchar2(30), c2 varchar2(30), c3 varchar2(30), c4 varchar2(30));

Table created.

SMERBLE @ gwunkus > create table trg_tst2 (c_log varchar2(30));

Table created.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > create or replace trigger trg_tst1_cpy_val
  2  after insert or update on trg_tst1
  3  for each row
  4  begin
  5  	     IF :new.c3 is not null then
  6  		     insert into trg_tst2 values (:new.c3);
  7  	     end if;
  8  end;
  9  /

Trigger created.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c3) values ('Inserting c3 - should log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c4) values ('Inserting c4 - should not log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > --  Drop some columns,
SMERBLE @ gwunkus > --  truncate trg_tst2 and repeat the test
SMERBLE @ gwunkus > --
SMERBLE @ gwunkus > --  No ORA-00600 errors are raised as
SMERBLE @ gwunkus > --  the trigger is invalidated by the
SMERBLE @ gwunkus > --  DDL.  Oracle then recompiles the
SMERBLE @ gwunkus > --  invalid trigger.
SMERBLE @ gwunkus > --  ===================================
SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > alter table trg_tst1 drop (c1,c2);

Table altered.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > select object_name, status from user_objects where object_name in (select trigger_name from user_triggers);

OBJECT_NAME                         STATUS
----------------------------------- -------
TRG_TST1_CPY_VAL                    INVALID

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > truncate table trg_tst2;

Table truncated.

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c3) values ('Inserting c3 - should log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 
SMERBLE @ gwunkus > insert into trg_tst1(c4) values ('Inserting c4 - should not log');

1 row created.

SMERBLE @ gwunkus > select * from trg_tst2;

C_LOG
------------------------------
Inserting c3 - should log

SMERBLE @ gwunkus > 

Oracle versions prior to 18c behave as expected, with the deferred column drop correctly setting the trigger status to ‘INVALID’:


SMARBLE @ gwankus > select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE	12.1.0.2.0	Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

SMARBLE @ gwankus >
SMARBLE @ gwankus > alter table trg_tst1 set unused (c1, c2);

Table altered.

SMARBLE @ gwankus > alter table trg_tst1 drop unused columns;

Table altered.

SMARBLE @ gwankus >
SMARBLE @ gwankus > select object_name, status from user_objects where object_name in (select trigger_name from user_triggers);

OBJECT_NAME			    STATUS
----------------------------------- -------
TRG_TST1_CPY_VAL		    INVALID

SMARBLE @ gwankus >

How the columns are dropped in versions older than 18c makes no difference as any triggers on the affected table will be rendered invalid. The next call to any trigger on that table will result in an ‘automatic’ recompile, setting the execution environment properly (meaning the missing columns in the affected table won’t remain in the execution context).

It isn’t likely that a production database will undergo column drops without first making such changes in a DEV or TST database. Unfortunately testing inserts after columns are dropped may not be a test that is executed after such changes are made and before the code is promoted to PRD. Having more than one person testing the after-effects of dropping columns would seem to be an excellent idea, since, as the old adage attests, ‘Two heads are better than one.’ The more the merrier in a testing situation so that many avenues of possible failure can be presented and executed. The extra time taken to more thoroughly test a change means a less likely chance of unforeseen errors seriously affecting, or stopping, production.

# # #

See articles by David Fitzjarrell

The next package release into AWS Athena

$
0
0

Feed: R-bloggers.
Author: Dyfan Jones Brain Dump HQ.

RBloggers|RBloggers-feedburner

RAthena 1.7.1 and noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to AWS Athena, noticeably the performance speed and several creature comforts.

These packages have both been designed to reflect one another,even down to how they connect to AWS Athena. This means that all features going forward will exist in both packages. I will refer to these packages as one, as they basically work in the same way.

Initially the packages utilised AWS Athena SQL queries. This was to achieve all the functional requirements of the DBI package framework. However the package would always send a SQL query to AWS Athena which in turn would have to lift a flat file from AWS S3, before returning the final result to R. This means the performance of the packages would be limited and fairly slow compared to other data base backends.

The biggest change is the adoption of more functionality of the SDKs (software development kit) into AWS. The key component that has been adopted is AWS Glue. AWS Glue contains all of AWS Athena table DDL’s. This means instead of going to AWS Athena for this information AWS Glue can be used instead.

By utilising AWS Glue, the table meta data (column names, column types, schema hierarchy etc…) can easily be retrieved at a fraction of the time it would of taken to query AWS Athena. Previously the DBI function dbListTables would send a query to AWS Athena, this would retrieve all the tables listed in all schemas. This would take over 3 seconds. Now using AWS Glue to retrieve the same data, it takes less than 0.5 of a second.

dplyr

When AWS Glue is used to collect metadata around a table in AWS Athena, a performance in dplyr::tbl can be done. I would like to say thanks to @OssiLehtinen for developing the initial implementation as this improvement would have been overlooked.

dplyr::tbl has two key methods when creating the initial object. The first is called SQL identifiers and this is the method that benefits from the new AWS Glue functionality. To use SQL identifiers is fairly straight forward.

library(DBI)
library(dplyr)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

dbWriteTable(con, "iris", iris)

ident_iris = tbl(con, "iris")

dplyr can identify the iris table within the connected schema. When a user uses the SQL identifier method in dplyr::tbl, AWS Glue is called to retrieve all the meta data for dplyr. This increases the performance from 3.66 to 0.29 seconds. The second method is called SQL sub query. This unfortunately won’t benefit from the new feature and will run in slower at 3.66 seconds.

subquery_iris = tbl(con, sql("select * from iris"))

Therefore I recommend the use of SQL identifier method when using dplyr's interface.

Due to user feature requests the packages now return more meta data around each query sent to AWS Athena. Thus the basic level of meta data returned, is the amount of data scanned by AWS Athena. This is formatted into a readable format depending on the amount of data scanned.

library(DBI)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

dbWriteTable(con, "iris", iris)

dbGetQuery(con, "select * from iris")
Info: (Data scanned: 860 Bytes)
     sepal_length sepal_width petal_length petal_width   species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

However if you set the new parameter statistics to TRUE then all the metadata around that query is printed out like so:

dbGetQuery(con, "select * from iris", statistics = TRUE)
$EngineExecutionTimeInMillis
[1] 1568

$DataScannedInBytes
[1] 860

$DataManifestLocation
character(0)

$TotalExecutionTimeInMillis
[1] 1794

$QueryQueueTimeInMillis
[1] 209

$QueryPlanningTimeInMillis
[1] 877

$ServiceProcessingTimeInMillis
[1] 17

Info: (Data scanned: 860 Bytes)
     sepal_length sepal_width petal_length petal_width   species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

This can also be retrieved by using dbStatistics:

res = dbExecute(con, "select * from iris")

# return query statistic
query_stats = dbStatistics(res)

# return query results
dbFetch(res)

# Free all resources
dbClearResult(res)

RJDBC inspired function

I have to give full credit to the package RJDBC for inspiring me to create this function. DBI has got a good function called dbListTables that will list all the tables that are in AWS Athena. However it won’t return to which schema each individual table is related to. To over come this RJDBC has a excellent function called dbGetTables. This function returns all the tables from AWS Athena as a data.frame. This has the advantage of detailing schema, table and table type. With the new integration into AWS Glue this can be returned quickly.

dbGetTables(con)
      Schema             TableName      TableType
 1:  default             df_bigint EXTERNAL_TABLE
 2:  default                  iris EXTERNAL_TABLE
 3:  default               mtcars2 EXTERNAL_TABLE
 4:  default         nyc_taxi_2018 EXTERNAL_TABLE

This just makes it a little bit easier when working in different IDE’s for example Jupyter.

Backend option changes

This is not really a creature comfort but it still interesting and useful. Both packages are dependent on data.table to read data into R. This is down to the amazing speed data.table offers when reading files into R. However a new package, with equally impressive read speeds, has come onto the scene called vroom. As vroom has been designed to only read data into R similarly to readr, data.table is still used for all of the heavy lifting. However if a user wishes to use vroom as the file parser an *_options function has been created to enable this:

nocuta_options(file_parser = c("data.table", "vroom"))

# Or 

RAthena__options(file_parser = c("data.table", "vroom"))

By setting the file_parser to vroom then the backend will change to allow vroom's file parser to be used instead of data.table.

If you aren’t sure whether to use vroom over data.table, I draw your attention to vroom boasting a whopping 1.40GB/sec throughput.

Statistics taken from vroom’s github readme

package version time (sec) speed-up throughput
vroom 1.1.0 1.14 58.44 1.40 GB/sec
data.table 1.12.8 11.88 5.62 134.13 MB/sec
readr 1.3.1 29.02 2.30 54.92 MB/sec
read.delim 3.6.2 66.74 1.00 23.88 MB/sec

RStudio Interface!

Due to the ability of AWS Glue to retrieve metadata for AWS Athena at speed, it has now been possible to add the interface into RStudio’s connection tab. When a connection is established:

library(DBI)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

The connection icon will as follows:

The AWS region you are connecting to will be reflected in the connection (highlighted above in the red square). This is to help users that are able to connect to multiple different AWS Athena over different regions.

Once you have connected AWS Athena, schema hierarchy will be displayed. In my example you can see some of the tables I have created when testing these packages.

For more information around RStudio’s connection tab please check out RStudio preview connections.

To sum up, the Rathena and noctua latest versions have been released to cran with all the new goodies they bring. As these packages are based on AWS SDK’s they are highly customisable. Features can easily be added to improve the packages when connecting to AWS Athena. So please raise any feature requests / bug issues to: https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook

Webinar Recap #1 of 3: Migration Strategy for Moving Operational Databases to the Cloud

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This webinar describes the benefits and risks of moving operational databases to the cloud. It’s the first webinar in a three part series focused on migrating operational databases to the cloud. Migrating to cloud based operational data infrastructure unlocks a number of key benefits, but it’s also not without risk or complexity. The first session uncovers the motivations and benefits of moving operational data to the cloud and describe the unique challenges of migrating operational databases to the cloud. (Visit here to view all three webinars and download slides.) 

About This Webinar Series

Today starts the first in a series of three webinars:

  • In this webinar we’ll discuss in broad strokes, migration strategy, cloud migration, and how those strategies are influenced by larger IT transformation or digital transformation strategy. 
  • In our next webinar, we’ll go into the next level of details in terms of database migration best practices, where we’ll cover processes and techniques of database migration across any sort of database, really. 
  • In the final webinar, we’ll get specific to the technical nuts and bolts of how we do this in migrating to Helios, which is MemSQL’s database as a service. 

In this webinar, we’ll cover the journey to the cloud, a little bit about the current state of enterprise IT landscapes, and some of the challenges and business considerations that go into making a plan, making an assessment, and choosing what kind of workloads to support. 

Next we’ll get into the different types of data migrations that are typically performed. And some of the questions you need to start asking if you’re at the beginning of this kind of journey. And finally, we’ll get into some specific types of workloads along the way. 

Any sort of change to a functioning system can invoke fear and dread, especially when it comes to operational databases, which of course process the critical transactions for the business. After all, they’re the lifeblood of the business. And so, we’ll start to peel the onion and break that down a little bit.

If you’re just starting your journey to the cloud, you’ve probably done some experimentation, and you’ve spun up some databases of different types in some of the popular cloud vendors. And these cloud providers give guidelines oriented towards the databases and database services that they support. There’s often case studies which relate to transformations or migrations from Web 2.0 companies, companies like Netflix, who famously have moved all of their infrastructure to AWS years ago.

But in the enterprise space, there’s a different starting point. That starting point is many years, perhaps decades of lots of different heterogeneous technologies. In regards to databases themselves, a variety of different databases and versions over the years. Some that are mainframe-resident, some from the client-server era, older versions of Oracle and Microsoft SQL, IBM DB2, et cetera.

And these databases perform various workloads and may have many application dependencies on them. So, unlike those web 2.0 companies, most enterprises have to start with a really sober inventory analysis to look at what their applications are. They have to look at that application portfolio, understand the interconnections and dependencies among the systems.In the last 10 to 15 years especially, we see the uptake of new varieties of data stores, particularly NoSQL data stores such as Cassandra or key-value stores or in-memory data grids, streaming systems, and the like.

Note. See here for MemSQL’s very widely read take on NoSQL

Introduction

In companies that have just been started in the last 15, 20 years, you could completely run that business without your own data center. And in that case, your starting point often is a SaaS application for payroll, human resources, et cetera. In addition to new custom apps that you will build, and of course, those will be on some infrastructure or platform as a service (PaaS) provider.

So some of this is intentional, and that large enterprises may want to hedge their bet across different providers. And that’s consistent with a traditional IT strategy in the pre-cloud era, where I might have an IBM Unix machine, and then an HP Unix machine, or more recently Red Hat, Linux, and Windows and applications.

Cloud Migration Webinar - Enterprise Cloud Strategy

But these days, it’s seen as the new platform where I want that choice is cloud platforms. Other parts of this are unintentional, like I said, with the lines of business, just adopting SaaS applications. And what you see here on the right, in the bar chart is that the hybrid cloud is growing. And to dig into that a little bit, to see just how much hybrid cloud has grown just from the year prior and 2018, it’s quite dramatic in the uptake of hybrid, and that speaks to the challenge that enterprise IT has, in that legacy systems don’t go away overnight.

Cloud Migration Webinar - State of the Cloud
It’s not surprising that cloud spend is the first thing that sort of bites businesses. And it does have an advantage for experimentation with new applications, new go to markets, especially customer facing applications.

Cloud Migration Webinar - Poll 1 Provider

Because it’s so easily scalable, you may not be able to predict how popular the mobile app may be, for instance, or your API, or your real-time visualization dashboard. So putting it in an elastic environment makes sense. But the cost may explode pretty quickly as other applications get there too. 

Cloud Migration Webinar - Database Migrations

And with governance and security, I think those are obvious in that when you’re across a multi-cloud environment, you’ve got to either duplicate or integrate those security domains to ensure that you have the right control over your data and your users. There are regulatory things to be concerned about in terms of the privacy of the data, depending on the business, traffic protection of data in the U.S. and California, or in Europe with the general data protection regulation (GDPR).

We’re now at a point in the adoption of cloud, that it’s not just sort of SaaS applications and ancillary supporting services around them, but it’s also the core data itself, like the databases service, in particular relational databases. And this might be a surprise given the popularity of NoSQL in recent years, you’ll see that NoSQL databases service are growing, but to lesser extent than relational. And what’s happening across relational data warehousing or OLTP, traditional OLAP, and NoSQL databases, is that there’s been a proliferation of all of these different types. But the power of relational still is what is most useful in many applications.

Gartner’s view of this is that just in the next two years that 75% of all databases will be deployed or migrated to a cloud platform. So that’s a lot of growth. That number doesn’t necessarily mean the retirement of existing databases. I think it speaks to the growth of new databases going in the cloud, because launching those new systems is so convenient and so easy, and – for the right kinds of workload – affordable.

So at this point, let’s pause and let’s have a question to the audience. So, who is your primary cloud service provider? You see the popular ones listed there. You may have more than one cloud service provider. But what’s your predominant or standard one is what we’re asking here. And we’ll wait for a few moments while responses come in. 

Okay, this result matches what we’ve seen from other industry reports in terms of the popularity of AWS and then second Azure. Given the time and the market, this isn’t such a surprise. In a year from now, we might see a very different mix with what’s happening with the adoption, uptake of Google and Azure in the different services. So let’s move on.

So what are the challenges of database migrations? Within enterprise IT, the first thing that needs to be done is to understand what that application dependency is. 

And when it comes to a database, you need to understand particularly how the application is using the database. And so just some examples of those dependency points to look for, what are the data types that are going to be used there? Are there bar codes, integers? What’s the distribution of those stored procedures? 

Although there’s a common language on families of databases, often there are nuances to how what’s available in a stored procedure in terms of processing, so the migration of stored procedures takes some effort. Most traditional SQL databases will provide user-defined functions where a user can extend functions. 

And then the query language itself in terms of the data manipulation language (DML) for queries, create, update, delete, et cetera. And in terms of the definition of objects in the database, the Data Definition Language (DDL) concerning how tables are created, for instance, and the definition of triggers and stored procedures and constraints. 

There’s also a hardware dependency to look at for depending on the age of the application, that software might be tied to your particular processor or machine type. And the application itself may only be available on that platform combination. 

In my own experience, I’ve seen this many times in airlines where the systems for gate and boarding, systems for check in, systems for ground operations, they were written decades ago provided typically by an industry specific technology provider, and they suited the business processes of that airline for many years.

But as the airline is looking to do more customer experience interactions and collect data about the customer’s experience from existing touch points like the check-in touch point, the kiosk, the mobile app, but they want to enhance this data. And they want to bring operational data, typically a lot of these operational data systems in logistics and create providers and airlines and other types of operations manufacturing, they don’t lend themselves well to do this.

So migrating these applications can be more difficult. Often it’s going to be Agra modernization where you’re just moving off of that platform. Initially, you would integrate with these, and you may store the data that you event out in your targets, new database in the cloud. And finally, there is often a management mismatch of the application. In other words, the configuration of that application as database doesn’t quite fit the infrastructure model of the cloud model that you’re migrating to.

The assets aren’t easily divided parametrized and put into your DevOps process and your CI/CD pipeline. Often it’s not easy to containerize. So these are some of the challenges that make it more difficult in enterprise IT context to migrate the applications which of course drag along the databases for these applications.

Charlie Feld, a visionary in the area of IT transformation, has his Twelve Timeless Principles:

  1. No Blind Spot
  2. Outcomes: Business, Architecture, Productivity
  3. Zoom Out
  4. Progressive Elaboration & Decomposition
  5. Systems Thinking
  6. The WHO is Where All the Leverage Is
  7. 30 Game Changers
  8. Functional Excellence is Table Stakes
  9. Think Capabilities
  10. Architecture Matters
  11. Constant Modernization
  12. Beachhead  First, Then Accelerate

So let’s talk about the phases of migration. So we’ll go into this more in the second webinar, where we talk about best practices, but I’ll summarize them here. 

Cloud Migration Webinar - Migration Phases

Assessing applications and workloads for cloud readiness allows organizations to:

  • Determine what applications and data can – and cannot – be readily moved to a cloud environment 
  • What delivery models (public, private, or hybrid) can be supported
  • Which applications you do not want to move to cloud 

You’ve got to classify these different workloads. So you can look at them in terms of what’s more amenable to the move? How many concurrent users do I expect? Where are they geographically distributed? Can I replicate data across more easily in the cloud to provide that service or without interrupting that service?

Cloud Migration Webinar - Steps

Do I have new applications and transactions coming online? Perhaps there are more, there are new sensors in IoT, sensors that I need to now bring that data to these applications. So you need to categorize these workloads in terms of the data size, the data frequency, the shape and structure of the data, and look at what kind of compute resources you’re going to need, because it’s going to be a little bit different. Of course, this will require some testing by workload.

So at this point, I’d like to pause and ask Alicia have another polling question. So what types of workloads have you migrated to the cloud so far? Given the statistics we see from the surveys, most likely, most of you have done some sort of migration or you’re aware of one in your business and what you’ve done. And you might be embarking on new types of applications in terms of streaming IoT.

Cloud Migration Webinar - Poll2 Workloads

So roughly a third have not been involved in migration so far. And another third, it’s been analytics and reporting. That result on analytic and reporting, I think is insightful, because when you think about the risks and rewards of migrating workloads, the offline historical reporting infrastructure is the least risky. 

If you have a business scenario where you’re providing weekly operational reports on revenue or customer churn or marketing effectiveness, and those reports don’t get reviewed perhaps until Monday morning, then you can do the weekly reporting generation over the weekend. If it takes two hours or 10 hours to process the data, it’s not such a big deal. Nobody’s going to look at it until Monday.

So there’s a broader array of sort of fallbacks and safety measures. And it’s less time-critical. Those are sort of the easier ones. So 16% of you reported that transactional or operational databases you’re aware of, or you’ve been involved in moving this to the cloud. And that is really what’s happening right now, that we find at MemSQL as well, is that the first wave was this wave of analytical applications, and now recently, you see more of the operational transactions, which is the core part of the business.

Here are criteria to choose the right workloads for data migration: 

  • Complexity of the application 
  • Impact to the business
  • Transactional and application dependencies 
  • Benefits of ending support for legacy applications 
  • Presence or absence of sensitive data content 
  • Likelihood of taking advantage of the cloud’s elasticity

What are the most suitable candidates for cloud migration? Here are a few keys:

  • Applications and databases which already need to be modernized, enhanced, or improved to support new requirements or increased demands
  • Consider apps having highly variable throughput
  • Apps used by a broad base of consumers, where you do not know how many users will connect and when 
  • Apps that require rapid scaling of resources
  • Development, testing and prototyping of application changes

Cloud Migration Webinar - Workloads and Benefits

Q&A and Conclusion

How do I migrate from Oracle to MemSQL?

Well, we’ve done this for several customers. And we have a white paper available online that goes into quite a lot of detail on how to approach that, and have a plan for an Oracle to MemSQL migration. 

What makes MemSQL good for time series?

That’s a whole subject in itself. We’ve got webinars and blog articles available on that. But essentially, I’ll give a few of them here and that MemSQL allows you to first of all ingest that data without blocking for writes; you can do that in parallel often. So if you’re reading from Kafka, for instance, which itself is deployed with multiple brokers and multiple partitions, MemSQL is a distributed database, and you can ingest that time series data in real time and in parallel. So that’s the first point is ingestion.

Secondly, we provide time series-specific functions to query that data that allows it for easy convenience, so it’s not necessary to go to a separate, distinct, unique database. Again, MemSQL is a unified converged database that handles relational, analytical, key-value, document, time series, geospatial all in one place. And so it’s suitable to the new cloud native era, where you’re going to have these different data types and access patterns.

What is the difference between MemSQL and Amazon Aurora?

Yeah, so that question is probably coming because when you’re migrating to a cloud database, typically you’re looking at one of the major cloud providers, AWS or Google Cloud Platform or Microsoft Azure. And each of these providers provides various types of databases. 

Amazon Aurora is a database built on Postgres, and there’s a version also for MySQL, or at least compatibility in that way that allows you to do that. So it’s worth a look. But what you’ll find when you’re doing sort of high-performance application is that the system architecture of Aurora itself is the biggest Achilles’ heel there, which is it’s composed of the single-node databases of MySQL or Postgres, depending on the edition you’ve chosen, and it’s basically sharding that across multiple instances and providing a shard and middleware layer above that.

And that has inefficiencies. It’s going to utilize more cloud resources. And in the beginning that might – at small volumes, that might not manifest into a problem. But when you’re doing this at scale across many applications, and on a bigger basis, those compute resources really add up in terms of the cost. 

So MemSQL is a much more efficient way, because it was written from the ground up, it’s not built out of some other single-node, traditional SQL database like Aurora. MemSQL’s built from the storage layer all the way up to take advantage of current cloud hardware as well as modern hardware in terms of AVX2 instruction sets and SIMD and, if that’s available, non volatile memory.

Secondly, I’d say that Aurora differs in a major way and that it’s oriented to just the transactions, OLTP type processing. Whereas MemSQL does that, but not just that it also has a rowstore with a columnstore, which is what our traditional analytical database like Amazon Redshift has. So, in a way, you could say that with Amazon, you would need two databases to do what MemSQL can do with a single database.

We invite you to learn more about MemSQL at memsql.com or get started with your trial of MemSQL Helios

Webinar Recap #3 of 3: Best Practices for Migrating Your Database to the Cloud

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This webinar concludes our three-part series on cloud data migration. In this session, Domenic Ravita actually breaks down the steps of actually doing the migration, including all the key things you have to do to prepare and guard against problems. Domenic then demonstrates part of the actual data migration process, using MemSQL tools to move data into MemSQL Helios. 

About This Webinar Series

This is the third part of a three-part series. First, we had a session on migration strategies; broad-brush business considerations to think about, beyond just the technical lift and shift strategy or data migration. And the business decisions and business strategy to guide you as to picking what sorts of workloads you will migrate, as well as the different kinds of new application architectures you might take advantage of. And then, last week, we got down to the next layer, talked about ensuring a successful migration to the cloud of apps and databases in general. 

In today’s webinar we’ll talk about more of the actual migration process itself. We’ll go into a little bit of more detail in terms of what to consider with the data definition language, queries, DML, that sort of thing. And then I’ll cover one aspect of that, which is the change data capture (CDC) or replication from Oracle to MemSQL, and show you what that looks like. 

Database Migration Best Practices

I’ll talk about the process itself here in terms of what to look at, the basic steps that you are going to be performing, what are the key considerations in each of those steps. Then we’ll get into more specifics of what a migration to Helios looks like and then I’ll give some customer story examples to wrap up, and we’ll follow this with a Q and A. 

The actual process of cloud data migration, such as from Oracle to MemSQL Helios, requires planning and care.

So the process that we covered in the last session has mainly these key steps, that we’re going to set the goals based on that business strategy in terms of timelines, which applications and databases are going to move, considering what types of deployment you’re going to have, and what’s the target environment. 

We do all this because it’s not just the database that’s being moved, it’s the whole ecosystem around the database. So connections for integration processes, the ETL processes from your operational database to your data warehouse or analytic stores. As well as where you’ve got multiple applications sharing a database, understanding what that is and what the new environment is going to look like.

Whether that’s going to be a similar application use of the database or if you’re going to redefine or refactor or modernize that application, perhaps splitting a monolith application to microservices for instance. Then that’s going to have an effect on what your data management design is going to be. 

So just drilling into step three there, for migration, that’s the part we’ll cover in today’s session. 

Cloud data migration requires the transfer of schema, existing data, and new, incoming data.

Within that, you’re going to look at specific assessment of the workloads and you’re going to look, what sorts of datasets to return, what tables are hit, what’s the frequency, the concurrency of these? This will help in capacity sizing, it’ll also help in understanding which functions of the source database are being used in terms of features and capabilities, such as stored procedures.

And once you do that assessment, and there are automated tools to do this, you’ll look at planning that schema migration of step two. And the schema migration involves the table definitions but also all sorts of other objects in the database that you’ll be prepared to adapt. Some will be one-to-one – it depends on your type of migration – and then the initial load, and then continuing to that replication with a CDC process. 

For a successful cloud data migration, identify the applications that use a database and database table dependencies.

So, let’s take the first one here in terms of the assessment of the workloads, what you want to consider here. And when you think about best practices for this, you want to think about how are applications using the database? Are they using it in a shared manner? And then specifically, what are they using from the database? So for instance, you may need to determine what SQL queries, for instance, are executed by which application so that you can do the sequencing of the migration appropriately.

So finding that dependency, first of which applications use the database, and then more fine-grained, what are the objects, use of stored procedures, etc., and tables. And then finally, what are the specific queries? 

So one strategy or tactic, I would say, that’s helpful in understanding that use by the application is to find a way to intercept the SQL queries that are being executed by those applications. So, if this is a job application, if you wrapped the connection object when the Java connection object is being created, and also the object for the dynamic SQL, then you can use this kind of wrapper to collect metrics and information and capture the query itself so that you have specific data on how applications use the database, which tables, which queries, how often they’re fired. 

And you could do this in other languages as long as you’re… it’s for that client library, whether it’s an ODBC, JDBC, et cetera. This technique helps to build a data set as you assess the workload to get really precise information about what the queries are executed and what objects.

And then secondly, when you have that data, you’ll find that the next thing that you want to do is to look at the table dependencies. So again, if this is an operational database that you’re migrating, then it’s typical that you might have an ETL process that keys off of one-to-many tables to replicate that data into a historical store, a data warehouse, a data mart, etc. And so, understanding what those export and/or ETL processes are, on which tables they depend, is fairly key here. 

These are just two examples of the kinds of things that you want to look at for the workload. And of course with the first one, once you have the queries, and you can see what tables are involved, you can get runtime performance on that, you can have a baseline for what you want to see in the target system, once the application and the database and the supporting system pipelines have been migrated.

Migrating schema is a key part of a successful cloud data migration, and may require creating new code to replace existing code.

So now let’s talk a little bit about schema migration. And there’s a lot involved in schema migration because we’re talking about all the different objects in the database, the tables, integrity constraints, indexes, et cetera. But we could sort of group this into a couple of broad areas, the first being the data definition language (DDL) and getting a mapping from your source database to your target. 

In previous sessions in this series we talked about the migration type or database type, whether it’s homogeneous or heterogeneous. Homogeneous is like for like, you’re migrating from a relational database source to the same version even, of a relational database, just in some other location – in a cloud environment or some other data center. 

That’s fairly straightforward and simple. Often the database itself provides out-of-the box tools for that sort of a migration and replication. When you’re moving from a relational database to another relational database, but from a different vendor, that’s when you’re going to have a more of an impedance mismatch of some of the implementations of DDL, for instance.

You’ll find many of the same constructs because they’re both relational databases. But despite decades of so-called standards, there’s going to be variation… there are going to be some specific things for each vendor database. So for instance, if you’re migrating from Oracle to MemSQL, as far as data types, you’ll find a pretty close match from an Oracle varchar2 to a MemSQL varchar, from a nvarchar2 to MemSQL varchar, from an Oracle FLOAT to MemSQL decimal. 

Those are just some examples, and we have a white paper that describes this in detail and gives you those mappings, such that you can use automated tools to do much of this schema migration as far as the data types and the table definitions, etc. 

After the data types, the next thing that you would be looking at would be queries and the data manipulation language (DML). So, when you look at queries, you’ll be thinking, “Okay, what are the different sorts of query structures? What are the operators in the expression language of my source, and how do they map to the target?”

So, how can I rewrite the queries from the source to be valid in the target source? Again, you’re going to look at the particular syntax around, for instance, outer join syntax, do I have recursive queries? Again, just using Oracle as an example, MemSQL has a fairly clear correspondence of those capabilities from relational data stores like Oracle, PostgreSQL, mySQL, etc., and MemSQL. 

If your source is a mySQL database, you’ll find that the client libraries can be used directly in MemSQL because our client bindings are MemSQL wire protocol compliant. So you can use basically any driver, any client driver, from MemSQL in the hundreds that are available throughout every programming language into MemSQL, so that simplifies a little bit of some of your testing in that particular case.

The third thing I’d point out here is that while you may be migrating from a relational database to another relational database, and you may still consider this, or you should consider this, a heterogeneous move, because the architecture of the source databases often, almost always these days, a legacy single-node type of database. Meaning that it’s built on a disk-first architecture, it’s meant to scale vertically, meaning a single machine to get more performance, you scale up, you get a bigger hardware with more CPU processors. 

And when you’re coming to MemSQL, you can run it as a single node, but the power of MemSQL is that it’s a scale-out distributed database, such that you can grow the database to the size of your growing dataset with simply adding nodes to the MemSQL cluster. MemSQL is distributed by default, or distributed native you might say, and that’s what also is one of the attributes that makes it a good cloud-native database with Helios, and that allows us to elastically scale that cluster for Helios, scale up and down, and I’ll come back to that in a moment.

But as part of that, when you think about the mapping, the structure of a source relational to a target like Helios, you’re mapping a single-node database to a distributed one. So there’s a few extra things to consider, like the sharding of data, or some people call this partitioning or distributing the data, across the cluster. 

The benefit of that is that you get resiliency, in the case of node failures you don’t lose data, but you also get to leverage the processing power of multiple machines in parallel. And this helps when you’re doing things like real-time raw data ingestion from Kafka pipelines and other sources like that. 

This is described in more detail in our white paper, which I’ll point out in just a moment. So once you’ve got those things mapped, and you may be using an automated tool to do some of that schema mapping, you’ll have to think about the initial load.

To perform the initial data load in your cloud data migration, disable constraints temporarily.

And this, depending on your initial dataset, could take some amount of time, a significant amount of time, just when you consider the size of the source dataset, the network link across which data must move, what’s the bandwidth of that link? 

And so if you’re planning a migration cut-over, like over a weekend time, you’ll want to estimate based on those things. And what’s the initial load going to be, and by when will that initial data load complete, such that you can plan the new transactions start of the replicating of the new data. And also when you’re doing the load, what other sorts of data prep needs to happen in terms of ensuring that the integrity constraints and other things like that are working correctly. I’ll touch a little bit about how we address that through parallelism.

Your cloud data migration needs a process to replicate new data into the cloud, for instance by using CDC.

So finally, once that initial load is done, then you’re looking to see how you can keep up with new transactions that are written to the source database. So you’re replicating, you’ve got a snapshot for the initial load, and now you’re replicating from a point in time, doing what’s called change data capture (CDC). 

As the data is written you want the minimal latency possible to move and replicate – copy – that data to the target system. And there are various tools on the market to do this. Generally you want this to have certain capabilities such as, you should expect some sort of failure. And so you need sort of some checkpoint in here so you don’t have to start from the very beginning. 

Again, this could be tens, hundreds of terabytes in size if this is an operational database, or an analytic database, it’s going to have more data if it’s been used over time. Or, if it’s multiple databases, each may be a small amount of data, but together you have got a lot in process at the same time. So you want to have your replication such that it can be done in parallel and you have checkpointing to restart from the point of failure rather than the very beginning. 

Your cloud data migration should include data validation and repair processes.

And then finally, data validation and repair. With these different objects in a source and a target, there’s room for error here, and you’ve got to have a way to automatically validate and run tests against the data that are valid, you want to think about automating that. And as much as possible in testing, doing your initial load, you want to validate data there before starting to replicate; as data’s replicating you’re going to have a series of ongoing validations to ensure that you’re not mismatching or your logic is not incorrect.

Let’s go to our first polling question. You’re attending this webinar probably because you’ve got cloud database migration on your mind. Tell us when you are planning a database migration. Coming up in the next few weeks, in which case you would have done a lot of this planning and testing already. Or maybe in the next few months, and you’re trying to find the right target database. Or maybe it’s later in this year or next year, and maybe you’re still in the migration strategy, business planning effort. 

Most of our webinar attendees for cloud data migration are not currently planning a database migration process.

Okay. Most of you are not in the planning phase yet, so you’re looking to maybe see what’s possible. You might be looking to see what target databases are available, and what you might be able to do there. We hope you take a look at what you might do with Helios in the Cloud. 

We’ll talk about moving workloads to Helios. Helios is MemSQL’s database as a service. Helios is, at its essence, the same code base is as MemSQL, self-managed, as we call it, but it’s provided as a service such that you don’t have to do any of your own work on the infrastructure management. It takes all the non-differentiated heavy lifting away, such that you can focus just on the definition of your data.

Like MemSQL self-managed, MemSQL Helios provides a way to do multiple workloads of analytics with transaction simultaneously on the same database or multiple databases in a Helios cluster. You can run Helios in all the major cloud providers, AWS, GCP, Azure is coming soon, in multiple regions.

For a successful cloud data migration, assess the database type of the source and the target database.

Looking at some of the key considerations in moving to Helios … I mentioned before this identification of the database type source and target. There’s multiple permutations of what to consider like for like with homogenous. Although it may be a relational to relational database, as the examples I just provided with, say, Oracle and MemSQL, there are still some details to be aware of.

The white paper we provide gives a lot of that guidance on the mapping. There are things that you can take advantage of in Helios that are just not available or not as accessible in Oracle. Again, that’s things like the combination of large, transactional workloads simultaneously with the analytical workloads.

Next thing is the application architecture. I mentioned this earlier. Are you moving? Is your application architecture going to stay the same? Most likely it’s going to change in some way, because when these migrations are done for a business, typically they’re making selections in the application portfolio for new replacement apps, SaaS applications often, to replace on-prem applications.

A product life cycle management, a PLM system on prem, often is not carried on into the cloud environment. Some SaaS cloud provider is used, but you still have the integrations that need to be done. There could be analytical databases that need to pull from that PLM system, but now they’re going to be in the cloud environment.

Looking at, what are the selections and the application portfolio or the application rationalization, as many people may think about it? As to what that means for the database. Then for any particular app, if it’s going to be refactored from a monolith to microservices-based, what does that mean for the database?

Our view in terms of MemSQL for use in microservices architectures is that you can have a level of independence of the service to the database, yet keep the infrastructure as simple as possible. We live in an era where it’s really convenient to spin up lots of different databases really easily, but even when they’re in the cloud, those are more pieces of infrastructure that you now have to manage the the life cycle of.

As much as possible you should try to minimize the amount of cloud infrastructure that you have to manage. Not just in the number of instances of database, but also the variety of types. Our view of purpose-built databases and microservices is that you can have the best of purpose-built, which is support for different data structures and data access methods, such as having a document store, and geospatial data, full-text search with relational, with transactions, analytics, all living together without having to have the complexity of your application to communicate with different types of databases, different instances, to get that work done.

Previously in the industry, and part of the reason why purpose-built databases caught on, is that they provided a flexibility to start simply, such as document database, and then grow and expand quickly. Now we, as an industry, have gone to the extreme of that where there’s an explosion of different types of sources. To get a handle on that complexity, we’ve got to simplify and bring that back in.

MemSQL provides all of those functions I just described in a single service, and Helios does that as well. You can still choose to segment by different instances of databases in the Helios cluster, yet you have the same database type, and you can handle these different types of workloads. For a microservices-based architecture, it’s giving you the best of both worlds; the best of the purpose-built polyglot persistence NoSQL sorts of capabilities and scale out, but with the benefits of robust ANSI SQL and relational joints.

Finally, the third point here is optimizing the migration. As I said, with huge datasets, the business needs continuity during that cutover time. You’ve got to maintain service availability during the cutover. The data needs to be consistent, and the time itself needs to be minimized on that cutover.

Advantages of migrating your cloud data into MemSQL Helios include scalability, predictable costs, reliability, and less operations work.

Let me give a run through some of the advantages of moving to Helios. As I said, it’s a fully managed cloud database as a service, and, as you would expect, you can elastically scale up a MemSQL cluster and scale it down.

Scaling down is also maybe even perhaps the more important thing, because if you have a cyclical or seasonal type of business like retail, then there’ll be a peak towards the end of the year, typically Thanksgiving, Christmas, holiday seasons. That infrastructure, you’ll want to be able to match to the demand without having to have full peak load provisioned for the whole year. Of course cloud computing, this is one of the major benefits of it. But, your database has to be able to take advantage of that.

Helios does that through, again, its distributed nature. If you’re interested in how this works exactly, go to the MemSQL YouTube channel. You’ll see quick tutorials on how to spin up a Helios cluster and resize it. The example there shows growing the cluster. Then once that’s done, it rebalances the data, but you can also size that cluster back down.

As I mentioned, it eliminates a lot of infrastructure and operations management. It gives you some predictability in costs. With Helios, without going into the full pricing of Helios, basically our pricing is structured around units or nodes. Those nodes, or resources, are described by computing resources in terms of how many CPUs, how much RAM. Eight virtual CPUs, 64 gigabytes of RAM. It’s based on your data growth and your data usage patterns. That’s the only thing you need to be concerned about in terms of cost. That makes doing financial forecasts for applications a lot simpler.

Again, since Helios is provided in multiple cloud providers like AWS, GCP, and soon Azure, in multiple regions, you can co-locate or have a network proximity of your chosen Helios cluster to your target application environment. Such that you can minimize any costs across in terms of data ingress and egress.

When you bring data into Helios, you just get the one cost, so the Helios unit cost. From your own application, your cloud-hosted application or your datacenter-hosted application that’s bringing data into Amazon, or Azure, or GCP, you may incur some costs from those providers, but from us, it’s very simple. It’s just the per-unit, based on the node.

Helios is reliable out of the box in that it’s a high availability (HA) deployment, such that if any one node fails, you’re not losing data. Data gets replicated to another leaf node. Leaf nodes in Helios are data nodes that store data. On every leaf node, there’s one-to-many partitions, so you’re guaranteed to have a copy of that data on an another machine. Most of this would be fairly under the covers for you. You should not be experiencing any sort of slowdown in your queries, provided that your data is distributed.

Next, freedom without sprawl. What I’m talking about is, Helios allows you to, as I said earlier, combine multiple types of workloads, to do mixed workloads of transactions and analytics, and different types of data structures like a document. If you’re creating a product catalog and you’re querying that, or you have orders structured as documents, with Helios as well as MemSQL, you can store these documents, such as the order, in the raw JSON format, directly in MemSQL. We have an index into that such that you can query and make JSON queries part of your normal application logic.

In that way, MemSQL can act as a document or key-value store in the same way that MongoDB or AWS DocumentDB or other types of document databases do. But, we’re more than that, in that you’re not just limited to that one kind of use case. You can add relational queries. A typical use case here is storing the raw JSON but then selecting particular parts of the nested array to put into relational or table columns, because those can be queried as a columnstore in MemSQL. That has the advantage of compression.

There’s a lot of advantages in doing these together, relational with document, for instance, or relational with full-text search. Again, you can have this freedom of the different workloads, but without the sprawl of having to set up a document database and then separately a relational database to handle the same use case.

Then, finally, I would say a major advantage of Helios is that it provides a career path for existing DBAs of legacy single-node databases. There’s a lot of similarity in the basic fundamental aspects of a database management system, but what’s different is that, with MemSQL, you get a lot of what was previously confined to the NoSQL types of data stores, key-value stores, and document stores, for instance. But, you’d get those capabilities in a distributed nature right in MemSQL. It’s, in some ways, the ideal path from a traditional single-node relational database career and experience into a cloud-native operational and distributed database like MemSQL.

The demo shows how to take a snapshot of an Oracle database and migrate the data to the cloud, in the form of MemSQL Helios.

So what I would like to do at this point is show you a simple demonstration of how this works. I’d refer you to our whitepaper for more details about what I discussed earlier from migrating Oracle to Helios, and you’ll find it there by navigating from our homepage to resources, Whitepapers and Oracle and MemSQL migration. So with that, let me switch screens for just a moment.

And what I’m going to do. So I’ve got a Telnet session here or SSH session into a machine running an Amazon where I’ve got an Oracle database, and I’m going to run those two steps I just described. Basically the initial data load, and then I’m going to start the replication process. Once that replication process, with MemSQL Replication, is running, then I’ll start inserting data, new data into the source of Oracle database. And you’re going to see that written to my target. And I’ll show a dashboard to make this easy to visualize and the data that’s being written. 

So the data here is billing data for a utility billing system. I’ve got new bills and payments and clearance notifications that come through that source database. I’ll show you the schema in just a moment. So what I’ll do is I’ll start my initial snapshot. I’ve got one more procedure to run here.

Okay. So that’s complete and now I’ll start my application process. And so from my source system, Oracle, we’re writing to MemSQL Helios. And you see it’s written 100,000 rows to the BILLS table, 300,000 to CCB and 100,000 to PAYMENTS. 

So now let’s take a look at our view there, and we can take a look at the database. It’s writing to a database called UTILITY. And if I refresh this, I’ll see that I will have some data here in those three tables… it gave me the count there, but I can quickly count the rows, see what I got there.

MemSQL Studio shows progress in a demo of cloud data migration to MemSQL Helios.

So I also have a dashboard, which I’ll show it here and that confirms that we’re going against the same database that I just showed you the query for. So at this point I’ve got my snapshot of the initial data for the bills, payments, and clearance notices. 

Looker also shows progress in a demo of cloud data migration to MemSQL Helios.

So what I’ll do now is start another process that’s going to write data into this source Oracle database. And we’ll see how quickly this happens. Again, I’m running from a machine image in Amazon US East. I’ve got my Helios cluster also on Amazon US East. And so let’s run this to insert into these three tables here. 

And as that runs, you’ll see MemSQL Replicate, which you’re seeing here, it’s giving us how many bytes per second are being written, how many rows are being inserted into each of these tables, and what’s our total run time in terms of the elapsed and the initial load time for that first snapshot data. So here you’ll see my dashboard’s refreshing. You start to see this data being written here into Helios.

MemSQL Studio displays the results of the demo of cloud data migration to MemSQL Helios.

What we can do is use MemSQL Studio to view the data as it’s being written. So let’s first take a look at the dashboard and you can see we’re writing roughly anywhere from four to 10,000 rows per second against the database, which is a fairly small rate. We can get rates much higher than that in terms of tens of thousands or hundreds of thousands of rows written per second depending on the size. If they’re small sometimes it can be millions of rows in certain situations.

And let’s take a look at the schema here. And you’ll see that this data is increasing in size. As I write that data and MemSQL gives you this studio view such that you can see diagnostics on the process on the table as it’s happening, as data’s being written. Now you may notice that these three tables are columnstore tables. Columnstores are used for analytic queries and they have really superior performance for analytic queries and they compress the data quite a lot. And our column stores use a combination of memory and disk. 

After some period of time this data and memory will persist, will write to disk, but even when it’s in memory, you’re guaranteed the durability and resilience. Again, because Helios provides high availability by default, which you can see that redundancy through the partitions of the database through this view.

Case Studies

I’ll close out with a few example case studies. First… Well I think we’re running a little bit short on time, so I’m going to go directly to a few of these case studies here. 

Helios was first launched back in fall of last year and since then we’ve had several migrations. It’s been one of the fastest launches in terms of uptake of new customers that we’ve seen in the company’s history. 

This is a migration from an existing MemSQL self-managed environment for a company called SSIMWAVE who provides video compression and acceleration for all sorts of online businesses, and their use case is around interactive analytics, ad hoc queries. And they want to be able to look at what are the analytics around how to optimally scale their video serving and their video compression.

We have a case study of the move by SSIMWAVE to MemSQL Helios.

And so they are a real-time business and they need operational analytics on this data. Just to draw an analogy, if you’re watching Netflix and you have a jitter or a pause, or you’re on Prime Video and you have a pause for any of these online services, it’s an immediately customer-impacting kind of customer-facing scenario. And so this is a great example of a business that depends on Helios and the Cloud to provide this reliability to deliver analytics for a customer facing application. Sort of what we call analytics live in SLA. So they’d been on Helios now for several months and you see some of the quote here on why they’ve moved and the advantage of the time savings with Helios.

A second example is Medaxion, and they were moving from… Initially they moved to MemSQL from MySQL instance, and then over to Helios. And their business is providing information to anesthesiologists, and for them, again, it’s a customer facing scenario for operational analytics. 

We also have a case study of the move by Medaxion to MemSQL Helios.

They’ve got to provide instantaneous analysis through Looker dashboards and ad hoc queries against this data. And Helios is able to perform in this environment for an online SAS application essentially where every second counts in terms of looking at what’s the status of the records that Medaxion handles. 

We also have a case study of the move by Thorn to MemSQL Helios.

And then finally, I’ll close with this Thorn. They are a nonprofit that focuses on helping law enforcement agencies around the world identify trafficked children faster. 

And if there’s any example that that shows that time criticality and the importance of it in the moment operational analytics, I think this is it, because most of this data that law enforcement needs exists in various silos or various systems or different agencies systems and what Thorn does is to unify and bring all of this together but do it a convenient searchable way.

So they’re taking the raw sources among which are posts online, which they feed through machine learning process to then land that processed data into Helios such that their Spotlight application can allow instant in the moment searches by law enforcement to identify based on image recognition and matching if a child has been involved… is in a dangerous situation and correlating these different law enforcement records. 

So those are three great examples of Helios in real-time operational analytics scenarios that we thought we’d share with you. And with that I’ll close, and we’ll move to questions.

Q&A and Conclusion

How do I learn more about migration with MemSQL?

On our resources page you’ll find a couple of Whitepaper’s on migration. One about Oracle specifically and one more generally about migrating. Just navigate to the home page, go to Resources, Whitepapers. You’ll find that. Also there is a webinar we did back last year or before, the five reasons to switch – you can catch that recording. Of course you can also contact us directly and we’ll provide an email address here to share with you.

Where do I find more about MemSQL Replicate?

So that’s part of 7.0, so you’ll find all of our product documentation is online and MemSQL Replicate is part of the core product, so if you go to docs.memsql.com, then you’ll find it there under the references.

Is there a charge for moving my data into Helios?

There’s no ingress charge that you incur using self-managed MemSQL or MemSQL Helios. Our pricing for Helios is purely based on the unit cost as we call it. And the unit again is the computing resources for a node and it’s just the leaf node, it’s just the data node. So eight vCPUs, 64GB of RAM, that is a Helios leaf node. All of that is just a unit. That’s the only charge. 

But you may incur data charges from depending on where your source is, your application or other system for the data leaving, or the egress from that environment, if it’s a cloud environment. So not from us per se, but you may from your cloud provider.

An Overview of DDL Algorithm’s in MySQL ( covers MySQL 8)

$
0
0

Feed: Planet MySQL
;
Author: MyDBOPS
;

Database schema change is becoming more frequent than before, Four out of five application updates(Releases) requires a corresponding database change, For a DBA schema change is a more often a repetitive task, it might be a request from the application team for adding or modifying columns in a table and many more cases.

MySQL supports online DDL from 5.6 and the latest MySQL 8.0 supports instant columns addition.

This blog post will look at the online DDL algorithms inbuilt which can be used to perform schema changes in MySQL.

DDL Algorithms supported by InnoDB is,

  • COPY
  • INPLACE
  • INSTANT ( from 8.0 versions)

INPLACE Algorithm:

INPLACE algorithm performs operations in-place to the original table and avoids the table copy and rebuild, whenever possible.

If the INPLACE algorithm is specified with the ALGORITHM clause if the ALTER TABLE  operation does not support the INPLACE algorithm, then an alter will be exited with an error by suggesting possible algorithm which can be used.

mysql> alter table sbtest1 add column h int(11) default null,algorithm=inplace;
Query OK, 0 rows affected (14.12 sec)
Records: 0  Duplicates: 0  Warnings: 0

INPLACE algorithm is dependent on two important variables when it performs the table operation.

  • It uses tmp dir to write sort files, in the defined tmp dir(uses /tmp by default), if defined tmp dir is not enough, we can explicitly define by the innodb_tmpdir system variable.
  • It also uses a temporary log file called innodb_online_alter_log_max_size to track data changes by DML queries executed like INSERT, UPDATE , DELETE in the table during the DDL operation, The maximum size for this log file can be configured by the dynamic variable innodb_online_alter_log_max_size (default is 128MB) system variable.

the incoming writes during the process of altering are stored with a size defined in innodb_online_alter_log_max_size are applied at the end of the DDL operation by locking the table for some seconds based on the write rate.

If the incoming writes floods the innodb_online_alter_log_max_size defined size, then DDL operation fails and the uncommitted transactions are rolled back.

Example:

mysql> alter table sbtest.sbtest5 add column l varchar(100),algorithm=inplace;
ERROR 1799 (HY000): Creating index 'PRIMARY' required more than 'innodb_online_alter_log_max_size' bytes of modification log. Please try again.

Below is an internal flow of INPLACE algorithm, when some operations perform a table rebuilt.

online_ddl_inplace

The below table provides operations that use an online DDL with INPLACE algorithm to perform table rebuilt.

online_inplace_table

Note: The InnoDB needs extra disk space to perform the above-listed operations, either equal to the size of the original table in the datadir or more in some cases.

Drawbacks of ‘INPLACE’ algorithm

  • Long-running online DDL operations can cause replication lag in slaves. Online DDL operation must finish running on the master before it is run on the slave. Also, DML that was processed concurrently on the master is only processed on the slave after the DDL operation on the slave is completed.
  • larger innodb_online_alter_log_max_size size extends the period of time at the end of the DDL operation when the table is locked to apply the data from the log.
  • At time can cause high IO usage for a larger table at high concurrency servers ( Aggressive in terms of resource consumption)

COPY Algorithm

Algorithm COPY alters the schema of the existing table by creating a new temporary table with the altered schema, once it migrates the data completely to the new temporary table it swaps and drops the old table.

Example:

mysql> alter table sbtest1 modify column h varchar(20) not null,algorithm=inplace;
ERROR 1846 (0A000): ALGORITHM=INPLACE is not supported. Reason: Cannot change column type INPLACE. Try ALGORITHM=COPY.

When the INPLACE algorithm is not supported, MySQL throws an error and prescribes using COPY algorithm.

mysql> alter table sbtest1 modify column h varchar(20) not null,algorithm=COPY;
Query OK, 1024578 rows affected (17.95 sec)
Records: 1024578  Duplicates: 0 Warnings: 0

ALTER TABLE with ALGORITHM=COPY is an expensive operation as it blocks concurrent DML’s (inserts,updates,deletes) operations, but it allows concurrent read queries(SELECT’S) when LOCK=SHARED.

if the lock mode LOCK=EXCLUSIVE is used, both reads/writes are blocked until the completion of the alter.

Below is an internal flow of the COPY algorithm when it creates a copy of a table.

onlineddlcopy

Drawbacks of COPY Algorithm

  • There is no mechanism to pause a DDL operation or to throttle I/O or CPU usage during the operation.
  • Rollback of operation can be an expensive process.
  • Blocks Concurrent DML’s are not allowed during the ALTER table
  • Causes replication lag

INSTANT Algorithm

In further improvement in online DDL’s ( column addition ) MySQL 8.0 has come up INSTANT algorithm ( a patch from TENCENT ) . This feature makes instant and in-place table alterations for column addition and allows concurrent DML with Improved responsiveness and availability in busy production environments.

If ALGORITHM is not specified, the server will first try the DEFAULT=INSTANT algorithm for all column addition. If it can not be done, then the server will try INPLACE algorithm; and if that can not be supported, at last server will finally try COPY algorithm.

INSTANT algorithm performs only metadata changes in the data dictionary. It doesn’t acquire any metadata lock during schema changes and as it doesn’t touch the data file of the table.

Example:

mysql> alter table city add pincode int(11) default null, algorithm=INSTANT;
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

The INSTANT algorithm supports only a few of the operations which are listed below.

online_ddl_instant

MySQL 8 versions have two new views added i.e I_S.innodb_tables and I_S.innodb_columns

Example :

Table structure before adding a column using INSTANT.

mysql> show create table sbtest.sbtest7G
*************************** 1. row ***************************
       Table: sbtest7
Create Table: CREATE TABLE `sbtest7` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `f` varchar(200) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k_7` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

When we query the innodb_tables table, the number of instant_columns will be zero.

mysql> SELECT table_id, name, instant_cols FROM information_schema.innodb_tables where name like '%sbtest7%';
+----------+----------------+--------------+
| table_id | name           | instant_cols |
+----------+----------------+--------------+
|     1258 | sbtest/sbtest7 |            0 |
+----------+----------------+--------------+
1 row in set (0.00 sec)

Let’s add a column using INSTANT algorithm,

mysql> alter table sbtest7 add g varchar(100) not null default 'Mysql 8 is great', algorithm=instant;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

after adding column instant_cols becomes 5.

mysql> SELECT table_id, name, instant_cols FROM information_schema.innodb_tables where name like '%sbtest7%';
+----------+----------------+--------------+
| table_id | name           | instant_cols |
+----------+----------------+--------------+
|     1258 | sbtest/sbtest7 |            5 |
+----------+----------------+--------------+
1 row in set (0.00 sec)

This means that  instant_cols keeps a track of the number of columns present in table sbtest7 before an instant column addition.

And the default values of columns that are added by the INSTANT algorithm are stored in the I_S.innodb_columns table.

mysql> SELECT table_id, name, has_default, default_value FROM information_schema.innodb_columns WHERE table_id = 1258;
+----------+------+-------------+----------------------------------+
| table_id | name | has_default | default_value                    |
+----------+------+-------------+----------------------------------+
|     1258 | id   | 0           | NULL                             |
|     1258 | k    | 0           | NULL                             |
|     1258 | c    | 0           | NULL                             |
|     1258 | pad  | 0           | NULL                             |
|     1258 | f    | 0           | NULL                             |
|     1258 | g    | 1           | 4d7973716c2038206973206772656174 |
+----------+------+-------------+----------------------------------+
6 rows in set (0.23 sec

Column g has_deafult value 1 and default_valuestored in hash format.

Drawbacks of INSTANT algorithm

  • A column can only be added as the last column of the table. Adding a column to any other position among other columns is not supported.
  • Columns cannot be added to tables that use  ROW_FORMAT=COMPRESSED.
  • Columns cannot be added to tables that include a FULLTEXT index.
  • Columns cannot be added to temporary tables. Temporary tables only support ALGORITHM=COPY.
  • Columns cannot be added to tables that reside in the data dictionary tablespace(shared tablespace).

Comparison of INPLACE, COPY AND INSTANT Algorithms.

we are the end of the blog the let us calculate the time difference between all 3 algorithm over a table with 1M records.

INPLACE – 7.09 sec

mysql> alter table sbtest7 add g varchar(100) not null default 0, algorithm=inplace;
Query OK, 0 rows affected (7.09 sec)
Records: 0  Duplicates: 0  Warnings: 0

COPY – 14.34 sec

mysql> alter table sbtest7 add g varchar(100) not null default 0, algorithm=copy;
Query OK, 1000000 rows affected (14.34 sec)
Records: 1000000  Duplicates: 0 Warnings: 0

INSTANT – 0.03 sec

mysql> alter table sbtest7 add g varchar(100) not null default 0, algorithm=instant;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

screenshot-2020-03-04-at-4.29.57-pm

As we can see, using INSTANT the column was added in no time (just 0.03secs) when compared to the other two algorithms.

Key Takeaways:

  • INSTANT will be the right algorithm to choose when our current MySQL Version is greater than 8 and also based on Alter type which we are trying to achieve.
  • Using the COPY algorithm for larger tables, with more no of async slaves, will be an expensive operation.
  • For alternatives, we can use other tools like pt-online-schema-change or gh-ost for schema changes, which provides more options for throttling the resource usage and more control.

I hope these blogs give you the insights of online DDL and the right algorithms to choose for performing schema changes.

Featured image by Glenn Carstens-Peters on Unsplash

How to Connect Teradata Vantage to Azure Blob Storage to Query JSON Files

$
0
0

Feed: Teradata Blog.

Disclaimer: This guide includes content from both Microsoft and Teradata product documentation.
 

feature to query JSON files on Microsoft Azure Blob Storage, where you can keep vast amounts of semi-structured or unstructured data. If you already have JSON files on your blob storage account, then you can skip to Connect Teradata Vantage using Native Object Store to Azure Blob Storage section.

There are many ways to create and process JSON data. In this getting started guide, we will use a Raspberry Pi online simulator to send JSON strings to an Azure IoT hub device, then route messages to the Azure Event hub, use an Azure Stream Analytic job to process,  and then send it to our Azure Blob Storage container as JSON files. Finally, we will use the NOS feature in Vantage to connect and query the JSON files.

This is a diagram of the workflow.

You are expected to be familiar with Azure services and Teradata Vantage with Native Object Store (NOS).

You will need the following accounts, objects, and systems. Links have been included with setup instructions.

We need an Azure Blob Storage account to store our JSON files.

Provide your Subscription, Resource group, Storage account name and Location. Leave the remaining fields set to their default values. Click Review+create to validate and click Create.

We suggest that you use the same location for all services you create. This will avoid confusing errors for this example.

2. Once the deployment is completed, click on Go to resource to create a container for your JSON files.

3. Click on Containers icon and then click on +Container and provide a new container name. Leave Public access level as “Private” for this example.
Picture1-(1).png
4. Click on Access keys and copy Storage account name and key1, which we will use later in the Connect Teradata Vantage Native Object Store to Azure Blob Storage – Create Authorization Object section.
Picture1-(2).png

We need to create Event Hub namespace and event hub endpoint for Iot Hub device to send messages.

1. Using the portal, create an event hubs namespace and event hub. Provide a Namespace name, choose Standard tier, create a new or use an existing Resource Group and Location. Leave the remaining fields set to their default values. Click Create. You may have to wait a few minutes for the system to fully provision the resources.

Open the Resource Group to see your Event Hub namespace.
Picture1-(3).pngWe do not cover the Throughput Units and Enable Auto-Inflate properties in this guide. Please see the Azure documentation for details.

Next, select your Event Hub namespace and create an event hub by clicking +Event Hub. Provide a Event Hub name. Leave remaining fields set to their default values and click Create.
Picture1-(4).png
The Event Hub capture features supports only the Avro format to Azure Blob Storage. This guide uses Stream Analytics to move data to accommodate the JSON format.

After the event hub is created, you should see it in the list of event hubs below.
Picture1-(5).png

We want to move Event Hub messages to Azure Blob Storage. We have chosen to process data from our event hub using Stream Analytics

1. Click on the created event hub (for example, eventhub1) and click on Process data.
Picture1-(6).png
2. Click on Explore to display the Query dialog. If there are messages in your Event Hub, you can process your messages by clicking Create.

3. Click Deploy query to create a Stream Analytic job to move Event Hub messages to Azure Blob Storage.
Picture1-(7).png4. Enter a Job name in the New Stream Analytic job pane and leave remaining fields set with their default values.

5. Click Create.

6. Click on the Outputs UI and click +Add to define Blob Storage as an output.
Picture1-(8).png7. Provide the Output alias, Storage account and container information in the Blob Storage/Data Lake Storage Gen2 pane. The Path pattern (for example, /{date}/{time}) is optional.

8. Ensure that Event serialization format is set to JSON. Leave remaining fields set to their default values and click Save.
Picture1-(9).png

You can find more information on Path pattern property at Azure Stream Analytic custom blob output partitioning.
 
8. Exit the Output pane by clicking in the upper right corner.
 
9. Edit and save query with new Output alias (for example, output1).
 
10. Exit the Query pane by clicking in the upper right corner.
 
11. Click Start job with the Now option.
Picture1-(10).png

We need to create an IoT Hub instance, register an IoT hub device, and setup a route/custom endpoint to get messages from our Raspberry Pi online simulator to Azure IoT hub and finally to Event Hub.

1. In the portal, create an IoT Hub, register a new device and define message routing.

2. Click on Create a resource at the top left side and search for IoT Hub. Click Create.

3. In the Basic tab, create a new or use an existing Resource Group, provide an Iot Hub Name, and in the choose B1: Basic tier in the Size and Scale tab. Leave remaining fields set to default values and click Review+create to validate. Click Create.

For more information see choosing the right IoT Hub tier. Configuring the Number of IoT Hub units property and a more general conversation on performance considerations is not covered here.

The online simulator requires a device identify in the registry to connect to a hub.

4. Click Go to resource or in your IoT hub pane, open IoT Devices, then select +New to add a device in your IoT hub. Provide a Device ID name and click Save.
Picture1-(11).png

5. Click Refresh in the IoT devices dialog to display your device and click on Device. Copy the Primary Connection String which we will use later to connect our Raspberry Pi online simulator to our IoT hub device (for example, rasppi).
Picture1-(12).png

We need to configure a route and endpoint for IoT Hub device to send messages to Event Hub.

6. In your IoT Hub pane, click Message routing under Messaging to define a route and custom endpoint.
Picture1-(13).png

Routes is the first tab on the Message Routing pane.

7. Click +Add to add a new route. You see the following screen. Enter a Name for your route and choose an endpoint. For the endpoint, you can select one from the dropdown list, or add a new one. In this example, click on +Add endpoint and choose Event Hub.
Picture1-(14).png
8. In the Add an event hub endpoint pane, provide an Endpoint name, an existing Event hub namespace, and an Event hub instance. Click Create.
Picture1-(15).png
9. Click Save to save routing rule. You should see the new routing rule displayed.
Picture1-(16).png
10. In the Custom endpoints tab, click Refresh to see your custom endpoint rule displayed under Event Hubs. Ensure that the Status is Healthy.
Picture1-(17).png

We will use the Raspberry Pi online simulator as our source for JSON strings. Data will be sent to the IoT hub registed device, which we created in previous section.

This is an example of a JSON string that the simulator will generate. (Line wrapping was added for clarity).

{“messageId”:2,
“deviceId”:”Raspberry Pi Web Client”,
“temperature”:29.288325949023434,
“humidity”:77.5147906}

We need to add the IoT hub device Primary Connection String for the online simulator to connect.

1. Click on START RASPBERRY PI SIMULATOR.

2. Edit line 15 and replace ‘[Your IoT hub device connection string]’; with your Primary Connection String.
Picture1-(18).png

3. Run the online simulator for a few minutes. You should see JSON files in your blob storage container.

4. Click Stop to stop the simulator.

If desired, you can view your blob storage data.

A. Logon to Azure portal and click Resource groups.

B. Find and click your resource group with storage account.

C. Click container name.
Picture1-(19).png
D. Click on the JSON file and Edit to view data.
Picture1-(20).png
Alternatively, you can use Azure Storage Explorer, although configuring the Azure Storage Explorer is not discussed in this guide.

Native Object Store is a new capability included with Teradata Vantage 17.0 that makes it easy to explore data sets located on external objects stores, like Azure Blob Storage, using standard SQL.

Once configured, an object on Azure Blob will look just like a table in Vantage.

Detailed information can be found in the Native Object Store – Teradata Vantage Advance SQL Engine 17.0 (Orange Book).

Vantage needs an authorization object to gain access to a Azure Blob Storage account. Authorization objects provide more security than just relying on Azure Blob Storage authorization.

The CREATE AUTHORIZATION DDL statement requires the storage account name to be specified in the USER field and the Azure access key we saved earlier in the PASSWORD field.

You must create the authorization object in the same database as the foreign table that will reference it.

1. Create an authorization object using syntax similar to the following.

CREATE AUTHORIZATION DefAuth_AZ
AS DEFINER TRUSTED
USER ‘mystorageacctrs’ /* storage account name */
PASSWORD ‘********’;  /* storage account key */

If you delete and recreate your storage account with the same name, the access keys will change. Therefore, you must drop and recreate your authorization objects.

A foreign table allows external data to be easily referenced within the Vantage Advanced SQL Engine and makes the data available in a structured, relational format. Below is a example of the syntax to create a simple foreign table.

The storage LOCATION information includes three parts, separated by slashes: the AZ prefix, the storage account name including the “blob.core.windows.net” suffix, and the container name.

If external security authentication is used, then the EXTERNAL SECURITY DEFINER TRUSTED DefAuth_AZ syntax in the DDL statement.

2. Create a foreign table using syntax similar to the following.
 

CREATE MULTISET FOREIGN TABLE json ,FALLBACK ,EXTERNAL SECURITY DEFINER TRUSTED DefAuth_AZ,
     MAP = TD_MAP1
     (
      Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC,
      Payload JSON(8388096) INLINE LENGTH 32000 CHARACTER SET UNICODE)
USING
(
      LOCATION  (‘/AZ/mystorageacctrs.blob.core.windows.net/json’)
      MANIFEST  (‘FALSE’)
      PATHPATTERN  (‘$Var1/$Var2/$Var3/$Var4/$Var5/$Var6/$Var7/$Var8/$Var9/$Var10/$Var11/$Var12/$Var13/$Var14/$Var15/$Var16/$Var17/$Var18/$Var19/$Var20’)
      ROWFORMAT  (‘{“record_delimiter”:”n”, “character_set”:”UTF8″}’)
      STOREDAS  (‘TEXTFILE’)
)
NO PRIMARY INDEX ;

Now that the foreign table is created, you can access the object on Azure Blob Storage.

JSON is essentially a list of keys and values. You can obtain a list of the keys (attributes) with the JSON_KEYS table operator with the following statement.
 

SELECT DISTINCT * FROM JSON_KEYS (ON (SELECT payload FROM json)) AS j

Picture1-(21).png
You can obtain a list of the actual values by selecting a small number of rows without any filtering.
 

SELECT Payload.* FROM json

Picture1-(22).pngYou can obtain the name-value pairs that includes the fields and values by omitting the asterisk from the Payload keyword.

SELECT Payload FROM json

Picture1-(23).pngIt is best practice to create a view on top of a table to provide a layer between users and tools and the underlying table.

Below is an example of a view on the JSON foreign table.

CREATE VIEW json_perm_cols_v AS (
SELECT
CAST(payload.messageId AS INT) MessageID,
          CAST(payload.deviceId AS VARCHAR(25)) DeviceID,
                CAST(payload.temperature AS FLOAT) Temperature,
          CAST(payload.humidity AS FLOAT) Humidity
       FROM json
       );

At this point, the view may be used by users and tools.

Native Object Store does not automatically make a persistent copy of the external object that it reads through the foreign table.

You can copy the data in the external object into a table in Vantage. Below are a few simple COPY options.

The CREATE TABLE AS…WITH DATA statement creates a No Primary Index (NoPI) relational table as the target table by default. If you do not take further action, this NoPI table will have just the two columns: Location and Payload.

CREATE TABLE json_perm AS (select * from json) WITH DATA

It is better to create columns for each payload. The example below does this.
 

CREATE TABLE json_perm_cols AS
(SELECT
       CAST(payload.messageId AS INT) MessageID,
       CAST(payload.deviceId AS VARCHAR(25)) DeviceID,
       CAST(payload.temperature AS FLOAT) Temperature,
       CAST(payload.humidity AS FLOAT) Humidity
       FROM json
       )
WITH DATA
NO PRIMARY INDEX

Another option is using an INSERT…SELECT statement. This approach does require that the permanent table be created beforehand. The example below first creates the table and then performs the insert.
 

CREATE SET TABLE json_perm_empty ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      MessageId INTEGER,
      DeviceId VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC,
      Temperature FLOAT,
      Humidity FLOAT)
PRIMARY INDEX ( MessageId );

 

INSERT into.json_perm_empty
SELECT
    payload.messageId,
    payload.deviceId,
    payload.temperature,
    payload.humidity
FROM json

 

 

Kirk Roybal: PostgreSQL is the worlds’ best database

$
0
0

Feed: Planet PostgreSQL.

The title is not clickbait or hyperbole. I intend to prove that by virtue of both design and implementation that PostgreSQL is objectively and measurably a better database than anything currently available, with or without money considerations.

How in the world can I claim and justify such a lofty statement? Read on, gentle nerd. I promise that your time will not be wasted.

Transparent Security

PostgreSQL has a security mailing list. The PostgreSQL project learns about the intrusion vectors at the same time that everybody else does. Nothing is hidden, and anything that is found is worked on at a rate that makes the commercial vendors’ heads spin. Don’t be fooled by shorter defect lists published by the same vendor that provides the software under scrutiny.

This means that all known attack vectors are handled as soon as they are made public. This kind of security responsiveness is not even contemplatable in the commercial market. For commercial vendors, secrecy until the problem can be addressed is vital to the remediation. PostgreSQL gets no such relief, and that’s fantastic for you.

Multi-version Concurrency Control is good for you

PostgreSQL picks a method of concurrency control that works best for high INSERT and SELECT workloads.

It is very easy to design for PostgreSQL, keeping the limitations in mind for the tracking overhead for

UPDATE and DELETE. Mostly, if you respect your data, you should learn to love the data security that PostgreSQL affords you .

DDL participates in transactions using PostgreSQL. Migrations work all the way or none of the way (the worst kind of not working is almost working). Testing harnesses are dead easy to build. Need to reset the testing harness? Just ROLLBACK.

PostgreSQL supports standards compliant forms of transaction isolation, including serialization, read committed, and repeatable read. These methods provide complete ACID compliance.

PostgreSQL does everything

So, you want NoSQL, Riak, REACT, Redis, Mongo, etc.? PostgreSQL does all that. Admittedly not with all the bells and whistles of all of the original products. For example, PostgreSQL doesn’t create new shards for you for any of those. That’s still a manual process. But then again, there’s always pg_partman. . .

You want a column data store? How about hstore? You don’t want to retrain your staff? Plug in the language of your choice and keep trucking. You want partial replication? Streaming Logical replication is for you.

I would have a hard time thinking of a feature that I want that PostgreSQL doesn’t have, or that there isn’t a well known extension to provide.

You want to extract data from other systems? PostgreSQL has the most vibrant collection of federation objects of any database. They call them foreign data wrappers, and you can hook PostgreSQL to an alligator with duct tape and zip ties. Treat anything like it’s your data.

Hook it to a map

The PostGIS community may arguably be bigger than the PostgreSQL community itself. The mapping capabilities of PostgreSQL put it in a class by itself, even compared to very expensive alternatives.

The PostGIS project picked PostgreSQL as a platform because of the ease of extensibility and the extensive data enrichment capabilities. These capabilities are directly exposed for any other project to take advantage of. They are also unanswered by any other vendor, commercial or open source.

Ultimately, you can hook it to anything.

PostgreSQL is growing and leading the way in open source

The PostgreSQL project is one of the most highly visible organizations in open source software. With a huge community and growing at an astronomical rate, any deficiencies that it has now will arguably be defeated in a time frame that other vendors can only dream about.

Additional enterprise quality features are announced literally every day, and the staff to maintain those features is self-selected from a pool of geniuses that every company is hoping to hire, and there just aren’t enough to go around.

PostgreSQL builds solutions that are stable forever

PostgreSQL has logical replication built in to the core. This allows for cross-version migrations. Read that again. You’re not locked in to a specific hardware or software version. The solution can be upgraded indefinitely.

Also, PostgreSQL is supported on many platforms, including the super-stable versions of Linux. Do you need a solution that outlives the typical 3~5 year ROI? PostgreSQL will last you forever, even if you never upgrade the hardware at all. And the fees for that are easy to calculate. $0.

Declarative is better than imperative

Database languages are generally declarative. That is, you write a query in the built in language of your choice, describing the results that you would like to see. The database tries to decode your intentions, and provide the appropriate results. This is the basis of all declarative programming models. In PostgreSQL, this comes down to a mapping of functions to keywords in the SQL language, sometimes with several algorithmic choices for exactly how to implement each declaration.

In the age old argument about the imperitive vs. the declarative5 programming models, it occurs to me that declarative programming is just imperative programming in a thin disguise. Each declarative token in a database query language ultimately maps to one or several algorithms which apply the declaration in imperitive terms. Thus, the impedance mismatch defined by Henrietta is ultimately in the mind of the developer. That is, if the developer would think exactly like the database function programmer thinks, then there would be no mismatch.

So how could a declarative model ultimately be better than an imperitive model, given that one is just a calling feature of the other? Glad you asked, because that brings me to my point.

The PostgreSQL developers are smarter than you. I don’t mean that to be facetious or coy. Literally thousands of contributors have made millions of contributions to the PostgreSQL project, many of them as improvements to the contributions of others. The chances that whatever you thought up on the top of your head is better than what has already been implemented is very low. And, even if your thoughts were better, you should contribute them to the PostgreSQL project for the benefit of all, thus raising the bar for everyone else.

So, what makes PostgreSQL so wonderful then? Worldwide mindshare without corporate considerations. Thousands of developers are working hundreds of thousands of hours to make better algorithmic choices. So your software gets better every release, most usually without having to do anything in particular on your part.

Isn’t that the nature of software in general, you say? Well, yes. But not to anywhere near the extent that it is when the entire world is involved in your project. PostgreSQL enjoys a very prominent place in the open source community. Commercial vendors will never be able to keep up with the rate of change that an open source project can provide at this level. The migrations to open source (and particularly PostgreSQL) are here to prove it.

The features keep rolling in. There are very few things left that commercial vendors can point to as a distinct advantage. Among those things are SMP support, bi-directional replication and external tools. Guess what the community is working on now, and will very likely release in the next few years?

Extend PostgreSQL any way you like

PostgreSQL has a vibrant community of authors that write ancillary software. This includes plugging in any language that you like, and using it to extend PostgreSQL in any way that seems helpful. Do you happen to like perl string handling? Ok, then use that. How about Python map support? Sure, just plug in python and go to town. Want to write web services using a PostgreSQL back end? That’s awesome, and PostgreSQL will help. JSON? Ok. XML? You bet. PostgreSQL has direct support for all of that and infinitely more. If you can think of a language that does it well, then plug that in to PostgreSQL and you can have it on the server side.

You can create your own functions, data types, operators, aggregates, window functions or pretty much anything else. Don’t see a feature you like? Plagiarize and customize it from the source code. You’re free to do that because of the license.

PostgreSQL also provides some hook functions that allow you to extend the database without having to go to programming extremes.

This ability to assimilate any feature of any other language is unique to PostgreSQL. You can provide any feature using any standardized library in existence. You can follow the standards, keep up with changes, still update PostgreSQL while it’s alive, and you can do it all for free.

Go big and go wide

PostgreSQL has several features to make the most of the hardware platform that it’s been given. Parti- tioning, parallel execution, partial indexes, tablespaces, cacheing, and parallel non-blocking maintenance routines (almost everything in PostgreSQL is sprouting the CONCURRENTLY keyword lately).

When that isn’t enough for you, then physical streaming replication will make a bunch of read slaves for you on the cheap. Sharding, memcache, queueing, load balancing and connection pooling all work with PostgreSQL. Still not enough? How about logical streaming replication? You want to geoshard the database all over the world, you say? Well, welcome to bi-directional replication.

And the price tag is still at $0.

Index all that

PostgreSQL supports such a huge list of indexes that it boggles the mind to figure out how to use them all. GiST, SP-Gist, KNN Gist, GIN, BRIN, and Btree are all available. And there are more to be had through the extension system, like Bloom filters and others.

PostgreSQL can use these with function driven indexes, partial indexes, covering indexes and full text search. And these extended features are not mutually exclusive. You can use them all at the same time.

Roll it in, Roll it out

Several of the technologies already mentioned make PostgreSQL a fantastic data integration and distribution platform. Multiple forms of replication, combined with multiple forms of federation provide both push and pull technologies for nearly any kind of data system.

These can be combined in infinite configurations to bridge database storage solutions. All that without requiring any ETL/ELT processing package. PostgreSQL just does it. The fastest single source of truth database on the planet does it by not moving the data out of the source system at all. This means that the data is always current, and the response times can be managed.

If you can’t stand the unreliability of the source system or would like a bit better performance on the query side, you can also still cache the data periodically with materialized views, which can be updated while still being queried.

The license is wide open

PostgreSQL has it’s own license that is largely based on the BSD license. This allows for even greater freedom of use and distribution.

The license applies to all of the code of the main project, major contribution extensions, client libraries, connection managers, and most of the associated tools.

It is highly permissive, written in plain english, and not available for purchase.

Fantastic documentation

The PostgreSQL project requires that any developer submitting code will provide documentation for the proposal. This proposal is used to create the documentation for the feature that is made available in several formats. This documentation is also used in the evaluation of the feature itself, and as a reference to develop future features.

All together, this means that PostgreSQL lives on documentation. There are many developers for PostgreSQL that have learned to code in C, how databases work, and how projects are managed by working with the PostgreSQL project. This documentation is second to none.

Test driven development

PostgreSQL is extensively tested. No, that’s not saying it strongly enough. PostgreSQL is exhaustively tested. Every bug is met with a test to verify it’s existence, and code is written to satisfy the test. New features are written by the creation of tests (and documentation) first, then coded until the feature appears.

These tests are integrated into the build farm for regression, so bugs don’t (re)appear in future versions of PostgreSQL. That means that every test (that is still current) is run for every version of PostgreSQL for every build cycle. That’s a lot of testing, and it ensures that PostgreSQL remains the most stable database that is available.

PostgreSQL is only released when ALL of the regression tests pass. This provides for “0 known bug” releases.

Internationalization and Localization

The developers of PostgreSQL come from all over the world. They have been working in many native languages since the inception of PostgreSQL as a college graduate project. Internationalization and localization have been baked into PostgreSQL as a standard practice, not a bolt on as PostgreSQL began to attract a commercial global market.

While PostgreSQL delegates some of the internationalization to the operating system for compatibility reasons, much of the translation is embedded into the system, providing for a seamless language transition experience.

Cloud operation

PostgreSQL works in cloud architectures using ansible, kubernetes, and proprietary tools from multiple cloud vendors. There are several native cloud implementations to choose from to match your architecture.

If you want to treat servers like cattle, not pets, PostgreSQL has you covered in the cloud also.

Standards Compliance

PostgreSQL has been standards focused for the lifetime of the project. Since PostgreSQL originated in a University graduate program, it has been used as a reference implementation for many SQL standards.

PostgreSQL implements SQL/Med, and ANSI SQL.

According to the fantastic documentation, “PostgreSQL supports most of the major features of SQL:2016. Out of 179 mandatory features required for full Core conformance, PostgreSQL conforms to at least 160.” This is more than almost any other database engine.

Language Features

PostgreSQL implements common table expressions (CTE), language control structures (if, for, case, etc.), structured error handling, and all of the goodies you would expect from a mature procedural language.

Are you convinced yet?

I could still talk about the fantastic community of user groups, IRC channels, web sites with solutions, blog articles and mentors. I could wax philosophical about the way that the database is cross-platform, cross-architecture, and cross-culture. There are hours and hours of presentations, videos and lectures.

Or you could just go download it, and see if it is bigger than your imagination. I think you’ll be very pleasantly surprised.


Help Drive the Future of Percona XtraDB Cluster

$
0
0

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

Percona XtraDB ClusterPercona is happy to announce the experimental release of Percona XtraDB Cluster 8.0. This is a major step for tuning Percona XtraDB Cluster to be more cloud- and user-friendly. This is the second experimental release that combines the updated and feature-rich Galera 4, with substantial improvements made by our development team.

Improvements and New Features in Percona XtraDB Cluster

Galera 4, included in Percona XtraDB Cluster 8.0, has many new features. Here is a list of the most essential improvements:

  • Streaming replication to support large transactions
  • The synchronization functions allow action coordination (wsrep_last_seen_gtid, wsrep_last_written_gtid, wsrep_sync_wait_upto_gtid)
  • More granular and improved error logging. wsrep_debug is now a multi-valued variable to assist in controlling the logging, and logging messages have been significantly improved.
  • Some DML and DDL errors on a replicating node can either be ignored or suppressed. Use the wsrep_ignore_apply_errors variable to configure.
  • Multiple system tables help find out more about the state of the cluster state.
  • The wsrep infrastructure of Galera 4 is more robust than that of Galera 3. It features a faster execution of code with better state handling, improved predictability, and error handling.

Percona XtraDB Cluster 8.0 has been reworked in order to improve security and reliability as well as to provide more information about your cluster:

  • There is no need to create a backup user or maintain the credentials in plain text (a security flaw). An internal SST user is created, with a random password for making a backup, and this user is discarded immediately once the backup is done.
  • Percona XtraDB Cluster 8.0 now automatically launches the upgrade as needed (even for minor releases). This avoids manual intervention and simplifies the operation in the cloud.
  • SST (State Snapshot Transfer) rolls back or fixes an unwanted action. It is no more “a copy only block” but a smart operation to make the best use of the copy-phase.
  • Additional visibility statistics introduced in order to obtain more information about Galera internal objects. This enables easy tracking of the state of execution and flow control.

Installation

You can install this release by using the percona-release tool. Since this is an experimental release and not a GA release, you need to enable the experimental repository and then install Percona XtraDB Cluster 8.0 using your package management system, such as apt or yum. Note that this release is not ready for use in any production environment.

If you use the apt package manager, use the following steps:

If you use the yum package manager, install Percona XtraDB Cluster 8.0 as follows:

Percona XtraDB Cluster 8.0 is based on the following:

and also includes:

Please be aware that this release will not be supported in the future, and as such, neither the upgrade to this release nor the downgrade from higher versions is supported.

Known Issues

  • PXC-2994: 8.0 clone plugin doesn’t work for cloning remote data
  • PXC-2999: DML commands are not replicating to PXC-8.0 nodes through MySQL shell
  • PXC-3016: 8.0 set persist binlog_format=’STATEMENT’ should not succeed
  • PXC-3020: Cluster goes down if a node is restarted during load
  • PXC-3027: Global wsrep memory leaks
  • PXC-3028: Memory leaks inside wsrep_st
  • PXC-3039: No useful error messages if an SSL-disabled node tries to join SSL-enabled cluster
  • PXC-3047: hung if cannot connect to members PXC 8.0

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thank you for your input and for your continued support of Percona!

New Book: MySQL 8 Query Performance Tuning

$
0
0

Feed: Planet MySQL
;
Author: Jesper Krogh
;

I have over the last few years been fortunate to have two books published through Apress, Pro MySQL NDB Cluster which I wrote together with Mikiya Okuno and MySQL Connector/Python Revealed. With the release of MySQL 8 around a year ago, I started to think of how many changes there has been in the last few MySQL versions. Since MySQL 5.6 was released as GA in early 2013, some of the major features related to performance tuning includes the Performance Schema which was greatly changed in 5.6, histograms, EXPLAIN ANALYZE, hash joins, and visual explain. Some of these are even unique in MySQL 8.

So, I was thinking that it could be interesting to write a book that focuses on performance tuning in MySQL 8. In order to try to limit the scope somewhat (which as you can see from the page count that I was not too successful with), I decided to mainly look at the topics related to query performance. I proposed this to my acquisition editor Jonathan Gennick, and he was very interested. Oracle whom I worked for at the time was also interested (thanks Adam Dixon, Victoria Reznichenko, Edwin DeSouza, and Rich Mason for supporting me and approving the project). Also thanks to the Apress editors and staff who has been involved including but not limited to Jonathan Gennick, Jill Balzano, Laura Berendson, and Arockia Rajan Dhurai.

Now around a year later, the final result is ready: MySQL 8 Query Performance Tuning. If you are interested, you can read more about the content and/or buy it at Apress, Amazon, and others book shops:

Book Structure

The book is divided into six parts with a total of 27 chapters. I have attempted to keep each chapter relatively self-contained with the aim that you can use the book as a reference book. The drawback of this choice is that there is some duplication of information from time to time. An example is Chapter 18 which describes the more theoretical side of locks and how to monitor locks, and Chapter 22 which provides practical examples of investigating lock contention. Chapter 22 naturally draws on the information in Chapter 18, so some of the information is repeated. This was a deliberate choice, and I hope it helps you reduce the amount of page flipping to find the information you need.

The six parts progressively move you through the topics starting with some basic background and finishing with more solution-oriented tasks. The first part starts out discussing the methodology, benchmarks, and test data. The second part focuses on the sources of information such as the Performance Schema. The third part covers the tools such as MySQL Shell used in this book. The fourth part provides the theoretical background used in the last two parts. The fifth part focuses on analyzing queries, transactions, and locks. Finally, the sixth part discusses how to improve performance through the configuration, query optimization, replication, and caching. There are cases where some content is a little out of place, like all replication information is contained in a single chapter.

Chapters

Part I: Getting Started

Part I introduces you to the concepts of MySQL query performance tuning. This includes some high-level considerations, of which some are not unique to MySQL (but are of course discussed in the context of MySQL). The four chapters are

  • Chapter 1 – MySQL Performance Tuning
    This introductory chapter covers some high-level concepts of MySQL performance tuning such as the importance of considering the whole stack and the lifecycle of a query.
  • Chapter 2 – Query Tuning Methodology
    It is important to work in an effective way to solve performance problems. This chapter introduces a methodology to work effectively and emphasizes the importance of working proactively rather than doing firefighting.
  • Chapter 3 – Benchmarking with Sysbench
    It is often necessary to use benchmarks to determine the effect of a change. This chapter introduces benchmarking in general and specifically discusses the Sysbench tool including how to create your own custom benchmarks.
  • Chapter 4 – Test Data
    The book mostly uses a few standard test databases which are introduced in this chapter.

Part II: Sources of Information

MySQL exposes information about the performance through a few sources. The Performance Schema, the sys schema, the Information Schema, and the SHOW statement are introduced in each their chapter. There are only relatively few examples of using these sources in this part; however, these four sources of information are used extensively in the remainder of the book. If you are not already familiar with them, you are strongly encouraged to read this part. Additionally, the slow query log is covered. The five chapters are

  • Chapter 5 – The Performance Schema
    The main source of performance related information in MySQL is – as the name suggests – the Performance Schema. This chapter introduces the terminology, the main concepts, the organization, and the configuration.
  • Chapter 6 – The sys Schema
    The sys schema provides reports through predefined views and utilities in stored functions and programs. This chapter provides an overview of what features are available.
  • Chapter 7 – The Information Schema
    If you need metadata about the MySQL and the databases, the Information Schema is the place to look. It also includes important information for performance tuning such as information about indexes, index statistics, and histograms. This chapter provides an overview of the views available in the Information Schema.
  • Chapter 8 – SHOW Statements
    The SHOW statements are the oldest way to obtain information ranging from which queries are executing to schema information. This chapter relates the SHOW statements to the Information Schema and Performance Schema and covers in somewhat more detail the SHOW statements without counterparts in the two schemas.
  • Chapter 9 – The Slow Query Log
    The traditional way to find slow queries is to log them to the slow query log. This chapter covers how to configure the slow query log, how to read the log events, and how to aggregate the events with the mysqldump utility.

Part III: Tools

MySQL provides several tools that are useful when performing the daily work as well as specialized tasks. This part covers three tools ranging from monitoring to simple query execution. This book uses Oracle’s dedicated MySQL monitoring solution (requires commercial subscription but is also available as a trial) as an example of monitoring. Even if you are using other monitoring solutions, you are encouraged to study the examples as there will be a large overlap. These three tools are also used extensively in the remainder of the book. The three chapters in this part are

  • Chapter 10 – MySQL Enterprise Monitor
    Monitoring is one of the most important aspects of maintaining a stable and well-performing database. This chapter introduces MySQL Enterprise Monitor (MEM) and shows how you can install the trial and helps you navigate and use the graphical user interface (GUI).
  • Chapter 11 – MySQL Workbench
    MySQL provides a graphical user interface through the MySQL Workbench product. This chapter shows how you can install and use it. In this book, MySQL Workbench is particularly important for its ability to create diagrams – known as Visual Explain – representing the query execution plans.
  • Chapter 12 – MySQL Shell
    One of the newest tools around from Oracle for MySQL is MySQL Shell which is a second-generation commandline client with support for executing code in both SQL, Python, and JavaScript. This chapter gets you up to speed with MySQL Shell and teaches you about its support for using external code modules, its reporting infrastructure, and how to create custom modules, reports, and plugins.

Part IV: Schema Considerations and the Query Optimizer

In Part IV, there is a change of pace, and the focus moves to the topics more directly related to performance tuning starting with topics related to the schema, the query optimizer, and locks. The six chapters are

  • Chapter 13 – Data Types
    In relational databases, each column has a data type. This data type defines which values can be stored, which rules apply when comparing two values, how the data is stored, and more. This chapter covers the data types available in MySQL and gives guidance on how to decide which data types to use.
  • Chapter 14 – Indexes
    An index is used to locate data, and a good indexing strategy can greatly improve the performance of your queries. This chapter covers the index concepts, considerations about indexes, index types, index features, and more. It also includes a discussion on how InnoDB uses indexes and how to come up with an indexing strategy.
  • Chapter 15 – Index Statistics
    When the optimizer needs to determine how useful an index is and how many rows match a condition on an indexed value, it needs information on the data in the index. This information is index statistics. This chapter covers how index statistics work in MySQL, how to configure them, monitoring, and updating the index statistics.
  • Chapter 16 – Histograms
    If you want the optimizer to know how frequent a value occurs for a given column, you need to create a histogram. This is a new feature in MySQL 8, and this chapter covers how histograms can be used, their internals, and how to query the histogram metadata and statistics.
  • Chapter 17 – The Query Optimizer
    When you execute a query, it is the query optimizer that determines how to execute it. This chapter covers the tasks performed by the optimizer, join algorithms, join optimizations, configuration of the optimizer, and resource groups.
  • Chapter 18 – Locking Theory and Monitoring
    One of the problems that can cause the most frustration is lock contention. The first part of this chapter explains why locks are needed, lock access levels, and lock types (granularities). The second part of the chapter goes into what happens when a lock cannot be obtained, how to reduce lock contention, and where to find information about locks.

Part V: Query Analysis

With the information from Part IV, you are now ready to start analyzing queries. This includes finding the queries for further analysis and then analyzing the query using EXPLAIN or the Performance Schema. You also need to consider how transactions work and investigate lock contention when you have two or more queries fighting for the same locks. The four chapters are

  • Chapter 19 – Finding Candidate Queries for Optimization
    Whether part of the daily maintenance or during an emergency, you need to find the queries that you need to analyze and potentially optimize. This chapter shows how you can use the Performance Schema, the sys schema, MySQL Workbench, your monitoring solution, and the slow query log to find the queries that are worth looking into.
  • Chapter 20 – Analyzing Queries
    Once you have a candidate query, you need to analyze why it is slow or impacts the system too much. The main tool is the EXPLAIN statement which provides information about the query plan chosen by the optimizer. How to generate and read – including examples – the query plans using EXPLAIN is the main focus of the chapter. You can also use the optimizer trace to get more information on how the optimizer arrived at the selected query plan. An alternative way to analyze queries is to use the Performance Schema and sys schema to break queries down into smaller parts.
  • Chapter 21 – Transactions
    InnoDB executes everything as a transaction, and transactions is an important concept. Proper use of transactions ensures atomicity, consistency, and isolation. However, transactions can also be the cause of severe performance and lock problems. This chapter discusses how transactions can become a problem and how to analyze them.
  • Chapter 22 – Diagnosing Lock Contention
    This chapter goes through four scenarios with lock contention (flush locks, metadata locks, record-level locks, and deadlocks) and discusses the symptoms, the cause, how to set up the scenario, the investigation, the solution, and how to prevent problems.

Part VI: Improving Queries

You have found your problem queries and analyzed them and their transaction to understand why they are underperforming. But how do you improve the queries? This chapter goes through the most important configuration options not covered elsewhere, how to change the query plan, schema changes and bulk loading, replication, and caching as means to improve the performance. The five chapters are

  • Chapter 23 – Configuration
    MySQL requires resources when executing a query. This chapter covers the best practices for configuring these resources and the most important configuration options that are not covered in other discussions. There is also an overview of the data lifecycle in InnoDB as background for the discussion of configuring InnoDB.
  • Chapter 24 – Change the Query Plan
    While the optimizer usually does a good job at finding the optimal query execution plan, you will from time to time have to help it on its way. It may be that you end up with full table scans because no indexes exist or the existing indexes cannot be used. You may also wish to improve the index usage, or you may need to rewrite complex conditions or entire queries. This chapter covers these scenarios as well as shows how you can use the SKIP LOCKED clause to implement a queue system.
  • Chapter 25 – DDL and Bulk Data Load
    When you perform schema changes or load large data sets into the system, you ask MySQL to perform a large amount of work. This chapter discusses how you can improve the performance of such tasks including using the parallel data load feature of MySQL Shell. There is also a section on general data load considerations which also applies to data modifications in general and shows the difference between sequential and random order inserts. That discussion is followed by considerations on what this means for the choice of primary key.
  • Chapter 26 – Replication
    The ability to replicate between instances is a popular feature in MySQL. From a performance point of view, replication has two sides: you need to ensure replication performs well, and you can use replication to improve performance. This chapter discusses both sides of the coin including covering the Performance Schema tables that can be used to monitor replication.
  • Chapter 27 – Caching
    One way to improve the performance of queries is to not execute them at all, or at least avoid executing part of the query. This chapter discusses how you can use caching tables to reduce the complexity of queries and how you can use Memcached, the MySQL InnoDB Memcached plugin, and ProxySQL to avoid executing the queries altogether.

I hope you will enjoy the book.

Top 6 FAQs on Transitioning to the Cloud with Distributed SQL (Corporate Blog)

$
0
0

Author: .

For decades organizations have relied on monolithic databases to run core business operations, but these traditional relational database management systems weren’t designed to support the requirements of modern application architectures.

Modern tools and technologies allow developers to design and build web-scale microservices and container-based applications, and non-scalable centralised databases just don’t work in this type of architecture. In fact, they become a real limitation, particularly for mission-critical applications moving to distributed or cloud environments. Recently I presented a webinar on how enterprises can move beyond these limitations with a distributed SQL database, and I got some great questions, so I wanted to take some time to share them and the answers. 

Q1: Since storage managers (SMs) and transaction engines (TMs) are separate, can they be scaled independently? Do they have to maintain a specific ratio? 

You’re looking for redundancy in the no single point of failure distributed model. So, ideally you’d want more than one storage manager and one transaction engine as a minimum to provide redundancy. You can scale out as many transaction engines as you need to support throughput and you can have as many storage managers as you like to provide a guarantee of the redundancy that your application needs. If you’re running only one TE and one SM, where is the built-in redundancy? There is no redundancy with this model. You need to have more than one of those to ensure that you have no single point of failure. The minimum configuration for resiliency would be two of each. However, there’s no specific ratio you need to maintain.

Q2: How do you guarantee ACID properties while maintaining performance?

Just to clarify, ACID stands for atomicity, consistency, isolation, and durability. How do we maintain those ACID guarantees along with maintaining performance? That’s a deep subject to get into, and I’ll be more than happy to share more knowledge about that. Because we’re a distributed database, elements of data are occurring at more than one place in the cluster at more than one time. So a piece of data can be on disc, and it can be in the SM, or it can be in memory, or in more than one TE. Now, we’re also a multiversion concurrency control database, and the way that we store data in the database allows us to store multiple versions of a copy of data at any particular time. 

So we understand that elastic scale with transactional consistency is critical for any applications migrating from a single-instance database to a distributed architecture. NuoDB builds on logical ordering and Multi-Version Concurrency Control (MVCC). Both systems mediate update conflicts via a chosen leader and allow NuoDB to scale while maintaining superior levels of performance.

There’s an excellent description of the NuoDB internal mechanisms to manage consistency and transactional isolation in the article by my colleague Martin Kysel here: Quick Dive into NuoDB Architecture. Martin also wrote an informative piece about the ACID properties of transactional DDL

Q3: Do you have a concept of virtual databases that is decoupled from the transaction layer, or is the database dedicated to the entire transaction and storage layer, or can you have separation between the layers?

NuoDB provides a number of ways to support multiple schemas or multiple databases that enable dedicated transactional and storage components. I’ve included the diagrams below to represent the various models of multi-tenancy (or co-existence) that can be supported within a single domain (numbers 1 to 4) together with a shared-nothing architecture (number 5).

You can see from these diagrams that, with the exception of scenario one, you have the opportunity to dedicate transaction and storage level components to individual databases.

If you’re thinking about a multi-tenancy situation where you’ve got more than one set of users in a single database, you’ll want to be able to ring-fence some of that computability so that different workloads and different users are not conflicting with each other. NuoDB has very flexible deployment options that allow you to do that. So you can do more than separate workloads, you can also separate applications, and ring-fence them and their physical resources. 

Q4: How do you integrate data mining technology on NuoDB databases?

NuoDB is optimised for OLTP transactions, and is ideal for systems of record and critical always-on applications. It is also a solution in many hybrid transaction/analytical processing (HTAP) use cases, where the ability to perform both online transaction processing and real-time operational intelligence processing simultaneously within the same database becomes a powerful combination. 

NuoDB is not aimed towards data warehousing or data lake management use cases. However, integration between NuoDB and other technologies is very straightforward via a comprehensive set of drivers, APIs, and certified development platforms. Spark and Kafka (among others) can be easily integrated with NuoDB in this way.

Q5: Is the NuoDB database tied to any specific cloud provider 

No, and that’s one of the beauties of our deployment model. We can happily run on Azure, on Google Cloud Platform (GCP), and on Amazon Web Services (AWS). You can even build a cluster where different components of your database are sitting on each of those three cloud providers. We’re totally cloud agnostic.

It’s in this way that we preserve your ability to avoid vendor lock-in. If, for example, you are using one of the proprietary cloud databases and they just told you that they are going to increase their prices. In that scenario, your choices would be quite limited. With NuoDB, you can take your database with you. You can move off that provider, you can move from on-premise to cloud quite seamlessly, and you can move from one cloud to another, so it’s very, very portable. 

Q6: If we’re not ready for the cloud, can we start using your database on premise and move to the cloud? 

Yes, you can start out building your database on-premise and when you’re ready to move to cloud you can stretch that cluster out to the cloud. Then you can start building capacity in one or more of the public clouds so that your cluster now is spread over multiple clouds.

So you’re not limited to just on premise, but you can also move into the public cloud as a hybrid architecture. So yes, you can definitely take your database with you when you’re ready.

Ready to scale out your mission-critical applications? 

As forward-thinking enterprises move to the cloud, experts are seeking solutions to reduce technical debt while increasing business agility. Watch my webinar on demand to learn more, check out the deck on SlideShare, reach out in the comments, or send a note to sales@nuodb.com for more information. 

Moving data between SQL Server and MariaDB

$
0
0

Feed: MariaDB Knowledge Base Article Feed.
Author: .

There are several ways to move data between SQL Server and MariaDB. Here we will discuss them and we will highlight some caveats.

Moving Data Definition from SQL Server to MariaDB

To copy an SQL Server data structures to MariaDB, one has to:

  1. Generate a CSV file from SQL Server data.
  2. Modify the syntax so that it works in MariaDB.
  3. Run the file in MariaDB.

Variables That Affect DDL Statements

DDL statements are affected by some server system variables.

sql_mode determines the behavior of some SQL statements and expressions, including how strict error checking is, and some details regarding the syntax. Objects like stored procedures, stored functions triggers and views, are always executed with the sql_mode that was in effect during their creation. sql_mode='MSSQL' can be used to have MariaDB behaving as close to SQL Server as possible.

innodb_strict_mode enables the so-called InnoDB strict mode. Normally some errors in the CREATE TABLE options are ignored. When InnoDB strict mode is enabled, the creation of InnoDB tables will fail with an error when certain mistakes are made.

updatable_views_with_limit determines whether view updates can be made with an UPDATE or DELETE statement with a LIMIT clause if the view does not contain all primary or not null unique key columns from the underlying table.

Dumps and sys.sql_modules

SQL Server Management Studio allows to create a working SQL script to recreate a database – something that MariaDB users refer to as a dump. Several options allow to fine-tune the generated syntax. It could be necessary to adjust some of these options to make the output compatible with MariaDB. It is possible to export schema, data or both. One can create a single global file, or one file for each exported object. Normally, producing a single file is more practical.

Alternatively, the sp_helptext() procedure returns information about how to recreate a certain object. Similar information is also present in the sql_modules table (definition column), in the sys schema. Such information, however, is not a ready to use set of SQL statements.

Remember however that Mhttps:www.google.com/search?client=ubuntu&channel=fs&q=sql+server+export+csv&ie=utf-8&oe=utf-8ariaDB does not support schemas. An SQL Server schema is approximately a MariaDB database.

To execute a dump, we can pass the file to mysql, the MariaDB command-line client.

Provided that a dump file contains syntax that is valid with MariaDB, it can be executed in this way:

mysql --show-warnings < dump.sql

--show-warnings tells MariaDB to output any warnings produced by the statements contained in the dump. Without this option, warnings will not appear on screen. Warnings don't stop the dump execution.

Errors will appear on screen. Errors will stop the dump execution, unless the --force option (or just -f) is specified.

For other mysql options, see mysql Command-line Client Options.

Another way to achieve the same purpose is to start the mysql client in interactive mode first, and then run the source command. For example:

root@d5a54a082d1b:/# mysql -uroot -psecret
Welcome to the MariaDB monitor.  Commands end with ; or g.
Your MariaDB connection id is 22
Server version: 10.4.7-MariaDB-1:10.4.7+maria~bionic mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

MariaDB [(none)]> W
Show warnings enabled.
MariaDB [(none)]> source dump.sql

In this case, to show warnings we used the W command, where "w" is uppercase. To hide warnings (which is the default), we can use w (lowercase).

For other mysql commands, see mysql Commands.

CSV Data

If the table structures are already in MariaDB, we need to only import table data. While this can still be done as explained above, it may be more practical to export CSV files from SQL Server and import them into MariaDB.

SQL Server Management Studio and several other Microsoft tools allow to export CSV files.

MariaDB allows to import CSV files with the LOAD DATA INFILE statements, which is essentially the MariaDB equivalent for BULK INSERT.

It can happen that we don't want to import the whole data, but some filtered or transformed version of it. In that case, we may prefer to use the CONNECT storage engine to access CSV files and query them. The results of a query can be inserted into a table using INSERT SELECT.

Moving Data from MariaDB to SQL Server

There are several ways to move data from MariaDB to SQL Server:

  • If the tables don't exist at all in SQL Server, we need to generate a dump first. The dump can include data or not.
  • If the tables are already in SQL Server, we can use CSV files instead of dumps to move the rows. CSV files are the most concise format to move data between different technologies.
  • With the tables already in SQL Server, another way to move data is to insert the rows into CONNECT tables that "point" to remote SQL Server tables.

Using a Dump (Structure)

mysqldump can be used to generate dumps of all databases, a specified database, or a set of tables. It is even possible to only dump a set of row by specifying the WHERE clause to use.

By specifying the --no-data option we can dump the table structures without data.

--compatible=mssql will produce an output that should be usable in SQL Server.

Using a Dump (Data)

mysqldump by default produces an output with both data and structure.

--no-create-info can be used to skip the CREATE TABLE statements.

--compatible=mssql will produce an output that should be usable in SQL Server.

--single-transaction should be specified to select the source data in a single transaction, so that a consistent dump is produced.

--quick speeds up the dump process when dumping big tables.

Using a CSV File

CSV files can also be used to export data to SQL Server. There are several ways to produce CSV files from MariaDB:

Using CONNECT Tables

Pstress: Database Concurrency and Crash Recovery Testing Tool

$
0
0

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

Pstress PerconaDatabases are complicated software made to handle the concurrent load while making specific guarantees about data consistency and availability. There are many scenarios which should be tested that can only happen under concurrent conditions.

Pstress is a probability-based open-source database testing tool designed to run in concurrency and to test if the database can recover when something goes wrong. It generates random transactions based on options provided by the user. With the right set of options, users can test features, regression, and crash recovery. It can create the concurrent load on a cluster or on a single server.

The tool is currently in beta, but it has already become very important within the testing pipeline for Percona. Pstress is widely used by Percona’s QA team during each phase of testing. It has identified some critical bugs in Oracle MySQL Community Edition, Percona Server for MySQL, Percona XtraDB Cluster, and Percona XtraBackup. Pstress is developed on top of the Percona Pquery framework which is a testing tool that picks random SQL statements from a file and executes them in multiple threads concurrently against the target environment. More information about Pquery can be found in the last section of this document.

Initially, the tool was named Pquery 3.0 but since its function and purpose diverged from that of Pquery we have renamed it to Pstress to more accurately convey its use case.

GitHub repository for Pstress

Key Features of Pstress

  • Generate different varieties of load based on the options provided by the user. For example, the tool can generate a single-threaded test having only inserts in one table or it can generate huge transactions involving JSON, GIS, partitions, foreign keys, generated column, et cetera in multiple threads.
  • Kill or shutdown the running server  or a node and let it recover and check 
  • In case any issue is identified it creates a bug directory having all relevant information. 
  • It extracts important information from the bug directory and tries to compare it with existing bugs.
  • Provides an easy mechanism to reproduce the problem

It can be used to test any database which supports a SQL interface. Currently, it is being developed to test MySQL and the plan is to extend the tool to test PostgreSQL as well.

How it Works

Pstress runs in multiple steps. In the first step, it creates metadata required to run the test based on the options provided by the user. Below you can find an example execution command line.

The above command creates 20 tables which have 10 columns and 20 indexes and other random attributes like compression, encryption, tablespaces based on seed 2020. If a seed is some other number, the table structures would differ. Options like “–no-blob –no-virtual” means it will not have blob and virtual columns and “–encryption=keyring” it will use keyring-based encryption.

It generates random SQLs in proportion to the probability of each option. It picks the command line option or the default probability for each option. Currently, there are around 100 options to generate random SQL. Once we have support for JSON, GIS, Partition tables there will be more options. 

The above command executes SQLs in the proportion of 10 add indexes, 10 drop indexes, 800 inserts, 500 updates, and the default value for other SQL in each session.  It helps to test the scenario where the add/drop indexes are happening concurrently with DMLs.

After a certain time, it stops the step, saves the data directory and restarts the server by varying some server variables and then continues the load. 

While the test is in progress the tool keeps checking the heartbeat of the server, error logs, and other checksums to identify any bug and in case Pstress finds any error it will create a bug directory. It saves stack traces, error logs, and other relevant information for you to analyze and reproduce the problem. 

At the end of all of the steps, it tags each bug’s directory with known and unknown issues and creates a single report to be analyzed by the user.

Why We Have Multiple Steps

  • Pstress can be used for a crash recovery test since we kill the server or node multiple times when the server is under load and let it recover.
  • These multiple steps act as a snapshot of the server, and if any problem is identified, Pstress can rerun from the last known good point to reproduce the problem.

Different Use Cases for Pstress

It can be used as a regression, feature, or crash recovery testing tool

Regression Testing

Each type of SQL statement has some default probability value which can be picked in a transaction. So when we combine all of these SQL statements we get a pretty good load covering the different features of the database in a concurrent mode. The tool has successfully found some issues that can only happen under concurrent conditions.

The plan is to have a configuration file containing combinations of different probabilities to test features like JSON, GIS, partitions, et cetera. Based on the seed value Pstress would pick these combinations to catch regression bugs.

Feature Testing

Pstress can be used to stress areas of a database that features can impact. Users have to pick the relevant SQL and set their probability high. 

In the next column, you can find a simple example to generate load to test the feature “Instant Add Column” introduced in the latest release series of MySQL, MySQL 8.0.  

Within this set of options –only-cl-ddl represents “only command line DDL”. 

The tool will execute concurrent add and drop columns in multiple sessions along with DML. Also if a column is added successfully in one session then another session will start generating DML using that column.

Crash Recovery Testing

At the end of each step, Pstress kills the server when the database is under load and lets it recover. It also changes some of the server variables before restarting the server and does a sanity check before moving ahead.  If Pstress identifies an issue, it saves all the contents of the data directory and other relevant information in the bug’s directory.

Modules in Pstress

Pstress consists of a driver script and a workload. 

Driver Script

The driver script is written in BASH as a shell script. Below are the main features of the driver script.

  1. start/shutdown nodes of the database server.
  2. Execute the workload.
  3. Checks the status of the server by verifying the heartbeat of the server, error logs generated by the server, etc.
  4.  If Pstress identifies an issue, it saves the data directory and other relevant information in the bug’s directory.
  5. It extracts stack traces and other information from the bug’s directory and tries to tag to known and unknown issues.
  6. Monitors the workload and kills the workload if it is stuck.
  7. At the end of the test, it generates report details about the test; such as the number of errors that occurred, known issues found, and the percentage of transactions that were successful.

Workload

The workload is a multi-threaded program that executes transactions. Each thread has its own loop where randomly some transaction is executed. This transaction can be DML/DDL against a table or setting some global variables. 

Below is the workflow for a workload

  1. Create metadata for tables based on the option provided by the user.
  2. Start workload in multiple threads.
  3. In the first step, it creates default tables and loads initial data into those tables.
  4. Picks some random table and issues transactions against it.
  5. Updates its own metadata after a DDL, so other threads can start using the new definition of the table. 
  6. At the end of the run, saves the metadata in a JSON file which can be used for the next step.
  7. If an error is found in any thread, that complete step is marked as a failure.

Types of Transactions

Pstress supports a huge variety of SQL statements and transactions which are randomly generated based on the seed value. 

  1. INSERT/REPLACE statement which can involve int, varchar, blob, GIS data, JSON, partitions, and foreign keys.
  2. SELECT statements with some columns, all columns, and a where clause, operators such as <, =, >, like, between, et cetera.
  3. UPDATE and DELETE statements with some columns, all columns, in the clause, where clause, and operators such as <,=>, like, between, et cetera.
  4. OPTIMIZE, ANALYZE, TRUNCATE or other table-level operations
  5. Add or drop columns, add or drop indexes, rename columns, rename indexes, and add JSON or GIS indexes.
  6. JOIN on multiple tables. 
  7. Executing special syntax SQL statements. The user provides a grammar file. Pstress will randomly fit some metadata and execute them as transactions. 
  8. Add/Drop/Create  redo/undo at runtime.

The target is to add the majority of transactions and statements supported by MySQL and PostgreSQL.

Design

The driver script is written in BASH shell script and workload is written in C++. Workload has nodes, tables, columns, and index objects. 

  • Nodes are instantiated when the database is running in a clustered mode.  
  • Tables have multiple derived objects such as temporary tables, partition tables, etc. 
  • Columns are also derived into GIS, JSON, generated, blob, varchar, int, etc. 
  • Indexes also have derived objects like virtual indexes, JSON indexes, GIS indexes, etc.
  • Tables can also have interrelationships as well for foreign keys.

There are also tablespace objects, undo objects, and redo objects.  Data structures are used to hold global variables and workload uses mutexes and locks to enable the modification of metadata in multiple threads.

At the end of each step, objects are written to a JSON file and then reconstructed from the file after restarting the step

Success Stories with MySQL

Within the development phase, Pstress has been successful in finding many bugs in Oracle MySQL Community Edition, Percona Server for MySQL, Percona XtraDB Cluster, and Percona XtraBackup.  As of the time of writing, there are more than 50 bugs filed against these products’ bug trackers. A few examples are linked below.

Bugs in Oracle MySQL Community Edition

https://bugs.mysql.com/bug.php?id=98564

https://bugs.mysql.com/bug.php?id=98530

Bugs in Percona Server for MySQL

https://jira.percona.com/browse/PS-6815

https://jira.percona.com/browse/PS-5924

Bugs in Percona XtraBackup

https://jira.percona.com/browse/PXB-1974

https://jira.percona.com/browse/PXB-1972

Bugs in Percona XtraDB Cluster

https://jira.percona.com/browse/PXC-2629

https://jira.percona.com/browse/PXC-2949

Comparison with Existing Tools

Some well-known existing tools in the MySQL community are RQG (Random Query Generator), SysBench, and Pquery.

RQG has the concept of a grammar file. It stresses the database by executing the grammar files. So, you must combine lots of grammar files to create a good load which is often not a straight-forward process.

Also when adding new database features, existing grammar files are not helpful because they do not account for the new feature and must be modified.  In essence, you must edit all grammar each time a new database feature is added to have a good test-case. This is not easy and can be a cumbersome process.

SysBench is more of a benchmark tool than a testing tool. It takes a lot of effort to combine tests to generate load which can be used to test database concurrency and test some specific database features.

Pstress has some cool features like crash recovery, bug tagging, and reports.  These features are part of the focus of testing database concurrency and its side-effects, which make it a better tool for this purpose.

Pquery and Pstress

Pstress is developed on the top of the Pquery testing tool. Pquery executes random SQL from a file. Pstress generates SQL based on the option provided by the user.

In Pstress, table metadata gets refreshed whenever a DDL happens which makes the probability of executing successful SQL statements much higher.  For example, if a new column is added in one session then other sessions start using that column to generate SQL.

Pquery picks random queries from a file and executes those queries against the server. Some SQL statements within a transaction must occur in a specific order to succeed, if they are executed out of order it will incur a failure. For example, attempting to execute an INSERT on a table before executing “CREATE TABLE”.  

Pquery by itself does not generate enough load to create stress on various modules within the database code to find stress-induced bugs in features.  Pstress is focused on concurrency and can be used to find and identify stress-induced bugs.

Limitations of Pstress

Pstress can’t be used to check the correctness of a database. Pstress doesn’t check that a query that was successful is returning correct data. There are a few checks, however, such as ensuring the parent-child relationship is not violated in foreign keys and that partition tables don’t have data from other partitions. These checks are very limited. Because of the concurrent nature of Pstress, there is no mechanism to check if it is returning correct query results because by the time we are able to perform a check other threads have updated something.  This presents an effective race condition if we were to attempt validating correctness. 

One possible way to solve this is to use multi-master clustering to check data correctness but it may not be a true validation because the additional master is another database instance.  We plan to try this method and see what the results are, as we already support working with clusters such as Percona XtraDB Cluster. 

Pstress primarily relies on database asserts, messages in error logs, checksum, and a few other things to find issues in the database.  This means that if bugs silently present themselves in the database, the tool will be unable to find and identify them. While the tool is good at finding interesting bugs, it’s still critically important that the database code includes proper logging and assertions.

Sometimes it can be difficult to reproduce the bugs found because of their highly concurrent nature. The concept of multiple steps helps because we save snapshots of the database and reproduce the problem in a small time-frame.  However, this is not always guaranteed. If a user has some hints about the source of the issue they can stress that part of the database and increase the probability of reproducing the bug.

Features in the Development Stage

JSON, GIS, Partition, and Fk are in the development stage. 

ACKNOWLEDGMENTS

Thanks to Percona for giving me enough time so I can work on this tool. Thanks to the Percona QA team for contributions to Pstress.

Viewing all 275 articles
Browse latest View live