Quantcast
Channel: DDL – Cloud Data Architect
Viewing all 275 articles
Browse latest View live

Integrating MySQL tools with Systemd Service

$
0
0

Feed: Planet MySQL
;
Author: MyDBOPS
;

In my day to day tasks as a DB Engineer at Mydbops we have been using multiple MySQL tools for multiple use cases to ensure an optimal performance and availability for servers managed by our Remote DBA Team.

A tool like pt-online-schema can be used for any DDL changes ( Overview to DDL algorithm ), if any tool which needs to scheduled for longer period we tend to use screen or cron.

Some of the problems we face when we demonise the process or use screen for running processes.

  • The daemon process gets killed when the server reboot happens.
  • The screen might accidentally terminate while closing it.
  • To flexibility to start or stop the process when required.

These common problem can be overcome by using systemd service.

What is Systemd?

Systemd is a system and service manager for Linux operating systems. When running as the first process on boot (as PID 1), it acts as an init system that brings up and maintains userspace services.

List of few use cases that can be made simple with systemd.service.

  • Integrating Pt-heartbeat with Systemd Service
  • Integrating Auto kill using pt-kill with Systemd Service.
  • Integrating multi query killer with Systemd service

Integrating Pt-heartbeat with Systemd Service.

We had the requirement to schedule pt-heartbeat to monitor replication lag for one of our clients under our database managed Services. Here is problem statement pt-heartbeat process was running as a daemon process, the usual problem we were facing was when the system is rebooted for any maintenance , the pt-heartbeat process gets killed and we start receiving the replication lag alerts and then it needs a manual fix.

Script for pt-heartbeat

/usr/bin/pt-heartbeat --daemonize -D mydbops --host=192.168.33.11 --master-server-id 1810 --user=pt-hbt --password=vagrant --table heartbeat --insert-heartbeat-row --update

Now let us integrate it with systemd

Question : Can we schedule multiple pt-kill processes?

Question : It is possible to schedule Multiple Kill statement for different servers

$ cd etc/systemd/system/

$ vi pt-heartbeat.service
##pt-heartbeat systemd service file
[Unit]
Description="pt-heartbeat"
After=network-online.target syslog.target
Wants=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=forking

ExecStart=/usr/bin/pt-heartbeat --daemonize -D mydbops --host=192.168.33.11 --master-server-id 1810 --user=pt-hbt --password=vagrant --table heartbeat --insert-heartbeat-row --update

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=pt-heartbeat
Restart=always

ExecStart = It needs the command which needs to be executed when the service kick start )

Restart = Always option specifies to start the process once the OS is booted up.

Once the new systemd script is pushed, Reload the systemctl daemon and start the service

$ systemctl daemon-reload
$ systemctl start pt-heartbeat
$ systemctl status pt-heartbeat -l
● pt-heartbeat.service - "pt-heartbeat"
Loaded: loaded (/etc/systemd/system/pt-heartbeat.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2020-06-20 13:20:24 IST; 10 days ago
Main PID: 25840 (perl)
Tasks: 1
Memory: 19.8M
CPU: 1h 1min 47.610s
CGroup: /system.slice/pt-heartbeat.service
└─25840 perl /usr/bin/pt-heartbeat --daemonize -D mydbops --host=192.168.33.11 --master-server-id 1810 --user=pt-hbt --password=vagrant --table heartbeat --insert-heartbeat-row --update

This service can be stopped by just giving ( similar to to any systemctl process )

$ systemctl stop pt-heartbeat
● pt-heartbeat.service - "pt-heartbeat"
Loaded: loaded (/etc/systemd/system/pt-heartbeat.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Jun 20 15:46:07 ip-192-168-33-11 systemd[1]: Stopping "pt-heartbeat"…
Jun 20 15:46:07 ip-192-168-33-11 systemd[1]: Stopped "pt-heartbeat".

Integrating Auto kill using pt-kill with Systemd Service

Usually in production servers long running queries will spike the system resource usage and degrade the MySQL performance drastically or might kill your MySQL process with OOM killer, in order to avoid this hiccups , we can schedule Percona pt-kill process based on the use case defined.

scheduling the pt-kill service

$ cd etc/systemd/system/

$ vi pt-kill.service
#pt-kill systemd service file

[Unit]
Description="pt-kill"
After=network-online.target syslog.target
Wants=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=forking

ExecStart=/usr/bin/pt-kill --user=mon_ro --host=192.168.33.11 --password=pt@123 --busy-time 200 --kill --match-command Query --match-info (select|SELECT|Select) --match-user cron_ae --interval 10 --print --daemonize

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=pt-kill
Restart=always
$ systemctl daemon-reload
$ systemctl start pt-kill
$ systemctl status pt-kill
pt-kill.service - "pt-kill"
Loaded: loaded (/etc/systemd/system/pt-kill.service; enabled)
Active: active (running) since Wed 2020-06-24 11:00:17 IST; 5 days ago
CGroup: name=dsystemd:/system/pt-kill.service
├─20974 perl /usr/bin/pt-kill --user=mon_ro --host=192.168.33.11 --password=pt@123 --busy-time 200 --kill --match-command Query --match-info (select|SELECT|Select) --match-user cron_ae --interval 10 --print --daemonize

Now we have configured a fail safe pt-kill process.

Integrating multi query killer with Systemd service

Question : Is it possible to integrate multiple Kill Statements for different hosts as single process.

Answer – Yes ! It is possible and quite simple too.

Just add the needed commands as shell script file and make it execute it. In the below example i have chose three different server consider a RDS instance ( more on AWS RDS its Myth ) and a couple of virtual machine.

$ vi pt_kill.sh

/usr/bin/pt-kill --user=pt_kill --host=test.ap-northeast-1.rds.amazonaws.com --password=awkS --busy-time 1000 --rds --kill --match-command Query --match-info "(select|SELECT|Select)" --match-user "(mkt_ro|dash)" --interval 10 --print --daemonize >> /home/vagrant/pt_kill_slave1.log

/usr/bin/pt-kill --user=mon_ro --host=192.168.33.11 --password=pt@123 --busy-time 20 --kill --match-command Query --match-info "(select|SELECT|Select)" --match-user "(user_qa|cron_ae)" --interval 10 --print --daemonize >> /home/vagrant/pt_kill_slave2.log

/usr/bin/pt-kill --user=db_ro --host=192.168.33.12 --password=l9a40E --busy-time 200 --kill --match-command Query --match-info "(select|SELECT|Select)" --match-user sbtest_ro --interval 10 --print --daemonize >> /home/vagrant/pt_kill_slave3.log

Scheduling pt-kill.service for multiple hosts.

#pt-kill systemd service file

[Unit]
Description="pt-kill"
After=network-online.target syslog.target
Wants=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=forking

ExecStart=/bin/bash /home/vagrant/pt_kill.sh

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=pt-kill
Restart=always

Reload the daemon and start the service

$ systemctl daemon-reload
$ systemctl start pt-kill
$ systemctl status pt-kill
pt-kill.service - "pt-kill"
Loaded: loaded (/etc/systemd/system/pt-kill.service; enabled)
Active: active (running) since Wed 2020-06-24 11:00:17 IST; 5 days ago
CGroup: name=dsystemd:/system/pt-kill.service
├─20974 Perl  /usr/bin/pt-kill --user=pt_kill --host=test.ap-northeast-1.rds.amazonaws.com --password=awkS --busy-time 1000 --rds --kill --match-command Query --match-info "(select... 
├─21082 perl  /usr/bin/pt-kill --user=mon_ro --host=192.168.33.11 --password=pt@123 --busy-time 20 --kill --match-command Query --match-info "(select...
├─21130 perl /usr/bin/pt-kill --user=db_ro --host=192.168.33.12 --password=l9a40E --busy-time 200 --kill --match-command Query --match-info "(select...

This makes Systemd more useful and easy tool for scheduling mysql tools in database environment. There are many more features in Systemd that be used for scheduling scripts bypassing the use of crontab, hopefully.

Note : All these are sample scripts you ensure you test well before making it in production.


The MySQL 8.0.21 Maintenance Release is Generally Available

$
0
0

Feed: Planet MySQL
;
Author: Geir Hoydalsvik
;

The MySQL Development team is very happy to announce that MySQL 8.0.21 is now available for download at dev.mysql.com. In addition to bug fixes there are a few new features added in this release.  Please download 8.0.21 from dev.mysql.com or from the MySQL  YumAPT, or SUSE repositories. The source code is available at GitHub. You can find the full list of changes and bug fixes in the 8.0.21 Release Notes. Here are the highlights. Enjoy!

InnoDB

Add config option to disable redo log globally (WL#13795) This work by Debarun Banerjee adds support for disabling and enabling the Innodb redo log dynamically (ALTER INSTANCE ENABLE|DISABLE INNODB REDO_LOG;).  Disabling the redo log (WAL) makes all write operation to the database faster. It also makes the server vulnerable to a crash and the entire instance data could be lost. The primary use case for disabling redo log is while loading initial data to a mysqld instance: First disable redo logging, then load data  and finally enable redo logging again. In this model data loading is fast. However, If node goes down during load then all data is lost and one needs to create a fresh instance and start all over again.

Make tablespace filename validation optional via –innodb-validate-tablespace-paths (WL#14008) This work by Niksa Skeledzija adds the option to disable the conservative approach to validate the  tablespace files  upon startup. The tablespace validation normally checks that all tablespaces listed in the MySQL data dictionary are found in the file system. This can be costly on low end systems with HDD and when there are many tablespace files to scan.  In situations where we know that the user does not move files around we can reduce startup time by skipping the validation. The new variable –innodb-validate-tablespace-paths := (ON | OFF) defaults to ON to preserve existing behavior. Users can set it to OFF if they know that tablespace files are never moved around manually at the file system level. Users can still safely move files around using the ALTER TABLESPACE syntax, even when the option is set to OFF.

Lock-sys optimization: sharded lock_sys mutex (WL#10314) This work by Jakub Lopuszanski introduces a granular approach to latching in the InnoDB lock-sys module. The purpose is to improve the throughput for high concurrency scenarios such as with 128 – 1024 Sysbench OLTP RW clients. The lock-sys orchestrates access to tables and rows. Each table, and each row, can be thought of as a resource, and a transaction may request an access right for a resource. Two transactions operating on a single resource can lead to problems if the two operations conflict with each other, lock-sys therefore maintains lists of already GRANTED lock requests and checks new requests for conflicts. If there is a conflict they have to start WAITING for their turn. Lock-sys stores both GRANTED and WAITING lock requests in lists known as queues. To allow concurrent operations on these queues, there is a mechanism to latch these queues safely and quickly. In the past a single latch protected access to all of these queues. This scaled poorly, and the management of queues became a bottleneck, so we have introduced a more granular approach to latching.

Restrict all InnoDB tablespaces to known directories (WL#13065) This work by Kevin Lewis ensures that the placement of tablespace files are restricted to known directories (datadir, innodb_data_home_dir, innodb_directories, and innodb_undo_directory). The purpose is to allow the DBA to restrict where files can be created as well as to avoid surprises during recovery. Recovery is terminated by InnoDB if a file referenced in the redo log has not previously been discovered.

Support ACID Undo DDL  (WL#11819)   This work by Kevin Lewis improves the handling of undo tablespaces in terms of both reliability and performance. With this feature users can enable the automatic truncation of empty UNDO table spaces. The CREATE/TRUNCATE of undo tablespaces is now redo logged. The advantage of using REDO logging is to avoid the two hard checkpoints that were required during undo truncation in the previous solution, i.e. these hard checkpoints could lead to a stall on a busy system. This change also fixes several issues that impacted durability of UNDO, CREATE, DROP and TRUNCATE.

JSON

Add JSON_VALUE function (WL#12228) This work by Knut Anders Hatlen adds the JSON_VALUE function (SQL 2016, chapter 6.27). The function extracts the value at the specified path from the given JSON document and returns it as the specified type. For example: SELECT JSON_VALUE('{"name": "Evgen"}', '$.name') will return unquoted string ‘Evgen’ as VARCHAR(512) with JSON’s default collation, while SELECT JSON_VALUE('{"price": 123.45}', '$.price' RETURNING DECIMAL(5,2)) will return 123.45 as decimal(5,2). The main motivation is to ease index creation of JSON values.

SQL DDL

Make CREATE TABLE…SELECT atomic and crash-safe  (WL#13355) This work by Gopal Shankar makes the CREATE TABLE ... AS SELECT statement atomic. Historically, MySQL has handled this statement as two different statements (CREATE TABLE and SELECT INTO) executed in two different transactions (since DDL is auto committed in MySQL). The consequence is that there have been failure scenarios where  the CREATE TABLE has been committed and SELECT INTO rolled back. This problem goes away with WL#13355, the CREATE TABLE … AS SELECT will now either be committed or rolled back.  This work enables the CREATE TABLE … AS SELECT in group replication. This also fixes Bug#47899 reported by Sven Sandberg.

Optimizer

Introduce new optimizer switch to disable limit optimization (WL#13929)  This work by Chaithra Gopalareddy adds a new optimizer switch called “prefer_ordering_index” (ON by default). The new switch controls the optimization to switch from a non-ordering index to an ordering index for group by and order by when there is a limit clause. While this optimization is beneficial for some queries, it can produce undesirable results for others. This addition mitigates the problem reported by Jeremy Cole in Bug#97001.

Make semijoin work with single-table UPDATE/DELETE  (WL#6057) This work by Guilhem Bichot makes single-table UPDATE/DELETE statements, such as UPDATE t1 SET x=y WHERE z IN (SELECT * FROM t2); and DELETE FROM t1 WHERE z IN (SELECT * FROM t2); go through the query optimizer and executor like other queries. Earlier, a shortcut bypassed optimization and went straight for a hard coded execution plan, preventing these statements from benefiting from more advanced optimization, e.g., semijoin. The old trick of adding an irrelevant table to the UDPATE/DELETE statement in order to make it a multi-table statement that is passed through the optimizer is no longer necessary. In addition, EXPLAIN FORMAT=TREE and EXPLAIN ANALYZE can now be used on these statements. This work fixes Bug#35794 reported by Alejandro Cusumano and Bug#96423 reported by Meiji Kimura.

Inject CAST nodes for comparisons with STRING non-const expressions (WL#13456)  This work by Catalin Besleaga extends earlier work (WL#12108) around injecting CAST nodes in the query plan. The purpose is to make data type casting explicit and consistent within the generated query plan. WL#12108 created the infrastructure around injecting CAST nodes in comparisons between non-const expressions of different data types, but string-based non-const expressions were excluded. WL#13456 now implements the missing CAST nodes for the string-based non-const expressions.

Group Replication

Reduce minimum value of group_replication_message_cache_size (WL#13979)   This work by Luís Soares lowers the minimum value of group_replication_message_cache_size from 1 GB to 128 MB. This allows the DBA to reduce the size of the XCom cache so that InnoDB Cluster can be deployed successfully on a host with a small amount of memory (like 16GB) and a good network connectivity.

Specify through which endpoints can recovery traffic flow  (WL#13767)  This work by Anibal Pinto implements a mechanism to specify which ips and ports are used during distributed recovery for group replication. The purpose is to control the flow of the recovery traffic in the network infrastructure, e.g. for stability or security reasons.

Compile XCom as C++  (WL#13842) This work by Tiago Vale initiates the transition from C to C++ for the XCom component. The purpose is to accelerate the development of future
XCom work due to C++’s higher-level features and richer standard library.

Classify important GR log messages as system messages  (WL#13769) This work by Nuno Carvalho reclassifies some Group Replication log messages as system messages. System messages are always logged, independently of the server log level. The purpose is to ensure that the DBA can observe the main events in the group.

START GROUP_REPLICATION to support credentials as parameters  (WL#13768)  This work by Jaideep Karandee extends the START GROUP_REPLICATION command to accept USER, PASSWORD, DEFAULT_AUTH and PLUGIN_DIR as optional parameters for the recovery channel. The purpose is to avoid storing credentials in a file which can be a security concern in some environments.

Increase default group_replication_autorejoin_tries (WL#13706) This work by Parveez Baig increases the default group_replication_autorejoin_tries from 0 to 3. With group_replication_autorejoin_tries=0, a group replication network partition for longer than 5 seconds  causes the member to exit the group and not come back, which results in the need to do a manual operation to bring the member back. The goal is to provide ‘automatic network partition handling’, including recovering from network partitions, which is most effectively achieved by setting group_replication_autorejoin_tries > 0.

Increase default group_replication_member_expel_timeout (WL#13773) This work by Pedro A Ribeiro increases the default group_replication_member_expel_timeout from 0 to 5 seconds. The default group_replication_member_expel_timeout is set to 5 seconds to decrease the likelihood of unnecessary expulsions and primary failovers on slower networks or in the presence of transient network failures. The new value for the default implies that the member will be expelled 10 seconds after becoming unreachable: 5 seconds are spent waiting before creating the suspicion that the member has left the group, then a further 5 seconds are waited before expelling the member.

Support binary log checksums  (WL#9038) This work by Nuno  Carvalho implements support for binlog checksums in Group Replication. Before this change the Group Replication plugin required that binlog-checksum was disabled, this restriction is now lifted. The purpose of binlog checksums is to ensure data integrity by automatically computing and validating checksums of binary logs events.

X Plugin

Support multiple –mysqlx-bind-address  (WL#12715)  This work by Grzegorz Szwarc allows the user to configure X Plugin bind address with multiple IP address (interfaces) where the user can skip unwanted interfaces of the host machine. The MySQL Server introduced binding to multiple addresses in 8.0.13 (WL#11652).

Router

User configurable log filename (WL#13838) This work by Thomas Nielsen adds an option to write log to filenames other than the “mysqlrouter.log” and to redirect console messages to stdout instead of stderr.

Support hiding nodes from applications (WL#13787) This work by Andrzej Religa adds support for a per-instance metadata attribute indicating that a given instance is hidden and should not be used as a destination candidate. The MySQL Router supports distributing connections across the nodes of InnoDB cluster. In general, distributing the load to all nodes is a good default and expected, but the user may have reasons to exclude a node from receiving load. For example, the user might want to exclude a given server instance from application traffic, so that it can be maintained without disrupting the incoming traffic.

Other

CREATE/ALTER USER COMMENT ‘JSON’  (WL#13562)  This work by Kristofer Älvring adds a “metadata” JSON object to the User_attributes column in the mysql.user table. The “metadata” JSON object allows users to also store their user account metadata into the column, for example ALTER USER foo ATTRIBUTE '{ "free_text" : "This is a free form text" }'; will be stored as {"metadata": {"free_text": "This is a free form text"}}. The user metadata is exposed in the information schema tables for users.

Support separate set of TLS certificates for admin connection port (WL#13850)  This work by Harin Vadodaria makes it possible to have different TLS certificates for the user port and the admin port. MySQL supports a dedicated port for administrative connections. Previously, both the admin connection port and regular client-server connection port shared the same set of TLS certificates. In a managed hosted environment, this poses challenges because: 1. Customers may want to bring their own certificates 2. Policy for certificate rotation may be different for internal and external certificates. We have now introduced a separate set of TLS certificates and other related configuration and status parameters specifically for the admin connection port and use a separate SSL context for connections coming from these two different ports.

Compression protocol for async client (WL#13510)  This work by Rahul Sisondia adds support for protocol compression for asynchronous clients. Support for asynchronous clients was added by WL#11381 in 8.0.16. Support for protocol compression for synchronous clients was added by WL#12475 in 8.0.18. This final step ensures that protocol compression is also supported for asynchronous clients. The purpose is to reduce the network traffic across data-centers. This work is based on a contribution from Facebook (BUG#88567).

Client library safe LOAD DATA LOCAL INFILE paths/directories (WL#13168) This work by Georgi Kodinov fixes a usability issue with the LOAD DATA INFILE statement at the client side without compromising on the security aspects. The client side configuration will specify what is allowed and not allowed. Then when the server asks for a file, the client will check the specification and either accept or reject the request.

Deprecation and Removal

Deprecate support for prefix keys in partition functions (WL#13588) This work by Nischal Tonthanahal adds deprecation warnings in cases when a table includes a column having prefix key index in the PARTITION BY KEY clause. As of today the syntax is supported but has no effect on the partition calculation and in the future the syntax will give an error message.

Thank you for using MySQL!



Haroon .: Generating and Managing PostgreSQL Database Migrations(Upgrades) with Spring Boot JPA

$
0
0

Feed: Planet PostgreSQL.

If you are building a Spring Boot application for your next project, you would also be preparing and planning for the PostgreSQL database to store the data generated by your customers. In some previous posts on the topic of RESTful services in Spring Boot, we discussed how we can use JPA to automatically create and automatically apply the schema to the PostgreSQL database. 

In this post, we will see why letting Hibernate control the schema changes is not best approach, and how to manage the schema, apply it to PostgreSQL database, synchronize your Java objects and PostgreSQL database tuples and how to version control your schema so your PostgreSQL database and web app are in sync.

Spring Boot: Hibernate Schema Generation

In Spring Boot and JPA/Hibernate, we were able to control the schema generation directly from our Entity objects and apply that to PostgreSQL database automatically. This can be controlled to:

  1. Either apply the changes always on the startup.
  2. Or apply the changes when there is a schema difference (the update configuration)

Although it is good for rapid prototyping and development / testing environment, it is discouraged to be used in the production environments. There are various reasons:

  1. Your PostgreSQL database schema gets destroyed and recreated each time you apply the changes from scratch.
  2. Your PostgreSQL database schema is controlled by the JPA—this can be a good thing if you want to avoid managing the DBA side. If you have a DBA, it is best approach to include them in your design process.
  3. JPA offers a very thin support of features when it comes to databases. JPA be able to manage your tables and their connections, but not be able to handle triggers / advanced key constraints.

Apart from this, if you end up modifying the database schema outside your application, JPA will not be able to accommodate the changes in your Java models. This leads to problematic behaviors in Java apps and Hibernate suggests that you consider controlling the schema changes manually and let Hibernate handle the object mappings. 

For a fresh Java web application, you can start off the development with Java models and SQL schema side-by-side in a versioned environment. JPA (or Hibernate) will allow us to map the Java entities and their properties to the backend SQL query, but what we will change in the process is the way the schema is created. Then, as your schema changes, you write these changes in SQL and add them to your app’s version control history. These changes are known as migrations; or database migrations to be precise.

What are migrations? The database migrations in relational database world are the changes to the schema of your database. They help DBAs and Operations team manage and maintain a version history for your database schema and rollback if latest schema poses potential performance / security problems.

To do this, we will use Flyway as the database migration tool and create the database schema manually. Instead of having Hibernate generate the schema, we write the SQL scripts and add them to our application resources. The Flyway plugin will read the migrations from the resources and apply them one by one to synchronize the PostgreSQL database schema and app code.

Verifying the SQL

Before we go ahead, it is better to always verify the SQL and the schema changes you have made in the testing environment. Once the schema changes are run on production it will be difficult to reverse the change; as it might bring in unwanted PostgreSQL database down time. 

Beyond this step, I assume that you have created a Spring Boot application, if you have not yet, you can always create a fresh copy of Spring Boot application from Spring Initializr. Also, note that this post will discuss the PostgreSQL database migrations instead of how RESTful services are configured, we have covered the RESTful bits in our Spring Boot and RESTful APIs with PostgreSQL series, please see.

Just as we start by defining the Java model, in code-first migration approach, we write the SQL schema and apply it to the PostgreSQL database. Then we create the JPA repositories and Java entities around this SQL schema. Flyway requires the migrations to follow a naming convention, 

V{number}__helpful_title_for_migration.sql

If your schema migration files do not meet this, Flyway will simply ignore them. So, following this, we create our first migration to ensure that we have the schema created in the PostgreSQL database:

CREATE SCHEMA IF NOT EXISTS migrations;

Now, we add a migration and add the people table to the PostgreSQL database:

CREATE TABLE IF NOT EXISTS migrations.people (
    id INT2,
    name TEXT NOT NULL,
    role TEXT NOT NULL,
    PRIMARY KEY (id)
);

Moving onwards, we create another table, comments:

CREATE TABLE IF NOT EXISTS comments (
    id INT4,
    comment TEXT NOT NULL,
    author TEXT NOT NULL,
    PRIMARY KEY (id)
);

Lastly, we insert some records in the PostgreSQL database for the people table:

-- INSERT DATA
INSERT INTO migrations.people (id, name, role)
VALUES
    (1, 'Person 1', 'Technical Writer'),
    (2, 'Person 2', 'Editor'),
    (3, 'Person 3', 'Reviewer'),
    (4, 'Person 4', 'Reader');

We create 4 files of the schema and name them accordingly. Here is the screenshot for these files, their file names and the directory where they reside:

Now that we have schema setup, we now need to configure Flyway plugin before we can start the application for testing.

Note: Whether you need to separate each schema update / object in a different file or make a compact file of all changes at a major release in a single file depends on the decision that you make with your DBAs. The approach that I have shown in this post is just a simple, quick-to-start one, and not recommended / prescribed one.

Update Flyway settings

Now, you will setup the Flyway plugin to be able to apply migrations to the PostgreSQL database. Flyway dependency can be added to your pom.xml file (for Maven projects; check for Gradle projects here). 

<dependency>
    <groupId>org.flywaydb</groupId>
    <artifactId>flyway-core</artifactId>
</dependency>

Update your Maven system and let it download the dependencies for you. After this, you should update your application.properties file to include Flyway settings that Spring Boot will use to apply your schema changes:

# Flyway
spring.flyway.schemas=migrations
spring.flyway.locations=classpath:db/migration

The complete setting file now would be:

# DataSource
spring.datasource.url=jdbc:postgresql://172.17.0.2:5432/postgres
spring.datasource.username=postgres
spring.datasource.password=mysecretpassword
spring.datasource.hikari.schema=migrations

# Disable Hibernate schema generation/application
spring.jpa.hibernate.ddl-auto=none

# Flyway
spring.flyway.schemas=migrations
spring.flyway.locations=classpath:db/migration

At this point, you have configured the application to disable the Hibernate schema management, and you have specified the location for the schema migrations; which is under “resources” directory, under “db/migrations”—see the screenshot for the migrations folder and the names above.

Since we follow a code-first approach, we will “now” build the Java web app and its models to support the PostgreSQL database. We know that Hibernate will be able to send queries to the backend based on the Java model, so we need to configure the objects, their table configurations, their column details and if there are any relationships that are defined (though none are defined as of right now in our schema above!).

Updating Entity attribute

We create a new @Entity type in Java (the way we did in the previous posts) and we expose a public getter method to get the fields that are captured by Hibernate.

package com.demos.pgschemamigrations.models;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.Id;
import javax.persistence.Table;

@Entity
@Table(name = "people")
public class Person {

    @Id
    private Long id;

    @Column(name = "name")
    private String name;

    @Column(name = "role")
    private String role;

    public Person() { }

    public Person(Long id, String name, String role) {
        this.id = id;
        this.name = name;
        this.role = role;
    }

    public Long getId() {
        return this.id;
    }

    public String getName() {
        return this.name;
    }

    public String getRole() {
        return this.role;
    }
}

You do not need to set a column name, if the values in the PostgreSQL database and Java app are same; Hibernate takes care of this. Although, it is recommended to have different values for security. 

Note: You can create Entity objects for other types as well, I will not explore this, since it is outside scope for the migration’s topic, and we already covered in the RESTful APIs post shared above. After this step, you need to create a repository to get the objects from PostgreSQL database. You can also use a RESTful controller to show the outputs in HTTP manner, or just log them manually.

At this point, we have our PostgreSQL database schema ready and prepared, we also have our Java web app ready to connect to and query from the PostgreSQL database using Hibernate. When you start the application, Spring Boot will automatically apply the migrations on boot up and verify the schema changes. Here is what happens when we launch our application on a fresh PostgreSQL database copy.

If you see, you will find that Flyway (o.f.core) is applying the migrations to the PostgreSQL database. The migrations are applied in this sequence:

  1. Flyway connects to the PostgreSQL database (using Spring Boot configurations / Flyway configurations; Spring Boot configurations are searched for first)
  2. Flyway checks if the PostgreSQL database contains required schema, and a table. See Line 2 and 3. 
  3. Flyway now checks if the PostgreSQL database contains a table with Flyway history (this is what makes this approach code-first approach).
  4. Finally, Flyway iterates over all the migration files in our migrations directory and applies them one by one to the PostgreSQL database.
  5. Lastly, Flyway shows the success message sharing the schema has been generated. 

We can verify this through OmniDB tool and see the schema, tables and the Flyway history:

We can query the flyway_schema_history (an auto-generated table by Flyway) to see the migrations history and how they are applied in order. You can use the following SQL query:

SELECT * FROM migrations.flyway_schema_history;

Here is the output for this:

You can note that schema migrations are just SQL scripts that are executed on the PostgreSQL database; as shown in the 5th record, where I am inserting the records in the PostgreSQL database. Flyway tool, or PostgreSQL database has no bias to DDL and DML when it comes to schema migrations.

Interesting bit is, if you run the application which does not contain any new migrations, Flyway will not make any changes to your PostgreSQL database; it is a constant time operation that checks your PostgreSQL database for latest migration, and compares it with the migrations in app resources. Rerunning the application shows this log:

Flyway checks that no migrations are necessary, so it leaves the PostgreSQL database schema as it is. Now, let’s see how we can update the schema on a production database.

If we use RESTful API, and send a request to load data from the PostgreSQL database, we can easily see that the data from PostgreSQL database is easily captured and shown:

So far, so good. Now we can go ahead and see how to change the schema in code-first approach. 

Warning: Make sure that you engage your DBAs once your PostgreSQL database has been live on production.

We will make the changes in our Java entity first (remember, code-first!) and add the experience field. Now our Java object becomes:

package com.demos.pgschemamigrations.models;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.Id;
import javax.persistence.Table;

@Entity
@Table(name = "people")
public class Person {

    @Id
    private Long id;

    @Column(name = "name")
    private String name;

    @Column(name = "role")
    private String role;

    @Column(name = "experience")
    private Integer experience;

    public Person() { }

    public Person(Long id, String name, String role, Integer experience) {
        this.id = id;
        this.name = name;
        this.role = role;
        this.experience = experience;
    }

    public Long getId() {
        return this.id;
    }

    public String getName() {
        return this.name;
    }

    public String getRole() {
        return this.role;
    }

    public Integer getExperience() {
        return this.experience;
    }
}

Now, we create a new migration, and change the table schema to include years of experience in their role, with SQL we can add the code to alter the table like:

ALTER TABLE migrations.people ADD COLUMN experience INTEGER DEFAULT 0;

We now have 5 migrations of our own, and we restart the application. Flyway finds a new migration pending, and applies it to the PostgreSQL database:

Now, we can confirm the schema changes from OmniDB as well:

The table contains the experience field, as required by our application. As I mentioned earlier, you should consult a DBA to apply the changes as they should happen in the PostgreSQL database.

Apply via CLI

Flyway also offers an automation-friendly way of applying the migrations to a PostgreSQL database “before” you release to the production, using Flyway CLI. This approach can help you decrease or maintain the downtime for a PostgreSQL database. You can also use the Maven CLI tools for Flyway plugin to apply the migrations.

You can use your DevOps pipelines to prepare the PostgreSQL database for the upcoming release of the Java app.

Version control

It is a good approach to add the migrations to the version control, so each version of application contains the latest version of PostgreSQL database schema. Version control systems such as Git handle these good, and they can even control when and who can submit changes to the PostgreSQL database schema folder.

This can provide a greater separation of concerns to support DBAs and Java engineers, in that the DBAs can work on the database schema in separation—your migrations do not need to be in the application, they can be on a separate or independent project

In this post, you learned how to synchronize our PostgreSQL database schema and Java web application models in a web app-first approach (code-first approach). You learned how to use Flyway to generate and apply migrations and how to keep a track of them. You saw how we can apply changes as they are needed. 

New MySQL 8.0.21 and Percona XtraBackup 8.0.13 Issues

$
0
0

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

MySQl IssueOn Monday, July 13, 2020, Oracle released MySQL 8.0.21.  This release contained a few new changes that cause issues with Percona XtraBackup.

First, this release introduced the ability to temporarily disable InnoDB redo logging (see the work log and documentation).  If you love your data, this feature should ONLY be used to speed up an initial logical data import and should never be used under a production workload.

When used, this new feature creates some interesting complications for Percona XtraBackup that you need to be aware of.  The core requirement for XtraBackup to be able to make consistent hot backups is to have some form of redo or write ahead log to copy and read from.  Disabling the InnoDB redo log will prevent XtraBackup from making a consistent backup.  As of XtraBackup 8.0.13, the process of making a backup with the InnoDB redo log disabled will succeed, but the backup will be useless.  We are working towards adding some validation to a future release of XtraBackup that will cause it to detect and fail earlier in the process.

XtraBackup has the –lock-ddl option that can help prevent a backup from failing due to a DDL operation occurring during the backup.  This option can also be used to help to prevent the toggling of InnoDB redo logs during the backup process.  Here is the current truth table that explains the potential outcome based on the various combinations of starting states and –lock-ddl:

  • With –lock-ddl
    • With REDO enabled at start – good backup
    • With REDO disabled at start – bad backup, prepare may even succeed
    • Can not disable or enable redo during backup
  • Without –lock-ddl
    • With REDO enabled – good backup
    • With REDO disabled – bad backup, prepare may even succeed
    • Can disable and enable redo during backup, if so, bad backup, prepare may even succeed

Second, this worklog modified the handling of the ALTER UNDO TABLESPACE tablespace_name SET INACTIVE operation and will cause XtraBackup to hang during the backup phase if this operation is executed and exists within the InnoDB redo log.

Look out for the upcoming release of Percona Server 8.0.21 and Percona XtraBackup 8.0.14 which will have more advanced detection and locking to ensure that you are making good, clean, consistent backups.

New Logical Backup and Restore Utilities in the MySQL Shell

$
0
0

Feed: Planet MySQL
;
Author: Dave Stokes
;

The MySQL Shell or mysqlsh version 8.0.21 comes with three new utilities to perform logical backups and restore. They were designed to make it easier to move your 5.7 or 8.0 data to the new MySQL Data Service but also work well by themselves. They feature compression, the ability to show the progress of the activity, and can spread the work over multiple threads.

The util.dumpInstance() utility will backup the entire MySQL Instance, util.dumpSchemas() lets you determine which schemas (or databases) to backup, and util.loadDump() is the restoration tool.

Backing Up Instances

util.dumpInstance(“/tmp/instance”,{ “showProgress” : “true” })
<some output omitted for brevity>
1 thds dumping – 100% (52.82K rows / ~52.81K rows), 0.00 rows/s, 0.00 B/s uncompressed, 0.00 B/s compressed         
Duration: 00:00:00s                                                                                        
Schemas dumped: 4                                                                                          
Tables dumped: 26                                                                                          
Uncompressed data size: 3.36 MB                                                                            
Compressed data size: 601.45 KB                                                                            
Compression ratio: 5.6                                                                                     
Rows written: 52819                                                                                        
Bytes written: 601.45 KB                                                                                   
Average uncompressed throughput: 3.36 MB/s                                                                 
Average compressed throughput: 601.45 KB/s   

The above was performed on an old laptop with a spinning disk, limited ram, and running the latest Fedora.  I have used these utilities on much bigger instances and have found the performance to be outstanding. 

This utility and the others featured in this blog have a log of options and I suggest setting a ‘pager’ to read through the online help with h util.dumpInstance.

Schema Backup

I created a quick test database named demo with table named x (don’t you love the creativity here) filled with about one million records of four INTEGERS plus an INTEGER primary key.  And remember the output below is from a ten plus year old laptop.

JS > util.dumpSchemas([‘demo’],”/tmp/demo”)
Acquiring global read lock
All transactions have been started
Locking instance for backup
Global read lock has been released
Writing global DDL files
Preparing data dump for table `demo`.`x`
Writing DDL for schema `demo`
Writing DDL for table `demo`.`x`
Data dump for table `demo`.`x` will be chunked using column `id`
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Data dump for table `demo`.`x` will be written to 1 file
1 thds dumping – 100% (1000.00K rows / ~997.97K rows), 577.69K rows/s, 19.89 MB/s uncompressed, 8.58 MB/s compressed
Duration: 00:00:01s                                                                                                 
Schemas dumped: 1                                                                                                   
Tables dumped: 1                                                                                                    
Uncompressed data size: 34.44 MB                                                                                    
Compressed data size: 14.85 MB                                                                                      
Compression ratio: 2.3                                                                                              
Rows written: 999999                                                                                                
Bytes written: 14.85 MB                                                                                             
Average uncompressed throughput: 20.11 MB/s                                                                         
Average compressed throughput: 8.67 MB/s                 

That is impressive performance.  And yes you can back up multiple schemas at one time by putting their name in the JSON array in the first argument.

And Restoring

The best backups in the world are useless unless you can restore from them.  I did a quick rename of the table x to y and then restored the data.

Be sure to have local_infile set to “ON” before proceeding.

JS > util.loadDump(“/tmp/demo”)
Loading DDL and Data from ‘/tmp/demo’ using 4 threads.
Target is MySQL 8.0.21. Dump was produced from MySQL 8.0.21
Checking for pre-existing objects…
Executing common preamble SQL
Executing DDL script for schema `demo`
Executing DDL script for `demo`.`x`
[Worker003] demo@x@@0.tsv.zst: Records: 999999  Deleted: 0  Skipped: 0  Warnings: 0
Executing common postamble SQL                        
                                     
1 chunks (1000.00K rows, 34.44 MB) for 1 tables in 1 schemas were loaded in 20 sec (avg throughput 1.72 MB/s)
0 warnings were reported during the load.

Summary

These three utilities are very fast and powerful tools for keeping your data safe.  Maybe mysqldump has seen it’s day.  And these three utilities and the clone plugin are proof that you can quickly save, copy, and restore you data faster than ever.

Tatsuo Ishii: Snapshot Isolation Mode

$
0
0

Feed: Planet PostgreSQL.

Pgpool-II 4.2 is under development

Pgpool-II developers have been working hard for upcoming Pgpool-II 4.2 release, which is supposed to be released around this September or October.

Snapshot isolation mode is coming

For the new release I have just added new clustering mode called “Snapshot isolation mode”, which would be not only interesting for Pgpool-II users but PostgreSQL users and I think I decided to write about it in this blog.

Existing clustering mode

Until Pgpool-II 4.1, we have following clustering mode:

  • Streaming replication mode. This is most widely used at this point. PostgreSQL’s streaming replication is used for the replication task in this mode.
  • Native replication mode. Until streaming replication is available, this was the most common setting. In this mode Pgpool-II issues same data modifying statements (DML/DDL) to all the PostgreSQL servers. As a result same data is replicated in all PostgreSQL servers.
  • Slony mode. In this mode replication is done by Slony-I instead of streaming replication. This had been popular until streaming replication came up.
  • Raw mode. No replication is done in this mode. Rarely used.

The new “Snapshot isolation mode” is pretty close to the existing native replication mode. It does not use the streaming replication at all. Replication is done by sending DML/DDL statements to all the PostgreSQL servers.  If so, what’s the difference?

Atomic visibility

Before explaining that I would like to talk about the background a little bit.

PostgreSQL developers has been discussing about technology which can scale to handle large data. One of such a technical directions is extending FDW (Foreign Data Wrapper). It sends query to another PostgreSQL server to replicate or shard data. The idea is similar to the native replication mode in Pgpool-II because it also sends DML/DDL to servers to replicate data.

As for replicating data, they look good so far. However, there is a problem when multiple sessions try to access data at the same time.

Suppose we have 2 sessions: session 1 continuously executes UPDATE t1 SET i = i +1; session 2 continuously executes INSERT INTO log SELECT * FROM t1. Table t1 has only 1 row and it’s initial value is 0 (actually each UPDATE and INSERT are executed in an explicit transaction).

So after session 1 and session 2 end, table “log” should have exactly same sequential numbers like “0, 1, 2, 3….” on both PostgreSQL servers. But see the example session below:

Session 1 executed UPDATE at t1 on servers1, while on server 2 UPDATE was executed on t2. Session 2 executed INSERT at t3 with value 1. Unfortunately on server2 INSERT was executed at t3, and t1 was not updated yet thus the value used was 0.

So after the sessions end, we will see different sequences of numbers in log table on server1 and server2. If we do not have atomic visibility this kind of data inconsistency could happen. Atomic visibility guarantees that value can be seen on server1 and server2 are same as long as they are executed in a same session. In the example above, data logged into log table would be either 0 (t1 has not been updated/committed yet) or 1 (t1 has been already updated/committed).

Snapshot isolation mode guarantees atomic visibility

The upcoming snapshot isolation mode is different from existing native replication mode in that it guarantees atomic visibility. So we don’t need to worry about such that data inconsistency in the example above. In fact we have new regression to test the new snapshot isolation mode by using essentially same SQL used in the example above. Please note that in order to use snapshot isolation mode, the transaction isolation mode must be “REPEATABLE READ”. (Postgres FDW also runs foreign transactions in this mode).

How to achieve atomic visibility?

PostgreSQL developers already recognize the necessity of atomic visibility. Postgres-XC, a fork of PostgreSQL, invented Global Transaction Manager to achieve atomic visibility. Some PostgreSQL developers are proposing CSN (Commit Sequence Number) approach. However these approch needs major surgery in PostgreSQL code.

Pgpool-II uses completely different approach, which is called “Pangea” proposed in an academic paper. With Pangea, no modification is necessary with PostgreSQL. Even you can use older versions of PostgreSQL like 8.0. So you can enjoy snapshot isolation mode of Pgpool-II today with your favorite version of PostgreSQL.

In Pagea the idea is called Global Snapshot Isolation, which extends snapshot isolation of local server to across servers. That’s the reason why we call the new clustering mode as “Snapshot Isolation mode”.

How Pangea works?

 With Pangea each first command in a transaction needs to wait if other transactions are trying to commit. If there are no commit is ongoing, the command will get snapshot which defines the visibility of the transaction. Since local transactions on each server acquire snapshots while there are no committing transactions, they can be regarded as consistent. In the example above, the INSERT command in session 2 will be forced to wait until t4, and at that time session 1 commits. Thus each INSERT command in session 2 will see “1” in table t1. Same thing can be said to the commit. Commit must wait if there’s ongoing snapshot acquisition.

Summary

Pgpool-II 4.1 will come with new clustering mode called “Snapshot Isolation Mode” which guarantees the atomic visibility and is not provided in current standard PostgreSQL. Unlike other techniques to implement the atomic visibility, Pgpool-II does not require any modification to PostgreSQL. Also it can be used even with older version of PostgreSQL. Stay tuned.

Matillion ETL 1.47 Release notes

$
0
0

Feed: Matillion.
Author: Julie Polito
;

Check out our release video by Ed Thompson, CTO, for a guided tour of Matillion version 1.47.

Want the very best Matillion ETL experience? Each new version of Matillion ETL is better than the last. Make sure you are on the latest version to take advantage of the new features, new components, and improvements. In this blog, find out about the new components, features, and improvements introduced in Matillion ETL v1.47. 

New and Updated Data Source Components 

Create Your Own Connector (New)

We understand that your business has a growing number of data sources that you need to connect to in order to centralize all your business-critical data in your cloud data warehouse. To ensure that you can easily access all your data, we are happy to announce the launch of our new Create Your Own Connector, which comprises an API Extract Wizard to guide you step by step. With this feature, you can easily create your own custom connectors, allowing near-limitless connections via any REST API. With Create Your Own Connector, , you will never have to ask the question “Do you have a connector for X?” again. Read our Create Your Own Connector Announcement to learn more.

We also want to hear from you about what sources you have been able to connect to – join the Create Your Own Connector discussion on the Matillion ETL Community.

Updated Enterprise Component Features 

Assert components (Update)

Matillion ETL would like to announce further improvements to the Assert components – Assert Table, Assert External Table, Assert Scalar Variable and Assert View – that were introduced in version 1.46. These components are now able to Auto populate when connected to a relevant component. 

Learn more about Assert components in this blog and video.

New / Updated Features

Snowflake Enhancements

Matillion always aims to advance our products along cloud data warehouse improvements. To align with recent Snowflake product updates, we have introduced multiple enhancements to Matillion ETL for Snowflake, including:

  • File Pattern Matching for External Tables
  • Set the Default DDL Collation for a table
  • Load Semi-structured data directly into Target tables based upon column name matches
  • New Aggregate Function – Skew
  • Display Source Jobs under folders in Migration Wizard
  • Support for Unloading to Parquet File Format with LZO Compression
  • Converting Null Specifiers and Trimming White Space
  • New Window Calculation Function – Kurtosis
  • Loading CSV Data Replacing Invalid Characters
  • Support for Appended Rows in Streams

For a more detailed look at the latest Snowflake enhancements in Matillion ETL, check out this blog.  

Open ID Support

This release we have added Open ID support for Google BigQuery and Azure Synapse users making it now available in all Matillion ETL products. This offers a number of different options to configure Matillion User Integration. Read more about Open ID here.

New in Matillion ETL for Azure Synapse 

Since we released Matillion ETL for Azure Synapse on May 12, we have been working to develop additional functionality. In 1.47, users now have: 

New Orchestration and Transformation Components

In order to match the functionality of other Matillion products and to help you create more advanced ETL workflows, we’d like to announce the support of the following components:

Improved Functionality

Matillion ETL for Azure Synapse is continuously improving. As such we have updated the following features in order to provide more coverage:

  • Rewrite Table Component supports Table Partitioning 
  • Azure Blob Storage Load Component supports Parquet and ORC files
  • External Tables Improvements
    • Tables listed in Environment Tree for ease of selection
    • Table Input Component support External Tables
    •  Delete Functionality added for External Tables

Full Release notes are available on the Support Site: 

You can also head over to the community to discuss 1.47 with other Matillion ETL users. 

Ready to upgrade to Matillion v1.47? 

For more information on how to upgrade, check out our blog on Best practices for updating your Matillion ETL instance.

Matillion ETL for Snowflake Improvements in Version 1.47

$
0
0

Feed: Matillion.
Author: Ashley Lozito
;

At Matillion, we design purpose-built ETL products for each cloud data warehouse we support. Our method provides users with a seamless experience, allowing them to utilize cloud data warehouse functionality from within Matillion ETL. For this reason, we are always updating our products to reflect the advancements made in cloud data warehouse technology. In Matillion ETL for Snowflake version 1.47, we have introduced a number of improvements and enhancements for our Snowflake users.  

What’s new and improved in Matillion ETL for Snowflake?

File pattern matching for external tables

The CREATE [OR REPLACE] EXTERNAL TABLE statement within Snowflake has been enhanced to include a PATTERN parameter that allows users to specify files to be matched on the external stage using regular expression pattern syntax.

For example, using a pattern ‘.*flight.*[.]csv’ will load any files with the extension .csv that contain the text ‘flight’ anywhere within the name section of the filename.

Set the default DDL collation for a table

The recently-released DEFAULT_DDL_COLLATION parameter can be used to set a default collation specification for new table columns that do not have an explicit collation specification.

Any currently supported collation specifiers can be provided as values for this parameter.

Load semi-structured data directly into target tables based on column name matches

This release introduces support for loading semi-structured data directly into columns in the target table that match columns represented in the input data file. This feature is implemented through a new MATCH_BY_COLUMN_NAME copy option. This copy option enables loading columns represented in semi-structured data into separate columns in a target table.

New aggregate function – skew

This newly added function returns the sample skewness of non-NULL records. In basic terms, this describes how asymmetric the underlying distribution is.

Display source jobs under folders in Migration Wizard

In previous versions of Matillion ETL for Snowflake, if a project is set up to use folders in order to better organize components, we did not make use of these folders when migrating jobs in the Migrate wizard. All jobs were shown as one single list of job names.

With this release, we now show jobs listed under the appropriate folders so it’s easier to navigate, as well as select and deselect jobs of interest.

If the project is set up to use the following structure,

then the Migration Wizard will display the job selection as follows:

Support for unloading to Parquet file format with LZO compression

The location based COPY INTO command now supports Parquet files with LZO(.lzo) compression in either an internal (i.e. Snowflake) or external location.

With the implementation of this feature, the parameter SNAPPY_COMPRESSION has been deprecated and replaced with the COMPRESSION parameter and associated values.

The original SNAPPY_COMPRESSION=TRUE is now superseded by the COMPRESSION=SNAPPY option (this is also implied by the selection of AUTO for the compression type).

Converting null specifiers and trimming white space

The table-based COPY INTO command now offers support for the NULL_IF and TRIM_SPACE file format options when loading string values from semi-structured data into separate columns in relational tables. This feature was previously only supported for CSV format data but is now extended to support JSON, Avro, ORC and Parquet file types.

The data entries representing a null value are entered, one per line, through the NULL If parameter:

Note the blank entry as the second identifier. This will identify empty strings within the incoming data.

The TRIM_SPACE parameter is set either to True (remove trailing spaces) or False (retain trailing spaces).

New window calculation function – Kurtosis

This newly added function returns the population excess Kurtosis of non-NULL records. The inclusion of this function allows a determination of how the data varies from a normal distribution.

Loading CSV data replacing invalid characters

The table-based COPY INTO command now offers support for loading CSV files even if the data contains invalid UTF-8 characters. To support this functionality we now provide an additional parameter, Replace Invalid Characters that can be set to True (replace invalid characters with the Unicode replacement character) or False (generate an error when invalid UTF-8 characters are detected in the data).

Loading CSV data ignoring blank records

The table-based COPY INTO command now offers support for loading CSV files even if the data contains blank records. To support this functionality, we now provide an additional parameter, Skip Blank Lines, that can be set to True (ignore blank records) or False (generate an end of record error).

Support for appended rows in streams

The CREATE STREAM component now includes a parameter, Append Only, to specify the stream as append only. When set to True, the stream will track row inserts into a table but will ignore updates and deletes. This can provide a performance improvement over the standard behaviour of the stream with Append Only set to the default value of False.

Learn more about Matillion ETL v1.47

Check out what else is new in release 1.47, including our new Create Your Own Connector functionality. 

Help shape future versions of Matillion ETL in the Matillion Community

Do you have an idea for the Matillion ETL for Snowflake product roadmap? Head over to the Community and submit your feature request to the Ideas Portal


Replication Between Two Percona XtraDB Clusters, GTIDs and Schema Changes

$
0
0

Feed: Planet MySQL
;
Author: Sveta Smirnova
;

Replication Between Two Percona XtraDB ClustersI got this question on the “How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster (PXC)” webinar and wanted to answer it in a separate post.

Will RSU have an effect on GTID consistency if replication PXC cluster to another cluster?

Answer for this: yes and no.

Galera assigns its own GTID for the operations, replicated to all nodes of the cluster. Such operations include DML ( INSERT/UPDATE/DELETE ) on InnoDB tables and DDL commands, executed with default TOI method. You can find more details on how GTIDs work in the Percona XtraDB Cluster in this blog post.

However, DDL commands, executed with RSU method, are applied locally and have their own, individual, GTID.

Let’s set up a replication between two PXC clusters and see how it works.

First, let’s use the default wsrep_osu_method  TOI and create three tables on each node of the source cluster:

You see that all GTIDs have the same UUID: 24f602ffcb9811eabeb2ba09d9a11266  and the number of the transaction increase no matter which node is the source of the change.

All changes successfully replicate. As a result, the replica has received and applied GTIDs as can be seen in the SHOW SLAVE STATUS  output:

With RSU method we are executing DDL on each node while it is out of sync with the rest of the cluster. After the operation finishes nothing replicated to other nodes by itself, rather relies on the DBA to perform changes manually. Therefore GTID for such an operation uses local UUID:

As you see that this same operation created GTIDs with three different UUIDs on the three nodes: 25394777cb9811eaa23a98af65266957 , 322ff3ebcb9811ea8a9498af65266957  and 3ab8cf00cb9811eab43398af65266957 .

Replica cluster received GTID from the node which set up as a replication source:

So by default RSU does not generate any issue with GTID.

However, if later you need to perform a failover and setup any other node as a replication source node, the replica will try to apply local GTIDs and fail with an error:

The only solution here is to inject empty transaction instead of one, created by the RSU operation:

Conclusion

Operations in RSU mode create local GTIDs with UUID, different from the one which is used cluster-wide. They do not cause any error until you need to perform a failover and replace the current replication source with another node.

How SnapLogic eXtreme Helps Visualize Spark ETL Pipelines on Amazon EMR

$
0
0

Feed: AWS Partner Network (APN) Blog.
Author: Jobin George.

By Jobin George, Sr. Partner Solutions Architect at AWS
By Rich Dill, Enterprise Solutions Architect, Partner Engineering at SnapLogic
By Dhananjay Bapat, Sr. Technical Product Marketing Manager at SnapLogic

Half of all data lake projects fail because organizations don’t have enough skilled resources who can code in Java, Scala, or Python, and do processing on large amounts of data. Additionally, skilled data engineers need a lot of time and effort to operationalize workloads that prepare, cleanse, and aggregate data to cloud data warehouses for analytics and insights.

Fully managed cloud services are rapidly enabling global enterprises to focus on strategic differentiators versus maintaining infrastructure. They do this by creating data lakes and performing big data processing in the cloud.

SnapLogic, an AWS Partner Network (APN) Advanced Technology Partner with the AWS Data & Analytics Competency, further simplifies an organization’s move to fully-managed data architecture.

SnapLogic eXtreme enables citizen integrators, those who can’t code, and data integrators to efficiently support and augment data-integration use cases by performing complex transformations on large volumes of data.

In this post, we provide step-by-step instructions to show you how to set up SnapLogic eXtreme and use Amazon EMR to do Amazon Redshift extract, transform, and load (ETL).

Using AWS CloudFormation, we create all the Amazon Web Services (AWS) resources required to complete the exercise, and guide you through to configure and complete ETL process using SnapLogic eXtreme.

We’ll also walk you through loading data from Amazon Simple Storage Service (Amazon S3) as well as an Amazon Redshift Spectrum table, and then perform transformations and load the transformed data into an Amazon Redshift table.

About SnapLogic eXtreme

Many organizations need to support use cases where there’s a need to perform complex transformations on millions of rows of data, such as:

  • Claims processing for insurance companies.
  • Fraud detection and risk management for financial services companies.
  • Customer behavior analytics and personalization for retailers.
  • Reducing healthcare costs and improving patient experience for healthcare providers.

SnapLogic eXtreme extends the unified and easy-to-use SnapLogic Intelligent Integration Platform (IIP) to build and submit powerful Spark-based pipelines visually, without writing a single line of code, to managed big data as a service (BDaaS) providers, such as Amazon EMR.

The SnapLogic visual programming interface eliminates the need for error-prone manual coding procedures, leading to quicker time-to-value without the traditional dependence on complex IT or data-engineering teams.

Unlike other data integration solutions that require integrators to have detailed knowledge on how to build and submit Spark jobs, SnapLogic eXtreme allows business users with domain expertise to perform complex processing and transformations on extremely large volumes of data within the enterprise’s existing big data infrastructure.

SnapLogic-Amazon-EMR-1

Figure 1 – High-level overview of SnapLogic eXtreme platform.

As shown in Figure 1, the first step in the enterprise data journey is the Capture phase. The modern enterprise has data in many data silos, and each of these sources can provide valuable business insights even if the exact use case isn’t known ahead of time. Data is captured for both known and unknown use cases.

The Conform phase involves the processing of raw data in order to make it available to the business. Data is conformed to corporate standards ensuring governance, quality, consistency, regulatory compliance, and accuracy for downstream consumers.

The Refine phase is where the data is transformed with the eventual downstream application in mind, be that a business intelligence (BI) tool, specific application, or even a common data model for consumption by many downstream applications. Of course, this will involve aggregations, summary calculations, and more.

Finally, the Deliver phase is where the refined data is delivered to end systems, such as a cloud data warehouse or applications. SnapLogic eXtreme enhances SnapLogic’s Enterprise Integration Cloud platform to include big data processing use cases.

Security Controls and Capabilities

The SnapLogic platform complies with high security and compliance standards while providing customers with secure data integration. It supports an authentication and privilege model that allows you to grant, limit, or restrict access to components and pipelines, and it comes with easy data management and encryption capabilities.

Below are some of the key security features of SnapLogic eXtreme.

Enhanced Account Encryption: You can encrypt account credentials that access endpoints from SnapLogic using a private key/public key model that leverage AWS Key Management Service (KMS). The data is encrypted with a public key before it leaves the browser, and then decrypted with a private key on the Groundplex. The private key needs to be manually copied to each node in a Groundplex.

Cross-Account IAM Role: You can specify a cross-account identity and access management (IAM) role for an AWS account to spin up clusters on Amazon EMR without needing to share the secret and access keys of their AWS accounts.

An org admin can configure the password expiration policy, and can enforce the policy to all the users in the organization. Also, SnapLogic supports Single Sign-On (SSO) through the Security Assertion Markup Language (SAML) standard.

The SnapLogic cloud runs with a Signed TSL (SSL) certificate. The client sending the HTTP request will validate the Certificate Authority (CA) certificate to verify its validity. For additional security, the org admin can configure an IP address trusted to enforce access control for nodes in a Snaplex that connects to the SnapLogic control plane.

To further enhance the SnapLogic platform security, you can also disable external process (like popen) creation on Cloudplex via the Script Snap or a custom Snap. Additionally, you can also disable read/write access to files in the Snaplex installation folder while executing pipelines.

Using eXtreme to Visualize EMR Spark Pipelines for Redshift ETL

The diagram in Figure 2 illustrates the architecture of how SnapLogic eXtreme helps you create visual pipelines to transform and load data into Amazon Redshift using Apache Spark on Amazon EMR.

We have two data sources; one resides in Amazon S3 and the other in Spectrum table. The target is Amazon Redshift internal tables.

SnapLogic-Amazon-EMR-2

Figure 2 – Architectural overview of SnapLogic eXtreme on AWS.

Prerequisites

Here is the minimum required setup to complete the hands-on tutorial.

  • Before starting this tutorial, you need to have an AWS account. In this exercise, you configure the required AWS resources using AWS CloudFormation in the us-east-1 region. If you haven’t signed up, complete the following tasks:
  • Active SnapLogic account with eXtreme enabled. If you don’t have an account, sign up for a trial and activate eXtreme in your account.

Create Amazon Redshift Cluster and Cross-Account Role

We have used the following AWS CloudFormation template to create a single node Amazon Redshift instance running on dc2.large to be used as the target in the below exercise.

This will also create three Amazon S3 buckets required for our eXtreme pipelines. It will also create a few IAM roles, including cross-account trust IAM roles required for eXtreme to spin up resources in your AWS account.

Click Launch Stack to initiate CloudFormation stack creation in the us-east-1 region.

Launch Stack

On the Create Stack page, acknowledge the resource Capabilities and click Create Stack. Wait for the stack creation to complete, which can take up to 10 minutes to complete.

Once this is done, check the Stack Output tab and note down all of the items. We’ll require them for future steps.

SnapLogic-Amazon-EMR-3.1

You may verify the Amazon Redshift cluster, IAM roles, and S3 bucket location created as part of this CloudFormation deployment using the AWS console.

Create Spectrum Schema and Source Table

A single node Amazon Redshift cluster is created as part of the CloudFormation deployment. Connect to the cluster in order to create the source Spectrum table pointing towards a dataset stored in Amazon S3.

Connect to Redshift Query Editor from the AWS console to create the required Amazon Redshift table. Navigate to Services > Redshift > Query Editor and provide the following credentials to connect to it:

  • Cluster: <Select the cluster with end point noted from the CloudFormation output >
  • Database Name: snaplogic
  • Database User: snapuser
  • Database Password: Snapuser01

Once on the query editor page, execute the following DDL command to create an external schema named spectrum_schema. Use the value of RedshiftSpectrumRoleARN noted from the CloudFormation output within single quotes below:

create external schema spectrum_schema from data catalog
database ‘spectrum_db’
iam_role ‘<RedshiftSpectrumRoleARN>’
create external database if not exists;

SnapLogic-Amazon-EMR-4

Once the schema is created, execute the following DDL command to create a Spectrum table pointing to data stored in S3 on a separate query editor tab:

create external table spectrum_schema.clickstream_data_external(
custKey int4, yearmonthKey int4, visitDate int4, adRevenue float, countryCode char(3), destURL varchar(100), duration int4, languageCode char(6), searchWord varchar(32), sourceIP varchar(116), userAgent varchar(256)) stored as parquet location 's3://redshift-spectrum-bigdata-blog-datasets/clickstream-parquet1/00fd0c2ead5ea05601f20f7e81032e741c10b2c03898737780f6c853ddabcb16/'
table properties ('numRows'='7203992');

SnapLogic-Amazon-EMR-5

Configure SnapLogic eXtremeplex on Amazon EMR

Once the previous section is completed, you can configure eXtremeplex now. All the prerequisites for eXtremeplex have been created as part of the CloudFormation deployment.

The following steps take you through how to establish connectivity. Visit the eXtreme documentation to learn more about what permissions and IAM roles were created.

Now, log in to the SnapLogic Control Plane User Interface using your account credentials.

Next, navigate to the SnapLogic Manager tab, and on the left panel towards the bottom you will find projects. Click on the downward arrow to display options and click Create Project. Let’s name our project eXtreme-APN-Blog.

SnapLogic-Amazon-EMR-6

Click on the project you just created and navigate to the Account tab. Click the ‘+’ sign to add an account, and from the list select Extreme > AWS IAM Role Account.

On the Create Account pop-up, provide a name for the account, such as AWS Cross Trust Account, and provide <SnapLogicCrossAccountARN> from the CloudFormation output page.

Click Validate and then Apply to add the account.

SnapLogic-Amazon-EMR-7

While on the Account tab, click the ‘+’ sign again to add another account. From the list, select Extreme Redshift > Redshift Database Account.

On the Create Account pop-up, provide the following details:

  • Label: Redshift Cluster
  • Hostname: <RedshiftEndpoint> from CloudFormation output page
  • Port Number: 5439
  • Database Name: snaplogic
  • Username: snapuser
  • Password: Snapuser01

Once updated, click Apply and save the details you entered.

Now, navigate to the Snaplexes tab, and click the ‘+’ sign to add a Snaplex of type eXtremeplex. In the pop-up window, provide these details:

  • Snaplex Type: eXtremeplex
  • Name: Redshift_ETL_On_EMR
  • Environment: Dev
  • Account Type: AWS IAM Role Account
  • Region: us-east-1 [select Account created above from Account Tab first]
  • Instance Type: m4.4xlarge
  • Market: On Demand
  • S3 Log Bucket: <LogsS3BucketName> from CloudFormation output page
  • S3 Artifact Bucket: <ArtifactS3BucketName> from CloudFormation output page
  • EC2 Instance Profile: SnapLogicEMREC2DefaultRole
  • EMR Role: SnapLogicEMRdefaultRole

Before saving, navigate to the Advanced tab to configure Auto Scaling Capabilities. Provide the following configurations and leave everything else to default:

  • Auto Scaling: <Checked>
  • Maximum Cluster Size: 5
  • Auto Scaling Role: SnapLogicEMRAutoScalingDefaultRole

Click Create to create the eXtremeplex. Now, you have all the configurations required for executing an eXtreme pipeline. Let’s build some!

Import and Configure eXtremeplex Pipeline into SnapLogic Designer

Once the eXtreme environment is configured, you can starting creating the pipeline, but here you may use an already created pipeline. Here’s a link for you to download a pre-created pipeline S3_to_Redshift.slp to your local system.

The pipeline you just downloaded is designed to load raw files in parquet format from Amazon S3 to an Amazon Redshift table, as well as data already in an Amazon Redshift Spectrum table into an Amazon Redshift internal table.

This pipeline contains only six snaps, including a Redshift Select snap, transform snap, Redshift insert snap and File Reader snap for reading files from S3 and Amazon Redshift Spectrum and insert it to Amazon Redshift internal tables.

Before you begin, navigate to the SnapLogic Designer tab, and on the top left side you will find the Import a Pipeline icon. Click it to import a new pipeline.

When prompted, select the saved file S3_to_Redshift.slp, and in the settings prompted make sure you choose eXtreme-APN-Blog which you created in the previous section.

Click Save to load it on the designer. Now, you can see the pipelines as below:

SnapLogic-Amazon-EMR-9

Click on the Redshift Select snap and make sure you update TempS3BucketName noted from the CloudFormation output page in the S3 Folder section.

Navigate to the Account tab and select the Redshift Cluster account you created in the previous section. Click Save to proceed. Leave other configurations as is.

SnapLogic-Amazon-EMR-10

Repeat the same steps for both Redshift Insert snaps as well. Make sure you update the TempS3BucketName noted from the CloudFormation output page in the S3 Folder section.

Navigate to the Account tab and select the Redshift Cluster account you created in the previous section. Click Save to proceed.

SnapLogic-Amazon-EMR-11

Next, let’s configure the S3 File Reader snap. Click on the snap, navigate to the Accounts tab, and click the Add Account button.

In the pop-up, select the location as projects/eXtreme-APN-Blog and choose account type to create as “AWS Account” and click OK.

When asked, provide a label and update the Access-Key ID and Secret Key noted from the CloudFormation output page. Click Apply.

Now, you are ready to execute your pipeline. Everything else required to read data from Amazon S3, Redshift Spectrum, and target table details for the Amazon Redshift table are configured in the pipeline already.

Click and open the Transform snap and look how visitDate field from an integer type in yyyyMMdd format is converted into date type in yyyy-MM-dd format.

Running an eXtremeplex Pipeline on Amazon EMR

Now that the pipeline is uploaded and configured, it’s time to run it and load data into Amazon Redshift.

Once the pipeline is configured, click on Execute Pipeline to start the pipeline execution. Make sure you have selected the Redshift_ETL_On_EMR snaplex you created in the previous section.

Once you start the pipeline, you may navigate to the Amazon EMR console to see the EMR spark cluster starting up.

SnapLogic-Amazon-EMR-12

It may take a few minutes to bootstrap the cluster. Once it’s started, click on the cluster and select the Application History tab. You will find the pipeline being executed.

Note that only the first execution takes time to spin the cluster; based on your configuration, the Amazon EMR cluster will continue to run for a period of time and then shut down after a period of time post no activity.

SnapLogic-Amazon-EMR-13

This application may take a few minutes to complete, as well. Once completed, navigate to Redshift Query Editor from the AWS console to verify the Amazon Redshift tables were created.

In the public schema, you’ll notice the new tables are created in the Data Object panel on the left. Right-click and select Preview Data and you’ll see the data being populated.

SnapLogic-Amazon-EMR-14

You may also simply run the below count query. You should see approximately ~7.2 million records loaded into both the Amazon Redshift tables.

Select count(*) from public.clickstream_internal;
select count(*) from public.data_from_s3;

Best Practices While Using SnapLogic eXtreme

To optimize the cost for running an Amazon EMR cluster, we recommend you terminate an idle cluster after 30 minutes. You can specify a timeout while configuring the cluster template.

Do not use colon ( : ) in Amazon S3 and Hadoop file or directory names, as the colon is interpreted as a URI separator, giving this error — Relative path in absolute URI: <filewithcolon>.

To optimize the cost and time in validating pipelines each time you modify it, ensure you disable the Auto Validate checkbox in the user Settings dialog.

For the Spark SQL 2.x CSV Parser Snap, the maximum number of columns is 20480 per row. However, system resource limitations may cause the pipeline to fail validation before the maximum number is reached.

Regarding performance optimization, executing eXtreme pipelines works most efficiently by setting the Spark parameter spark.sql.files.maxPartitionBytes with a value 128MB, which is the Spark default setting.

Because this setting value performs better for most of the use cases, we recommend you do not change it. In SnapLogic Designer, always specify in the Pipeline parameters: spark_sql_files_maxPartitionBytes.

Amazon EMR clusters give multiple configuration options. Based on our tests, for a 100GB dataset, the following cluster configuration performs better for most of the use cases: m4.4xlarge or r4.4xlarge with 8 nodes.

Accordingly, in terms of scaling for larger datasets, customers should scale the number of nodes in their cluster depending on the increase in dataset size. For example, if the dataset size is increased to 200GB, you should double the cluster size to 16 nodes.

When reading or writing your Spark jobs to the Snowflake data warehouse, we recommend you follow these best practices:

  • Use the X-Small size warehouse for regular purposes.
  • Use a Large/X-Large size warehouse for faster execution of pipelines that ingest data sets larger than 500GB.
  • Use one pipeline per warehouse at any given time to minimize the execution time.

Clean Up

When you finish, also remember to clean up all of other AWS resources you created using AWS CloudFormation. Use the CloudFormation console or AWS Command Line Interface (CLI) to delete the stack named SnapLogic-eXtreme-Blog.

Note that when you delete the stack, created Amazon S3 buckets will be retained by default. You will have to manually empty and delete those S3 buckets.

Conclusion

In this post, you learned how to set up eXtremeplex to connect your AWS environment and execute Amazon Redshift ETL workloads using Spark pipelines from SnapLogic User Interface, and to run it on transient EMR Spark cluster.

We imported a SnapLogic eXtreme pipeline designed to use data sets residing in Amazon S3, as well as in a Redshift Spectrum table as source, while the target is a Redshift internal table. We also demonstrated how to do transformation before we load the data into Amazon Redshift.

SnapLogic eXtreme extends the unified and easy-to-use SnapLogic Intelligent Integration Platform (IIP) to build and submit powerful Spark-based pipelines visually, without writing a single line of code to managed big data as a service (BDaaS) providers, such as Amazon EMR.

To learn more about SnapLogic, get started with the AWS Quick Start: Data Lake with SnapLogic on AWS.

.
SnapLogic-APN-Blog-CTA-2
.


SnapLogic – APN Partner Spotlight

SnapLogic is an AWS Competency Partner. Through its visual, automated approach to integration, SnapLogic uniquely empowers business and IT users to accelerate integration needs for applications, data warehouse, big data, and analytics initiatives.

Contact SnapLogic | Solution Overview

*Already worked with SnapLogic? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

Michał Mackiewicz: Pitfalls and quirks of logical replication in Postgres 12

$
0
0

Feed: Planet PostgreSQL.

Logical replication, in a nutshell, is a process of “replaying” data changes in another database. The first attempts to implement replication in Postgres – Slony, Bucardo – were logical replication solutions. In contrast to binary (streaming) replication, it offers greater flexibility: data can be replicated selectively – only relevant databases and tables, the replica server remains a fully functional instance that can have its own users, security rules and non-replicated databases, and in some cases the performance is better.

However, the logical replication is not as foolproof as binary replication, and without proper care it can lead to a primary server crash. I’d like to share some thoughts after setting up a logical replication in a large-ish (one terabyte) production database.

Do not proceed without monitoring

Seriously, if you don’t have a robust monitoring solution that will warn you against system abnormalities – especially running out of disk space – set it up and test before any attempt to implement logical replication. This is because logical replication can break silently without any SQL exceptions, cause the WAL files to pile up, fill the entire disk and bring the primary server down. At least two things have to be monitored:

  • disk usage,
  • errors in Postgres log.

Only with such early warning system, you will be able to fix any issues before they cause an outage.

Replicate to the same server? Possible, but…

Logical replication can be used to replicate data between databases in a single Postgres cluster. It’s a perfectly valid setup, but it requires special treatment: you have to create logical replication slot first, and with the slot already in place, create a subscription pointing to that slot. If you try to set up replication in a default way – with automatic slot creation – the CREATE SUBSCRIPTION command will hang. No errors or warnings – just a hung statement.

Be careful with PUBLICATION FOR ALL TABLES

Like all previous logical replication solutions, native replication doesn’t replicate data structure changes. If any DDL command is issued (CREATE, ALTER, DROP) it must be executed manually both on primary and replica(s). The FOR ALL TABLES modifier of a CREATE PUBLICATION statement doesn’t change this limitation. Instead, it will cause the replication to stop as soon as you (or your application, or a Postgres extension) issue a CREATE TABLE statement. Also, FOR ALL TABLES will automatically includes any tables created by extensions (like spatial_ref_sys from PostGIS), and tables that don’t have a primary key or replica identity – both cases are problematic.

Spatial is special

The spatial_ref_sys table is a part of PostGIS extension, and it’s populated by CREATE EXTENSION postgis; statement. More often than not, it shouldn’t be replicated, as every PostGIS instance populates it itself. If you have to replicate it (for example, you work with coordinate systems that aren’t part of EPSG registry), remember to TRUNCATE the spatial_ref_sys table on replica before creating subscription.

Review the primary keys carefully

A table eligible for logical replication must have a primary key constraint or replica identity – that’s the rule. It’s nothing new as it was the same with previous logical replication solutions, but its enforcement in native Postgres replication is at least weird. You are allowed to add a table without PK or replica identity to a publication, it won’t cause any error, but… it will block any write activity to it!

You will need to add a PK as soon as possible (or if you can’t afford an exclusive lock for unique index creation time, replica identity full will be just fine, but less performant) to unlock write access.

After adding a new table, refresh publication

It’s not enough to execute “ALTER PUBLICATION … ADD TABLE …” on primary server in order to add a new table to replication. You have to log in into a replica database and execute the following statement:

ALTER SUBSCRIPTION … REFRESH PUBLICATION;

Summary

Logical replication is great and has many use cases: separating transactional and analitical workload, aggregating changes from multiple databases, and so on. It is not as simple as it looks at first glance. Follow the rules and use it with caution, and you should enjoy a fast, flexible data replication.

Tatsuo Ishii: New statistics data in Pgpool-II

$
0
0

Feed: Planet PostgreSQL.

Upcoming Pgpool-II 4.2 will provide new statistics data such as number of INSERT/DELETE/UPDATE etc.  to you. In this blog I will show you how it will look like.

Existing statistics

Before jumping into the new feature, let’s take a look at existing statistics data. PostgreSQL already gives you tuple access data like number of tuples inserted in a database (see pg_stat_database manual). Before Pgpoool-II 4.1 or before gives number of SELECT sent to PostgreSQL in “show pool_nodes” command.

How is like the new statistics?

A new command called “show backend_stats” provides the statistics data like number of NSERT, UPDATE and DELETE issued since Pgpool-II started up. So it’s not like PostgreSQL’s pg_stat_database in that PostgreSQL provides number of tuples accessed, while Pgpool-II provides number SQL commands issued.  If you want number of tuples based statistics you can always access PostgresQL’s statistics data.

 Here is an example of the new statistics data:

test=# show pool_backend_stats;
 node_id | hostname | port  | status |  role   | select_cnt | insert_cnt | update_cnt | delete_cnt | ddl_cnt | other_cnt | panic_cnt | fatal_cnt | error_cnt 
---------+----------+-------+--------+---------+------------+------------+------------+------------+---------+-----------+-----------+-----------+-----------
 0       | /tmp     | 11002 | up     | primary | 12         | 10         | 30         | 0          | 2       | 30        | 0         | 0         | 1
 1       | /tmp     | 11003 | up     | standby | 12         | 0          | 0          | 0          | 0       | 23        | 0         | 0         | 1
(2 rows) 

select_cnt, insert_cnt, update_cnt, delete_cnt, ddl_cnt, other_cnt are the numbers of SQL command issued since Pgpool-II started up. What is “other_cnt”? Well, it is a count of: CHECKPOINT/DEALLOCATE/DISCARD/EXECUTE/EXPLAIN/LISTEN/LOAD/LOCK/NOTIFY/PREPARE/SET/SHOW/Transaction commands/UNLISTEN. If an SQL does not belong to SELECT/INSERT/UPDATE/DELETE/other, then it is counted as DDL.

Please note that these counts are counted up even if an SQL command fails or roll backed.

panic_cnt, fatal_cnt and error_cnt are number of errors returned from PostgreSQL.  They are counts of PANIC, FATAL or ERROR. These error statistics is handy because PostgreSQL does not provide error statistics.

Summary

The “show pool_backend_stats” gives you at a glance view of whole cluster statistics on number SQL commands issued. Moreover, error counts, which does not exists in PostgreSQL could be useful for those who are interested in database or application failures.

How to Migrate On-premises MySQL Enterprise Database to OCI Compute Instance MySQL Enterprise Database ?

$
0
0

Feed: Planet MySQL
;
Author: Chandan Kumar
;

Migrating On-premises MySQL Enterprise Database to OCI Compute Instance MySQL Enterprise Database ?

In this post , we will walk through steps needed to migrate particular database(example – Sales database) from local Instance(On-premises) to Oracle Cloud Infrastructure Compute Instance.

We will use two new features introduced with latest release of MySQL
8.0.21.

1.      
Dump Schema Utility

a.      
This will help us to take the backup from
On-promises database and export  to Oracle Cloud Object Storage.

2.      
Load Dump Utility

a.      
This is help us to Import
the schema from Object
Storage to local compute Instance.

More info:- https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-dump-instance-schema.html

How does Migration Work ?

Suppose you wanted to do lift-shift of database called “sales”
, so “Schema Dump” utility will export sales
database from on-premises to OCI(oracle cloud infrastructure) object storage.

Then “Load Dump” utility
will import directly to MySQL Instance running in OCI compute instance.

Below diagram I have made for clear understanding…

What do we needed handy

1.      
MySQL Shell 8.0.21 Version.

2.      
On-premises MySQL Up and Running.

3.      
Cloud Instances Up and Running.

4.      
Install OCI CLI on On-premises Machine.

5.      
Install OCI CLI on OCI(Oracle Cloud Compute
Instance)

6.      
local_infile variables must be ON  for destination machine.

Additional Details

On-Premises Instance Details

Oracle Cloud Compute Instance Details

Database Name:- Sales

IP:- 192.168.1.10

User:- root

Port: 3306

Public IP Address: 

Command to Export the backup from On-premises to OCI Object
Storage

MySQL  localhost:33060+
ssl  Py > 

util.dump_schemas([“sales”], “worlddump”,
{“osBucketName”: “BootcampBucket”, “osNamespace”:
“idazzjlcjqzj”, 
“ocimds”: “true”,”ociConfigFile”:”/root/.oci/config”,
“compatibility”: [“strip_definers”, “strip_restricted_grants”]})

Checking for compatibility with MySQL Database Service 8.0.21

NOTE: Database sales had unsupported ENCRYPTION option commented out

Compatibility issues with MySQL Database Service 8.0.21 were found and
repaired. Please review the changes made before loading them.

Acquiring global read lock

All transactions have been started

Locking instance for backup

Global read lock has been released

Writing global DDL files

Preparing data dump for table `sales`.`employee`

Writing DDL for schema `sales`

Writing DDL for table `sales`.`employee`

Data dump for table `sales`.`employee` will be chunked using column
`empid`

Running data dump using 4 threads.

NOTE: Progress information uses estimated values and may not be
accurate.

Data dump for table `sales`.`employee` will be written to 1 file

1 thds dumping – 100% (2 rows / ~2 rows), 0.00 rows/s, 12.00 B/s
uncompressed, 0.00 B/s compressed

Duration: 00:00:03s

Schemas dumped: 1

Tables dumped: 1

Uncompressed data size: 39 bytes

Compressed data size: 0 bytes

Compression ratio: 39.0

Rows written: 2

Bytes written: 0 bytes

Average uncompressed throughput: 10.61 B/s

Average compressed throughput: 0.00 B/s

 MySQL  localhost:33060+ ssl  Py >

Import Dump file into Compute Instance from OCI Object Storage

util.loadDump(“worlddump”, {threads: 8, osBucketName:
“BootcampBucket”, osNamespace:
“idazzjlcjqzj”,”ociConfigFile”:”/root/.oci/config”
})

Verify The results

On-Premises Database:-

Conclusion:-

Migration happened successfully!!!

MySQL Shell 8.0.21 makes MySQL easier to use, by providing
an interactive MySQL client supporting SQL, Document Store, JavaScript &
Python interface with support for writing custom extensions.

And with dumpInstance(), dumpSchemas(), importTable() and
loadDump()  MySQL Shell now provides
powerful logical dump and load functionality.

===========Rough Notes=========================================

So doing this Migration what are challenges has come? what are Error has occurred and most important how did you fix it up?

Let’s have a look for all Error one by one…

Error#01:- Traceback (most recent call
last):

  File
“<string>”, line 1, in <module>

SystemError: RuntimeError: Util.dump_schemas:
Cannot open file: /root/.oci/config.

Fix:- 

Install OCI CLI in your
local machine

Installing
the CLI (document available on D:ChandanCloudOCI_Notes)

https://docs.cloud.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm

 

Error:#02-

MySQL  localhost:33060+ ssl  Py >
util.dump_schemas([“sales”], “salesdump”,
{“osBucketName”: “BootcampBucket”,
“osNamespace”: “idazzjlcjqzj”,  “ocimds”: “true”,
“compatibility”: [“strip_definers”,
“strip_restricted_grants”]})

Traceback (most recent call last):

  File
“<string>”, line 1, in <module>

SystemError: RuntimeError: Util.dump_schemas: Failed to get object list using prefix ‘salesdump/’: The required
information to complete authentication was not provided. (401)

Fix:- 

util.dump_schemas([“sales”], “worlddump“,
{“osBucketName”: “dumpbucket-2”, “osNamespace”:
idazzjlcjqzj“,  “ocimds”: “true”,
“compatibility”: [“strip_definers”,
“strip_restricted_grants”]})

Note:- don’t
change the name of worlddump.

Error#03

MySQL  localhost:33060+
ssl  Py > util.dump_schemas([“sales”],
“worlddump”, {“osBucketName”: “BootcampBucket”,
“osNamespace”: “idazzjlcjqzj”,  “ocimds”:
“true”,”ociConfigFile”:”/root/.oci/config”,
“compatibility”: [“strip_definers”,
“strip_restricted_grants”]})

Traceback (most recent call last):

  File
“<string>”, line 1, in <module>

SystemError: RuntimeError: Util.dump_schemas:
Failed to list multipart uploads: Either the bucket named ‘BootcampBucket’
does not exist in the namespace ‘idazzjlcjqzj’ or you are not authorized to
access it (404)

Fix:-

make
sure you have set policies

Allow
group chandangroup to read buckets in compartment chandankumar-sandbox

Allow
group chandangroup to manage objects in compartment chandankumar-sandbox
where any {request.permission=’OBJECT_CREATE’, request.permission=’OBJECT_INSPECT’}

Allow
group chandangroup to read objects in compartment chandankumar-sandbox

 

Error#04

 

 MySQL 
localhost:33060+ ssl  JS > util.dumpSchemas([“sales”],
“worlddump”, {“osBucketName”: “BootcampBucket”,
“osNamespace”: “idazzjlcjqzj”,  “ocimds”: “true”,”ociConfigFile”:”/root/.oci/config”,
“compatibility”: [“strip_definers”,
“strip_restricted_grants”]})

Checking for
compatibility with MySQL Database Service 8.0.21

NOTE: Database sales
had unsupported ENCRYPTION option commented out

Compatibility issues
with MySQL Database Service 8.0.21 were found and repaired. Please review the
changes made before loading them.

Acquiring global read
lock

All transactions have
been started

Locking instance for
backup

Global read lock has
been released

Writing global DDL
files

Preparing data dump
for table `sales`.`employee`

Writing DDL for schema
`sales`

Writing DDL for table
`sales`.`employee`

WARNING: Could not
select a column to be used as an index for table `sales`.`employee`. Chunking
has been disabled for this table, data will be dumped to a single file.

Running data dump
using 4 threads.

NOTE: Progress
information uses estimated values and may not be accurate.

Data dump for table
`sales`.`employee` will be written to 1 file

ERROR: [Worker001]: Failed to rename object ‘worlddump/sales@employee.tsv.zst.dumping’ to
‘worlddump/sales@employee.tsv.zst’: Either the bucket named ‘BootcampBucket’
does not exist in the namespace ‘idazzjlcjqzj’ or you are not authorized to
access it (404)

Util.dumpSchemas: Fatal error during dump
(RuntimeError)

Fix

Make
sure you have given Rename Permission/policies to objects

 

Allow group chandangroup to manage objects
in compartment chandankumar-sandbox where any
{request.permission=’OBJECT_CREATE’, request.permission=’OBJECT_INSPECT’, request.permission=
‘OBJECT_OVERWRITE’,request.permission=‘OBJECT_DELETE’}

Final
Policies:-

go to OCI console –> Identity–>Policies–> add Policy

Thank you for using MySQL!!!

Please test and let us know your feedback…

Share your feedback on improvement on my blog ,thank you!

MySQL Shell Python mode blog posts compilation

$
0
0

Feed: Planet MySQL
;
Author: Joshua Otwell
;

Over the last few months, I have written numerous blog posts on different features of the MySQL Shell ranging from basic CRUD to aggregate functions and DDL. As a part of the MySQL version 8 release, MySQL Shell is a powerful and alternative environment that you can manage and work with your data in using a choice of 3 languages: Python, Javascript, or SQL. So this blog post is a simple compilation of all the Python mode related posts, in one easy-to-access location…


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


Python mode in the MySQL Shell

Below is the compilation list of posts. I hope you enjoy them and please share them along!


Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Migrate from on premise MySQL to MySQL Database Service

$
0
0

Feed: Planet MySQL
;
Author: Frederic Descamps
;

If you are running MySQL on premise, it’s maybe the right time to think about migrating your lovely MySQL database somewhere where the MySQL Team prepared a comfortable place for it to stay running and safe.

This awesome place is MySQL Database Service in OCI. For more information about what MDS is and what it provides, please check this blog from my colleague Airton Lastori.

One important word that should come to your mind when we talk about MDS is SECURITY !

Therefore, MDS endpoint can only be a private IP in OCI. This means you won’t be able to expose your MySQL database publicly on the Internet.

Now that we are aware of this, if we want to migrate an existing database to the MDS, we need to take care of that.

What is my case ?

When somebody needs to migrate its actual MySQL database, the first question that needs to be answered is: Can we eventually afford large downtime ?

If the answer is yes, then the migration is easy:

  • you stop your application(s)
  • you dump MySQL
  • you start your MDS instance
  • you load your data into MDS

and that’s it !

In case the answer is no, things are of course more interesting and this is the scenario I will cover in this post.

Please note that the application is not covered in this post and of course, it’s also recommended to migrate it to the cloud, in a compute instance of OCI for example.

What’s the plan ?

To migrate successfully a MySQL database from on premise to MDS, these are the actions I recommend:

  1. create a VCN with two subnets, the public and the private one
  2. create a MDS instance
  3. create a VPN
  4. create an Object Storage Bucket
  5. dump the data to be loaded in MDS
  6. load the data in MDS
  7. create an in-bound replication channel in MDS

The architecture will look like this:

Virtual Cloud Network

First thing to do when you have your OCI access, it’s to create a VNC from the dashboard. If you have already created some compute instances, these steps are not required anymore:

 

You can use Start VNC Wizard, but I will cover the VPN later in this article. So let’s just use Create VCN:

We need a name and a CIDR Block, we use 10.0.0.0/16:

 

This is what it looks like:

 

Now we click on the name (lefred_vcn in my case) and we need to create two subnets:

 

We will create the public one on 10.0.0.0/24:

 

and the Private one on 10.0.1.0/24.

After these two steps, we have something like this:

 

MySQL Database Service Instance

We can create a MDS instance:

 

And we just follow the creation wizard that is very simple:

 

It’s very important to create an admin user (the name can be what you want) and don’t forget the password. We put our service in the private subnet we just created.

 

The last screen of the wizard is related to the automatic backups:

 

The MDS instance will be provisioned after a short time and you can see that in its detailed view:

 

VPN

OCI allows you to very easily create IPSEC VPN’s with all enterprise level hardware that are used in the industry. Unfortunately I don’t have such opportunity at home (and no need for it), so I will use another supported solution that is more appropriate for domestic usage: OpenVPN.

If you are able to deploy the IPSEC solution, I suggest you to use it.

 

On that new page, you have a link to the Marketplace where you can deploy a compute instance to act as OpenVPN server:

 

You need to follow the wizard and make sure to use the vcn we created and the public subnet:

 

The compute instance will be launched by Terraform. When done we will be able to reach the OpenVPN web interface using the public IP that was assigned to this compute instance using the credentials we entered in the wizard:

 

In case you lost those logs, the ip is available in the Compute->Instances page:

 

As soon as the OpenVPN instance is deployed, we can go on the web interface and setup OpenVPN:

 

As we want be able to connect from our MDS instance to our on-premise MySQL server for replication, we will need to setup our VPN to use Routing instead of NAT:

 

We also specified two ranges as we really want to have a static IP for our on-premise MySQL Instance, otherwise, the IP might change the next time we connect to the VPN.

The next step is the creation of a user we will use to connect to the VPN:

 

The settings are very important:

 

Save the settings and click on the banner to restart OpenVPN.

Now, we connect using the user we created to download his profile:

That client.ovpn file needs to be copied to the on-premise MySQL Server.

If OpenVPN is not yet installed on the on-premise MySQL Server, it’s time to install it (yum install openvpn).

Now, we copy the client.ovpn in /etc/openvpn/client/ and we call it client.conf:

# cp client.ovpn /etc/openvpn/client/client.conf

We can start the VPN:

# systemctl status openvpn-client@client
Enter Auth Username: lefred
Enter Auth Password: ******

We can verify that the VPN connection is established:

# ifconfig tun0
tun0: flags=4305 mtu 1500
inet 172.27.232.134 netmask 255.255.255.0 destination 172.27.232.134
inet6 fe80::9940:762c:ad22:5c62 prefixlen 64 scopeid 0x20

unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 100 (UNSPEC)
RX packets 1218 bytes 102396 (99.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1287 bytes 187818 (183.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

sytemctl status openvpn-client@client can also be called to see the status.

Object Storage

To transfer our data to the cloud, we will use Object Storage.

And we create a bucket:

 

Dump the Data

To dump the data of our on-premise MySQL server, we will use MySQL Shell that has the capability to Load & Dump large datasets in an optimized and compatible way for OCI since 8.0.21.

Please check those links for more details:

OCI Config

The first step is to create an OCI config file that will look like this:

[DEFAULT]
user=ocid1.user.oc1..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
fingerprint=xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
key_file=/home/lefred/oci_api_key.pem
tenancy=ocid1.tenancy.oc1..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
compartment=ocid1.compartment.oc1..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
region=us-ashburn-1

The user information and key can be found under the Identity section:

Please refer to this manual page to generate a PEM key.

Now that we have an oci config file (called oci.config in my case), we need to verify that our current MySQL server is using GTID:

on-premise mysql> select @@gtid_mode;
+————-+
| @@gtid_mode |
+————-+
| OFF |
+————-+
1 row in set (0.00 sec)

By default GTID mode is disabled and we need to enable it. To be able to perform this operation without restarting the MySQL instance, this is how to proceed:

on-premise mysql> SET PERSIST server_id=1;
on-premise mysql> SET PERSIST enforce_gtid_consistency=true;
on-premise mysql> SET PERSIST gtid_mode=off_permissive;
on-premise mysql> SET PERSIST gtid_mode=on_permissive;
on-premise mysql> SET PERSIST gtid_mode=on;
on-premise mysql> select @@gtid_mode;
+————-+
| @@gtid_mode |
+————-+
| ON |
+————-+

Routing & Security

We need to add some routing and firewall rules to our VCN to allow the traffic from and to the VPN.

 

Now that we dealt with routing and security, it’s time to dump the data to Object Store by connecting MySQL Shell to our on-premise server and use util.dumpInstance():

$ mysqlsh
MySQL JS > c root@localhost
[…]
MySQL localhost:33060+ ssl JS > util.dumpInstance(‘onpremise’, {ociConfigFile: “oci.config”,
osBucketName: “lefred_bucket”, osNamespace: “xxxxxxxxxxxx”,threads: 4,
ocimds: true, compatibility: [“strip_restricted_grants”, “strip_definers”]})

You can also find more info on this MDS manual page.

Load the data in MDS

The data is now already in the cloud and we need to load it in our MDS instance.

We first connect to our MDS instance using Shell. We could use a compute instance in the public subnet or the VPN we created. I will use the second option:

MySQL localhost:33060+ ssl JS > c admin@10.0.1.11
Creating a session to ‘admin@10.0.1.11’
Fetching schema names for autocompletion… Press ^C to stop.
Closing old connection…
Your MySQL connection id is 283 (X protocol)
Server version: 8.0.21-u1-cloud MySQL Enterprise – Cloud
No default schema selected; type use to set one.

It’s time to load the data from Object Storage to MDS:

MySQL 10.0.1.11:33060+ ssl JS > util.loadDump(‘onpremise’, {ociConfigFile: “oci.config”,
osBucketName: “lefred_bucket”, osNamespace: “xxxxxxxxxxxx”,threads: 4})
Loading DDL and Data from OCI ObjectStorage bucket=lefred_bucket, prefix=’onpremise’
using 4 threads.
Target is MySQL 8.0.21-u1-cloud. Dump was produced from MySQL 8.0.21
Checking for pre-existing objects…
Executing common preamble SQL
Executing DDL script for schema employees
Executing DDL script for employees.departments
Executing DDL script for employees.salaries
Executing DDL script for employees.dept_manager
Executing DDL script for employees.dept_emp
Executing DDL script for employees.titles
Executing DDL script for employees.employees
Executing DDL script for employees.current_dept_emp
Executing DDL script for employees.dept_emp_latest_date
[Worker002] employees@dept_emp@@0.tsv.zst: Records: 331603 Deleted: 0 Skipped: 0 Warnings: 0
[Worker002] employees@dept_manager@@0.tsv.zst: Records: 24 Deleted: 0 Skipped: 0 Warnings: 0
[Worker003] employees@titles@@0.tsv.zst: Records: 443308 Deleted: 0 Skipped: 0 Warnings: 0
[Worker000] employees@employees@@0.tsv.zst: Records: 300024 Deleted: 0 Skipped: 0 Warnings: 0
[Worker002] employees@departments@@0.tsv.zst: Records: 9 Deleted: 0 Skipped: 0 Warnings: 0
[Worker001] employees@salaries@@0.tsv.zst: Records: 2844047 Deleted: 0 Skipped: 0 Warnings: 0
Executing common postamble SQL
6 chunks (3.92M rows, 141.50 MB) for 6 tables in 1 schemas were loaded in
5 min 28 sec (avg throughput 431.39 KB/s)
0 warnings were reported during the load.

We still need to set the GTID purged information from when the dump was taken.

In MDS, this operation can be achieved calling a dedicated procedure called sys.set_gtid_purged()

Now let’s find the value we need to add there. The value of GTID executed from the dump is written in the file @.json. This file is located in Object Storage and we need to retrieve it:

When you have the value of gtidExecuted in that file you can set it in MDS:

MySQL 10.0.1.11:33060+ ssl
SQL > call sys.set_gtid_purged(“ae82914d-e096-11ea-8a7a-08002718d305:1”)

In-bound Replication

Before stopping our production server running MySQL on premise, we need to resync the data. We also need to be sure we have moved everything we need to the cloud (applications, etc…) and certainly run some tests. This can take some time and during that time we want to keep the data up to date. We will then use replication from on-premise to MDS.

Replication user creation

On the production MySQL (the one still running on the OCI compute instance), we need to create a user dedicated to replication:

mysql> CREATE USER ‘repl’@’10.0.1.%’ IDENTIFIED BY ‘C0mpl1c4t3d!Paddw0rd’ REQUIRE SSL;
mysql> GRANT REPLICATION SLAVE ON *.* TO ‘repl’@’10.0.1.%’;

Creation of the replication channel

We go back on OCI’s dashboard and in our MDS instance’s details page, we click on Channels:

 

We now create a channel and follow the wizard:

 

We use the credentials we just created and as hostname we put the IP of our OpenVPN client: 172.27.232.134

After a little while, the channel will be created and in MySQL Shell when connected to your MDS instance, you can see that replication is running.

 

Wooohooo it works ! o/

Conclusion

As you can see, transferring the data and creating a replication channel from on-premise to MDS is easy. The most complicated part is the VPN and dealing with the network, but straightforward for a sysadmin. This is a task that you have to do only once and it’s the price to pay to have a more secure environment.


How to take MySQL Enterprise Edition backup using Instance Dump Features Introduced in MySQL 8.0.21?

$
0
0

Feed: Planet MySQL
;
Author: Chandan Kumar
;

 Using Instance Dump and Schema Dump features introduced in MySQL 8.0.21 Version.

In this Blog I will cover below topic:-

1.      
What is Instance Dump/Schema Dump features all
about?

2.      
What are advantage?

3.      
What is Disadvantage of using Instance Dump?

4.      
Performance Benchmarks?

5.      
Conclusion

What is Instance Dump/Schema Dump features all about?

MySQL Instance Dump is another logical back up option where backup
can be processed in multi-threaded with file compress which will help users
to improve performance of overall backup process and also save disk space.

This features is introduced in MySQL 8.0.21, which means to utilize
these features you must use client MySQL
SHELL 8.0.21 version.

To perform instance Dump there
is utility called util.dumpInstance(outputUrl[,
options])

In short, util.dumpInstance(): dump an entire database
instance, including users.

MySQL Schema Dump is another customize options where users can
perform back up of choice, let’s say if you wanted to take one
single/multiple databases or else particular table  then we use utility called “Schema Dump Utility” à
util.dumpSchemas

In short, util.dumpSchemas(): dump a set of schemas.

The dumps created by MySQL Shell’s instance dump utility and schema
dump utility comprise DDL files specifying the schema structure, and
tab-separated .tsv files containing the data.

By default, the dump utilities chunk table data into multiple data
files and compress the files.

What are things can be done with Instance Dump

1.       Backing
to Oracle Cloud Infrastructure Object Storage.

2.       Backing
Up to local machine directory.

3.       Helpful
in provisioning new Instance in on-premises as well as on Cloud.

Requirements apply to dumps using the instance dump utility and schema dump
utility:

 –

 MySQL 5.7 or later is required for both the
source MySQL instance and the destination MySQL instance.

·       Object names in the instance or schema must be
in the latin1 or utf8 characterset.

·      
Data consistency is guaranteed only for tables
that use the InnoDB storage engine.

·      
The upload method used to transfer files to an
Oracle Cloud Infrastructure Object Storage bucket has a file size limit of 1.2
TiB.

More info: – https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-dump-instance-schema.html

What is Advantage of using Instance Dump?

·      
It is parallel dumping with multiple threads.

·      
Automatic file compression – zstd algorithm. The
alternatives are to use gzip compression (gzip) or no compression (none).

·      
Improved Performance than mysqldump.

·      
Flexibility to choose, what we wanted to
backup.

·      
Options to push backup to Oracle Cloud
Infrastructure.

·      
Options to get problem statement/error in
advance  with features called dryRun.

What is Disadvantage of using Instance Dump?

·      
It is Logical Backup. so challenges of
mysqldump will still applies.

·      
While taking backup it does take LOCK on the
schemas.

·      
Debugg is not an easy as files is divided into
chunks.

MySQL  localhost:3306 ssl  JS >
util.dumpInstance(“opt/packages/worlddump”, {dryRun: true, ocimds:
true})

Error:-

 Checking for compatibility with MySQL Database Service 8.0.21

FIX:-

util.dumpInstance(“C:/Users/hanna/worlddump”,
{dryRun:true,ocimds: true, compatibility: [“strip_definers”,
“strip_restricted_grants”]})

Use Case #01 :- To dump complete MySQL Instance.

shell-js> util.dumpInstance(“/opt/packages/onpremisesdump”,
{ocimds: true, compatibility: [“strip_definers”,
“strip_restricted_grants”]})

Use Case #02 :- Take Backup of only single database using Schema Dump Utility

MySQL 
localhost:33060+ ssl  JS > util.dumpSchemas([“customerDB”],”/opt/packages/customerDB123″,{threads:20})

 

 

O/P

4 thds dumping – 98% (10.57M rows / ~10.69M rows), 170.72K rows/s, 16.87
MB/s uncompressed, 1.64 MB/s 4 thds dumping – 98% (10.58M rows / ~10.69M rows),
170.72K rows/s, 16.87 MB/s uncompressed, 1.64 MB/s 4 thds dumping – 99% (10.58M
rows / ~10.69M rows), 170.72K rows/s, 16.87 MB/s uncompressed, 1.64 MB/s 4 thds
dumping – 99% (10.59M 

Duration: 00:01:47s

Schemas dumped: 1

Tables dumped: 922

Uncompressed data size: 1.30 GB

Compressed data size: 125.10 MB

Compression ratio: 10.4

Rows written: 10844660

Bytes written: 125.10 MB

Average uncompressed throughput: 12.17 MB/s

Average compressed throughput: 1.17 MB/s

 MySQL 
localhost:33060+ ssl  JS >

Performance Benchmarks?

Conclusion

MySQL Shell makes MySQL easier to use, by providing an interactive
MySQL client supporting SQL, Document Store, JavaScript & Python interface
with support for writing custom extensions.

And with dumpInstance(), dumpSchemas() and loadDump() Shell
now provides powerful logical dump and load functionality.

Thank you for using MySQL!!!

Sadequl Hussain: How to Get the Best Out of PostgreSQL Logs

$
0
0

Feed: Planet PostgreSQL.

As a modern RDBMS, PostgreSQL comes with many parameters for fine-tuning. One of the areas to consider is how PostgreSQL should log its activities. Logging is often overlooked in Postgres database management, and if not ignored, usually wrongly set. This happens because most of the time, the purpose of logging is unclear. Of course, the fundamental reason for logging is well-known, but what is sometimes lacking is an understanding of how the logs will be used. 

Each organization’s logging requirements are unique, and therefore how PostgreSQL logging should be configured will be different as well. What a financial service company needs to capture within its database logs will be different from what a company dealing with critical health information needs to record. And in some cases, they can be similar too.

In this article, I will cover some fundamental practices to get the best out of PostgreSQL logs. This blog is not a hard and fast rule book; readers are more than welcome to share their thoughts in the comments section. To get the best value out of it though, I ask the reader to think about how they want to use their PostgreSQL database server logs:

  • Legal compliance reason where specific information needs to be captured
  • Security auditing where specific event details need to be present
  • Performance troubleshooting where queries and their parameters are to be recorded
  • Day-to-day operational support where  a set number of metrics are to be monitored

With that said, let’s start.

Don’t Make Manual Changes to postgresql.conf

Any changes in the postgresql.conf file should be made using a configuration management system like Puppet, Ansible, or Chef. This ensures changes are traceable and can be safely rolled back to a previous version if necessary. This holds true when you are making changes to the logging parameters.

DBAs often create multiple copies of the postgresql.conf file, each with slightly different parameters, each for a different purpose. Manually managing different configuration files is a cumbersome task if not prone to errors. On the other hand, a configuration management system can be made to rename and use different versions of the postgresql.conf file based on a parameter passed to it. This parameter will dictate the purpose of the current version. When the need is completed, the old config file can be put back by changing the same input parameter. 

For example, if you want to log all statements running on your PostgreSQL instance, a config file with the parameter value “log_statement=all” can be used. When there is no need to record all statements – perhaps after a troubleshooting exercise – the previous config file could be reinstated.

Use Separate Log Files for PostgreSQL 

I recommend enabling PostgreSQL’s native logging collector during normal operations. To enable PostgreSQL native logging, set the following parameter to on:

logging_collector = on

There are two reasons for it:

First of all, in busy systems, the operating system may not consistently record PostgreSQL messages in syslog (assuming a nix-based installation) and often drop messages.  With native PostgreSQL logging, a separate daemon takes care of recording the events. When PostgreSQL is busy, this process will defer writing to the log files to let query threads to finish. This can block the whole system until the log event is written. It is therefore useful to record less verbose messages in the log (as we will see later) and use shortened log line prefixes.

Secondly – and as we will see later – logs should be collected, parsed, indexed, and analyzed with a Log Management utility. Having PostgreSQL record its events in syslog will mean creating an extra layer of filtering and pattern-matching in the Log Management part to filter out all the “noise messages”. Dedicated log files can be easily parsed and indexed for events by most tools.

Set Log Destination to stderr

Let’s consider the “log_destination” parameter. It can have four values:

log_destination = stderr | syslog | csv | eventlog

Unless there is a good reason to save log events in comma-separated format or event log in Windows, I recommend setting this parameter to stderr. This is because with a CSV file destination, a custom “log_line_prefix” parameter value will not have any effect, and yet, the prefix can be made to contain valuable information.

On the flip side though, a CSV log can be easily imported to a database table and later queried using standard SQL. Some PostgreSQL users find it more convenient than handling raw log files. As we will see later, modern Log Management solutions can natively parse PostgreSQL logs and automatically create meaningful insights from them. With CSV, the reporting and visualization has to be manually done by the user. 

Ultimately it comes down to your choice. If you are comfortable creating your own data ingestion pipeline to load the CSV logs into database tables, cleanse and transform the data, and create custom reports that suit your business needs, then make sure the “log_destination” is set to CSV.

Use Meaningful Log File Names

When PostgreSQL log files are saved locally, following a naming style may not seem necessary. The default file name style is “postgresql-%Y-%m-%d_%H%M%S.log” for non-CSV formatted logs, which is sufficient for most cases.

Naming becomes important when you are saving log files from multiple servers to a central location like a dedicated log server, a mounted NFS volume, or an S3 bucket. I recommend using two parameters in such case:

log_directory
log_filename

To store log files from multiple instances in one place, first, create a separate directory hierarchy for each instance. This can be something like the following:

/<Application_Name>/<Environment_Name>/<Instance_Name>

Each PostgreSQL instance’s “log_directory” can then be pointed to its designated folder.

Each instance can then use the same “log_filename” parameter value. The default value will create a file like 

postgresql_2020-08-25_2359.log

To use a more meaningful name, set the “log_filename” to something like this:

log_filename = "postgresql_%A-%d-%B_%H%M"

The log files will then be named like:

postgresql_Saturday-23-August_2230

Use Meaningful Log Line Prefix

PostgreSQL log line prefixes can contain the most valuable information besides the actual message itself. The Postgres documentation shows several escape characters for log event prefix configuration. These escape sequences are substituted with various status values at run time. Some applications like pgBadger expect a specific log line prefix. 

I recommend including the following information in the prefix:

  • The time of the event (without milliseconds): %t
  • Remote client name or IP address: %h
  • User name: %u
  • Database accessed: %d
  • Application name: %a
  • Process ID: %p
  • Terminating non-session process output: %q
  • The log line number for each session or process, starting at 1: %l

To understand what each field in the prefix is about, we can add a small literal string before the field. So, process ID value can be preceded by the literal “PID=”, database name can be prefixed with “db=” etc.  Here is an example:

log_line_prefix = 'time=%t, pid=%p %q db=%d, usr=%u, client=%h , app=%a, line=%l '

Depending on where the event is coming from, the log line prefix will show different values. Both background processes and user processes will record their messages in the log file. For system processes, I have specified %q, which will suppress any text after the process ID (%p). Any other session will show the database name, user name, client address, application name, and a numbered line for each event.

Also, I included a single space after the log line prefix. This space separates the log event prefix from the actual event message. It does not have to be a space character- anything like a double colon (::), hyphen (-), or another meaningful separator can be used.

Also, set the “log_hostname” parameter to 1:

log_hostname = 1

Without this, only the client IP address will be shown. In production systems, this will typically be the address of the proxy, load balancer, or the connection pooler. Unless you know the IP addresses of these systems by heart, it may be worthwhile to log their hostnames. However, the DNS lookup will also add extra time for the logging daemon to write to the file.

Another parameter that should be set along with the “log_line_prefix” is “log_timezone”. Setting this to the local timezone of the server will ensure logged events are easy to follow from their timestamp instead of converting to local time first. In the code snippet below, we are setting the log_timzeone to Australian Eastern Standard Timezone:

log_timezone = 'Australia/Sydney'

Log Connections Only

Two parameters control how PostgreSQL records client connections and disconnections. Both parameters are off by default. Based on your organization’s security requirements, you may want to set one of these to 1 and the other to 0 (unless you are using a tool like pgBadger – more on that later).

log_connections = 1
log_disconnections = 0

Setting log_connections to 1 will record all authorized connections as well as attempted connections. This is obviously good for security auditing: a brute force attack can be easily identified from the logs. However, with this setting enabled, a busy PostgreSQL environment with thousands, or even hundreds of short-lived valid connections could see the log file getting inundated. 

Nevertheless, it can also identify application issues that may not be obvious otherwise. A large number of connection attempts from many different valid client addresses may indicate the instance needs a load balancer or connection pooling service in front of it. A large number of connection attempts from a single client address may uncover an application with too many threads that need some type of throttling.

Log DDL and DML Operations Only

There is a lot of debate around what should be recorded in the Postgres log – i.e., what should be the value of the “log_statement” parameter. It can have three values:

log_statement = 'off' | 'ddl' | 'mod' | 'all'

It may be tempting to set this to “all” to capture every SQL statement running on the server, but this may not always be a good idea in reality.

Busy production systems mostly run SELECT statements, sometimes thousands of those per hour. The instance might be running perfectly well, without any performance issues. Setting this parameter to “all” in such cases would unnecessarily burden the logging daemon as it has to write all those statements to the file.

What you want to capture, though, is any data corruption, or changes in the database structure that caused some type of issue. Unwanted or unauthorized database changes cause more application issues than selecting data; that’s why I recommend setting this parameter to “mod”. With this setting, PostgreSQL will record all DDL and DML changes to the log file.

If your PostgreSQL instance is moderately busy (dozens of queries per hour), feel free to set this parameter to “all”. When you are troubleshooting slow-running SELECT queries or looking for unauthorized data access, you can also set this to “all” temporarily. Some applications like pgBadger also expect you to set this to “all”.

Log “Warning” Messages and Up

If the “log_statement” parameter decides what type of statements will be recorded, the following two parameters dictate how detailed the message will be:

log_min_messages
log_min_error_statement

Each PostgreSQL event has an associated message level. The message level can be anything from verbose DEBUG to terse PANIC. The lower the level, the more verbose the message is. The default value for the “log_min_messages” parameter is “WARNING”. I recommend keeping it to this level unless you want informational messages to be logged as well.

The “log_min_error_statement” parameter controls which SQL statements throwing error will be logged. Like “log_min_message”, any SQL statement having an error severity level equal or above the value specified in“log_min_error_statement” will be recorded. The default value is “ERROR”, and I recommend keeping the default.

Keep Log Duration Parameters to Default

Then we have the following two parameters:

log_duration
log_min_duration_statement

The “log_duration” parameter takes a boolean value. When it is set to 1, the duration of every completed statement will be logged. If set to 0, statement duration will not be logged. This is the default value, and I recommend keeping it to 0 unless you are troubleshooting performance problems. Calculating and recording statement durations makes the database engine do extra work (no matter how small), and when it is extrapolated to hundreds or thousands of queries, the savings can be significant.

Lastly, we have the “log_min_duration_statement” parameter. When this parameter is set (without any units, it’s taken as milliseconds), the duration of any statement taking equal to or longer than the parameter value will be logged. Setting this parameter value to 0 will record the duration of all completed statements. Setting this to -1 will disable statement duration logging. This is the default value, and I recommend keeping it so.

The only time you want to set this parameter to 0 is when you want to create a performance baseline for frequently run queries. Bear in mind though, if the parameter “log_statement” is set, the statements that are logged will not be repeated in the log messages showing durations. So you will need to load the log files in a database table, then join the Process ID and Session ID values from the log line prefixes to identify related statements and their durations.

Whatever the means, once you have a baseline for each frequently run query, you can set the “log_min_duration_statement” parameter to the highest of the baseline values. Now, any query running longer than the highest baseline value will be a candidate for fine-tuning.

Keep Error Message Verbosity to Default

The “log_error_verbosity” parameter can have three possible values:

log_error_verbosity = terse | standard | verbose

This parameter controls the amount of information PostgreSQL will record for each event recorded in the log file. Unless debugging a database application, this parameter is best to keep to “default”. The verbose mode will be useful when you need to capture the file or function name and the line number there that generated the error. Setting this to “terse” will suppress logging the query, which may not be useful either.

Find a Log Rotation Policy That Works for Your Use Case

I recommend creating a log rotation policy based on either the size or age of the log file, but not both. Two PostgreSQL configuration parameters dictate how old logs are archived and new logs are created:

log_rotation_age = <number of minutes>
log_rotation_size = <number of kilobytes>

The default value for “log_rotration_age” is 24 hours, and the default value for “log_rotation_size” is 10 megabytes.

In most cases, having a size rotation policy does not always guarantee the last log message in the archived log file is completely contained in that file only.

If the “log_rotation_age” is kept to its default value of 24 hours, each file can be easily identified and individually examined, as each file will contain a day’s worth of events. However, this, too, does not guarantee that each file will be a self-contained unit of logs of the last 24 hours. Sometimes a slow-performing query can take more than 24 hours to finish; events could be happening when the old file is closed, and the new one is generated. This can be the case during a nightly batch job,  resulting in some parts of the queries recorded in one file and the rest in another.

Our recommendation is to find a log rotation period that works for your use case. Check the time difference between two consecutive periods of lowest activity (for example, between one Saturday to the next). You can then set the “log_rotation_age” value to that time difference, and during one of those periods, restart the PostgreSQL service. That way, PostgreSQL will rotate the current log file when the next lull period happens. However, if you need to restart the service between these periods, the log rotation will again be skewed. You will have to repeat this process then. But as like many other things in PostgreSQL, trial and error will dictate the best value. Also, if you are using a log management utility, the log rotation age or size will not matter because the log manager agent will ingest each event from the file as it is added.

Use Tools Like pgBadger

pgBadger is a powerful PostgreSQL log analyzer that can create very useful insights from Postgres log files. It is an open-source tool written in Perl with a very small footprint in the machine where it runs. The tool is run from the command line and comes with a large number of options. It will take one or more logs as its input and can produces a HTML report with detailed statistics on:

  • Most frequently waiting queries.
  • Queries generating most temporary files or the largest temporary files
  • Slowest running queries
  • Average query duration
  • Most frequently run queries
  • Most frequent errors in queries
  • Users and application who run queries
  • Checkpoints statistics.
  • Autovacuum and autoanalyze statistics.
  • Locking statistics
  • Error events (panic, fatal, error and warning).
  • Connection and session profiles (by database, user, application)
  • Session profiles
  • Query profiles (query types, queries by database/application)
  • I/O statistics
  • etc.

As I mentioned before, some of the log-related configuration parameters have to be enabled to capture all log events so pgBadger can effectively analyze those logs. Since this can produce large log files with many events, pgBadger should be only used to create benchmarks or troubleshoot performance issues.. Once the detailed logs have been generated, the configuration parameters can be changed back to their original values. For continuous log analysis, it’s best to use a dedicated log management application.

If you are more comfortable doing things in command prompt and making use of system views, you would want to use pg_stat_statements. In fact, this should be enabled in any production PostgreSQL installation.

pg_stat_statements is a PostgreSQL extension and with the default installation now. To enable it, the “shared_preload_libraries” configuration parameter should have pg_stat_statements as one of its values. It can be then installed like any other extension using the “CREATE EXTENSION” command. The extension creates the pg_stat_statement view which provides valuable query-related information.

Use a Log Management Application to Gain Insight

There are many log management utilities in the market, and most organizations use one or more these days. Whatever tool is in place, I recommend making use of it to collect and manage PostgreSQL logs. 

There are a few reasons for it:

It is much easier to parse, analyze, and filter out noise from log files with an automated tool.  Sometimes, an event can span multiple log files based on the duration of the event, and the log rotation age or size. Having a log manager makes it much simpler to have this information presented as a whole.

Log management solutions today typically come with built-in ability to parse PostgreSQL logs. Some also come with dashboards that can show the most common metrics extracted from these logs.

Most modern log management applications also offer powerful search, filter, pattern matching, event correlation, and AI-enabled trend analysis features. What’s not visible to ordinary eyes can be easily made evident by these tools.

Finally, using a log manager to store PostgreSQL logs will also mean the events are saved for posterity, even if the original files are deleted accidentally or maliciously.

Although there are obvious advantages of using a log management application, many organizations have restrictions about where their logs can live. This is a typical case with SaaS-based solutions where logs are often saved outside an organization’s geographic boundary – something that may not be compliant with regulatory requirements.

In such cases, I recommend choosing a vendor with a local data center presence – if possible – or using a self-managed log manager hosted in the organization’s network, such as an ELK stack.

Final Words

PostgreSQL server logs can be a gold-mine of information when appropriately configured. The trick is to determine what to log and how much to log, and more importantly, test if the logs can deliver the right information when needed. It will be a matter of trial and error, but what I have discussed here today should give a pretty decent starting. As I said at the beginning, I would be more than happy to hear about your experience of configuring PostgreSQL logging for optimal results.

Amazon EMR supports Apache Hive ACID transactions

$
0
0

Feed: AWS Big Data Blog.

Apache Hive is an open-source data warehouse package that runs on top of an Apache Hadoop cluster. You can use Hive for batch processing and large-scale data analysis. Hive uses Hive Query Language (HiveQL), which is similar to SQL.

ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are atomic, consistent, isolated, and reliable.

Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions.

This post demonstrates how to enable Hive ACID transactions in Amazon EMR, how to create a Hive transactional table, how it can achieve atomic and isolated operations, and the concepts, best practices, and limitations of using Hive ACID in Amazon EMR.

Enabling Hive ACID in Amazon EMR

To enable Hive ACID as the default for all Hive managed tables in an EMR 6.1.0 cluster, use the following hive-site configuration:

[
   {
      "classification": "hive-site",
      "properties": {
         "hive.support.concurrency": "true",
         "hive.exec.dynamic.partition.mode": "nonstrict",
         "hive.txn.manager": "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"
      }
   }
]

For the complete list of configuration parameters related to Hive ACID and descriptions of the preceding parameters, see Hive Transactions.

Hive ACID use case

In this section, we explain the Hive ACID transactions with a straightforward use case in Amazon EMR.

Enter the following Hive command in the master node of an EMR cluster (6.1.0 release) and replace <s3-bucket-name> with the bucket name in your account:

hive --hivevar location=<s3-bucket-name> -f s3://aws-bigdata-blog/artifacts/hive-acid-blog/hive_acid_example.hql 

After Hive ACID is enabled on an Amazon EMR cluster, you can run the CREATE TABLE DDLs for Hive transaction tables.

To define a Hive table as transactional, set the table property transactional=true.

The following CREATE TABLE DDL is used in the script that creates a Hive transaction table acid_tbl:

CREATE TABLE acid_tbl (key INT, value STRING, action STRING)
PARTITIONED BY (trans_date DATE)
CLUSTERED BY (key) INTO 3 BUCKETS
STORED AS ORC
LOCATION 's3://${hivevar:location}/acid_tbl' 
TBLPROPERTIES ('transactional'='true');

This script generates three partitions in the provided Amazon S3 path. See the following screenshot.

The first partition, trans_date=2020-08-01, has the data generated as a result of sample INSERT, UPDATE, DELETE, and MERGE statements. We use the second and third partitions when explaining minor and major compactions later in this post.

ACID is achieved in Apache Hive using three types of files: base, delta, and delete_delta. Edits are written in delta and delete_delta files.

The base file is created by the Insert Overwrite Table query or as the result of major compaction over a partition, where all the files are consolidated into a single base_<write id> file, where the write ID is allocated by the Hive transaction manager for every write. This helps achieve isolation of Hive write queries and enables them to run in parallel.

The INSERT operation creates a new delta_<write id>_<write id> directory.

The DELETE operation creates a new delete_delta_<write id>_<write id> directory.

To support deletes, a unique row__id is added to each row on writes. When a DELETE statement runs, the corresponding row__id gets added to the delete_delta_<write id>_<write id> directory, which should be ignored on reads. See the following screenshot.

The UPDATE operation creates a new delta_<write id>_<write id> directory and a delete<write id>_<write id> directory.

The following screenshot shows the second partition in Amazon S3, trans_date=2020-08-02.

A Hive transaction provides snapshot isolation for reads. When an application or query reads the transaction table, it opens all the files of a partition/bucket and returns the records from the last transaction committed.

Hive compactions

With the previously mentioned logic for Hive writes on a transactional table, many small delta and delete_delta files are created, which could adversely impact read performance over time because each read over a particular partition has to open all the files (including delete_delta) to eliminate the deleted rows.

This brings the need for a compaction logic for Hive transactions. In the following sections, we use the same use case to explain minor and major compactions in Hive.

Minor compaction

A minor compaction merges all the delta and delete_delta files within a partition or bucket to a single delta_<start write id>_<end write id> and delete_delta_<start write id>_<end write id> file.

We can trigger the minor compaction manually for the second partition (trans_date=2020-08-02) in Amazon S3 with the following code:

ALTER TABLE acid_tbl PARTITION (trans_date='2020-08-02') COMPACT 'minor';

If you check the same second partition in Amazon S3, after a minor compaction, it looks like the following screenshot.

You can see all the delta and delete_delta files from write ID 0000005–0000009 merged to single delta and delete_delta files, respectively.

Major compaction

A major compaction merges the base, delta, and delete_delta files within a partition or bucket to a single base_<latest write id>. Here the deleted data gets cleaned.

A major compaction is automatically triggered in the third partition (trans_date='2020-08-03') because the default Amazon EMR compaction threshold is met, as described in the next section. See the following screenshot.

To check the progress of compactions, enter the following command:

hive> show compactions;

The following screenshot shows the output.

Compaction in Amazon EMR

Compaction is enabled by default in Amazon EMR 6.1.0. The following property determines the number of concurrent compaction tasks:

  • hive.compactor.worker.threads – Number of worker threads to run in the instance. The default is 1 or vCores/8, whichever is greater.

Automatic compaction is triggered in Amazon EMR 6.1.0 based on the following configuration parameters:

  • hive.compactor.check.interval – Time period in seconds to check if any partition requires compaction. The default is 300 seconds.
  • hive.compactor.delta.num.threshold – Triggers minor compaction when the total number of delta files is greater than this value. The default is 10.
  • hive.compactor.delta.pct.threshold – Triggers major compaction when the total size of delta files is greater than this percentage size of base file. The default is 0.1, or 10%.

Best practices

The following are some best practices when using this feature:

  • Use an external Hive metastore for Hive ACID tables – Our customers use EMR clusters for compute purposes and Amazon S3 as storage for cost-optimization. With this architecture, you can stop the EMR cluster when the Hive jobs are complete. However, if you use a local Hive metastore, the metadata is lost upon stopping the cluster, and the corresponding data in Amazon S3 becomes unusable. To persist the metastore, we strongly recommend using an external Hive metastore like an Amazon RDS for MySQL instance or Amazon Aurora. Also, if you need multiple EMR clusters running ACID transactions (read or write) on the same Hive table, you need to use an external Hive metastore.
  • Use ORC format – Use ORC format to get full ACID support for INSERT, UPDATE, DELETE, and MERGE statements.
  • Partition your data – This technique helps improve performance for large datasets.
  • Enable an EMRFS consistent view if using Amazon S3 as storage – Because you have frequent movement of files in Amazon S3, we recommend using an EMRFS consistent view to mitigate the issues related to the eventual consistency nature of Amazon S3.
  • Use Hive authorization – Because Hive transactional tables are Hive managed tables, to prevent users from deleting data in Amazon S3, we suggest implementing Hive authorization with required privileges for each user.

Limitations

Keep in mind the following limitations of this feature:

  • The AWS Glue Data Catalog doesn’t support Hive ACID transactions.
  • Hive external tables don’t support Hive ACID transactions.
  • Bucketing is optional in Hive 3, but in Amazon EMR 6.1.0 (as of this writing), if the table is partitioned, it needs to be bucketed. You can mitigate this issue in Amazon EMR 6.1.0 using the following bootstrap action:
    --bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/hive-acid-blog/make_bucketing_optional_for_hive_acid_EMR_6_1.sh","Name":"Set bucketing as optional for Hive ACID"}]'

Conclusion

This post introduced the Hive ACID feature in EMR 6.1.0 clusters, explained how it works and its concepts with a straightforward use case, described the default behavior of Hive ACID on Amazon EMR, and offered some best practices. Stay tuned for additional updates on new features and further improvements in Apache Hive on Amazon EMR.


About the Authors

Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

Chao Gao is a Software Development Engineer at Amazon EMR. He mainly works on Apache Hive project at EMR, and has some in-depth knowledge of distributed database and database internals. In his spare time, he enjoys making roadtrips, visiting all the national parks and traveling around the world.

MySQL DROP statement using phpMyAdmin

$
0
0

Feed: Planet MySQL
;
Author: Joshua Otwell
;

The MySQL DROP statement is one of many powerful DDL commands. Be it ALTER TABLE some_table DROP some_column or DROP some_table, this type of command can drastically change your data landscape because in executing MySQL DROP, you are completely removing objects from the database! If you are using the phpMyAdmin web interface, you can execute the MySQL DROP statement with just a few mouse clicks. Continue reading to see how…

low-light photo of water drops
Drops of water.
Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


Note: The DROP statement is prevalent in most SQL dialects and is not limited in use to only MySQL.

Suppose we have a common ‘users’ table with 4 columns: ‘user_id’, ‘first_name’, ‘last_name’, and ‘country’:

phpMyAdmin-table-description
users table description.

For whatever reason, we have decided we no longer need the ‘country’ column and want to remove it completely from the ‘users’ table. How can we do that using MySQL DROP statement?

Removing a column with MySQL DROP statement in phpMyAdmin

In phpMyAdmin, simply check the checkbox on the far left on the column row you want to remove (in our example the ‘country’ column). Once chosen, click the Drop word just beside the red warning-like icon to complete the action. The screenshot below provides a complete visual overview of the steps.

phpMyAdmin-drop-column
Dropping a table column in phpMyAdmin.

!!!WARNING. YOU ARE ABOUT TO COMPLETELY REMOVE A COLUMN!!!

Not only am I myself extremely cautious, but so is phpMyAdmin. In carrying out the above-shown action, phpMyAdmin displays this warning message:

phpMyAdmin-DROP-column
Confirmation popup for dropping the ‘country’ column from table ‘users’.

After clicking the OK button, the command is executed. The screenshot below shows, the ‘users’ table no longer has the ‘country’ column as part of its definition:

phpMyAdmin-table-description-after-DDL
Table ‘users’ description after dropping the ‘country’ column.

Removing a table with MySQL DROP statement in phpMyAdmin

Should you decide that you no longer need a table as part of your database, you can use the MySQL DROP statement and completely remove it. Suppose I want to remove the ‘users’ table. How can I do that using the MySQL DROP statement?

First, check the checkbox on the far left for the table row you wish to remove. Then, click the Drop action word next to the warning-like icon. Essentially, the exact steps taken to remove a column, we are now applying them on a database level to the actual table object.

The below screenshot shows a visual overview of the entire process:

phpMyAdmin-drop-table
phpMyAdmin steps to remove the ‘users’ table from the database.

I’ll save my personal warning message in this section because phpMyAdmin provides one just the same:

phpMyAdmin MySQL DROP statement
Confirmation popup for dropping the ‘users’ table in phpMyAdmin.

Clicking the OK button executes the MySQL DROP statement, removing the ‘users’ table from the database.


Recommended Reading and Informational Resources

Feel free to visit the below resources for more information on the MySQL DROP statement, along with other blog posts I have written on using the phpMyAdmin web-based interface:

Take good care when removing database objects like tables, columns, indexes, and the like using the MySQL DROP statement. Always be sure that you actually want them completely removed from the database schema because once they are gone, they are indeed gone!

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

A Step by Step Guide to Take your MySQL Instance to the Cloud

$
0
0

Feed: Planet MySQL
;
Author: Keith Hollman
;

You have a MySQL instance? Great. You want to take it to a cloud? Nothing new. You want to do it fast, minimizing downtime / service outage? “I wish” I hear you say. Pull up a chair. Let’s have a chinwag.

Given the objective above, i.e. “I have a database server on premise and I want the data in the cloud to ‘serve’ my application”, we can go into details:

  • – Export the data – Hopefully make that export find a cloud storage place ‘close’ to the destination (in my case, @OCI of course)
  • – Create my MySQL cloud instance.
  • – import the data into the cloud instance.
  • – Redirect the application to the cloud instance.

All this takes time. With a little preparation we can reduce the outage time down to be ‘just’ the sum of the export + import time. This means that once the export starts, we will have to set the application in “maintenance” mode, i.e. not allow more writes until we have our cloud environment available. 

Depending on each cloud solution, the ‘export’ part could mean “export the data locally and then upload the data to cloud storage” which might add to the duration. Then, once the data is there, the import might allow us to read from the cloud storage, or require adjustments before the import can be fully completed.

Do you want to know more? https://mysqlserverteam.com/mysql-shell-8-0-21-speeding-up-the-dump-process/

 Let’s get prepared then:

Main objective: keep application outage time down to minimum.

Preparation:

  • You have an OCI account, and the OCI CLI configuration is in place.
  • MySQL Shell 8.0.21 is installed on the on-premise environment.
  • We create an Object Storage bucket for the data upload.
  • Create our MySQL Database System.
  • We create our “Endpoint” Compute instance, and install MySQL Shell 8.0.21 & MySQL Router 8.0.21 here.
  • Test connectivity from PC to Object storage, from PC to Endpoint, and, in effect, from PC to MDS.

So, now for our OCI environment setup. What do I need?

Really, we just need some files to configure with the right info. Nothing has to be installed nor similar. But if we do have the OCI CLI installed on our PC or similar, then we’ll already have the configuration, so it’s even easier. (if you don’t have it installed, it does help avoid the web page console once we have learned a few commands so we can easily get things like the Public IP of our recently started Compute or we can easily start / stop these cloud environments.)

What we need is the config file from .oci, which contains the following info:

You’ll need the API Key stuff as mentioned in the documentation “Required Keys and OCIDs”.

Remember, this is a one-off, and it really helps your OCI interaction in the future. Just do it.

The “config” file and the PEM key will allow us to send the data straight to the OCI Object Storage bucket.

MySQL Shell 8.0.21 install on-premise.

Make a bucket.

I did this via the OCI console.

This creates a Standard Private bucket.

Click on the bucket name that now appears in the list, to see the details.

You will need to note down the Name and Namespace.

Create our MySQL Database System.

This is where the data will be uploaded to. This is also quite simple.

And hey presto. We have it.

Click on the name of the MDS system, and you’ll find that there’s an IP Address according to your VCN config. This isn’t a public IP address for security reasons.

On the left hand side, on the menu you’ll see “Endpoints”. Here we have the info that we will need for the next step.

For example, IP Address is 10.0.0.4.

Create our Endpoint Compute instance.

In order to access our MDS from outside the VCN, we’ll be using a simple Compute instance as a jump server.

Here we’ll install MySQL Router to be our proxy for external access.

And we’ll also install MySQL Shell to upload the data from our Object Storage bucket.

For example, https://gist.github.com/alastori/005ebce5d05897419026e58b9ab0701b.

First, go to the Security List of your OCI compartment, and add an ingress rule for the port you want to use in Router and allow access from the IP address you have for your application server or from the on-premise public IP address assigned.

Router & Shell install ‘n’ configure

Test connectivity.

Test MySQL Router as our proxy, via MySQL Shell:

$ mysqlsh root@kh01:3306 –sql -e ‘show databases’

Now, we can test connectivity from our pc / application server / on-premise environment. Knowing the public IP address, let’s try:

$ mysqlsh root@<public-ip>:3306 –sql -e ‘show databases’

If you get any issues here, check your ingress rules at your VCN level.

Also, double check your o.s. firewall rules on the freshly created compute instance too.

Preparation is done.

We can connect to our MDS instance from the Compute instance where MySQL Router is installed, kh01, and also from our own (on-premise) environment.

Let’s get the data streaming.

MySQL Shell Dump Utility

In effect, it’s here when we’ll be ‘streaming’ data.

This means that from our on-premise host we’ll export the data into the osBucket in OCI, and at the same time, read from that bucket from our Compute host kh01 that will import the data into MDS.

First of all, I want to check the commands with “dryRun: true”.

util.dumpSchemas dryRun

From our own environment / on-premise installation, we now want to dump / export the data:

$ mysqlsh root@OnPremiseHost:3306

You’ll want to see what options are available and how to use the util.dumpSchemas utility:

mysqlsh> help util.dumpSchemas

NAME

      dumpSchemas – Dumps the specified schemas to the files in the output

                    directory.

SYNTAX

      util.dumpSchemas(schemas, outputUrl[, options])

WHERE

      schemas: List of schemas to be dumped.

      outputUrl: Target directory to store the dump files.

      options: Dictionary with the dump options.

Here’s the command we’ll be using, but we want to activate the ‘dryRun’ mode, to make sure it’s all ok. So:

util.dumpSchemas(

[“test“], “test“,

{dryRun: true, showProgress: true, threads: 8, ocimds: true, “osBucketName”: “test-bucket“, “osNamespace”: “idazzjlcjqzj“, ociConfigFile: “/home/os_user/.oci/config“, “compatibility”: [“strip_definers”]

}

)

[“test“]               I just want to dump the test schema. I could put a list of                                schemas here.      Careful if you think you can export internal                                      schemas, ‘cos you can’t.

test”                             is the “outputURL target directort”. Watch the prefix of all the                        files being created in the bucket..

options:

dryRun:             Quite obvious. Change it to false to run.

showProgress:                 I want to see the progress of the loading.

threads:              Default is 4 but choose what you like here, according to the                                        resources available.

ocimds:              VERY IMPORTANT! This is to make sure that the                                      environment is “MDS Ready” so when the data gets to the                             cloud, nothing breaks.

osBucketName:   The name of the bucket we created.

osNamespace:                 The namespace of the bucket.

ociConfigFile:    This is what we looked at, right at the beginning. This what makes it easy. 

compatibility:                There are a list of options here that help reduce all customizations and/or simplify our data export ready for MDS.

Here I am looking at exporting / dumping just schemas. I could have dumped the whole instance via util.DumpInstance. Have a try!

I tested a local DumpSchemas export without OCIMDS readiness, and I think it might be worth sharing that, this is how I found out that I needed a Primary Key to be able to configure chunk usage, and hence, a faster dump:

util.dumpSchemas([“test”], “/var/lib/mysql-files/test/test”, {dryRun: true, showProgress: true})

Acquiring global read lock

All transactions have been started

Locking instance for backup

Global read lock has been released

Writing global DDL files

Preparing data dump for table `test`.`reviews`

Writing DDL for schema `test`

Writing DDL for table `test`.`reviews`

Data dump for table `test`.`reviews` will be chunked using column `review_id`

(I created the primary key on the review_id column and got rid of the following warning at the end:)

WARNING: Could not select a column to be used as an index for table `test`.`reviews`. Chunking has been disabled for this table, data will be dumped to a single file.

Anyway, I used dumpSchemas (instead of dumpInstance) with OCIMDS and then loaded with the following:

util.LoadDump dryRun

Now, we’re on the compute we created, with Shell 8.0.21 installed and ready to upload / import the data:

$ mysqlsh root@kh01:3306

util.loadDump(“test“, {dryRun: true, showProgress: true, threads: 8, osBucketName: “test-bucket“, osNamespace: “idazzjlcjqzj“, ociConfigFile: “/home/osuser/.oci/config“})

As imagined, I’ve copied my PEM key and oci CLI config file to the compute, via scp to a “$HOME/.oci directory.

Loading DDL and Data from OCI ObjectStorage bucket=test-bucket, prefix=’test’ using 8 threads.

Util.loadDump: Failed opening object ‘@.json’ in READ mode: Not Found (404) (RuntimeError)

This is due to the bucket being empty. You’ll see why it complains of the “@.json” in a second.

You want to do some “streaming”?

With our 2 session windows opened, 1 from the on-premise instance and the other from the OCI compute host, connected with mysqlsh:

On-premise:

dry run:

util.dumpSchemas([“test”], “test”, {dryRun: true, showProgress: true, threads: 8, ocimds: true, “osBucketName”: “test-bucket”, “osNamespace”: “idazzjlcjqzj”, ociConfigFile: “/home/os_user/.oci/config”, “compatibility”: [“strip_definers”]})

real:

util.dumpSchemas([“test”], “test”, {dryRun: false, showProgress: true, threads: 8, ocimds: true, “osBucketName”: “test-bucket”, “osNamespace”: “idazzjlcjqzj”, ociConfigFile: “/home/os_user/.oci/config”, “compatibility”: [“strip_definers”]})

OCI Compute host:

dry run:

util.loadDump(“test”, {dryRun: true, showProgress: true, threads: 8, osBucketName: “test-bucket”, osNamespace: “idazzjlcjqzj”, waitDumpTimeout: 180})

real:

util.loadDump(“test”, {dryRun: false, showProgress: true, threads: 8, osBucketName: “test-bucket”, osNamespace: “idazzjlcjqzj”, waitDumpTimeout: 180})

They do say a picture is worth a thousand words, here are some images of each window that was executed at the same time:

On-premise:

At the OCI compute host you can see the waitDumpTimeout take effect with:

NOTE: Dump is still ongoing, data will be loaded as it becomes available.

In the osBucket, we can now see content (which is what the loadDump is reading):

And once it’s all dumped ‘n’ uploaded we have the following output:

If you like logs, then check the .mysqlsh/mysqlsh.log that records all the output under the directory where you have executed MySQL Shell (on-premise & OCI compute)

Now the data is all in our MySQL Database System, all we need to do is point the web server or the application server to the OCI compute systems IP and port so that MySQL Router can enroute the connection to happiness!!!!

Conclusion

Viewing all 275 articles
Browse latest View live