Database Migration from MS SQL to PostgreSQL

·

8 min read

This article exposes tips and challenges of SQL Server to PostgreSQL database migration based on the experience of Intelligent Converters specialists. Despite which method or converter would have been chosen for the migration project, the responsible person or team must know the most significant bottlenecks and subjects of validation. Basically, database migration from MS SQL to PostgreSQL can be split into five fundamental phases:

  1. Investigate all MS SQL-specific elements of the source database and related techniques of migration to PostgreSQL.

  2. Migrate MS SQL table and constraint definitions. This phase requires validation of all differences between the syntax of table and constraint declarations in MS SQL and PostgreSQL.

  3. Choose the most suitable method of data migration to minimize downtime of MS SQL database. Run the data migration and validate that all the required transformations are made.

  4. Convert source code of stored procedures, functions and triggers into the PostgreSQL format.

  5. Validate the target PostgreSQL database using performance and functional tests, and fine-tune the performance.

Some of these phases and related points of attention are specified below in detail.

Migration of Table Definitions

SQL Server and PostgreSQL have similar sets of data types, and most of them are equal in source and target DBMS. Examples: BIGINT, DATE, DECIMAL, INT, MONEY, NUMERIC, REAL, SMALLINT, TEXT, XML.

MS SQL spatial types GEOGRAPHY and GEOMETRY require special attention since PostgreSQL requires the installation of the extension PostGIS to support spatial data.

The table below illustrates safe mapping for distinguished types from SQL Server to PostgreSQL:

MS SQL

PostgreSQL

BINARY(n)

BYTEA

BIT

BOOLEAN, BOOL

CHAR(n) where n>8000

TEXT

DATETIME

TIMESTAMP(3)

DATETIME2(n)

TIMESTAMP(n)

DATETIMEOFFSET(n)

TIMESTAMP(n) WITH TIME ZONE, TIMESTAMPTZ

FLOAT(n)

DOUBLE PRECISION, FLOAT8

IMAGE

BYTEA

NCHAR(n) where n>8000

TEXT

NTEXT

TEXT

NVARCHAR(max)

TEXT

ROWVERSION

BYTEA

SMALLMONEY

MONEY

UNIQUEIDENTIFIER

CHAR(16), UUID

VARBINARY(max)

BYTEA

VARCHAR(max)

TEXT

If the MS SQL table has an IDENTITY column with both seed and increment equal to 1, it must be converted into PostgreSQL SERIAL for INT or BIGSERIAL for BIGINT. If the IDENTITY declaration has the custom seed or increment, it must be converted into PostgreSQL IDENTITY (supported in version 10 and higher). Assume, we have the MS SQL table declared as:

CREATE TABLE TBL1(
    Col1 INT NOT NULL IDENTITY(2,4) PRIMARY KEY, 
    Col2 VARCHAR(100)
);

It must be migrated into PostgreSQL as follows:

CREATE TABLE TBL1(
    Col1 INT NOT NULL GENERATED BY DEFAULT AS IDENTITY 
(START WITH 2 INCREMENT BY 4) PRIMARY KEY, 
    Col2 VARCHAR(100)
);

Challenges of Data Migration

In the previous section, we mentioned BYTEA is the recommended data type for storing binary data in PostgreSQL. However, when dealing with large binary data, typically with an average field size of 10MB or more, it is not advisable to use BYTEA. The reason for this is that BYTEA data is read in a specific manner - it can only be extracted as a single fragment, and it does not support piecewise reading. As a result, this limitation can cause significant RAM overhead.

To address this issue, PostgreSQL offers an alternative solution known as the LARGE OBJECT provides stream-style access to the data. Values of the LARGE OBJECT type are stored in the system table called pg_largeobject, which is available in every database. This table can accommodate up to 4 billion rows, and the maximum size of a LARGE OBJECT is 4TB. Furthermore, the LARGE OBJECT type supports piecewise reading, overcoming the limitations of BYTEA.

Spatial data must be migrated through literal representation known as WKT. Assume, we have the following MS SQL table:

CREATE TABLE spatialtest(F1 INT NOT NULL, F2 geometry, F3 geography)

Then we can use STAsText function to extract geometry and geography data as text:

SELECT F1, F2.STAsText(), F3.STAsText() FROM spatialtest

After running the query, you will see something like this:

1 POLYGON ((5 5, 10 5, 10 10, 5 5)) LINESTRING (-122.36 47.656, -122.343 47.656)

In PostgreSQL, the same table must be declared as (PostGIS extension must be installed):

CREATE TABLE spatialtest(F1 INT NOT NULL, F2 geometry, F3 geography);

And the spatial data must be inserted in the table through the WKT representation mentioned above:

INSERT INTO spatialtest VALUES (1,'POLYGON ((5 5, 10 5, 10 10, 5 5))',

MS SQL provides a feature called 'external tables' that allows external data stored outside the database to be linked and treated as a regular table within the DBMS. Similarly, PostgreSQL offers a similar capability through the use of the Foreign Data Wrapper (FDW) library. For example, you can leverage the file_fdw extension to interact with external CSV files as if they were regular tables. This extension enables seamless integration with external data by providing a convenient way to access and manipulate the contents of CSV files as if they were native PostgreSQL tables.

Methods of Data Migration

When it comes to data migration, particularly for large databases, careful assessment of the strategy and implementation is crucial to avoid unacceptable system downtime and/or overhead. There are three common approaches to the data migration:

  • Snapshot: This method involves migrating all the data in a single step. However, it requires essential downtime for the source database during the entire period of data reading to prevent data loss or corruption.

  • Piecewise Snapshot: With this approach, data is migrated in chunks using parallel threads or processes. Although some downtime is still required, it is significantly reduced compared to the snapshot method. For most migration projects specialists Intelligent Converters team uses the piecewise snapshot method, especially for large tables that contain millions of rows.

  • Changed Data Replication (CDR): This technique is based on continuously loading data by tracking incremental changes. It allows for a decrease in system downtime close to zero, as only the changes made since the initial migration need to be synchronized.

Now, let's consider the two major implementations of the CDR technique in detail. The first implementation involves creating triggers on insert, update, and delete operations in the MS SQL database for each table being synchronized. These triggers track all changes by storing information about the events in a special 'history' table. Using this recorded data, the CDR tool replicates the changes to the target database. This approach may cause essential overhead in the MS SQL database due to multiple transactions running from triggers per data update.

The second implementation of Change Data Replication is called Transaction Log CDR. This technique is based on replicating data changes from the MS SQL transaction logs into the target PostgreSQL database. Unlike the trigger-based CDR, this method doesn't require modifications of the SQL Server database. However, it is connected with some risk of data loss or corruption due to the lack of control over the transaction log. This may happen in case of a broken connection to the PostgreSQL database during the replication of changes from the transaction log.

It is important to carefully evaluate the requirements of the particular migration project and select the most appropriate approach that balances downtime, overhead and efficiency. Automation tools like MS SQL to PostgreSQL database converter can be used to speed up and simplify the migration. This tool maps all types most intelligently, allows deep customization and implements data migration via the piecewise snapshot method.

Migration of SQL Code

All the issues addressing the migration of MS SQL stored procedures, functions, and triggers to PostgreSQL are explored in this section.

Types casting. Unlike SQL Server, PostgreSQL requires strict type casting when invoking functions, and operators, or when updating data using the results of expressions. The workaround for this issue is either to add type casting operators in SQL code where it is necessary or to use the PostgreSQL anyelement. This pseudo-type allows flexible handling of data types inside PostgreSQL function calls and operations. For example:

create or replace function my_concat(str1 anyelement, str2 anyelement)
returns varchar 
language plpgsql 
as $$
begin
    return str1::varchar || str2::varchar;
end;
$$;

It is important to remember that stored procedures and functions must have either a single parameter of anyelement type or all parameters of the same type.

Sequences. SQL Server and PostgreSQL have similar syntax for declaring sequences. For example:

CREATE SEQUENCE SEQ1 START WITH 4 INCREMENT BY 3 MAXVALUE 1000000;

However, the two DBMSs have different syntaxes for generating a sequence number. In MS SQL, the syntax is NEXT VALUE FOR sequence_name, whereas in PostgreSQL, it is nextval('sequence_name’).

Triggers. The main difference in creating triggers for two DBMS is that in MS SQL the source code is directly included in the CREATE TRIGGER statement, while in PostgreSQL the source code is arranged as a separate function which is then referenced from the CREATE TRIGGER statement:

CREATE OR REPLACE FUNCTION store_changes()
  RETURNS TRIGGER 
  LANGUAGE PLPGSQL
  AS
$$
BEGIN
    IF (NEW.first_name <> OLD.first_name) OR (NEW.last_name <> OLD.last_name) OR (NEW.email <> OLD.email)
THEN
         INSERT INTO changes_log(id,changed_on)
         VALUES(OLD.id,now());
    END IF;
    RETURN NEW;
END;
$$

CREATE TRIGGER make_changes
  BEFORE UPDATE
  ON employees
  FOR EACH ROW
  EXECUTE PROCEDURE store_changes();

Built-in Functions. SQL Server and PostgreSQL offer similar built-in functions, but some of them have no direct equivalent:

MS SQL

PostgreSQL

CHARINDEX($substr, $str, $start_pos)

STRPOS($str,$substr) and SUBSTRING

CHAR(n)

CHR(n)

DATEADD($interval, $n_units, $date)

$date + $n_units  interval '1 second/minute/hour/day/month/year'

DATEDIFF($interval, $date1, $date2)

DATE_PART

DATEPART

*DATE_PART

GETDATE

NOW()

IIF($condition,$expr1,$expr2)

CASE WHEN $condition THEN $expr1 ELSE $expr2 END

ISNULL

COALESCE

LEN

LENGTH

REPLICATE

REPEAT

SPACE($n)

REPEAT(' ', $n)

*MS SQL interval DAY/DD/D is converted as: date_part('day', $date2 - $date1)::int. MS SQL interval HOUR/HH is converted as 24 date_part('day', $date2 - $date1)::int + date_part('hour', $date2 - $date1). The same logic is applied to other intervals.

Conclusion

Database migration from SQL Server to PostgreSQL is a complicated process that may require much time and effort for large databases. It consists of five stages and has some nuances in almost each of them. This article covers only a few aspects of MS SQL to PostgreSQL database migration, while every project is a special one. The Intelligent Converters team is always ready to help with any kind of database migration and synchronization project.