关于sql:Transpositions-in-SQL

74次阅读

共计 17323 个字符,预计需要花费 44 分钟才能阅读完成。

Transposition algorithm is common for SQL. There are row-to-column transposition, column-to-row transposition, dynamic transposition and join-based transposition, etc. This article analyzes the algorithm for each type of transposition and offers the sample code. For the transpositions that are difficult to handle in SQL, there are convenient esProc solutions. Looking Transpositions in SQL for details.
A transposition is to rotate information from one row or column to another to change the data layout, for the purpose of making observations from a new perspective. Some transposition algorithms are simple, such as row to column, column to row and bidirectional transposition. Others are not so simple, such as dynamic transposition, transposition with inter-row calculations and join-based transposition. All are commonly seen in data analyses and thus worth a study.

Basic transposition

Row-to-column transposition and column-to-row transposition are the simplest. Each is the other’s inverse computation.

1. Row to column: Below is the grouped sales table. Task: Transpose values (rows) Q1-Q4 under quarter field into new field names (columns), as shown below:

2. Column to row: Below is the sales cross table. Task: Transpose fields Q1-Q4 to values Q1-Q4 under the new field quarter, as shown below:

The early SQL solutions:

In its earlier days, SQL didn’t have the special PIVOT function (MySQL and HSQLDB don’t have one even now), it handled row to column transpositions by using and coordinating multiple basic functions. There was often more than one way to solve a computing problem.

Method 1: case when subquery + grouping & aggregation

Method 2: sum if + grouping & aggregation

Other methods include WITH ROLLUP + grouping & aggregation and UNION + grouping & aggregation, etc. They are essentially the same: calculate the year value after grouping, and generate new columns Q1-Q4 through enumeration and their values through aggregation.

The SQL code is lengthy even for the most basic and simplest transposition. This is because you need to enumerate each new column. The more the new columns there are, the longer the code will be. Imagine the code if the new columns are 12 months, states and provinces in a country.

Only if the new columns are known, the inefficient enumeration of the new columns will affect the length of the code only but not the complexity of the code. But if the new columns cannot be known in advance, it’s difficult to enumerate them. One example is to convert the dynamic row-based VIP customers into field names. It’s hard to do this in SQL alone. Usually we turn to the stored procedure or high-level language, like Java, to handle it. But the code complexity and maintenance cost will considerably increase.

There’s another problem about the above SQL program. That’s the hard to understand aggregate algorithm. There is only one record in each quarter per year, and there’s no need to do the aggregation. But as SQL forces an aggregation after each grouping, there’s unreasoned aggregate over the single record after grouping for the calculation of the year column. It’s unnecessary and senseless. You can do any aggregate and get same result, to replace MAX with SUM, for instance.

SQL’s binding of aggregate to each grouping action results from its incomplete set-orientation. Specifically, SQL can only express a simple, small set consisting of multiple records, but it doesn’t have the syntax or operator to phrase a complicated, large set made up of multiple smaller sets. That’s why it can’t help aggregating each subgroup to convert it a single record to make a large set of smaller sets a simple set.

The column to row transposition doesn’t involve the hard to understand aggregation. Its early SQL solution is relatively simple. You just need to get records under Q1-Q4 by column names and then union them. The code is as follows:

Though it is simple, the code is very long because you need to enumerate new rows in each group, which could be a quarter, a month or a state. Fortunately the new rows in each group are the column (field) names of the source table and they are fixed rather than dynamic. So the algorithm is not complicated.

PIVOT/UNPIVOT functions

To make the transposition convenient, database vendors released special functions to implement the algorithms.

The PIVOT function for performing row to column transposition:

PIVOT function shortens the code, but doesn’t hit the real problem. SQL’s weaknesses are still there.

It can’t handle the problem of dynamic columns. The stored procedure or Java is still needed. And the code is difficult to develop and maintain.

It can’t deal with the problem of set-based operations. All it can do is the aggregation for all scenarios, even unnecessary. For beginning users, that’s the hard nut to crack and needs a lot of extra efforts.

In certain cases, the aggregation is necessary. To do the row to column transposition to get multiple rows for the record of each quarter per year, and calculate the biggest amount per quarter per year based on the grouped sales table, for example:

Aggregation is reasonable and necessary in such a case, so we can use the same core code:

Now you can see that this is the“grouping & aggregation + row-to-column transposition”instead of a pure transposition. Beginners may wonder why we use same core code for two different algorithms. If you have read the previous part carefully, you know that’s due to SQL’s incomplete set orientation.

UNPIVOT function is easier to understand:

UNPIVOT produces shorter and easy to understand code. The code is simple because there isn’t aggregate operation involved. Besides, it’s rare that a column to row transposition involves dynamic column names. So the algorithm won’t be too complicated. In view of these, UNPIVOT is perfect.

Bidirectional transposition

A bidirectional transposition is the swapping or mapping of rows or columns to another over, generally, a crosstab.

3. Task: To transpose Year-Quarter sales table to Quarter-Year sales table. That is to convert the Year values year2018 and year 2019 to new column names and at the same time, to transform column names Q1-Q4 to values of the new column quarter.

The expected result is as follows:

As the name shows, the bidirectional transposition is to first perform a column to row over Q1-Q4 and then a row to column over year2018 and year2019. The code will be like this if you do it with a small database:

As there are both the row to column and column to row algorithms in the program, it has the weaknesses of both, such as lengthy code, dynamic column problem and unintelligible aggregate operation. A procedural language, like Java and C++, follows, in order, a set of commands, so the relationship between code complexity and code length is linear. There’s a different case with SQL. It’s hard to write a SQL program in a step-by-step or module-by-module way or debug one through breakpoint. This leads to an exponential increase of code complexity as the code becomes longer. All these make the bidirectional transposition more difficult to implement than it appears.

It appears that you can reverse the order to perform row to column and then column to row. Actually it won’t work because it will increase the number of subqueries from 1 to 4 as the result of union. That will further produce longer code and lower performance. But there will be no such problems if you use databases that support WITH clause, like Oracle.

You can make a duo of PIVOT and UNPIVOT if you use Oracle or MSSQL instead of a small database that requires WITH clause. The code is as follows:

he order of column to row result is random, so you need to use order by to sort quarter column according to Q1-Q4. If you want to arrange it by a user-defined order (like 0,a,1), then you need to create a pseudo table and perform join with it. This will greatly complicate the computation.

Another point about PIVOT/UNPIVOT functions is that they are not ANSI standard. Vendors have their own ways to implement them and so it’s difficult to migrate their code between different databases.

Dynamic transposition

A dynamic transposition has unfixed, changeable to-be-transposed values and thus the indefinite transposed rows or columns that require dynamic calculations.

4. Dynamic row to column transposition: There’s a Dept-Area average salary table where the number of areas increases as businesses expand. Task: Convert values under Area field (rows) to new field names (columns).

As shown in the figure below:

It seems that we can get this done using PIVOT with a subquery used in the in clause to dynamically get the unique area values, as shown below:

Actually PIVOT’s in function is different by that it doesn’t support a direct subquery.

To use a subquery directly, you need the unusual xml key word:

And get a strange intermediate result set consisting of two fields, one of which is XML type, as shown below:

Then you need to parse the XML dynamically to get the AREA nodes and generate a dynamic table structure to which data will be populated dynamically. It’s impossible to implement such a dynamic algorithm in SQL alone. For subsequent computation, you need to embed the SQL code in Java or stored procedure. The code will become extremely long.

5. Row to column transposition over intra-group records: In the income source table, Name is the logical grouping field; Source and Income is the intra-group fields. Each Name corresponds to multiple records in the group, whose number is indefinite. Task: transpose rows to columns for each group.

Below is the source data and the expected transposed data:

The logic is clear: generate the result table structure, insert data into it and then export data from it.

Yet the implementation is not simple at all. The code needs a lot of dynamic syntax, even in the nested loop, but SQL doesn’t support the dynamic syntax. To make up for it, SQL turns to another language to do the job, like Java or the stored procedure, which are not good at handling structured computations. Forcing them to do this will result in lengthy code. Below is the SQL solution:

1. Calculate the number of intra-group fields (colN) in the result table. To do this, group the source table by Name, get the number of records in each group, and find the largest record count. In the above table, both David and Andrew have two records, which is the most. So colN is 2, and the dynamic column name is colNames.

2. Dynamically generate the SQL strings (cStr) for the result table。This requires looping over records for colN times to generate a set of intra-group fields each time. The fields include one fixed column and 2*colN dynamic columns (as the above table shows).

3. Execute the generated SQL strings dynamically to generate a temporary table using the code like execute immediate cStr.

4. Get the list of key words (rowKeys) to be inserted into the result table, which is performing distinct over the source table. The key word list for the above table is rowKeys=[“David”,”Daniel”,”Andrew”,”Robert”].

5. Loop over rowKeys to dynamically generate a SQL string iStr to be inserted into the result table and then execute the insertion. To generate iStr, query the source table by the current Name to get the corresponding list of records. Both the generation of iStr and the subsequence execution are dynamic. Then loop over the list of records to compose an iStr to execute. That’s the end of one round of loop.

6. Query the result table to return the data.

The code would be much simpler if SQL supported dynamic syntax or Java/the stored procedure has built-in structured function library (which is independent of SQL).

The algorithm of step 4 is removing duplicates from Name values, which is equivalent to getting values of the grouping filed after data is grouped. That of step 1 is to count the records in each group after the grouping operation. Since both have the grouping action, the grouping result can be reused in theory. But as an aggregate will always follow a grouping action due to SQL’s incomplete set orientation, reuse of grouping result is disabled. When there is only a small amount of data, reuse is not that important if you don’t care about whether the code is graceful or not. But, when there is a large amount of data or the algorithm requires frequent reuses, reusability will affect the performance.

6. Complex static row-to-column transposition: There will be a fixed 7 records for each person per day on the attendance table. Now we want to transpose each set of records into 2 records. Values of In, Out, Break, Return fields in the first record come from the Time values in the 1st, 7th, 2nd and 3rd records in the source table. Values of the second record correspond to the Time values of the 1st, 7th, 5th and the 6th records.

The source table:

The expected transposed table:

Since the number of columns after transposition is fixed, we can use SQL to implement the algorithm. The code is as follows:

SQL is based on unordered sets. It doesn’t support referencing records directly with sequence numbers. To make the data retrieval convenient, we have to create sequence numbers manually using the with clause. As we explained earlier, the additional aggregate max is the display of SQL’s incomplete set orientation.

7. Complex dynamic row-to-column transposition: The user table relates the record table through user IDs. Each user has an activity record in a certain date of the year 2018. Task: Find whether each user has the activity record in each week of 2018. User names will be transposed to new columns.

The source table structure:

The expected transposed table:

We need to implement the dynamic columns using the stored procedure/Java + dynamic SQL. The code will be very long.

We need some preparations. Join the user table and the record table; add a calculated column and calculate which week the current Date field value falls beginning from 2018-01-01 (the result should be not greater than 53); find the maximum number of weeks to get the key word list rowKeys for the target table; perform distinct over the join result and get the new column names colNames for the target table.

Then we begin to implement the dynamic transposition algorithm: generate a dynamic SQL query according to colNames to create the target table and execute the query; loop through rowKeys to first get data from the join result and then generate Insert SQL dynamically and then execute the dynamic SQL.

All transpositions involving dynamic columns include generating the dynamic target table structure and then inserting data dynamically. The implementation is difficult and we have to turn to Java or the stored procedure because SQL lacks the ability to express dynamic query. I’ll simply use the“dynamic transposition”for similar scenarios in my later illustrations.

Transposition + inter-column calculation

The pure transposition exists only in exercise book most of the time. In real-world businesses, a transposition is often accompanied by another, or others, operations, such as the inter-column calculation.

8 Temp table stores the monthly payable amount for each customer in 2014. The name filed is the key (key words). Now we want to transpose the months in dates into the new columns (Month 1-12). Their corresponding values are the monthly payable amount. If the amount is null for a month, just use the amount of the previous month.

The source table:

The target transposed table:

We can handle the transposition in SQL since the columns are fixed after transposition. The algorithm is like this. Create a temporary table t1 made up of one field month whose values are 1-12; get the month according to the dates in the source table and name the field month; perform a left join between the two tables to create continuous payable amount records that include invalid data; use PIVOT to do the row to column transposition and remove invalid data through min aggregate. The SQL code is as follows:

The code is not very long but difficult to understand, particularly with the extra creation of invalid data. That’s because SQL sets don’t have sequence number and the language isn’t good at performing order-based calculations, especially the inter-row calculation.

Table join + column to row transposition

9 Insert sub table into the main table: The relationship of Order table and OrderDetail table is that of the sub table and the main table. One order corresponds to at least one detail record. We want to insert the details into the orders, as shown below:

The relationship between the source tables:

The target transposed table:

We use the stored procedure/Java + dynamic SQL to implement the dynamic columns. The algorithm is this. Join the two tables; group the joining result (or the sub table) by ID, count records in each group, and find the largest number of records to get the value of colNames, the dynamic columns list; perform distinct over the joining result (or the main table) by ID to get the key word list rowKeys for the target table; implement the dynamic transposition algorithm according to colNames and rowKeys.

10 Table join + column to row transposition: Both Exam table and Retest table are Students table’s sub tables. We want to convert the data in the sub tables into the main table’s new columns and add a total_score column. The exam subjects may vary for different students and not every student participates in the retest. The exam subjects always include the retest subject(s).

The source tables and their relationship:

The target transposed table:

If the exam subjects are fixed, we can do it in SQL. Left join Students table and Exam table and perform PIVOT; Left join Retest table and Exam table and perform PIVOT; and then perform another left join between the two result tables.

But in this case the subjects are not fixed and so the target table will have dynamic columns. The old trick again, which is the stored procedure/Java + dynamic SQL. The algorithm is like this. Left join both sub tables to Students table; group the joining result by stu_id, count records in each group and find the largest record count to get the dynamic columns list (colNames) for the target table; perform distinct over the joining result by stu_id and get the key word list (rowKeys) for the target table; implement the dynamic transposition algorithm by colNames and rowKeys.

Display data in column groups

11 The source table records populations in certain cities in different continents. We want to get European and African cities and their populations and display them in two column groups horizontally. The target columns are fixed but the number of rows in the source table is dynamic:

We can implement a target table with a fixed structure in SQL. The algorithm is this. Filter records to get those of European cities and calculate row numbers through rownum to make them a calculated column; get records of African cities in the same way; perform full join between them and get the desirable fields.

The SQL code:

Summary

After the detailed explanations, you can see that there are only 3 types of simple transpositions that can be directly handled using SQL PIVOT and UNPIVOT in large databases. And you need to take care of XML parsing, unordered result set and migration problem.

For transposition algorithms that are not very simple, SQL can handle them if columns are fixed but the code is difficult to write. You need to be familiar with SQL weaknesses and devise ingenious and unusual techniques to make up for them. The SQL defects include incomplete set orientation, lack of sequence numbers for elements in a set, order-based calculation headache, nonprocedural calculation and difficult to debug code, etc.

For algorithms involving dynamic columns, the coe will be difficult to write and you have to turn to Java or stored procedure to produce very complicated code. The lack of support of dynamic data structure is another SQL flaw.

SQL headaches owe to limitation of times when it was born. This doesn’t exist in other computer languages, like VB\C++\JAVA, and the stored procedure. But on the other hand, these languages have weaker set-based computing ability, lack class library for structured computations and thus need to write a lot of code to implement the transposition algorithms if no SQL program is embedded into them.

Yet all these problems can be solved with esProc SQL. esProc is the professional data computing engine that is based on ordered sets, provides all-round structured computation functions as SQL does and intrinsically supports stepwise coding and execution as Java does. It inherits the merits of both SQL and Java. You can always use SPL instead of Java with SQL to handle the transposition tasks effortlessly:

正文完
 0