12/19/2005

Things I miss about Oracle 10g

Due to a recent job change, I have had to go from 10gR1 to 8i. While it hasn't been a horrible transition, there are some things that I wish I was ignorant about. Here are some things that I miss about Oracle 10gR1:

1) Dual row generation trick
I don't use this one much, but when I need it, its one of the best ways to create rows on the fly.


SELECT ROWNUM rn
FROM DUAL
CONNECT BY ROWNUM <= :p_num_rows



2) With clause
I love the modularity of this syntax. It's great when you want to get a filtered list from a master table and then get the child records of those master records. Complex queries can be written in a cascading style instead of inside out. Queries can be better self-documented. It even opens a new path of materializing a reused query to temp for the sake of performance!


WITH info AS (
SELECT o.ORDER_ID, o.CUSTOMER_NAME
FROM ORDERS o
WHERE o.ORDER_TOTAL > 1000
), details AS (
SELECT od.ORDER_ID, od.PRODUCT, od.QUANTITY, od.LINE_TOTAL
FROM ORDER_DETAILS od
WHERE od.ORDER_ID IN (
SELECT i.ORDER_ID
FROM INFO i
)
)
SELECT
i.ORDER_ID,
i.CUSTOMER_NAME,
NULL PRODUCT,
NULL QUANTITY,
NULL LINE_TOTAL
FROM INFO i
UNION ALL
SELECT
d.ORDER_ID,
NULL CUSTOMER_NAME,
d.PRODUCT,
d.QUANTITY,
d.LINE_TOTAL
FROM DETAILS d



3. SQL and PL/SQL having the same parser.
This is one of the most annoying changes. Write a query that works from TOAD/SQL*PLUS and then, when you put it in your PL/SQL, it bombs. Why? Because PL/SQL doesn't recognize the syntax. Its easy to get around, just open it dynamically with "execute immediate" or "open {cursor} for" syntax, but I prefer to avoid as much hard parsing as possible.


4. User defined aggregates (especially stragg).
When you want to return a list of information and a comma separated list of some detail record (e.g. a user lists and a list of each user's interests), the fastest and best way I've found to return that is via stragg. Without it, I've had to resort all sorts of tricks depending on the situation. A cursor column, a query for details for each record, a cartesian of sorts (duplicate the master record for each child), or hard code a max number of details to display.


5. Flashback
This feature is just too cool. Its nice to be able to "see" what modifications were made to a table. Just compare the table to a flashback at a known time and see if it did what you expected. Its like having access to any data in your db as far back as your undo space can hold


6. PL/SQL tables indexed by VARCHAR2
Trying to figure out how to assign a number to each row/group you want to process when it has no uniquely identifying number attribute can be awfully painful. Usually its a choice between using a simple algorithm that will be slower, easier to read, and harder to screw up and a more complex "correct" algorithm that will be faster, but harder to read and easier to screw up.


7. AWR
Ok, its available in 8i via Statspack, but AWR is much cleaner. Instead of having to have the correct scripts in your oracle home and then running the report script, you just use the awr package to run both the snapshots AND the report. Very cool. One of the easier ways to get performance info without having to do any plate switching. Makes me wonder why they don't build a trace analyzing package into the database to make it easy to tkprof a session's work...


8. Transparent PL/SQL optimizations
While it may just feel like 10g is faster, I think it might actually be faster with some of the optimizations they did. One known example is the transparent 100 row bulk fetch for implicit cursor for loops.


9. Returning Clause
This clause allows you to insert a record and simultaneously query back what you just inserted. Why is it useful? The most common use for me is to insert a record with a sequence and get the new id back in one statement.


INSERT INTO T(
ID,
NAME
) VALUES (
T_SEQ.NEXTVAL,
'John Doe'
)
RETURNING ID
INTO v_id;



10. Inserting using a %ROWTYPE PL/SQL variable
This just looks cleaner to me. Have a pl/sql %ROWTYPE variable, set the fields there and then insert all the values. I don't use it when I need to get a sequence as I find a sequence query + insert is less performant on scale (e.g. the more inserts you do that way, the worse it gets)


DECLARE
v_t T%ROWTYPE
BEGIN
v.NAME := 'Tom';
INSERT INTO T VALUES v_t;
END;



11. Merge
Another tool that's perfect for what it was designed. Want to ensure that a record exists a certain way in a table, but don't know whether to insert or delete it? Then merge it!


MERGE INTO T old_t
USING (
SELECT 12 ID, 'John Doe' NAME
FROM DUAL
) new_t
ON (
old_t.ID = new_t.ID
)
WHEN MATCHED
THEN UPDATE
SET
old_t.NAME = new_t.NAME
WHEN NOT MATCHED
THEN INSERT(
ID,
NAME
) VALUES (
new_t.ID,
new_t.NAME
)


The examples are not tested and are merely meant to show the syntax. If I can come up with one more, then I may have to make a 12 days of XMas song out of it :-).

12/16/2005

Analytics: An Introduction

I think Tom Kyte is dead on when he says analytics are the best thing since "select" (i.e. since SQL was created). Analytics are tough to understand at first, but become hard to live without. What is an analytic? It is a way to query the results of your query.

Before playing, let's create some sample data to play with:

CREATE TABLE EMP AS
SELECT
A.EMP_ID,
A.DEPT_ID,
ROUND(
( 800000 / COUNT(*) OVER ( PARTITION BY DEPT_ID ) ) +
( ( MOD( EMP_ID, 10 ) - 5 ) * 976.32 ),
2
) SALARY
FROM (
SELECT
ROWNUM EMP_ID,
MOD( ROWNUM, MOD( ROWNUM, 10 ) + 3 ) DEPT_ID
FROM ALL_OBJECTS ao
WHERE ROWNUM <= 100
) A

* Note: I would've used random data, but then the results would be different for different people.

Now we have 100 employees split unevenly among 12 departments with each department having a salary range.


Every query starts with a question, so what are some things we can ask of this data?

1) How many employees are there?
2) How many employees are there in each department?
3) What is the total salary of each department?
4) What is the salary range of each department?

Each of these questions can be answered with either a group by or with analytics. The difference is that group by will eliminate redundant details. Analytics will show the same information, but for each detail of the group.


1) How many employees are there?

SELECT COUNT(*) cnt
FROM EMP e

CNT
---
100


Ok, no group by, sue me. But what if we wanted to get all the employees AND know how many there were?


SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e

CNT EMP_ID DEPT_ID SALARY
100 14 0 113309.39
100 30 0 109404.11
100 60 0 109404.11
100 84 0 113309.39
100 36 0 115262.03
... results snipped ...


Ok, so COUNT(*) OVER () tell me how many rows are in the table? No, not quite. Lets get all the employees in DEPT_ID = 9 to see what happens.


SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e
WHERE e.DEPT_ID = 9

CNT EMP_ID DEPT_ID SALARY
2 9 9 403905.28
2 69 9 403905.28


The analytic COUNT(*) OVER () returns the number of rows in the result set. Analytics take the basic form {Aggregate (e.g. count)} OVER ({window for this row}). By specifying nothing in the window area, we told Oracle to use EVERYTHING as the window. If this is new to you, I suggest playing around using aggregates you already know with the OVER () to get a feel for what its doing and how analytics perform. SUM, COUNT, AVG, MIN, MAX, are just a few aggregates you can use this way. Keep in mind that the analytic will be displayed for every row.


2) What is the total salary of each department?

SELECT e.DEPT_ID, COUNT(*) cnt
FROM EMP e
GROUP BY e.DEPT_ID

DEPT_ID CNT
0 7
1 18
2 16
3 17
4 4
... results snipped ...


Ok, so lets use our counting analytic and see what happens.

SELECT e.DEPT_ID, COUNT(*) OVER () cnt
FROM EMP e
GROUP BY e.DEPT_ID

DEPT_ID CNT
0 12
1 12
2 12
3 12
4 12
... results snipped ...


Darn, that's not at all what we wanted. Why is it 12 and for every row? Like I said earlier, analytics are a query against your result set. If you dropped the COUNT(*) OVER () portion, you'd get 12 rows. COUNT(*) OVER () yields the count of rows in the result set and is therefore 12. So how do we find out how many employees in each department? The partition clause.


SELECT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e

CNT DEPT_ID
7 0
7 0
7 0
7 0
7 0
... results snipped ...


That's sort of what we wanted, but why all the duplicates? Well, in order for the analytics to work, all the things we want to analyze have to be in the result set. If you wanted to get the same result set, you'd have to either distinct the results or query this query with a group by. I will show both methods below without result sets.

Using distinct:

SELECT DISTINCT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e

Using query of a query w/group by:
SELECT A.CNT, A.DEPT_ID
FROM (
SELECT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e
) A
GROUP BY A.CNT, A.DEPT_ID


At this point, I suggest playing around with using partition by and not using partition by. A good understanding of this can yield quick correct results and reduce procedural programming by a lot. Group by 1, 2, 3, or more columns against different tables in your system. Put a few count(*) OVER ( PARTITION BY {columns} ) in a single query to see how it performs. One tip for "seeing" what's going on is to order the query by the same things as the partition clause so you can easily perform the same calculations (if pheasible).


3) What is the salary of each department?
At this point, you should be able to answer this one on your own, but here are my "answers" anyway.


SELECT e.DEPT_ID, SUM( e.SALARY ) TOTAL_SALARY
FROM EMP e
GROUP BY e.DEPT_ID

DEPT_ID TOTAL_SALARY
0 787307.81
1 761923.44
2 758994.56
3 782426.18
4 801952.64
... results snipped ...

SELECT DISTINCT e.DEPT_ID, SUM( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) TOTAL_SALARY
FROM EMP e

DEPT_ID TOTAL_SALARY
0 787307.81
1 761923.44
2 758994.56
3 782426.18
4 801952.64
... results snipped ...


4) What is the salary range of each department?
This one should be answerable with the above information, but here are my answers.


SELECT
e.DEPT_ID,
MIN( e.SALARY ) MIN_SALARY,
MAX( e.SALARY ) MAX_SALARY
FROM EMP e
GROUP BY e.DEPT_ID

DEPT_ID MIN_SALARY MAX_SALARY
0 109404.11 117214.67
1 39562.84 48349.72
2 45118.4 52928.96
3 43153.54 50964.1
4 199023.68 202928.96
... results snipped ...

SELECT
DISTINCT
e.DEPT_ID,
MIN( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) MIN_SALARY,
MAX( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) MAX_SALARY
FROM EMP e

DEPT_ID MIN_SALARY MAX_SALARY
0 109404.11 117214.67
1 39562.84 48349.72
2 45118.4 52928.96
3 43153.54 50964.1
4 199023.68 202928.96
... results snipped ...


This introduction only covers some of the capabilities of analytics. Analytics can also pick off values from previous or future rows, keep running statistics (usually totals), do moving windows, and more. I'll go over these uses in a future post.

A performance note about analytics
One performance "problem" I came across using analytics was that sometimes they were fast and other times I thought they were slow. I couldn't understand why until I started reducing the slow queries. Since analytics have to analyze the result set they are a part of, they often preclude the use of predicate filtering and indexes. For example:

Create a unique key on emp_id:

ALTER TABLE EMP ADD ( PRIMARY KEY( EMP_ID ) );
{Gather stats on emp with the cascade option}

Create a view of emp with an analytic:
CREATE OR REPLACE VIEW EMP_WITH_COUNT AS
SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e;

Query1:
SELECT e.*
FROM EMP e
WHERE EMP_ID = 3

Query2:
SELECT ewc.*
FROM EMP_WITH_COUNT ewc
WHERE ewc.EMP_ID = 3


If you look at the explain plans, you'll see that query 1 does a unique index scan with table access by rowid. Query 2, on the other hand, does a full scan. Why? Because the analytic prevents the optimizer from pushing the EMP_ID = 3 predicate into the view. The view has to be run first and then EMP_ID = 3 can be looked up.

12/15/2005

SQL "Beautifying"

Read PL/SQL Beautifier at Tim Hall's blog. I like the idea of a built in SQL, and PL/SQL, formatter for Oracle. If Oracle creates a built-in formatter, I'd like it to be extensible. Have a default format and give the user (us) the ability to tweak it to our liking. Once the "beautifier" comes out and it is integrated into 3rd party tools, you'll have one location to tweak your SQL and PL/SQL formatting style. That sounds good to me. While you're at it, throw in an asserter so we can just say DBMS_BEAUTIFIER.assert_is_formatted( v_sql_or_plsql_clob ).


However, I'm writing to discuss "good" and "bad" SQL formats. I've found that many of the suggested standards are either poor for readability or poor for maintainability.

First some bad:
1) Aligning the keywords

SELECT DUMMY
FROM DUAL
WHERE 1 = 1
AND 2 = 2


Why is it bad? On its own, it doesn't make a difference. If you try and follow that format in PL/SQL, it becomes both hard to maintain and hard to read. Also, try writing fluent SQL while trying to figure out how many spaces you need to enter before starting your keyword. Formats should be as close to natural writing as possible

2) Commas before columns

SELECT
DUMMY d1
, DUMMY d2
, DUMMY d3
FROM DUAL


This is annoying because the comma isn't the most important thing about each row: THE COLUMN IS. This format is used to "help" people remember commas after each non-terminal column. If a programmer can remember to put a semi-colon after every single line of code they make, they can remember to put a comma after non-terminal columns. The most important information should be first


And some good:
1) Alias everything in the from list and use those aliases for EVERY column reference.
This adds both readability and maintainability:

  • Easier to determine from where columns are coming - readability

  • May be more self-documenting - readability

  • Easier to switch which tables or views are referenced - maintainability

  • Allows the addition of columns without breaking the query - maintainability




SELECT
ORDER_ID,
CUSTOMER,
SALES_TOTAL,
PRODUCT,
COMMENTS
FROM ORDERS, ORDER_DETAILS
WHERE ORDER_ID = ORD_ORDER_ID


Where are sales_total and comments coming from?
What if I want to use my ORDERS_SUMMARY view instead of ORDERS?
What if I want to add a comments column to my ORDER_DETAILS table?

Aliasing everything answers all these questions clearly:

SELECT
o.ORDER_ID,
o.CUSTOMER,
od.SALES_TOTAL,
od.PRODUCT,
od.COMMENTS
FROM ORDERS o, ORDER_DETAILS od
WHERE o.ORDER_ID = od.ORD_ORDER_ID


2. Put each item of the select or from list on its own line when it exceed 2 elements
One of the most common changes for a query is adding information or changing from where information is coming. By putting each select and from element on its own line, you not only make it easier to read, but you make it easier to add, delete, and remove from and select elements. It also clears up any confusion between columns used in a function call vs columns in the select list.

3. Put each parameter of a function call on its own line when it exceeds 2 elements
This one is more a rule of thumb (ROT) and should be broken as you see fit. For instance, with decodes, I generally put the item being decoded on its own line and then put pairs of elements on their own line and then the default on its own line. For example:

DECODE(
flag,
'M', 'MONTHLY',
'D', 'DAILY',
'W', 'WEEKLY',
'OTHER'
)

This ROT is especially useful when you begin nesting functions. For example:


SUBSTR( my_str, DECODE( INSTR( my_str, ',' ), 0, LENGTH( my_str ), INSTR( my_str, ',' ) ) )


vs


SUBSTR(
my_str,
DECODE(
INSTR( my_str, ',' ),
0, LENGTH( my_str ),
INSTR( my_str, ',' )
)
)