Analytics: An Introduction
I think Tom Kyte is dead on when he says analytics are the best thing since "select" (i.e. since SQL was created). Analytics are tough to understand at first, but become hard to live without. What is an analytic? It is a way to query the results of your query.
Before playing, let's create some sample data to play with:
* Note: I would've used random data, but then the results would be different for different people.
Now we have 100 employees split unevenly among 12 departments with each department having a salary range.
Every query starts with a question, so what are some things we can ask of this data?
1) How many employees are there?
2) How many employees are there in each department?
3) What is the total salary of each department?
4) What is the salary range of each department?
Each of these questions can be answered with either a group by or with analytics. The difference is that group by will eliminate redundant details. Analytics will show the same information, but for each detail of the group.
1) How many employees are there?
Ok, no group by, sue me. But what if we wanted to get all the employees AND know how many there were?
Ok, so COUNT(*) OVER () tell me how many rows are in the table? No, not quite. Lets get all the employees in DEPT_ID = 9 to see what happens.
The analytic COUNT(*) OVER () returns the number of rows in the result set. Analytics take the basic form {Aggregate (e.g. count)} OVER ({window for this row}). By specifying nothing in the window area, we told Oracle to use EVERYTHING as the window. If this is new to you, I suggest playing around using aggregates you already know with the OVER () to get a feel for what its doing and how analytics perform. SUM, COUNT, AVG, MIN, MAX, are just a few aggregates you can use this way. Keep in mind that the analytic will be displayed for every row.
2) What is the total salary of each department?
Ok, so lets use our counting analytic and see what happens.
Darn, that's not at all what we wanted. Why is it 12 and for every row? Like I said earlier, analytics are a query against your result set. If you dropped the COUNT(*) OVER () portion, you'd get 12 rows. COUNT(*) OVER () yields the count of rows in the result set and is therefore 12. So how do we find out how many employees in each department? The partition clause.
That's sort of what we wanted, but why all the duplicates? Well, in order for the analytics to work, all the things we want to analyze have to be in the result set. If you wanted to get the same result set, you'd have to either distinct the results or query this query with a group by. I will show both methods below without result sets.
Using distinct:
At this point, I suggest playing around with using partition by and not using partition by. A good understanding of this can yield quick correct results and reduce procedural programming by a lot. Group by 1, 2, 3, or more columns against different tables in your system. Put a few count(*) OVER ( PARTITION BY {columns} ) in a single query to see how it performs. One tip for "seeing" what's going on is to order the query by the same things as the partition clause so you can easily perform the same calculations (if pheasible).
3) What is the salary of each department?
At this point, you should be able to answer this one on your own, but here are my "answers" anyway.
4) What is the salary range of each department?
This one should be answerable with the above information, but here are my answers.
This introduction only covers some of the capabilities of analytics. Analytics can also pick off values from previous or future rows, keep running statistics (usually totals), do moving windows, and more. I'll go over these uses in a future post.
A performance note about analytics
One performance "problem" I came across using analytics was that sometimes they were fast and other times I thought they were slow. I couldn't understand why until I started reducing the slow queries. Since analytics have to analyze the result set they are a part of, they often preclude the use of predicate filtering and indexes. For example:
Create a unique key on emp_id:
If you look at the explain plans, you'll see that query 1 does a unique index scan with table access by rowid. Query 2, on the other hand, does a full scan. Why? Because the analytic prevents the optimizer from pushing the EMP_ID = 3 predicate into the view. The view has to be run first and then EMP_ID = 3 can be looked up.
Before playing, let's create some sample data to play with:
CREATE TABLE EMP AS
SELECT
A.EMP_ID,
A.DEPT_ID,
ROUND(
( 800000 / COUNT(*) OVER ( PARTITION BY DEPT_ID ) ) +
( ( MOD( EMP_ID, 10 ) - 5 ) * 976.32 ),
2
) SALARY
FROM (
SELECT
ROWNUM EMP_ID,
MOD( ROWNUM, MOD( ROWNUM, 10 ) + 3 ) DEPT_ID
FROM ALL_OBJECTS ao
WHERE ROWNUM <= 100
) A
* Note: I would've used random data, but then the results would be different for different people.
Now we have 100 employees split unevenly among 12 departments with each department having a salary range.
Every query starts with a question, so what are some things we can ask of this data?
1) How many employees are there?
2) How many employees are there in each department?
3) What is the total salary of each department?
4) What is the salary range of each department?
Each of these questions can be answered with either a group by or with analytics. The difference is that group by will eliminate redundant details. Analytics will show the same information, but for each detail of the group.
1) How many employees are there?
SELECT COUNT(*) cnt
FROM EMP e
CNT
---
100
Ok, no group by, sue me. But what if we wanted to get all the employees AND know how many there were?
SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e
CNT EMP_ID DEPT_ID SALARY
100 14 0 113309.39
100 30 0 109404.11
100 60 0 109404.11
100 84 0 113309.39
100 36 0 115262.03
... results snipped ...
Ok, so COUNT(*) OVER () tell me how many rows are in the table? No, not quite. Lets get all the employees in DEPT_ID = 9 to see what happens.
SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e
WHERE e.DEPT_ID = 9
CNT EMP_ID DEPT_ID SALARY
2 9 9 403905.28
2 69 9 403905.28
The analytic COUNT(*) OVER () returns the number of rows in the result set. Analytics take the basic form {Aggregate (e.g. count)} OVER ({window for this row}). By specifying nothing in the window area, we told Oracle to use EVERYTHING as the window. If this is new to you, I suggest playing around using aggregates you already know with the OVER () to get a feel for what its doing and how analytics perform. SUM, COUNT, AVG, MIN, MAX, are just a few aggregates you can use this way. Keep in mind that the analytic will be displayed for every row.
2) What is the total salary of each department?
SELECT e.DEPT_ID, COUNT(*) cnt
FROM EMP e
GROUP BY e.DEPT_ID
DEPT_ID CNT
0 7
1 18
2 16
3 17
4 4
... results snipped ...
Ok, so lets use our counting analytic and see what happens.
SELECT e.DEPT_ID, COUNT(*) OVER () cnt
FROM EMP e
GROUP BY e.DEPT_ID
DEPT_ID CNT
0 12
1 12
2 12
3 12
4 12
... results snipped ...
Darn, that's not at all what we wanted. Why is it 12 and for every row? Like I said earlier, analytics are a query against your result set. If you dropped the COUNT(*) OVER () portion, you'd get 12 rows. COUNT(*) OVER () yields the count of rows in the result set and is therefore 12. So how do we find out how many employees in each department? The partition clause.
SELECT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e
CNT DEPT_ID
7 0
7 0
7 0
7 0
7 0
... results snipped ...
That's sort of what we wanted, but why all the duplicates? Well, in order for the analytics to work, all the things we want to analyze have to be in the result set. If you wanted to get the same result set, you'd have to either distinct the results or query this query with a group by. I will show both methods below without result sets.
Using distinct:
SELECT DISTINCT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e
Using query of a query w/group by:
SELECT A.CNT, A.DEPT_ID
FROM (
SELECT COUNT(*) OVER ( PARTITION BY e.DEPT_ID ) cnt, e.DEPT_ID
FROM EMP e
) A
GROUP BY A.CNT, A.DEPT_ID
At this point, I suggest playing around with using partition by and not using partition by. A good understanding of this can yield quick correct results and reduce procedural programming by a lot. Group by 1, 2, 3, or more columns against different tables in your system. Put a few count(*) OVER ( PARTITION BY {columns} ) in a single query to see how it performs. One tip for "seeing" what's going on is to order the query by the same things as the partition clause so you can easily perform the same calculations (if pheasible).
3) What is the salary of each department?
At this point, you should be able to answer this one on your own, but here are my "answers" anyway.
SELECT e.DEPT_ID, SUM( e.SALARY ) TOTAL_SALARY
FROM EMP e
GROUP BY e.DEPT_ID
DEPT_ID TOTAL_SALARY
0 787307.81
1 761923.44
2 758994.56
3 782426.18
4 801952.64
... results snipped ...
SELECT DISTINCT e.DEPT_ID, SUM( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) TOTAL_SALARY
FROM EMP e
DEPT_ID TOTAL_SALARY
0 787307.81
1 761923.44
2 758994.56
3 782426.18
4 801952.64
... results snipped ...
4) What is the salary range of each department?
This one should be answerable with the above information, but here are my answers.
SELECT
e.DEPT_ID,
MIN( e.SALARY ) MIN_SALARY,
MAX( e.SALARY ) MAX_SALARY
FROM EMP e
GROUP BY e.DEPT_ID
DEPT_ID MIN_SALARY MAX_SALARY
0 109404.11 117214.67
1 39562.84 48349.72
2 45118.4 52928.96
3 43153.54 50964.1
4 199023.68 202928.96
... results snipped ...
SELECT
DISTINCT
e.DEPT_ID,
MIN( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) MIN_SALARY,
MAX( e.SALARY ) OVER ( PARTITION BY e.DEPT_ID ) MAX_SALARY
FROM EMP e
DEPT_ID MIN_SALARY MAX_SALARY
0 109404.11 117214.67
1 39562.84 48349.72
2 45118.4 52928.96
3 43153.54 50964.1
4 199023.68 202928.96
... results snipped ...
This introduction only covers some of the capabilities of analytics. Analytics can also pick off values from previous or future rows, keep running statistics (usually totals), do moving windows, and more. I'll go over these uses in a future post.
A performance note about analytics
One performance "problem" I came across using analytics was that sometimes they were fast and other times I thought they were slow. I couldn't understand why until I started reducing the slow queries. Since analytics have to analyze the result set they are a part of, they often preclude the use of predicate filtering and indexes. For example:
Create a unique key on emp_id:
ALTER TABLE EMP ADD ( PRIMARY KEY( EMP_ID ) );
{Gather stats on emp with the cascade option}
Create a view of emp with an analytic:
CREATE OR REPLACE VIEW EMP_WITH_COUNT AS
SELECT COUNT(*) OVER () cnt, e.*
FROM EMP e;
Query1:
SELECT e.*
FROM EMP e
WHERE EMP_ID = 3
Query2:
SELECT ewc.*
FROM EMP_WITH_COUNT ewc
WHERE ewc.EMP_ID = 3
If you look at the explain plans, you'll see that query 1 does a unique index scan with table access by rowid. Query 2, on the other hand, does a full scan. Why? Because the analytic prevents the optimizer from pushing the EMP_ID = 3 predicate into the view. The view has to be run first and then EMP_ID = 3 can be looked up.
1 Comments:
THANK you for this. I recently had to use an analytic function but had no idea of how it worked (just found an example of what I needed to do and copy-and-pasted).
The best part of this was:
"What is an analytic? It is a way to query the results of your query"
and
"Analytics take the basic form {Aggregate (e.g. count)} OVER ({window for this row}). "
those two simple lines explained virtually everything I needed to know (at least to ask an intelligent question)
Thanks Much!!
-joel
Post a Comment
<< Home