data warehouse, with many systems “
under-configured” and therefore delivering performance reduced by orders of magnitude.
When you remove I/O from the equation,
the game changes entirely. The processes
you had in place in an I/O-constrained
system do not work effectively in a system
without I/O constraints. Not that those processes were efficient in the I/O-constrained
system either! They were suboptimal there
as well, but given the poor I/O performance,
it was not as noticeable.
What does this mean to you? What do
you need to unlearn for Oracle Exadata? The
first thing to look at is your load and query
process. On Ask Tom I frequently see people
asking how to speed up their load, claiming
they cannot load 5,000,000 new records
in their multihour processing window.
When they say it takes hours to load millions of rows, I know something is seriously
wrong, because I can load millions of rows in
minutes—on my laptop. It is all in how you
approach the load.
Most people adopt what I’ll call a slow-by-slow (row-by-row) approach. They load and
validate a single row millions of times. What
they need to do is load and validate millions of rows once. There is a big difference
between those two approaches, because
they scale entirely differently.
Let’s look at the numbers: If you wanted to
load 5,000,000 rows and it took a slow-by-slow process 1 millisecond (1/1000 second)
to process each row during a validation/
transformation phase, you would be looking
at 5,000 seconds—or almost 1. 5 hours. Now,
1 millisecond is probably being very generous,
because there would be network round-trips
for each row, database CPU, client CPU, and
reading of input files to be loaded and other
steps involved in the process. You can easily
see how a row-by-row approach to loading
5,000,000 rows will quickly become your
bottleneck as the row counts go up and up.
When you take a big number (millions or billions) and multiply it by a little number (the
amount of processing time per row) you end
up with . . . still a big number!
If, on the other hand, you bulk-load
your data, using direct path operations and
parallel query to validate and transform the
data, you can do in seconds or minutes what
Code Listing 2: PL/SQL pipelined function solution
SQL> create or replace function foo( p_cursor in sys_refcursor,
p_threshold in number )
return my TableType
2 pipelined
3 as
4 type array is table of t%rowtype index by binary_integer;
5
6 l_data array;
7 l_running_total number := 0;
8 l_group number := 1;
9 n number := 100;
10 begin
11 loop
12 fetch p_cursor bulk collect into l_data limit N;
13 for i in 1 .. l_data.count
14 loop
15 l_running_total := l_running_total + l_data(i).cnt;
16 if ( l_running_total > p_threshold )
17 then
18 l_group := l_group + 1;
19 l_running_total := l_data(i).cnt;
20 end if;
21 pipe row( myScalarType( l_data(i).study_site,
l_data(i).cnt, l_group ));
22 end loop;
23 exit when p_cursor%notfound;
24 end loop;
25 close p_cursor;
26 return;
27 end;
28 /
would otherwise take hours. If you don’t
believe me, watch a series of videos available
on You Tube. They were developed by a team
including Graham Wood, father of Statspack
and Automatic Workload Repository, and
Andrew Holdsworth, senior director of
Oracle’s Real World Performance group,
and delivered by Holdsworth. They show
how you would approach loading 1 TB of
data ( 8 billion rows) into an Oracle Exadata
Database Machine, gathering statistics, and
validating all of it in about 20 minutes—
yes, 20 minutes. This can be achieved only
with parallel direct path operations. The
videos are parts 1, 2, and 3 of Migrate a 1 TB
Datawarehouse in 20 Minutes:
•;
bit.ly/vq2A1I
•;
bit.ly/vV1Qu T
•;
bit.ly/rUXY69
Once you’ve decided to use large, bulk,
direct path loads, choosing a tool is important
as well. Historically, many of us have used
SQLLDR as our tool of choice for performing
large loads. That worked well in the 1900s,
but it isn’t the right tool for the twenty-first
century. A new century demands new tools,
because the scale of our loading problems has
increased by many orders of magnitude.