Database Application Developer and DBA
ASK TOM BY TOM K Y TE
On History, Invisibility,
and Moving
Our technologist counts on histograms, proves
hash tables by example, and moves CLOBs.
ORACLE DATABASE
I have a question about using histograms with character data. I’m trying to improve
performance on a set of very large tables. My
table T1 has more than 120 million rows and
is not partitioned, and it is representative
of the other tables in my schema. The data
is such that now a couple of key varchar2
columns have very skewed data.
I’ve gathered histograms on this information and have been pleased with the results.
The optimizer is better able to gauge the estimated cardinalities during optimization and
has been coming up with better plans.
I’m concerned about the future, however.
I’ve heard that histograms gathered
against varchar2 columns have limitations.
Specifically, I’ve heard that only the first few
characters are considered when the histograms are being generated, so that as my var-
char2 data gets longer and longer, I might see
a decrease in the efficiency of my histograms.
Is this true, and what are the limits?
Before I address the question, let’s first
define what a histogram is. A histogram—in
the context of statistics—contains informa-
tion that gives the optimizer a clear picture
of the data in a column. In many cases, the
histogram on a column will tell the optimizer
precisely how many rows in the table have
a certain value, so that when the optimizer
parses a query in the form “where column_
with_histogram = ‘some value’,” the opti-
mizer will be able to very accurately estimate
how many rows that predicate will return.
When the column with a histogram generated
against it has fewer than 255 distinct values,
the histogram will contain a complete picture
of the values in the column. When the column
has more than 254 distinct values, the histo-
gram will become less accurate but will still
paint a good picture of the data in the column.
This 255-distinct-value limit is because the
database doesn’t store more than 254 rows of
information for a single histogram.
SQL> create table t
2 as
3 select
4 rpad( 'x', 30, 'x' ) ||
5 chr( ascii('A')+
6 case when rownum < 20000
7 then 0
8 else mod(rownum, 12)
9 end ) len30,
10 rpad( 'x', 31, 'x' ) ||
11 chr( ascii('A')+
12 case when rownum < 20000
13 then 0
14 else mod(rownum, 12)
15 end ) len31,
16 rpad( 'x', 32, 'x' ) ||
17 chr( ascii('A')+
18 case when rownum < 20000
19 then 0
20 else mod(rownum, 12)
21 end ) len32,
22 rpad( 'x', 33, 'x' ) ||
23 chr( ascii('A')+
24 case when rownum < 20000
25 then 0
26 else mod(rownum, 12)
27 end ) len33
28 from all_objects
29 /
Table created.
As I’ve generated the data, I’ve also
introduced some skew to it. If you look at
the function for assigning the last letter—
after all of the ‘x’s—you’ll see that the first
20,000 rows (out of about 72,000) will have
the letter ‘A’ assigned to them. After the first
20,000 rows, I’ll get an even distribution of
A–L values. That means that strings that end
in ‘A’ happen often and that strings that end
in B–L do not happen often.
Next I’ll gather statistics and ensure that
I have histograms that provide a perfect
picture of the data:
SQL> begin
2 dbms_stats.gather_table_stats
3 ( user, 'T',
4 method_opt =>
'for all columns size 12',
5 estimate_percent=> 100
6 );
7 end;
8 /
PL/SQL procedure successfully completed.