Choice of a way of access in a relational control system of databases

The foreword of the translator


Though at clause{article}, translation into which Russian brings to your attention, there are five authors, and all of them rather deserved and dear in the world of databases people, this clause{article} strongly associates with name Patricii Selindzher as she has thought up and for the first time realized the corresponding approach. Clause{article} has been written at the closing stage of the project system r, experimental relational SUBD to the company ibm on the basis of which known commercial products ibm sql/ds and much more known SUBD db2 subsequently have been developed. On a site citforum.ru it is possible to read interview Patricii Selindzher, my review of the literature devoted system r, and also extremely interesting material devoted to a meeting of participants of the project system r later of 25 years after his{its} beginning. In all these publications it is possible, besides other, to learn{find out} the interesting facts about a history of occurrence of given clause{article}.


Most likely, clause{article} Stalemate Selindzher, is one of most often quoted in all boundless space of the literature on databases. And, certainly, it is connected at all to oustanding literary advantages of authors. On the contrary, clause{article} is written extremely modestly and simply. Its{her} oustanding role in a modern history of technology of databases will be, what exactly she has put in pawn the approach to transaction processing in relational systems of databases which developers of all modern SUBD adhere.


The motivation and essence of the approach will consist in the following. At the formulation of searches to databases in declarative languages of a high level, such as language sql, the user informs system, what sort the data of it{him} interest, but speaks nothing from how the system should provide sample of these data. The system should find itself ways of access to these data, i.e. to formulate some procedural plan performance of search which performance will lead to to desirable results. As a rule, for one sql-search it is possible to construct set of such plans, and the problem will be, that is required to choose one plan from this set, allowing to reach{achieve} results in the most effective image. Understandably, that such choice needs criterion, and Patricija Selindzher has suggested to use as such criterion numerical estimation of cost of performance of search under the given plan. In clause{article} the exact method of reception of such ratings is described and a number{line} of the estimated formulas used at calculation of the general{common} rating is resulted.


The approach Stalemate Selindzher was based on several simplifying assumptions which not are always fair for real databases. In tens, and can be, and hundreds clauses{articles} which have appeared the next years, were undertaken (often successful) attempts to weaken and specify these assumptions. These researches, in particular, have led to to occurrence of a separate direction of researches in the field of databases - to accumulation, processings and approximations of statistics of the data. Effective job modern relational SUBD, thus, appeared is possible{probable} due to use, development and perfection of ideas and methods Patricii Selindzher and its{her} colleagues.


I am proud, that in Russian this clause{article} for the first time appears in my translation. I hope, that this publication will be pleasant that who, is similar to me, admired with this job 25 years ago, and will be useful to people, only beginning{starting} to get acquainted with technology of databases.

The summary


In language of a high level sql, intended for the formulation of searches and a manipulation the data, operators are formulated in not procedural form, without the instruction{indication} of ways of access. In clause{article} it is described, how in system r ways of access as for simple (above one attitude{relation}), and for complex{difficult} searches (such as connections) get out according to the specification of the desirable data set by the user as bulevskogo expressions of predicates. system r is the experimental control system of databases developed for performance of researches above relational model of the data. system r was projected and under construction members of research laboratory ibm in San Jose.

1. Introduction


system r is the experimental control system of databases based on relational model of the data and developed in research laboratory ibm in San Jose with 1975[1]. The software was developed for performance of researches in the field of relational databases and is not accessible outside research division ibm.


Acquaintance of readers to the terminology of relational model of the data described Koddom [7] and Date [8] is supposed. The interface of users system r is the unified language of searches, definitions of the data and manipulations the data sql [5]. Operators sql can be given out as from the operative interface focused on the casual user, so from programming languages, such as pl/1 and cobol.


In system r the user does not need to know, how are stored{kept} trains at a physical level and what ways of access (for example for what stolbcov there are indexes) exist. From the user it is not required to specify in operators sql something concerning ways of access which should be used for sample of the data. The user does not specify also the order of performance of connections. Optimizator system r chooses both the order of connections, and a way of access for each table specified in the operator sql. From many possible{probable} variants optimizator chooses what minimizes " a total cost of access " for performance of all operator.


In this clause{article} problems of a choice of ways of access for searches are mentioned. Sample in operations of a manipulation by the data (update, delete) is processed similarly. In section 2 the place optimizatora is described during processing the operator sql, and in section 3 ways of access to a component the storages accessible to the single physically stored{kept} table are described. In section 4 estimated formulas for searches above one table are entered, and in section 5 connections of two or the greater number of tables and corresponding estimations of cost are discussed. Section 6 is devoted to the enclosed searches (to searches in predicates).

2. Processing the operator sql


The operator sql is exposed chetyrekhehtapnoj to processing. Depending on an origin and the maintenance{contents} of the operator these stages can be separated by any intervals of time. In system r these any intervals of time are transparent for components of the system, processing the operator sql. These mechanisms and processing of operators sql, acting of programs and from terminals, are in more detail discussed in [2]. Here we shall briefly discuss only those steps of processing which concern a choice of ways of access.


Four stages of processing of the operator is a syntactic analysis, optimization, generation of a code and performance. Each operator sql is sent a parser which checks correctness of syntax. The block of search is represented by the list select, the list from and a tree where, containing, accordingly, the list of the elements subject to sample; the list of tables from which sample should be made; bulevskuju the combination of simple predicates set by the user. In one operator sql there can be many blocks of searches as one of operands of a predicate itself can be search.


If the parser does not find out mistakes, component OPTIMIZATOR is caused. OPTIMIZATOR collects names stolbcov and the tables, contained in search, and makes their search in catalogues system r to be convinced of their existence and to choose the information on them.


Part OPTIMIZATORA engaged in search in catalogues, also obtains the statistical data on the specified attitudes{relations} and the data on ways of access to each of them. These data are used later at a choice of a way of access. After reception from the catalogue of the information on type of the data and length of each column OPTIMIZATOR repeatedly scans the list select and a tree where for check of absence of semantic mistakes and compatibilities of types in expressions and operations of comparison in predicates.


In summary OPTIMIZATOR makes a choice of a way of access. First of all, he defines{determines} the order of calculation of blocks of the searches contained in the operator. Then for each block of search attitudes{relations} from the list from are processed. If in the block is present more than one attitude{relation}, rearrangements about connections and methods of performance of connections are estimated. The plan of access, minimizirujuhhij a total cost of calculation of the block, gets out of a tree of possible{probable} ways. Result is the plan of performance submitted in language asl (access specification language) [10].


After for each block of search the plan gets out and it is represented in a tree of syntactic analysis, the GENERATOR of the CODE is caused. The GENERATOR of the CODE is a tabulared program which broadcasts asl-trees in a machine code which is intended for performance of the plan chosen OPTIMIZATOROM. It is done{made} by use concerning a small number of patterns of a code, on one for each version of methods of performance of connections (including absence of connections). Blocks of searches for the enclosed searches are treated as the "subroutines" returning values in corresponding predicates. The GENERATOR of the CODE is more in detail described in [9].


During generation of a code the tree of syntactic analysis is replaced with a carried out machine code and the structures of the data associated with it{him}. Management is immediately passed this code, or the code is saved in a database for the subsequent performance depending on an origin of the operator (the program or the terminal). In any case when the code eventually is carried out, he addresses to internal system of storage system r (rss) through the interface of system of storage (rsi) for scanning each of physically stored{kept} attitudes{relations} specified in search. This scanning is made through the ways of access chosen OPTIMIZATOROM. Rsi commands which can be used in the generated code, are described in the following section.

3. Research system of storage (research storage system)

The research system of storage (rss) is a subsystem of storage system r. She is responsible for support of physical storage of attitudes{relations}, ways of access to these attitudes{relations}, blocking (in the multiuser environment), and also means of journalizing and restoration. rss represents the users pokortezhnyj the interface (rss). Though rss it can be used irrespective of system r, here us use of this subsystem for performance of the code generated in system r at processing of operators sql as it is described in the previous section interests. The full description rss see in [1].


Attitudes{Relations} are stored{kept} in rss as collections of trains with physically adjacent stolbcami. These trains are saved in pages on 4 KB; each train is entirely placed on one page. Villages form the logic units named segments. Segments can contain more than one attitude{relation}, but each attitude{relation} entirely settles down in one segment. Trains from two or the greater number of attitudes{relations} can be placed in one page. Each train is marked by the identifier of the attitude{relation} to which he belongs.


The basic way of access to trains of the attitude{relation} is his{its} scanning through rss. Scanning returns on one train for one reference{manipulation} through the set way of access. The basic commands of scanning are open, next and close.


Now two types of scanning are accessible to operators sql. The first type is a segment scanning for a presence{finding} of all trains of the given attitude{relation}. The sequence of next commands at segment scanning simply checks all pages of a segment containing trains of any attitude{relation}, and returns those trains which belong to the given attitude{relation}.


The second type of scanning is index scanning. By the user system r the index on one or several stolbcakh attitudes{relations} can be created, and for the attitude{relation} there can be any (including zero) number of indexes. These indexes are stored{kept} in the pages separated from what contain trains of attitudes{relations}. Indexes are realized as b-trees [3] which leaves are the pages containing sets (a key, identifiers of the trains containing this key). Therefore the sequence of next commands at index scanning makes consecutive reading sheet pages of an index, receiving identifiers of the trains corresponding to a key, and using them for a presence{finding} and to return to the user of trains of the data by way of values of a key. Sheet pages of an index are connected in the list so at performance of next command references{manipulations} to pages of an index of higher level are not required.


At segment scanning all nonempty pages of a segment irrespective of, whether they contain trains of the set attitude{relation} are mentioned. However each page is mentioned only once. At viewing the attitude{relation} through index scanning each page of an index is mentioned only once, but the page of the data can be checked more than once if there are two trains which are not being "relatives" by way of an index. If trains are inserted into a segment by way of an index and if this physical affinity corresponding to values of a key of an index, is supported we speak, that the index is klasterizovannym. At klasterizovannogo an index there is that property, that at scanning through this index will be mentioned only once not only each index page, but also each page of the data containing a train of the given attitude{relation}.


Through an index it is possible to scan and not the attitude{relation} entirely. It is possible to specify starting and stopovoe values to scan only those trains which key is in a range of index values. Both for index, and for segment scanning it is possible to set also a set of the predicates named arguments of search (search arguments, sargs) which are applied to a train before he comes back in the program which has addressed to rsi. If the train satisfies to predicates, he comes back; otherwise scanning proceeds until or the train satisfying sargs will not be found, or the segment or the set range of index values will not be exhausted{settled}. It cuts down expenses by elimination of an overhead charge for calls rsi for trains which can be rejected by reasonable image inside rss. Not all predicates are in the form which can become sargs. sargable the predicate is the predicate which is looking like (or resulted{brought} to a kind) " ncolumn comparison-operator value ". sargs are represented as bulevskie expressions from such predicates in a disjunctive normal form.

4. Cost of ways of access for one attitude{relation}

In the following several sections we shall describe process of a choice of the plan of performance of search. First we shall describe the elementary case of access to one attitude{relation}, and then we shall show, as this approach extends and generalized for cases of connection of two attitudes{relations}, connection n attitudes{relations} and, at last, searches with several blocks of searches (the enclosed searches).


OPTIMIZATOR analyzes predicates in search and the ways of access accessible to attitudes{relations} of search, and formulates a prediction of cost of each plan of access, using the following estimated formula:

cost = page fetches + w * (rsi calls)


This cost is the weighed measure of input-output (number of the read pages) and uses of the CENTRAL PROCESSING UNIT (number of the executed commands). w is nastaivaemym factor of weighing between an input - conclusion and the CENTRAL PROCESSING UNIT. rsi calls is a predicted number of the trains returned from rss. As system r the most part of time spends in rss, the number of calls rsi is good approach{approximation} of loading of the CENTRAL PROCESSING UNIT. Thus, at a choice of a way with the least cost of processing of search OPTIMIZATOR aspires to minimize all required resources.


At performance of check of compatibility of types and a semantic correctness of search OPTIMIZATOR analyzes a tree of predicates where each block of search. It is considered, that the tree where is in a conjunctive normal form, and everyone kon``junkt is called bulevskim as a factor (boolean factor). Importance bulevskikh factors will be, that each train returned to the user, should satisfy to everyone bulevskomu to a factor. Speak, that the index corresponds{meets} bulevskomu to a factor if bulevskij the factor is sargable a predicate in which the column is a key of an index; for example, the index on stolbce salary corresponds{meets} to a predicate ' salary = 20000 '. More precisely, we speak, that the predicate or a set of predicates corresponds{meets} to an index way of access if predicates are sargable, and stolbcy, specified in predicates, are starting podstrokami a set stolbcov a key of an index. For example, the index on stolbcakh name, location corresponds{meets} name = ' smith ' and location = ' san jose '. If the index corresponds{meets} bulevskomu to a factor access with use of this index is effective way of satisfaction of it bulevskogo a factor. sargable bulevskie factors can be satisfied effectively also if are represented as arguments of search. We shall notice, that bulevskij the factor can be a full tree of the predicates connected through or.


By search in catalogue OPTIMIZATOR takes the statistical data on attitudes{relations} of search and the ways of access available for each attitude{relation}. The following statistical data are supported.


For each attitude{relation} t:


* ncard (t) - capacity of the attitude{relation} t;

* tcard (t) - number of pages in a segment to which trains of the attitude{relation} t are stored{kept};

* p (t) - a share of pages of the data in a segment, storing{keeping} trains of the attitude{relation} t.


p (t) = tcard (t) / (number of nonempty pages in a segment).


For each index i on the attitude{relation} t:


* icard (i) - number of various keys in an index i;

* nindx (i) - number of pages in an index i.


These statistical data are supported in catalogues system r and occur from several sources. Initialization of statistics is made at initial loading attitudes{relations} and creation of indexes. Then these data are periodically updated through update statistics command which can be carried out by any user. In system r these statistical data are not updated at performance of each operator insert, delete or update as it would need additional operations above a database, and system catalogues would become a bottleneck on blocking. Dynamic updating of statistics could lead to to only consecutive performance of the operations modifying contents of attitudes{relations}.


Using these statistical data, OPTIMIZATOR appoints to everyone bulevskomu to a factor from the list of predicates factor of selectivity ' f '. This factor of selectivity very approximately corresponds{meets} an expected share of trains which will satisfy to a predicate. In tab. 1 factors of selectivity for different kinds of predicates are resulted. We assume, that for the lack of statistics follows, that the attitude{relation} has the small size, and for him{it} the any factor can be chosen.


column = value


f = 1 / icard (column index) if on stolbce there is an index

Here uniform distribution of trains on values of a key of an index is supposed

Otherwise f = l/10



column1 = column2


f = l/max (icard (columnl index), icard (column2 index))

If there are indexes both on columnl, and on column2

Here it is supposed, that for each value of a key in an index with smaller capacity there is a corresponding value in the other index

f = l/icard (column-i index) if there is an index only on column-i

Otherwise f = l/10


column> value (or any other open comparison)


f = (high key value - value) / (high key value - low key value)

f it turns out by linear interpolation of a preset value in a range of values of a key if at a column arithmetic type, and value is known during a choice of a way of access

Otherwise f = l/3 (for example, the type of a column is not arithmetic)

Last number is not too essential, except for that fact, that it shows smaller selectivity, than at a predicate of comparison on equality for which there are no indexes, and this number is more 1/2. We have a hypothesis, that only in few searches predicates with which satisfies more than half of trains are used.


column between value1 and value2


f = (value1 - value2) / (high key value - low key value)

As factor of selectivity the attitude{relation} of the size of a range of values between to the size of the general{common} range of a key if the type of a column is arithmetic is used, both value1 and value2 are known during a choice of a way of access

Otherwise f = l/4

In last formula again there is not enough sense except that value is between factors by default for a predicate of comparison on equality and a predicate of comparison with an open inequality.


column in (the list of values)


f = (number of elements in the list) * (factor of selectivity for column = value)

Here value, the greater l/2 is supposed


columna in subquery


f = (expected capacity of result of a subrequest) / (product of capacities of all

Attitudes{Relations} from the list from a subrequest)

Calculation of capacity of result of search will be discussed later

This formula is deduced on the basis of the following arguments:

Let's consider the elementary case when the subrequest looks like " select columnb from relationc ". We shall assume, that the set of all values columnb in relationc contains set of all values columna. If the subrequest chooses all trains relationc value of a predicate always is true, and f = 1. If trains of a subrequest are limited to factor of selectivity f ' we shall assume, that set of unique values as a result of a subrequest which correspond{meet} to values columna, is limited in the same proportion, i.e. factor of selectivity of a predicate should be f '. f ' is product of all factors of selectivity of a subrequest, namely, (capacity of a subrequest) / (capacity of all possible{probable} answers to a subrequest). At presence of some optimism we can expand these reasons on subrequests with connections and subrequests in which columnb it is replaced with the arithmetic expression containing names stolbcov. It results in the mentioned above formula.



(pred expression1) or (pred expression2)

         f = f (pred1) + f (pred2) - f (pred1) * f (pred2)


(pred1) and (pred2)

         f = f (pred1) * f (pred2)


Let's notice, that here it is supposed, that values stolbcov are independent.



not pred

         f = 1 - f (pred)


Table 1. Factors of selectivity


Capacity of search (qcard) is product of capacities of all attitudes{relations} from the list from the block of the search, increased on product of all factors of selectivity bulevskikh factors of this block of search. The number of expected calls rsi (rsicard) is product of capacities of attitudes{relations} on factors of selectivity sargable bulevskikh factors as sargable bulevskie factors will be placed in arguments of search which will filter trains without from return through the interface rss.


The choice of an optimum way of access for one attitude{relation} will consist in use of these factors of selectivity in formulas together with the statistical data concerning available ways of access. Before to describe this process, it is required to enter definition. Use of an index way of access or sorting of trains make the trains ordered according to values of a key of an index or sorting. We name it the order of trains the interesting order if such order of sorting is specified in sections group by or order by the block of search.


For single attitudes{relations} the cheapest way of access turns out by estimation of cost of each accessible way of access (each index on the attitude{relation} plus segment scanning). Estimations of cost are described below. For each such way of access it is calculated predskazyvaemaja cost, and also made orderliness of trains is estimated. For example, scanning on an index salary in ascending order will generate some estimated cost c and orderliness of trains on salary (in ascending order). To find out the cheapest way of access for search above one attitude{relation}, we need to survey only the cheapest ways of access which make trains in each "interesting" order, and the cheapest "disorder" way of access. We shall notice, that the "disorder" way of access actually can make trains in some order, but this order is not "interesting". If in search do not contain neither section group by, nor section order by the "interesting" order is absent, and one cheapest way of access gets out. If there are sections group by or order by cost of a way of the access making this "interesting" ordering, is compared to cost of the cheapest disorder way plus cost of sorting qcard trains in the due order. The cheapest alternative gets out as the plan for the given block of search.


Estimated formulas for ways of access to one attitude{relation} are resulted in tab. 2. In these formulas participate number of references{manipulations} to index pages plus number of references{manipulations} to pages of the data plus factor of the weighing increased on number of references{manipulations} to rsi for sample of trains. w is a factor of weighing between references{manipulations} to pages and calls rsi. In some situations alternative formulas are resulted some depending on, whether all set of the chosen trains in buffer to a bullet rss (or parts of the pool, selected to the user) is located. We assume, that for klasterizovannykh indexes the page remains in the buffer until all trains will not be chosen from it{her}. For neklasterizovannykh indexes it is supposed, that the attitudes{relations} which are not located in the buffer, are great enough in comparison with the size of the buffer, and each sample of a train needs reading page.



5. A choice of a way of access for connections


In 1976. Blasgen and EHsvaran [4] investigated a number{line} of methods of performance of connections of two attitudes{relations}. Efficiency of each of these methods was analyzed at different capacities of attitudes{relations}. The data of authors show, that for any not too fine attitudes{relations} one of two methods of connection always is optimum or close to optimum. Optimizator system r makes a choice of one of these methods. First we shall describe these a method, and then we shall discuss, as they extend for a case of connection n attitudes{relations}. In summary we specify, how the order of connections (the order in which attitudes{relations} incorporate) gets out. For connection of two attitudes{relations} are allocated the external attitude{relation} of which the train all over again gets out, and the internal attitude{relation} of which trains get out, probably, depending on values of the received train of the external attitude{relation}. The predicate connecting stolbcy of two attitudes{relations} for generation of connection, is called as a predicate of connection. Stolbcy, the connections set in a predicate, are called stolbcami connections.


In the first method of the connection named a method of enclosed cycles, scanning, in any order, external and internal attitudes{relations} is used. Scanning the external attitude{relation} opens, and the first train gets out. For each received train of the external attitude{relation} scanning the internal attitude{relation} for sample (on one for one reference{manipulation} to rsi) all trains of the internal attitude{relation} satisfying a predicate of connection opens. The compound trains formed of pairs "train - vneshnego-otnoshenija/kortezh-vnutrennego-«??«????n" enter into result of this connection.


For application of the second method of the connection named scanning with merge, it is required, that external and internal attitudes{relations} were scanned by way of a column of connection. This implies, that along with stolbcami, specified in sections order by and group by, stolbcy predicates ehkvisoedinenija (a kind table1.columnl = table2.column2) also define{determine} the "interesting" order. If is present more than one predicate of connection one of them is used as a predicate of connection, and the others - as usual predicates. The method of scanning with merge is applied only to ehkvisoedinenijam though, basically, it{he} could be applied and to other types of connection. If one or both connected attitudes{relations} do not have indexes on stolbce connections, they should be sorted and placed in the temporary list (list), ordered on stolbcu connections.


In a method of scanning with merge property of orderliness on stolbcu connection is used to avoid full scanning of the internal attitude{relation} (in search of corresponding trains) for each train of the external attitude{relation}. It is done{made} by synchronization of internal and external scanning at the expense of maintenance of the link to corresponding values of a column of connection and "storing" of a site of corresponding groups of connection. The big economy, if the internal attitude{relation} klasterizovano on stolbcu connections (for example, it is received by sorting on stolbcu connections) is possible{probable} still. "Klasterizacija" on stolbcu means, that trains with identical value of this column are stored{kept} physically close one from another so for one reference{manipulation} to page some trains will get out.


Connection n attitudes{relations} can be embodied as sequence of connections of two attitudes{relations}. In this embodiment two attitudes{relations} incorporate, the resulting attitude{relation} incorporates to the third attitude{relation}, etc. On each step of connection n attitudes{relations} it is possible to define{determine} the external attitude{relation} (as a rule, compound) and the internal attitude{relation} (added to connection). Thus, the methods described above are easily generalized for connection n attitudes{relations}. However it is necessary to emphasize, that the first connection of two attitudes{relations} should be not necessary completely is completed yes the beginnings of performance of the second connection of two attitudes{relations}. As soon as we shall receive a compound train for the first connection of two attitudes{relations}, he can be connected to trains of the third attitude{relation}, forming trains for connection with the fourth attitude{relation}, etc. At performance of one search methods of connection by the enclosed cycles and scanning with merge can mix up, for example, at connection of three attitudes{relations} first two attitudes{relations} can incorporate a method of scanning to merge, and the resulting attitude{relation} can incorporate to the third attitude{relation} a method of the enclosed cycles. Intermediate compound attitudes{relations} are physically saved only when on the following step of connection sorting is required. If sorting of the compound attitude{relation} is not necessary, it is materialized on one train which immediately takes part in the following connection.


Now we shall consider the order which gets out for connection of attitudes{relations}. It is necessary to notice, that though capacity of result of connection n attitudes{relations} does not depend on the order of performance of connections of two attitudes{relations}, cost of connections in the different order can differ essentially. If in the list from the block of search it is specified n attitudes{relations} exists n! Different orders of performance of connections. The space of search can be reduced on the basis of that supervision, that after connection of the first k attitudes{relations} the method of connection of result with (k+1)-m the attitude{relation} does not depend on the order of connection of the first k attitudes{relations}, i.e. the same predicates are applicable{applied}, there is a same set of interesting orderings, the same methods of performance of connection, etc. are possible{probable}. With use of this property the effective way of the organization of search will be to find the best order of connection for consistently growing podnabora tables.


For reduction of number of considered{examined} rearrangements by way of connection the heuristics is used. When it is possible, search is limited to consideration only those orders of connection for which there are the predicates of connection connecting the internal attitude{relation} with other attitudes{relations}, already participating in connection. It means, that at connection of attitudes{relations} tl, t2..., tn those orders til, ti2 are considered{examined} only..., tin, in which for all j (j=2..., n) or (1) for tij there is, at least, one predicate of connection with some attitude{relation} tik where k <j, or (2) for all k> j for tik there is no predicate of connection with til, tit..., ti (j-1).


It means, that all connections for which the Cartesian product is required, are carried out in sequence of connections as it is possible later. For example, if tl, t2, t3 are three attitudes{relations} from the list from the block of search, and there are predicates of connection between tl and t2 and between t2 and t3 on others stolbcakh, than for connection tl-t2 the following rearrangements are not considered{examined}:


tl-t3-t2 t3-tl-t2


For a presence{finding} of the optimum plan for connection n attitudes{relations} the tree of possible{probable} decisions is designed. As it was discussed above, search is made by a presence{finding} of the best way podnabora attitudes{relations}. For each set of connected attitudes{relations} capacity of the compound attitude{relation} is estimated and saved. In addition to this, for the disorder connection and for each interesting order, received by performance of connection till now, are saved the cheapest decision for achievement of this order and cost of this decision. The decision will consist of the ordered list of connected attitudes{relations}, a method used for each connection and the plan, showing as access to each attitude{relation} will be made. If or the external compound attitude{relation}, or for the internal attitude{relation} needs sorting before performance of connection it too is included in the plan. As well as in case of one attitude{relation}, "interesting" orders are what are specified in sections group by or order by the block of search (if these sections are present). Besides the "interesting" order is defined{determined} by everyone stolbcom connections. For minimization of number of different interesting orders and, hence, numbers of decisions in a tree are calculated classes of equivalence of interesting orders, and for each class of equivalence the best decision is saved only. For example, if are available a predicate of connection e.dno = d.dno and other predicate of connection d.dno = f.dno all three these of a column belong to the same class of equivalence of orders.


The tree of search is designed by iteration procedure on number of the attitudes{relations} connected till now. First of all, there is the best way of access to each single attitude{relation} for each interesting ordering trains and for the disorder case. Further there is the best way of connection of any attitude{relation} with each of an available set in view of heuristics for form's sake connections. In result decisions are made for connection of pairs attitudes{relations}. Then there is the best way of connection of sets from three attitudes{relations} by consideration of all sets from two attitudes{relations} and their connection with each third attitude{relation} admitted by heuristics about the order of connection. For each plan of connection of a set of attitudes{relations} the order of compound result is saved in a tree. It allows to consider{examine} an opportunity of connection by scanning with merge for which sorting of a composition is not required. After a presence{finding} of all full decisions (connections of all attitudes{relations}) optimizator chooses the cheapest decision which is giving out the required order if those has been specified. We shall notice, that if there is a decision with the correct order, sorting for order by or group by is not carried out, if only the ordered decision is not more expensive, than the cheapest disorder decision plus cost of sorting in the required order.


The number of decisions which are required to be saved, does not exceed 2 ** n (number of subsets from n tables), increased on number interesting porjadko results. Time of calculations for generation of a tree is approximately proportional to the same number. Often this number essentially decreases heuristics about connections. By our experience in typical cases it is required all a little bit{some} thousand in byte of memory t some tenth shares of second of time of the CENTRAL PROCESSING UNIT 370/158. Connections of eight tables are optimized for some seconds.

Calculation stoimostej


Cost for connections are calculated on a basis stoimostej scannings of each attitude{relation} and capacities. Cost of scanning of each attitude{relation} are calculated with use of estimated formulas for ways of access to the single attitudes{relations} submitted in section 4.


Let c-outer (path1) designates cost of scanning of the external attitude{relation} through path1, and n - number of trains of the external attitude{relation} satisfying applicable predicates. n it is calculated as follows:


n = (product of capacities of all attitudes{relations} t, participating in connection till now) * (product of factors of selectivity of all applicable predicates)


Let c-inner (path2) designates cost of scanning of the internal attitude{relation} with application of all applicable predicates. We shall notice, that with a case of connection by scanning with merge it means scanning consecutive group of the internal attitude{relation} which corresponds{meets} to one value of a column of connection in the external attitude{relation}. Then cost of connection is calculated by a method of the enclosed cycles as follows:



c-nested-loop-join (pathl, path2) = c-outer (path1) + n * c-inner (path2)


Cost of connection by scanning with merge it is possible to break into cost of performance of merge plus cost of sorting of external or internal attitudes{relations} if it is required. Cost of performance of merge is calculated as follows:



c-merge (pathl, path2) = c-outer (path1) + n * c-inner (path2)


In a case when the internal attitude{relation} is sorted in the time attitude{relation}, any of formulas for a way of access to the single attitude{relation} from section 4 is not applicable{applied}. In this case vnutrenee scanning is similar to segment scanning except that in a method of scanning with merge that fact is used, that the internal attitude{relation} is sorted, and by search of conformity his{its} full scanning is not required. For this case we use the following formula of estimation of cost of internal scanning:



c-inner (sorted list) = temppages/n + w*rsicard


Where temppages is a number of the pages required for preservation the internal attitude{relation}. In this formula it is supposed, that during merge each page of the internal attitude{relation} will be schitytvat`sja only once. It is interesting to notice, that the estimated formula for connections by a method of the enclosed cycles and the estimated formula for connection by a method of scanning with merge are, in essence, identical. The reason of that scanning with merge in some cases appears better the enclosed cycles, will be, that cost of internal scanning can be much less. After sorting the internal attitude{relation} is klasterizovannym on stolbcu connections that conducts to minimization of number of readings of pages, and it is not necessary to scan all vnutrenee the attitude{relation} (in search of conformity) for each train of the external attitude{relation}.


Cost of sorting of the attitude{relation} csort (path) includes cost of sample of the data with use of the specified way of access, sorting of the data which can include some passes, and premises{rooms} of results in the temporary list. We shall notice, that before sorting the internal table local predicates can be applied only. Besides if it is required to sort compound result all compound attitude{relation} should be saved in the time attitude{relation} before it{he} can be sorted. Cost of an insert of compound trains before sorting is included in the time attitude{relation} in c-sort (path).