Berkeley DB Reference Guide: Equality Join

Berkeley DB Reference Guide:
Access Methods

Equality Join

Berkeley DB supports "equality" (also known as "natural"), joins on secondary indices. An equality join is a method of retrieving data from a primary database using criteria stored in a set of secondary indices. It requires the data be organized as a primary database which contains the primary key and primary data field, and a set of secondary indices. Each of the secondary indices is indexed by a different secondary key, and, for each key in a secondary index, there is a set of duplicate data items that match the primary keys in the primary database.

For example, let's assume the need for an application that will return the names of stores in which one can buy fruit of a given color. We would first construct a primary database that lists types of fruit as the key item, and the store where you can buy them as the data item:

Primary key: Primary data:

apple Convenience Store

blueberry Farmer's Market

peach Shopway

pear Farmer's Market

raspberry Shopway

strawberry Farmer's Market

Primary key:	Primary data:
apple	Convenience Store
blueberry	Farmer's Market
peach	Shopway
pear	Farmer's Market
raspberry	Shopway
strawberry	Farmer's Market

We would then create a secondary index with the key color, and, as the data items, the names of fruits of different colors.

Secondary key: Secondary data:

blue blueberry

red apple

red raspberry

red strawberry

yellow peach

yellow pear

Secondary key:	Secondary data:
blue	blueberry
red	apple
red	raspberry
red	strawberry
yellow	peach
yellow	pear

This secondary index would allow an application to look up a color, and then use the data items to look up the stores where the colored fruit could be purchased. For example, by first looking up blue, the data item blueberry could be used as the lookup key in the primary database, returning Farmer's Market.

Your data must be organized in the following manner in order to use the DB->join method:

The actual data should be stored in the database represented by the DB object used to invoke this method. Generally, this DB object is called the primary.
Secondary indices should be stored in separate databases, whose keys are the values of the secondary indices and whose data items are the primary keys corresponding to the records having the designated secondary key value. It is acceptable (and expected) that there may be duplicate entries in the secondary indices.
These duplicate entries should be sorted for performance reasons, although it is not required. For more information see the DB_DUPSORT flag to the DB->set_flags method.

What the DB->join method does is review a list of secondary keys, and, when it finds a data item that appears as a data item for all of the secondary keys, it uses that data item as a lookup into the primary database, and returns the associated data item.

If there were another secondary index that had as its key the cost of the fruit, a similar lookup could be done on stores where inexpensive fruit could be purchased:

Secondary key: Secondary data:

expensive blueberry

expensive peach

expensive pear

expensive strawberry

inexpensive apple

inexpensive pear

inexpensive raspberry

Secondary key:	Secondary data:
expensive	blueberry
expensive	peach
expensive	pear
expensive	strawberry
inexpensive	apple
inexpensive	pear
inexpensive	raspberry

The DB->join method provides equality join functionality. While not strictly cursor functionality, in that it is not a method off a cursor handle, it is more closely related to the cursor operations than to the standard DB operations.

It is also possible to do lookups based on multiple criteria in a single operation. For example, it is possible to look up fruits that are both red and expensive in a single operation. If the same fruit appeared as a data item in both the color and expense indices, then that fruit name would be used as the key for retrieval from the primary index, and would then return the store where expensive, red fruit could be purchased.

Example

Consider the following three databases:

personnel

key = SSN
data = record containing name, address, phone number, job title

lastname

key = lastname
data = SSN

jobs

key = job title
data = SSN

Consider the following query:

Return the personnel records of all people named smith with the job
title manager.

This query finds are all the records in the primary database (personnel) for whom the criteria lastname=smith and job title=manager is true.

Assume that all databases have been properly opened and have the handles: pers_db, name_db, job_db. We also assume that we have an active transaction to which the handle txn refers.

DBC *name_curs, *job_curs, *join_curs;
DBC *carray[3];
DBT key, data;
int ret, tret;

name_curs = NULL;
job_curs = NULL;
memset(&key, 0, sizeof(key));
memset(&data, 0, sizeof(data));

if ((ret =
    name_db->cursor(name_db, txn, &name_curs, 0)) != 0)
	goto err;
key.data = "smith";
key.size = sizeof("smith");
if ((ret =
    name_curs->c_get(name_curs, &key, &data, DB_SET)) != 0)
	goto err;

if ((ret = job_db->cursor(job_db, txn, &job_curs, 0)) != 0)
	goto err;
key.data = "manager";
key.size = sizeof("manager");
if ((ret =
    job_curs->c_get(job_curs, &key, &data, DB_SET)) != 0)
	goto err;

carray[0] = name_curs;
carray[1] = job_curs;
carray[2] = NULL;

if ((ret =
    pers_db->join(pers_db, carray, &join_curs, 0)) != 0)
	goto err;
while ((ret =
    join_curs->c_get(join_curs, &key, &data, 0)) == 0) {
	/* Process record returned in key/data. */
}

/*
 * If we exited the loop because we ran out of records,
 * then it has completed successfully.
 */
if (ret == DB_NOTFOUND)
	ret = 0;

err:
if (join_curs != NULL &&
    (tret = join_curs->c_close(join_curs)) != 0 && ret == 0)
	ret = tret;
if (name_curs != NULL &&
    (tret = name_curs->c_close(name_curs)) != 0 && ret == 0)
	ret = tret;
if (job_curs != NULL &&
    (tret = job_curs->c_close(job_curs)) != 0 && ret == 0)
	ret = tret;

return (ret);

The name cursor is positioned at the beginning of the duplicate list for smith and the job cursor is placed at the beginning of the duplicate list for manager. The join cursor is returned from the join method. This code then loops over the join cursor getting the personnel records of each one until there are no more.

Berkeley DB Reference Guide:Access Methods

Equality Join

Example

Berkeley DB Reference Guide:
Access Methods