Documente Academic
Documente Profesional
Documente Cultură
Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.
It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:
Employee Location
EmpID EmpName EmpID EmpLoc
13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
25 Johnson 39 Bangalore, India
Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.
Outer Joins
Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:
In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL
below. This is equivalent to what we have above.
A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.
Earlier we had mentioned something called a join predicate. In the SQL above, the join predicate
is "on employee.empID = location.empID". This is the heart of any type of join, because it
determines what common column between the 2 tables will be used to "join" the 2 tables. As you
can see from the result set, all of the rows from the left table are returned when we do a left outer
join. The last row of the Employee table (which contains the "Johson" entry) is displayed in the
results even though there is no matching row in the Location table. As you can see, the non-
matching columns in the last row are filled with a "NULL". So, we have "NULL" as the entry
wherever there is no match.
A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:
Using the tables presented above, we can show what the result set of a right outer join would
look like:
We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.
Inner Joins
Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres
what the SQL for an inner join will look like:
Now, here is what the result of running that SQL would look like:
With so many different types of keys (foreign, primary, unique, natural, super, etc), it can really
get quite confusing to have a solid understanding of keys in SQL. And to make it even more
confusing, many people think that relational database theory (which deals with terms like tuples,
attributes, and relations), SQL (which deals with terms like tables rows and columns, and is our
main concern here), and file systems (which deal with terms like records and fields) are all the
same concept when in fact they are all completely different.
With that in mind, we want to give the proper definition of keys in SQL so that the foundation of
your understanding can be solid.
According to the SQL standard, a key is not allowed to have values that are NULL-able. Any key
that has more columns than necessary to uniquely identify each row in the table is called a super-
key (think of it as a super-set). But, if the key has the minimum amount of columns necessary to
uniquely identify each row then it is called a minimal super-key. A minimal super-key is also
known as a candidate key, and there must be one or more candidate keys in a table.
The one thing that primary, unique, and foreign keys all have in common is the fact that each
type of key can consist of more than just one column from a given table. In other words,
foreign, primary, and unique keys are not restricted to having just one column from a given table
each type of key can cover multiple columns. So, that is one feature that all the different types
of keys share they can each be comprised of more than just one column, which is something
that many people in software are not aware of.
Of course, the database programmer is the one who will actually define which columns are
covered by a foreign, primary, or unique key. That is one similarity all those keys share, but there
are also some major differences that exist between primary, unique, and foreign keys. We will go
over those differences in this article. But first, we want to give a thorough explanation of why
foreign keys are necessary in some situations.
As we stated earlier, both unique and primary keys can be referenced by foreign keys.
Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.
The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.
By enforcing referential integrity, we can solve that problem, so that we wouldnt have to
manually delete him from the Employee Salary table (or any others). Heres how: first we would
define the employee ID column in the Employee table to be our primary key. Then, we would
define the employee ID column in the Employee Salary table to be a foreign key that points to a
primary key that is the employee ID column in the Employee table. Once we define our foreign
to primary key relationship, we would need to add whats called a constraint to the Employee
Salary table. The constraint that we would add in particular is called a cascading delete this
would mean that any time an employee is removed from the Employee table, any entries that
employee has in the Employee Salary table would also automatically be removed from the
Employee Salary table.
Note in the example given above that referential integrity is something that must be enforced,
and that we enforced only one rule of referential integrity (the cascading delete). There are
actually 3 rules that referential integrity enforces:
3.If the primary key for a record in the Employee table changes,
all corresponding records in the Employee Salary table must be
modified using what's called a cascading update.
Its worth noting that most RDBMSs relational databases like Oracle, DB2, Teradata, etc.
can automatically enforce referential integrity if the right settings are in place. But, a large part of
the burden of maintaining referential integrity is placed upon whoever designs the database
schema basically whoever defined the tables and their corresponding structure/relationships in
the database that you are using. Referential integrity is an important concept and you simply
must know it for any programmer interview.
Consider a table called People. If we use the columns First_Name, Last_Name, and Address
together to form a key then that would be a natural key because those columns are something
that are natural to people, and there is definitely a logical relationship between those columns
and any other columns that may exist in the table.
In other words, the surrogate key really has no business meaning i.e., the data stored in a
surrogate key has no intrinsic meaning to it.
A surrogate key could be considered to be the artificial key that we mentioned earlier. In most
databases, surrogate keys are only used to act as a primary key. Surrogate keys are usually just
simple sequential numbers where each number uniquely identifies a row. For example, Sybase
and SQL Server both have whats called an identity column specifically meant to hold a unique
sequential number for each row. MySQL allows you to define a column with the
AUTO_INCREMENT attribute, which just means that the value in the column will automatically
increment the value in a given column to be 1 greater than the value in the previous row. This
just means that every time you add a new row, the value in the column that is auto incremented is
1 greater than the value in the most recent row added to the table. You can also set the increment
value to be whatever you want it to be.
You may have heard the term secondary key in Oracle, MySQL, SQL Server, or whatever other
dbms you are dealing with. What exactly is a secondary key? Lets start with a definition, and
then a simple example that will help you understand further.
A given table may have more than just one choice for a primary key. Basically, there may be
another column (or combination of columns for a multi-column primary key) that qualify as
primary keys. Any combination of column(s) that may qualify to be a primary key are known as
candidate keys. This is because they are considered candidates for the primary key. And the
options that are not selected to be the primary key are known as secondary keys.
Lets go through an example of a secondary key. Consider a table called Managers that stores all
of the managers in a company. Each manager has a unique Manager ID Number, a physical
address, and an email address. Lets say that the Manager ID is chosen to be the primary key of
the Managers table. Both the physical address and email address could have been selected as the
primary key, because they are both unique fields for every manager row in the Managers table.
But, because the email address and physical address were not selected as the primary key, they
are considered to be secondary keys.
Provide a definition and example of a superkey in SQL.
In SQL, the definition of a superkey is a set of columns in a table for which there are no two
rows that will share the same combination of values. So, the superkey is unique for each and
every row in the table. A superkey can also be just a single column.
Example of a superkey
Suppose we have a table that holds all the managers in a company, and that table is called
Managers. The table has columns called ManagerID, Name, Title, and DepartmentID. Every
manager has his/her own ManagerID, so that value is always unique in each and every row.
This means that if we combine the ManagerID column value for any given row with any other
column value, then we will have a unique set of values. So, for the combinations of (ManagerID,
Name), (ManagerID, TItle), (ManagerID, DepartmentID), (ManagerID, Name, DepartmentID),
etc there will be no two rows in the table that share the exact same combination of values,
because the ManagerID will always be unique and different for each row. This means that pairing
the Manager ID with any other column(s) will ensure that the combination will also be unique
across all rows in the table.
And that is exactly what defines a superkey its any combination of column(s) for which that
combination of values will be unique across all rows in a table. So, all of those combinations of
columns in the Manager table that we gave earlier would be considered to be superkeys. Even
the ManagerID column is considered to be a superkey, although a special type of superkey as
you can read more about below.
emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus that each employee received, then we would write a SQL
statement like this:
Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for
the year of 2007. You might think that we could write a query like this:
BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;
GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Difference between having and where clause
So we can see that the difference between the having and where clause in sql is that the where
clause can not be used with aggregates, but the having clause can. One way to think of it is that
the having clause is an additional filter to the where clause.
Now, lets say that we want to run a query to find all the details of any employees who are named
Jesus? So, we decide to run a simple query like this:
Once we run that query, what exactly goes on behind the scenes to find employees who are
named Jesus? Well, the database software would literally have to look at every single row in the
Employee table to see if the Employee_Name for that row is Jesus. And, because we want
every row with the name Jesus inside it, we can not just stop looking once we find just one row
with the name Jesus, because there could be other rows with the name Jesus. So, every row up
until the last row must be searched which means thousands of rows in this scenario will have to
be examined by the database to find the rows with the name Jesus. This is what is called a full
table scan.
What is an index?
So, what is an index? Well, an index is a data structure (most commonly a B- tree) that stores the
values for a specific column in a table. An index is created on a column of a table. So, the key
points to remember are that an index consists of column values from one table, and that those
values are stored in a data structure. The index is a data structure remember that.
Hash tables are another data structure that you may see being used as indexes these indexes are
commonly referred to as hash indexes. The reason hash indexes are used is because hash tables
are extremely efficient when it comes to just looking up values. So, queries that compare for
equality to a string can retrieve values very fast if they use a hash index. For instance, the query
we discussed earlier (SELECT * FROM Employee WHERE Employee_Name = Jesus) could
benefit from a hash index created on the Employee_Name column. The way a hash index would
work is that the column value will be the key into the hash table and the actual value mapped to
that key would just be a pointer to the row data in the table. Since a hash table is basically an
associative array, a typical entry would look something like Jesus => 0x28939, where 0x28939
is a reference to the table row where Jesus is stored in memory. Looking up a value like Jesus
in a hash table index and getting back a reference to the row in memory is obviously a lot faster
than scanning the table to find all the rows with a value of Jesus in the Employee_Name
column.
Another type of index is a bitmap index, which work well on columns that contain Boolean
values (like true and false), but many instances of those values basically columns with low
selectivity.
Because an index is basically a data structure that is used to store column values, looking up
those values becomes much faster. And, if an index is using the most commonly used data
structure type a B- tree then the data structure is also sorted. Having the column values be
sorted can be a major performance enhancement read on to find out why.
Lets say that we create a B- tree index on the Employee_Name column This means that when
we search for employees named Jesus using the SQL we showed earlier, then the entire
Employee table does not have to be searched to find employees named Jesus. Instead, the
database will use the index to find employees named Jesus, because the index will presumably be
sorted alphabetically by the Employees name. And, because it is sorted, it means searching for a
name is a lot faster because all names starting with a J will be right next to each other in the
index! Its also important to note that the index also stores pointers to the table row so that other
column values can be retrieved read on for more details on that.
As a general rule, an index should only be created on a table if the data in the indexed column
will be queried frequently.
You should also read more about normalization, and also check out the example of first normal
form.
In databases, what is a full table scan? Also, what are some of the causes of full
table scans?
A full table scan looks through all of the rows in a table one by one to find the data that a
query is looking for. Obviously, this can cause very slow SQL queries if you have a table with a
lot of rows just imagine how performance-intensive a full table scan would be on a table with
millions of rows. Using an index can help prevent full table scans.
Lets go through some different scenarios which cause a full table scan:
If a query does not have a WHERE clause to filter out the rows which appear in the result set,
then a full table scan might be performed.
There are some scenarios in which a full table scan will still be performed even though an index
is present on that table. Lets go through some of those scenarios.
If a query does have a WHERE clause, but none of the columns in that WHERE clause match the
leading column of an index on the table, then a full table scan will be performed.
Even if a query does have a WHERE clause with a column that matches the first column of an
index, a full table scan can still occur. This situation arises when the comparison being used by
the WHERE clause prevents the use of an index. Here are some scenarios in which that could
happen:
What are the differences between a hash table and a binary search tree? Suppose
that you are trying to figure out which of those data structures to use when
designing the address book for a cell phone that has limited memory. Which data
structure would you use?
A hash table can insert and retrieve elements in O(1) (for a big-O refresher read here). A binary
search tree can insert and retrieve elements in O(log(n)), which is quite a bit slower than the hash
table which can do it in O(1).
A hash table is an unordered data structure
When designing a cell phone, you want to keep as much data as possible available for data
storage. A hash table is an unordered data structure which means that it does not keep its
elements in any particular order. So, if you use a hash table for a cell phone address book, then
you would need additional memory to sort the values because you would definitely need to
display the values in alphabetical order it is an address book after all. So, by using a hash table
you have to set aside memory to sort elements that would have otherwise be used as storage
space.
Because a binary search tree is already sorted, there will be no need to waste memory or
processing time sorting records in a cell phone. As we mentioned earlier, doing a lookup or an
insert on a binary tree is slower than doing it with a hash table, but a cell phone address book
will almost never have more than 5,000 entries. With such a small number of entries, a binary
search trees O(log(n)) will definitely be fast enough. So, given all that information, a binary
search tree is the data structure that you should use in this scenario, since it is a better choice than
a hash table.
How does Big-O Notation work, and can you provide an example?
First and foremost, do not even walk into a software interview without knowing what Big O
Analysis is all about you will embarrass yourself. Big O Notation is simply something that you
must know if you expect to get a job in this industry. Here we present a tutorial on Big O
Notation, along with some simple examples to really help you understand it. You can consider
this article to be sort of a big O notation for dummies tutorial, because we really try to make it
easy to understand.
This is where Big O analysis helps it gives us some basis for measuring the efficiency of an
algorithm. A more detailed explanation and definition of Big O analysis would be this: it
measures the efficiency of an algorithm based on the time it takes for the algorithm to run as a
function of the input size. Think of the input simply as what goes into a function whether it be
an array of numbers, a linked list, etc.
Its really not that bad at all and it is something best illustrated by an example with actual code
samples.
Even if you already know what Big O Notation is, you can still check out the example algorithms
below and try to figure out the Big O Notation of each algorithm on your own without reading
our answers first. This will give you some good practice finding the Big O Notation on your own
using the problems below.
Now its really time to pay attention lets start our explanation of Big O Notation with an actual
problem. Here is the problem we are trying to solve:
Lets suppose that we want to create a function that, when given an array of integers greater
than 0, will return the integer that is the smallest in that array.
In order to best illustrate the way Big-O analysis works, we will come up with two different
solutions to this problem, each with a different Big-O efficiency.
Heres our first function that will simply return the integer that is the smallest in the array. The
algorithm will just iterate through all of the values in the array and keep track of the smallest
integer in the array in the variable called curMin.
Lets assume that the array being passed to our function contains 10 elements this number is
something we arbitrarily chose. We could have said it contains 100, or 100000 elements either
way it would have made no difference for our purposes here.
As promised, we want to show you another solution to the problem. In this solution, we will use
a different algorithm - we will soon compare the big O Notation of the two different solutions
below. What we do for our second solution to the problem is compare each value in the array to
all of the other numbers in the array, and if that value is less than or equal to all of the other
numbers in the array then we know that it is the smallest number in the array.
int x, y;
if(isMin)
break;
}
return array[x];
}
Now, you've seen 2 functions that solve the same problem - but each one uses a different
algorithm. We want to be able to say which algorithm is more efficient using mathematical
terms, and Big-O analysis allows us to do exactly that.
For our purposes, we assumed an input size of 10 for the array. But when doing Big O analysis,
we don't want to use specific numbers for the input size - so we say that the input is of size n.
Remember that Big-O analysis is used to measure the efficiency of an algorithm based on the
time it takes for the algorithm to run as a function of the input size.
When doing Big-O analysis, "input" can mean a lot of different things depending on the problem
being solved. In our examples above, the input is the array that is passed into the different
functions. But, input could also be the number of elements of a linked list, the nodes in a tree, or
whatever data structure you are dealing with.
Since input is of size n, and in our example the input is an array - we will say that the array is of
size n. We will use the 'n' to denote input size in our Big-O analysis.
So, the real question is how Big-O analysis measures efficiency. Basically, Big-O will want to
express how many times the 'n' input items are 'touched'. The word 'touched' can mean different
things in different algorithms - in some algorithms it may mean the number of times a constant is
multiplied by an input item, the number of times an input is added to a data structure, etc.
In the function CompareSmallestNumber, the n (we used 10 items, but lets just use the variable
'n' for now) input items are each 'touched' only once when each one is compared to the minimum
value. In Big O notation, this would be written as O(n) - which is also known as linear time.
Linear time means that the time taken to run the algorithm increases in direct proportion to the
number of input items. So, 80 items would take longer to run than 79 items or any quantity less
than 79. Another way to phrase this is to say that the algorithm being used in the
CompareSmallestNumber function has order of n time complexity.
You might also see that in the CompareSmallestNumber function, we initialize the curMin
variable to the first value of the input array. And that does count as 1 'touch' of the input. So, you
might think that our Big O notation should be O(n + 1). But actually, Big O is concerned with the
running time as the number of inputs - which is 'n' in this case - approaches infinity. And as 'n'
approaches infinity the constant '1' becomes very insignificant - so we actually drop the constant.
Thus, we can say that the CompareSmallestNumber function has O(n) and not O(n + 1).
Also, if we have n3 + n, then as n approaches infinity it's clear that the "+ n" becomes very
insignificant - so we will drop the "+ n", and instead of having O(n3 + n), we will have O(n3), or
order of n3 time complexity.
Now, let's do the Big O analysis of the CompareToAllNumbers function. The worst case of Big
O notation in our example basically means that we want to find the scenario which will take the
longest for the CompareToAllNumbers function to run. When does that scenario occur?
Well, let's think about what the worst case running time for the CompareToAllNumbers function
is and use that as the basis for the Big O notation. So, for this function, let's assume that the
smallest integer is in the very last element of the array - because that is the exact scenario which
will take the longest to run since it will have to get to the very last element to find the smallest
element. Since we are taking each element in the array and comparing it to every other element
in the array, that means we will be doing 100 comparisons - assuming, of course, that our input
size is 10 (10 * 10 = 100). Or, if we use a variable "n" to represent the input size, that will be n2
'touches' of the input. Thus, this function uses a O(n2 ) algorithm.
In an interview, you may be asked what the Big-O of an algorithm that you've come up with is.
And even if not directly asked, you should provide that information in order to show that you are
well aware of the need to come up with an efficient solution whenever possible.
Big O and Big Omega notations are not the same thing. You can read about the differences here:
Big O versus Big Omega.
What is selectivity in SQL? How is selectivity calculated and how does it relate to a
database index?
The terms selectivity and cardinality are closely related in fact, the formula used to calculate
selectivity uses the cardinality value. The term selectivity is used when talking about database
indexes. This is the formula to use to calculate the selectivity of an index dont worry we do
explain what it all means below:
So, you see the formula and you are thinking thats great, but what does this actually mean? Well,
lets say we have a table with a Sex column which has only two possible values of Male and
Female. Then, that Sex column would have a cardinality of 2, because there are only two
unique values that could possibly appear in that column Male and Female. If there are 10,000
rows in the table, then this means that the selectivity of an index on that particular column will
be 2/10,000 * 100%, which is .02%.
The key with the selectivity value is that it basically measures how selective the values within
a given column are in other words how many different values are available in the given sample
set. A selectivity of .02% is considered to be really low, and means that given the number of
rows, there is a very small amount of variation in the actual values for that column. In our
example Sex column,
Why does the database actually care about the selectivity and how does it use it? Well, lets
consider what a low selectivity means. A low selectivity basically means there is not a lot of
variation in the values in a column that there is not a lot of possibilities for the values of a
column. Suppose, using the example table that we discussed earlier, that we want to find the
names of all the females in the table.
Database query optimizers have to make a decision about whether it would actually make sense
to either use the index to find certain rows in a table or to not use the index. This is because there
are times when using the index is actually less efficient than just directly scanning the table
itself. This is something that you should remember: even if a column has an index created for it,
that does not mean the index will always be used, because scanning the table directly without
going through the index first could be a better, more efficient, option.
So, when exactly is it better to not use a database index? Well, when there is a low selectivity
value! Why does a low selectivity mean that using the index is not a good idea? Well, think about
it lets say we want to run a query that will find the names of all the females in the table we
are of course assuming that there is another column for Name in addition to the Sex column.
If we are searching for all the female rows in a table with 10,000 rows then there is a good
chance that 50% of the rows are females, because there really are just two possible values male
and female. Assuming that 50% of the rows are indeed females, then this means that we would
have to access the index 5,000 times to find all the female rows. Accessing the index takes time,
and consumes resources. If we are accessing the index 5,000 times, it is actually faster to just
directly access the table and do a full table scan. So, you can see that the selectivity value was
used by the query optimizer to determine whether it was more efficient to use an index or just
read the table directly.
Its really hard to say since that exact value varies from one database to another.
Of course, a high selectivity value means that the index should definitely be used. For example,
if we are dealing with a column that has a selectivity of 100%, then all the values in that column
are unique. This means that if a query is searching for just one of those values then it makes
much more sense to use the index, because it will be far more efficient than risking a full table
scan which is the worst case scenario if the table is searched directly without consulting the
index first.
Suppose we have three tables that are used by a company to store employee information: an
Employee table, an Employee_Salary table, and a Department table. The Department table will
have a one to many relationship with the Employee table, because every employee can belong to
only one department, but a department can consist of many employees. In other words, the
cardinality of the Department table in relationship to the employee table is one to many. The
cardinality of the Employee table in relationship to the Employee_Salary table will be one to
one, since an employee can only have one salary, and vice versa (yes, two employees can have
the same salary, but there will still be exactly one salary entry for each employee regardless of
whether or not someone else has the same salary).
The other definition of cardinality is probably the more commonly used version of the term.
In SQL, the cardinality of a column in a given table refers to the number of unique values that
appear in the table for that column. So, remember that the cardinality is a number. For example,
lets say we have a table with a Sex column which has only two possible values of Male and
Female. Then, that Sex column would have a cardinality of 2, because there are only two
unique values that could possibly appear in that column Male and Female.
Or, as another example, lets say that we have a primary key column on a table with 10,000 rows.
What do you think the cardinality of that column would be? Well, it is 10,000. Because it is a
primary key column, we know that all of the values in the column must be unique. And since
there are 10,000 rows, we know that there are 10,000 entries in the column, which translates to a
cardinality of 10,000 for that column. So, we can come up with the rule that the cardinality of a
primary key column will always be equal to the number of records in the same table.
Well, if a column has a cardinality of zero, it means that the column has no unique values. This
could potentially happen if the column has all NULLs which means that the column was never
really used anyways.
Before anything else lets state simply why we need database normalization. There are two
primary goals that database normalization looks to achieve:
The concept of normalization was introduced by a researcher at IBM by the name of E.F. Codd
in the 1970s. He was also the inventor of the relational model which is of course used in
relational databases today.
So, the next logical question is how do we actually apply normalization to a database? Well,
there are a set of rules called forms that must be followed to normalize a database. You may
have heard of 1st normal form, 2nd normal form each of these forms defines different rules.
Each form is basically a set of rules that must be followed.
You can read more about first normal form here: Example of first normal form.
Note that the forms must be followed in order. So, if we want to achieve 2nd normal form,
then we must first achieve 1st normal form.
If a database follows all the rules for 1st normal form then the database is said to be 1st normal
form compliant, or 1NF compliant. The same holds true for 2nd normal form.
Here are some key points of normalization that you should remember:
Lets illustrate the need for a self join with an example. Suppose we have the following table
that is called employee. The employee table has 2 columns one for the employee name (called
employee_name), and one for the employee location (called employee_location):
employee
employee_name employee_location
Sunil India
Alex Russia
Albert Canada
Now, suppose we want to find out which employees are from the same location as the employee
named Joe. In this example, that location would be New York. Lets assume for the sake of our
example that we can not just directly search the table for people who live in New York with a
simple query like this (maybe because we dont want to hardcode the city name) in the SQL
query:
SELECT employee_name
FROM employee
WHERE employee_location = "New York"
So, instead of a query like that what we could do is write a nested SQL query (basically a query
within another query which is more commonly called a subquery) like this:
SELECT employee_name
FROM employee
WHERE employee_location in
( SELECT employee_location
FROM employee
WHERE employee_name = "Joe")
A subquery is inefficient
Using a subquery for such a simple question is inefficient. Is there a more efficient and elegant
solution to this problem?
It turns out that there is a more efficient solution we can use something called a self join. A self
join is basically when a table is joined to itself. The way you should visualize a self join for a
given table is by imagining that a join is performed between two identical copies of that table.
And that is exactly why it is called a self join because of the fact that its just the same table
being joined to another copy of itself rather than being joined with a different table.
Before we come up with a solution for this problem using a self join, we should go over some
concepts so that you can fully understand how a self join works. This will also make the SQL in
our self join tutorial a lot easier to understand, which you will see further below.
In a self join we are joining the same table to itself by essentially creating two copies of that
table. But, how do we distinguish between the two different copies of the table because there is
only one table name after all? Well, when we do a self join, the table names absolutely must use
aliases otherwise the column names would be ambiguous. In other words, we would not know
which of the two copies of the tables columns is being referenced without using an alias for each
copy of the table. If you dont already know what an alias is, its simply another name given to a
table think of an alias as a nickname and that nickname is then used in the SQL query to
reference the table. Because we need two copies of the employee table, we will just use the
aliases e1 and e2 for the employee table when we do a self join.
As with any join there must be a condition upon which a self join is performed we can not just
arbitrarily say do a self join, without specifying some condition. That condition will be our
join predicate. If you need a refresher on join predicates (or just joins in general) then check this
link out: Inner vs. Outer joins.
Now, lets come up with a solution to the original problem using a self join instead of a subquery.
This will help illustrate how exactly a self join works. The key question that we must ask
ourselves is what should our join predicate be in this example? Well, we want to find all the
employees who have the same location as Joe.
Because we want to match between our two tables (both of which are the same table employee
aliased as e1 and e2) on location our join predicate should clearly be WHERE
e1.employee_location = e2.employee_location. But is that enough to give us what we want?
No, its not, because we also want to filter the rows returned since we only want people who are
from the same location as Joe.
So, how can we filter the rows returned so that only people from Joes location are returned?
Well, what we can do is simply add a condition on one of the tables (e2 in our example) so that it
only returns the row where the name is Joe. Then, the other table (e1) will match up all the
names that have the same location in e2, because of our join predicate which is WHERE
e1.employee_location = e2.employee_location. We will then just select the names from e1, and
not e2 because e2 will only have Joes name. If thats confusing then keep reading further to
understand more about how the query will work.
So, the self join query that we come up with looks like this:
This query will return the names Joe and Jack since Jack is the only other person who lives in
New York like Joe.
Generally, queries that refer to the same table can be greatly simplified by re-writing the queries
as self joins. And, there is definitely a performance benefit for this as well.
It will help tremendously to actually visualize the actual results of a self join internally.
Remember that a self join is just like any other join, where the two tables are merged into one
temporary table. First off, you should visualize that we have two separate copies of the employee
table, which are given aliases of e1 and e2. These copies would simply look like this note that
we shortened the column names from employee_name and employee_location to just Name and
Location for convenience:
e1 e2
Name Location
Sunil India
Alex Russia
Albert Canada
Are self joins and inner joins the same? You might be wondering if all self joins are also inner
joins. After all, in our example above our self join uses an inner join because only the rows that
match based on the join predicate are returned non-matching rows are not returned. Well, it
turns out that a self join and inner join are completely different concepts. A self join could just as
well be an outer join or an inner join it just depends on how the query is written. We could
easily change the query we used above to do a LEFT OUTER JOIN while the query still
remains a self join but that wouldnt give us the results we want in our example. So, we use an
implied inner join instead because that gives us the correct results. Remember that a query is a
self join as long as the two tables being joined are exactly the same table, but whether its an
inner join or outer join depends on what is specified in the SQL. And, inner/outer joins are
separate concepts entirely from a self join.
The most commonly used example for self joins is the classic employee manager table. The table
is called Employee, but holds all employees including their managers. Every employee has an
ID, and there is also a column for the manager ID. So, for example, lets say we have a table that
looks like this and we call it Employee:
1 Sam 10
2 Harry 4
4 Manager NULL
10 AnotherManager NULL
Notice that in the table above there are two managers, conveniently named Manager and
AnotherManager. And, those managers dont have managers of their own as noted by the
NULL value in their Manager column.
Now, given the table above, how can we return results that will show each employees name, and
his/her managers name in nicely arranged results with the employee in one column and his/her
managers name in the other column. Well, it turns out we can use a self join to do this. Try to
come up with the SQL on your own before reading our answer.
In order to come up with a correct answer for this problem, our goal should be to perform a self
join that will have both the employee information and manager information in one row. First off,
since we are doing a self join, it helps to visualize the one table as two tables lets give them
aliases of e1 and e2. Now, with that in mind, we want the employees information on one side of
the joined table and the managers information on the other side of the joined table. So, lets just
say that we want e1 to hold the employee information and e2 to hold the corresponding
managers information. What should our join predicate be in that case?
Well, the join predicate should look like ON e1.ManagerID = e2.EmployeeID this basically
says that we should join the two tables (a self join) based on the condition that the manager ID in
e1 is equal to the employee ID in e2. In other words, an employees manager in e1 should have
the managers information in e2. An illustration will help clarify this. Suppose we use that
predicate and just select everything after we join the tables. So, our SQL would look like this:
SELECT *
FROM Employee e1
INNER JOIN Employee e2
ON e1.ManagerID = e2.EmployeeID
The results of running the query above would look like this:
AnotherManag
1 Sam 10 10 NULL
er
Note that there are only 2 rows returned this is because an inner join is performed, which
means that only when there is a match between employee IDs and manager IDs will there be a
result returned. And since there are 2 people without managers (who have a manager ID of
NULL), they will not be returned as part of table e1, because no employees have a matching ID
of NULL.
Now, remember that we only want to return the names of the employee and corresponding
manager as a pair. So, we can fine-tune the SQL as follows:
Sam AnotherManager
Harry Manager
And that is the answer to the employee manager problem using a self join! Feel free to post any
comments.
Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.
It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:
Employee Location
Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.
Outer Joins
Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:
In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL
below. This is equivalent to what we have above.
A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.
A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:
Using the tables presented above, we can show what the result set of a right outer join would
look like:
We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.
Inner Joins
Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres
what the SQL for an inner join will look like:
Now, here is what the result of running that SQL would look like:
We can see that an inner join will only return rows in which there is a match based on the join
predicate. In this case, what that means is anytime the Employee and Location table share an
Employee ID, a row will be generated in the results to show the match. Looking at the original
tables, one can see that those Employee IDs that are shared by those tables are displayed in the
results. But, with a left or right outer join, the result set will retain all of the rows from either the
left or right table.
Suppose we have the Employee table below, and we want to retrieve all of the cities
that the employees live in, but we dont want any duplicates. How can we do this in
SQL?
employee
employee_name employee_location
Alex Russia
Albert Canada
Alex Russia
In SQL, the distinct keyword will allow us to do that. Heres what the simple SQL would look
like:
employee_l
ocation
New York
India
Russia
Canada
So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.
Its worth noting that the DISTINCT keyword can be used with more than one column. That
means that only the unique combination of columns will be returned. Again, this is best
illustrated by an example.
employee_name employee_location
Joe New York
Sunil India
Alex Russia
Albert Canada
Note that the one extra entry for "Alex, Russia" is missing in the result set above. This is because
when we select a distinct combination of name and location, if there are 2 entries with the same
exact name and location then the sql that we ran above will only return one of those entries.
In the table below, how would you retrieve the unique values for the
employee_location without using the DISTINCT keyword?
employee
employee_name employee_location
Sunil India
Alex Russia
Albert Canada
Alex Russia
We can actually accomplish this with the GROUP BY keyword. Heres what the SQL would look
like:
employee_l
ocation
New York
India
Russia
Canada
So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.
This is a valid alternative to using the DISTINCT keyword. If you need a refresher on the
GROUP BY clause, then check out this question: Group By and Having. This question would
probably be asked just to see how good you are with coming up with alternative options for SQL
queries. Although, it probably doesnt prove much about your SQL skills.
Theres no better way to improve your SQL skills than to practice with some real SQL interview
questions and these SQL practice problems are a great way to improve your SQL online. We
recommend first creating the following simple tables presented below in the RDBMS software of
your choice MySQL, Oracle, DB2, SQL Server, etc, and then actually try to figure out the
answer on your own if possible.
The following SQL practice exercises were actually taken from real interview tests with Google
and Amazon. Once again, we highly recommended that you try finding the answers to these SQL
practice exercises on your own before reading the given solutions. The practice problems are
based on the tables presented below.
Salesperson Customer
7 Da 4 520
n 1 00
8 Ke 5 115
n 7 000
1
3 380
1 Joe
8 00
Orders
10 8/2/96 4 2 540
20 1/30/99 4 8 1800
30 7/14/95 9 1 460
40 1/29/98 7 2 2400
50 2/3/98 6 7 600
60 3/2/98 6 7 720
70 5/6/98 9 7 150
b. The names of all salespeople that do not have any order with Samsonic.
Lets start by answering part a. Its obvious that we would need to do a SQL join, because the
data in one table will not be enough to answer this question. This is a good question to get some
practice with SQL joins, so see if you can come up with the solution.
Now, what tables should we use for the join? We know that the customer ID of Samsonic is 4, so
we can use that information and do a simple join with the salesperson and customer tables. The
SQL would look like this:
We can also use subqueries (a query within a query) to come up with another possible answer.
Here is an alternative, but less efficient, solution using a subquery:
Click on the Next button below to check out the answer to parts B and C of this SQL interview
question.
Lets now work on answering parts B and C of the original question. We present the tables below
again for your convenience.
Here is part B: Find the names of all salespeople that do not have any orders with
Samsonic.
This is part C: Find the names of salespeople that have 2 or more orders.
Salesperson Customer
I Na A ID Name City Industry Type
Sal
Dm g
ary
e e 4 Samsonic pleasant J
5 Ch 3 400
ris 4 00
7 Da 4 520
n 1 00
8 Ke 5 115
n 7 000
1
3 380
1 Joe
8 00
Orders
10 8/2/96 4 2 540
20 1/30/99 4 8 1800
30 7/14/95 9 1 460
40 1/29/98 7 2 2400
50 2/3/98 6 7 600
60 3/2/98 6 7 720
70 5/6/98 9 7 150
Part B of the question asks for the names of the salespeople who do not have an order with
Samsonic. A good way to approach this problem is to break it down: if we can first find the name
of all the salespeople who do have an order with Samsonic. Then, perhaps we can work with that
list and get all the salespeople who do not have an order with Samsonic.
So, lets start by just getting a list of all the salespeople IDs that have an order with Samsonic.
We can get this list by doing a join with a condition that the customer is Samsonic. We can use
both the Customer and Orders table. The SQL for this will look like:
This will give us a list of all the salespeople IDs that have an order with Samsonic. Now, we can
get a list of the names of all the salespeople who do NOT have an order with Samsonic. SQL has
a NOT operator that easily allows us to exclude elements of the result set. We can use this to
our advantage. Here is one possible answer to question B, and this is what the final SQL will
look like:
Now, lets work on answering part C. As always, its best to break the problem down into more
manageable pieces. So, lets focus on one table: the Orders table. Looking at that table we can
find the IDs that belong to the salespeople who have 2 or more orders. This will require use of
the "group by" syntax in SQL, which allows us to group by whatever column we choose. In this
case, the column that we would be grouping by is the salesperson_id column, because for a given
salesperson ID we would like to find out how many orders were placed under that ID. With that
said, we can write this SQL:
Note how we used the having clause instead of the where clause because we are using the count
aggregate. Well, now we have a SQL statement that gives us the IDs of the salespeople who
have more than 1 order. But, what we really want is the names of the salespeople who have those
IDs. This is actually quite simple if we do a join on the Salesperson and Orders table, and use
the SQL that we came up earlier. It would look like this:
SELECT name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.id
GROUP BY name, salesperson_id
HAVING COUNT( salesperson_id ) >1
Based on our tables, this SQL will return the names of Bob and Dan. Click on the Next button
below to check out the answer to part D.
Weve finally come to the last part of this question. Question D is presented below again for your
convenience.
Part D: Write a SQL statement to insert rows into a table called highAchiever(Name, Age),
where a salesperson must have a salary of 100,000 or greater to be included in the table.
Looking at part D, its easy to come up with the SQL to specify the condition that the salary of
the salesperson must be greater or equal to 100,000. It would look like this "WHERE SALARY
>= 100000". The only slightly difficult part of this question is how we insert values into the
highachiever table while selecting values from the salesperson table. It turns out that the SQL for
this is:
Because we are inserting values into the highAchiever table based off of what we select from
another table, we dont use the "Values" clause that we would normally use when inserting. This
is what a regular insertion would look like (note the use of the "values" clause):
As you can see the answer to this one is pretty simple. Click next below to read part 2 of our
practice SQL interview questions.
This question was asked in a Google interview: Given the 2 tables below, User and
UserHistory:
User
user_id
name
phone_num
UserHistory
user_id
date
action
1. Write a SQL query that returns the name, phone number and most recent date for any user
that has logged in over the last 30 days (you can tell a user has logged in if the action field in
UserHistory is set to "logged_on").
Every time a user logs in a new row is inserted into the UserHistory table with user_id, current
date and action (where action = "logged_on").
2. Write a SQL query to determine which user_ids in the User table are not contained in the
UserHistory table (assume the UserHistory table has a subset of the user_ids in User table).
Do not use the SQL MINUS statement. Note: the UserHistory table can have multiple entries
for each user_id.
Note that your SQL should be compatible with MySQL 5.0, and avoid using subqueries.
Lets start with #1 by breaking down the problem into smaller, more manageable problems. Then
we can take the pieces and combine them to provide a solution to the overall problem.
Figuring out how to tell whether a user has logged on in the past 30 days seems like a good place
to start. We want to see how we can express this in MySQL. You can look online for some Mysql
functions that will help with this calculation. MySQL has a "date_sub" function, in which we can
pass the current date (as in todays date) and an interval of 30 days, and it will return us the date
30 days ago from today. Once we have that date, we can compare it with the date in the
UserHistory table to see if it falls within the last 30 days. One question that remains is how we
will retrieve the current date. This is simple, because MySQL comes built in with a function
called curdate() that will return the current date.
So, using the date_sub function, we can come up with this piece of SQL:
UserHistory.date >= date_sub(curdate(), interval 30 day)
This will check to see that the date in the UserHistory table falls within the last 30 days. Note
that we use the ">=" operator to compare dates in this case, we are simply saying that the date
in the UserHistory table is greater than or equal to the date returned from the date_sub function.
A date is "greater" than another date when it occurs further in the future than the other date. So,
2007-9-07 will be considered "greater" than 2006-08-19, because 2007-9-07 occurs further in the
future than 2006-08-19.
Now, thats only one piece of the overall problem, so lets continue. The problem asks us to
retrieve the name, phone number, and the most recent date for any user thats logged in over
the last 30 days. We have one table with the user_id and the phone number, but only the other
table contains the actual date. Clearly, we will have to do a join on the 2 tables in order to
combine the data into a form that will allow us to solve this problem. And since the 2 tables only
share one column the user_id column its clear what common column we will use to join the
2 tables. Doing a join, selecting the required fields, and using the date condition will look like
this:
So far, we are selecting the name, phone number, and the date for any user thats logged in over
the last 30 days. But, wait a minute the problem specifically asks for "the most recent date for
any user thats logged in over the last 30 days." The problem with this is that we could get
multiple entries for a user that logged on more than once in the last 30 days. That is not what we
want we want to see the most recent date that someone logged on in the last 30 days this will
return a maximum of 1 entry per user.
Now, the question is how do we get the most recent date? This is quite simple again, as MySQL
provides a MAX aggregate function that we can use to find the most recent date. Given a group
of dates, the MAX function will return the "maximum" date which is basically just the most
recent date (the one furthest in the future). Because this is an aggregate function, we will have to
provide the GROUP BY clause in order to specify what column we would like to use as a
container of the group of dates. So, now our SQL looks like this:
Phew! We are finally done with question 1, click next to check out the answer to question #2.
User
user_id
name
phone_num
UserHistory
user_id
date
action
Lets continue with the 2nd question, presented again below
2. Given the tables above, write a SQL query to determine which user_ids in the User table are
not contained in the UserHistory table (assume the UserHistory table has a subset of the
user_ids in User table). Do not use the SQL MINUS statement. Note: the UserHistory table
can have multiple entries for each user_id.
Note that your SQL should be compatible with MySQL 5.0, and avoid using subqueries.
Basically we want the user_ids that exist in the User table but not in the UserHistory table. If
we do a regular inner join on the user_id column, then that would just do a join on all the
rows in which the User and UserHistory table share the same user_id values . But the
question specifically asks for just the user_ids that are in the User table, but are not in the
UserHistory table. So, using an inner join will not work.
What if, instead of an inner join, we use a left outer join on the user_id column? This will
allow us to retain all the user_id values from the User table (which will be our "left" table)
even when there is no matching user_id entry in the "right" table (in this case, the
UserHistory table). When there is no matching record in the "right" table the entry will just
show up as NULL. This means that any NULL entries are user_id values that exist in the User
table but not in the UserHistory table. This is exactly what we need to answer the question. So,
heres what the SQL will look like:
You may be confused by the "User as u" and the "UserHistory as uh" syntax. Those are
whats called aliases. Aliases allow us to assign a shorter name to a table, and it makes for
cleaner and more compact SQL. In the example above, "u" will actually be another name for
the "User" table and "uh" will be another name for the "UserHistory" table.
We also use the distinct keyword. This will ensure that each user_id is returned only once.
That concludes our series of practice sql interview questions. If you are looking for some more
advanced and challenging SQL interview questions the check out our other articles: Advanced
SQL practice questions.
n SQL, whats the difference between the having clause and the group by statement?
In SQL, the having clause and the group by statement work together when using aggregate
functions like SUM, AVG, MAX, etc. This is best illustrated by an example. Suppose we have a
table called emp_bonus as shown below. Note that the table has multiple entries for employees
A and B which means that both employees A and B have received multiple bonuses.
emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus amount that each employee has received, then we
would write a SQL statement like this:
In the SQL statement above, you can see that we use the "group by" clause with the employee
column. The group by clause allows us to find the sum of the bonuses for each employee
because each employee is treated as his or her very own group. Using the group by in
combination with the sum(bonus) statement will give us the sum of all the bonuses for
employees A, B, and C.
Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for
the year of 2012 this is assuming of course that the emp_bonus table contains bonuses only
for the year of 2012. This is when we need to use the HAVING clause to add the additional
check to see if the sum of bonuses is greater than $1,000, and this is what the SQL look like:
GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Employee Sum(Bonus)
A 1500
B 3250
Difference between having clause and group by statement
So, from the example above, we can see that the group by clause is used to group column(s) so
that aggregates (like SUM, MAX, etc) can be used to find the necessary information. The
having clause is used with the group by clause when comparisons need to be made with those
aggregate functions like to see if the SUM is greater than 1,000, as in our example above.
So, the having clause and group by statements are not really alternatives to each other but
they are used alongside one another!