Sunteți pe pagina 1din 51

In SQL, whats the difference between an

inner and outer join?

Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.

It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:

Employee Location
EmpID EmpName EmpID EmpLoc
13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
25 Johnson 39 Bangalore, India

Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.

Outer Joins
Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:

select * from employee left outer join location


on employee.empID = location.empID;
Subscribe to our newsletter for more free interview questions.

In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL
below. This is equivalent to what we have above.

select * from employee left join location


on employee.empID = location.empID;

A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
25 Johnson NULL NULL

The Join Predicate a geeky term you should know

Earlier we had mentioned something called a join predicate. In the SQL above, the join predicate
is "on employee.empID = location.empID". This is the heart of any type of join, because it
determines what common column between the 2 tables will be used to "join" the 2 tables. As you
can see from the result set, all of the rows from the left table are returned when we do a left outer
join. The last row of the Employee table (which contains the "Johson" entry) is displayed in the
results even though there is no matching row in the Location table. As you can see, the non-
matching columns in the last row are filled with a "NULL". So, we have "NULL" as the entry
wherever there is no match.

A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:

select * from employee right outer join location


on employee.empID = location.empID;

// taking out the "outer", this also works:


select * from employee right join location
on employee.empID = location.empID;

Using the tables presented above, we can show what the result set of a right outer join would
look like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India
NULL NULL 39 Bangalore, India

We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.

Inner Joins
Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres
what the SQL for an inner join will look like:

select * from employee inner join location on


employee.empID = location.empID

This can also be written as:

select * from employee, location


where employee.empID = location.empID

Now, here is what the result of running that SQL would look like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc


13 Jason 13 San Jose
8 Alex 8 Los Angeles
3 Ram 3 Pune, India
17 Babu 17 Chennai, India

Inner vs Outer Joins


We can see that an inner join will only return rows in which there is a match based on the join
predicate. In this case, what that means is anytime the Employee and Location table share an
Employee ID, a row will be generated in the results to show the match. Looking at the original
tables, one can see that those Employee IDs that are shared by those tables are displayed in the
results. But, with a left or right outer join, the result set will retain all of the rows from either the
left or right table.

In SQL, what is the definition of a key?

With so many different types of keys (foreign, primary, unique, natural, super, etc), it can really
get quite confusing to have a solid understanding of keys in SQL. And to make it even more
confusing, many people think that relational database theory (which deals with terms like tuples,
attributes, and relations), SQL (which deals with terms like tables rows and columns, and is our
main concern here), and file systems (which deal with terms like records and fields) are all the
same concept when in fact they are all completely different.

With that in mind, we want to give the proper definition of keys in SQL so that the foundation of
your understanding can be solid.

The Definition of A Key in SQL


According to the SQL Standard, a key is a subset of columns in a table that allow a row to be
uniquely identified. So, a key can be more than just one column. And, every row in the table will
have a unique value for the key or a unique combination of values if the key consists of more
than just one column.

Can a key have NULL values in SQL?

According to the SQL standard, a key is not allowed to have values that are NULL-able. Any key
that has more columns than necessary to uniquely identify each row in the table is called a super-
key (think of it as a super-set). But, if the key has the minimum amount of columns necessary to
uniquely identify each row then it is called a minimal super-key. A minimal super-key is also
known as a candidate key, and there must be one or more candidate keys in a table.

Keys in actual RDBMS implementations


Even though the SQL standard says that a key can not be NULL, in practice actual RDBMS
implementations (like SQL Server and Oracle), allow both foreign and unique keys to actually be
NULL. And there are plenty of times when that actually makes sense. However, a primary key
can never be NULL.

n SQL, what are the differences between primary, foreign,


and unique keys?

The one thing that primary, unique, and foreign keys all have in common is the fact that each
type of key can consist of more than just one column from a given table. In other words,
foreign, primary, and unique keys are not restricted to having just one column from a given table
each type of key can cover multiple columns. So, that is one feature that all the different types
of keys share they can each be comprised of more than just one column, which is something
that many people in software are not aware of.

Of course, the database programmer is the one who will actually define which columns are
covered by a foreign, primary, or unique key. That is one similarity all those keys share, but there
are also some major differences that exist between primary, unique, and foreign keys. We will go
over those differences in this article. But first, we want to give a thorough explanation of why
foreign keys are necessary in some situations.

What is the point of having a foreign key?


Foreign keys are used to reference unique columns in another table. So, for example, a foreign
key can be defined on one table A, and it can reference some unique column(s) in another table
B. Why would you want a foreign key? Well, whenever it makes sense to have a relationship
between columns in two different tables.

An example of when a foreign key is necessary


Suppose that we have an Employee table and an Employee Salary table. Also assume that every
employee has a unique ID. The Employee table could be said to have the master list of all
Employee IDs in the company. But, if we want to store employees salaries in another table, then
do we want to recreate the entire master list of employee IDs in the Employee Salary table as
well? No we dont want to do that because its inefficient. It would make a lot more sense to
just define a relationship between an Employee ID column in the Employee Salary table and the
master Employee ID column in the Employee table one where the Employee Salary table can
just reference the employee ID in the Employee table. This way, whenever someones employee
ID is updated in the Employee table, it will also automatically get updated in the Employee
Salary table. Sounds good right? So now, nobody has to manually update the employee IDs in
the Employee Salary table every time the ID is update in the master list inside the Employee
table. And, if an employee is removed from the Employee table, he/she will also automatically be
removed (by the RDBMS) from the Employee Salary table of course all of this behavior has to
be defined by the database programmer, but hopefully you get the point.

Foreign keys and referential integrity


Foreign keys have a lot to do with the concept of referential integrity. What we discussed in the
previous paragraph are some of the principles behind referential integrity. You can and should
read a more in depth article on that concept here: Referential integrity explained.

Can a table have multiple unique, foreign, and/or primary


keys?
A table can have multiple unique and foreign keys. However, a table can have only one primary
key.

Can a unique key have NULL values? Can a primary key


have NULL values?
Unique key columns are allowed to hold NULL values. The values in a primary key column,
however, can never be NULL.

Can a foreign key reference a non-primary key?


Yes, a foreign key can actually reference a key that is not the primary key of a table. But, a
foreign key must reference a unique key.

Can a foreign key contain null values?


Yes, a foreign key can hold NULL values. Because foreign keys can reference unique, non-
primary keys which can hold NULL values this means that foreign keys can themselves hold
NULL values as well.
Some other differences between foreign, primary, and
unique keys
While unique and primary keys both enforce uniqueness on the column(s) of one table, foreign
keys define a relationship between two tables. A foreign key identifies a column or group of
columns in one (referencing) table that refers to a column or group of columns in another
(referenced) table in our example above, the Employee table is the referenced table and the
Employee Salary table is the referencing table.

As we stated earlier, both unique and primary keys can be referenced by foreign keys.

Whats referential integrity?

Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.

The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.

Now, suppose we wanted to remove an employee because he no longer works at company X.


Then, we would remove his entry in the Employee table. Because he also exists in the Employee
Salary table, we would also have to manually remove him from there also. Manually removing
the employee from the Employee Salary table can become quite a pain. And if there are other
tables in which Company X uses that employee then he would have to be deleted from those
tables as well an even bigger pain.

By enforcing referential integrity, we can solve that problem, so that we wouldnt have to
manually delete him from the Employee Salary table (or any others). Heres how: first we would
define the employee ID column in the Employee table to be our primary key. Then, we would
define the employee ID column in the Employee Salary table to be a foreign key that points to a
primary key that is the employee ID column in the Employee table. Once we define our foreign
to primary key relationship, we would need to add whats called a constraint to the Employee
Salary table. The constraint that we would add in particular is called a cascading delete this
would mean that any time an employee is removed from the Employee table, any entries that
employee has in the Employee Salary table would also automatically be removed from the
Employee Salary table.

Note in the example given above that referential integrity is something that must be enforced,
and that we enforced only one rule of referential integrity (the cascading delete). There are
actually 3 rules that referential integrity enforces:

1.We may not add a record to the Employee Salary table


unless the foreign key for that record points to an existing
employee in the Employee table.

2.If a record in the Employee table is deleted, all corresponding


records in the Employee Salary table must be deleted using a
cascading delete. This was the example we had given earlier.

3.If the primary key for a record in the Employee table changes,
all corresponding records in the Employee Salary table must be
modified using what's called a cascading update.

Its worth noting that most RDBMSs relational databases like Oracle, DB2, Teradata, etc.
can automatically enforce referential integrity if the right settings are in place. But, a large part of
the burden of maintaining referential integrity is placed upon whoever designs the database
schema basically whoever defined the tables and their corresponding structure/relationships in
the database that you are using. Referential integrity is an important concept and you simply
must know it for any programmer interview.

Provide an example and definition of a natural key in SQL.


You have probably come across the term natural key within the context of SQL and data
warehouses. What exactly is a natural key? A natural key is a key composed of columns that
actually have a logical relationship to other columns within a table. What does that mean in plain
English? Well, lets go through an example of a natural key.

Natural Key Example

Consider a table called People. If we use the columns First_Name, Last_Name, and Address
together to form a key then that would be a natural key because those columns are something
that are natural to people, and there is definitely a logical relationship between those columns
and any other columns that may exist in the table.

Why is it called a natural key?


The reason its called a natural key is because the columns that belong to the key are just
naturally a part of the table and have a relationship with other columns in the table. So, a natural
key already exists within a table and columns do not need to be added just to create an
artificial key.

Natural keys versus business keys


Natural keys are often also called business keys so both terms mean exactly the same thing.

Natural keys versus domain keys


Domain keys also mean the same thing as natural keys.

Natural keys versus surrogate keys


Natural keys are often compared to surrogate keys. What exactly is a surrogate key? Well, first
consider the fact that the word surrogate literally means substitute. The reason a surrogate key is
like a substitute is because its unnatural, in the sense that the column used for the surrogate key
has no logical relationship to other columns in the table.

In other words, the surrogate key really has no business meaning i.e., the data stored in a
surrogate key has no intrinsic meaning to it.

Why are surrogate keys used?

A surrogate key could be considered to be the artificial key that we mentioned earlier. In most
databases, surrogate keys are only used to act as a primary key. Surrogate keys are usually just
simple sequential numbers where each number uniquely identifies a row. For example, Sybase
and SQL Server both have whats called an identity column specifically meant to hold a unique
sequential number for each row. MySQL allows you to define a column with the
AUTO_INCREMENT attribute, which just means that the value in the column will automatically
increment the value in a given column to be 1 greater than the value in the previous row. This
just means that every time you add a new row, the value in the column that is auto incremented is
1 greater than the value in the most recent row added to the table. You can also set the increment
value to be whatever you want it to be.

What is a simple key in a dbms?


In a database table, a simple key is just a single attribute (which is just a column) that can
uniquely identify a row. So, any single column in a table that can uniquely identify a row is a
simple key. The reason its called a simple key is because of the fact that it is simple in the sense
that its just composed of one column (as opposed to multiple columns) and thats it.

Example of a simple key


Lets go through an example of a simple key. Consider a table called Employees. If every
employee has a unique ID and a column called EmployeeID, then the EmployeeID column
would be considered a simple key because its a single column that can uniquely identify every
row in the table (where each row is a separate employee). Simple isnt it?

What is the definition of a secondary key?

You may have heard the term secondary key in Oracle, MySQL, SQL Server, or whatever other
dbms you are dealing with. What exactly is a secondary key? Lets start with a definition, and
then a simple example that will help you understand further.

A given table may have more than just one choice for a primary key. Basically, there may be
another column (or combination of columns for a multi-column primary key) that qualify as
primary keys. Any combination of column(s) that may qualify to be a primary key are known as
candidate keys. This is because they are considered candidates for the primary key. And the
options that are not selected to be the primary key are known as secondary keys.

Example of a Secondary Key in SQL

Lets go through an example of a secondary key. Consider a table called Managers that stores all
of the managers in a company. Each manager has a unique Manager ID Number, a physical
address, and an email address. Lets say that the Manager ID is chosen to be the primary key of
the Managers table. Both the physical address and email address could have been selected as the
primary key, because they are both unique fields for every manager row in the Managers table.
But, because the email address and physical address were not selected as the primary key, they
are considered to be secondary keys.
Provide a definition and example of a superkey in SQL.
In SQL, the definition of a superkey is a set of columns in a table for which there are no two
rows that will share the same combination of values. So, the superkey is unique for each and
every row in the table. A superkey can also be just a single column.

Example of a superkey
Suppose we have a table that holds all the managers in a company, and that table is called
Managers. The table has columns called ManagerID, Name, Title, and DepartmentID. Every
manager has his/her own ManagerID, so that value is always unique in each and every row.

This means that if we combine the ManagerID column value for any given row with any other
column value, then we will have a unique set of values. So, for the combinations of (ManagerID,
Name), (ManagerID, TItle), (ManagerID, DepartmentID), (ManagerID, Name, DepartmentID),
etc there will be no two rows in the table that share the exact same combination of values,
because the ManagerID will always be unique and different for each row. This means that pairing
the Manager ID with any other column(s) will ensure that the combination will also be unique
across all rows in the table.

And that is exactly what defines a superkey its any combination of column(s) for which that
combination of values will be unique across all rows in a table. So, all of those combinations of
columns in the Manager table that we gave earlier would be considered to be superkeys. Even
the ManagerID column is considered to be a superkey, although a special type of superkey as
you can read more about below.

What is a minimal superkey?


A minimal superkey is the minimum number of columns that can be used to uniquely identify a
single row. In other words, the minimum number of columns, which when combined, will give a
unique value for every row in the table. Remember that we mentioned earlier that a superkey can
be just a single column. So, in our example above, the minimal superkey would be the
ManagerID since it is unique for each and every row in the Manager table.

Can a table have multiple minimal superkeys?


Yes, a table can have multiple minimal superkeys. Let use our example of a Manager table again.
Suppose we add another column for the Social Security Number (which, for our non-American
readers, is a unique 9 digit number assigned to every citizen of the USA) to the Manager table
lets just call it SSN. Since that column will clearly have a unique value for every row in the
table, it will also be a minimal superkey because its only one column and it also is unique for
every row.

Can a minimal superkey have more than one column?


Absolutely. If there is no single column that is unique for every row in a given table, but there is
a combination of columns that produce a unique value for every row in a table, then that
combination of columns would be the minimal superkey. This is of course provided that the
combination is the smallest number of columns necessary to produce a unique value for each
row.

Why is it called a superkey?


Its called a superkey because it comes from RDBMS theory, as in superset and subset. So, a
superkey is essentially all the superset combinations of keys, which will of course uniquely
identify a row in a table.

Superkey versus candidate key


We discussed minimal superkeys and defined exactly what they are. Candidate keys are actually
minimal superkeys so both candidate keys and minimal superkeys mean exactly the same
thing.

In SQL, whats the difference between the


having clause and the where clause?
The difference between the having and where clause is best illustrated by an example. Suppose
we have a table called emp_bonus as shown below. Note that the table has multiple entries for
employees A and B.

emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus that each employee received, then we would write a SQL
statement like this:

select employee, sum(bonus) from emp_bonus group by employee;

The Group By Clause


In the SQL statement above, you can see that we use the "group by" clause with the employee
column. What the group by clause does is allow us to find the sum of the bonuses for each
employee. Using the group by in combination with the sum(bonus) statement will give us the
sum of all the bonuses for employees A, B, and C.

Running the SQL above would return this:

Employee Sum(Bonus)
A 1500
B 3250
C 700

Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for
the year of 2007. You might think that we could write a query like this:

BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;

The WHERE clause does not work with aggregates like


SUM
The SQL above will not work, because the where clause doesnt work with aggregates like
sum, avg, max, etc.. Instead, what we will need to use is the having clause. The having clause
was added to sql just so we could compare aggregates to other values just how the where
clause can be used with non-aggregates. Now, the correct sql will look like this:

GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Difference between having and where clause
So we can see that the difference between the having and where clause in sql is that the where
clause can not be used with aggregates, but the having clause can. One way to think of it is that
the having clause is an additional filter to the where clause.

How do database indexes work? And, how do


indexes help? Provide a tutorial on database
indexes.
Lets start out our tutorial and explanation of why you would need a database index by going
through a very simple example. Suppose that we have a database table called Employee with
three columns Employee_Name, Employee_Age, and Employee_Address. Assume that the
Employee table has thousands of rows.

Now, lets say that we want to run a query to find all the details of any employees who are named
Jesus? So, we decide to run a simple query like this:

SELECT * FROM Employee


WHERE Employee_Name = 'Jesus'

What would happen without an index on the table?

Once we run that query, what exactly goes on behind the scenes to find employees who are
named Jesus? Well, the database software would literally have to look at every single row in the
Employee table to see if the Employee_Name for that row is Jesus. And, because we want
every row with the name Jesus inside it, we can not just stop looking once we find just one row
with the name Jesus, because there could be other rows with the name Jesus. So, every row up
until the last row must be searched which means thousands of rows in this scenario will have to
be examined by the database to find the rows with the name Jesus. This is what is called a full
table scan.

How a database index can help performance


You might be thinking that doing a full table scan sounds inefficient for something so simple
shouldnt software be smarter? Its almost like looking through the entire table with the human
eye very slow and not at all sleek. But, as you probably guessed by the title of this article, this
is where indexes can help a great deal. The whole point of having an index is to speed up
search queries by essentially cutting down the number of records/rows in a table that need
to be examined.

What is an index?
So, what is an index? Well, an index is a data structure (most commonly a B- tree) that stores the
values for a specific column in a table. An index is created on a column of a table. So, the key
points to remember are that an index consists of column values from one table, and that those
values are stored in a data structure. The index is a data structure remember that.

Subscribe to our newsletter for more free interview questions.

What kind of data structure is an index?


B- trees are the most commonly used data structures for indexes. The reason B- trees are the
most popular data structure for indexes is due to the fact that they are time efficient because
look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason
B- trees are more commonly used is because the data that is stored inside the B- tree can be
sorted. The RDBMS typically determines which data structure is actually used for an index. But,
in some scenarios with certain RDBMSs, you can actually specify which data structure you
want your database to use when you create the index itself.

How does a hash table index work?

Hash tables are another data structure that you may see being used as indexes these indexes are
commonly referred to as hash indexes. The reason hash indexes are used is because hash tables
are extremely efficient when it comes to just looking up values. So, queries that compare for
equality to a string can retrieve values very fast if they use a hash index. For instance, the query
we discussed earlier (SELECT * FROM Employee WHERE Employee_Name = Jesus) could
benefit from a hash index created on the Employee_Name column. The way a hash index would
work is that the column value will be the key into the hash table and the actual value mapped to
that key would just be a pointer to the row data in the table. Since a hash table is basically an
associative array, a typical entry would look something like Jesus => 0x28939, where 0x28939
is a reference to the table row where Jesus is stored in memory. Looking up a value like Jesus
in a hash table index and getting back a reference to the row in memory is obviously a lot faster
than scanning the table to find all the rows with a value of Jesus in the Employee_Name
column.

The disadvantages of a hash index


Hash tables are not sorted data structures, and there are many types of queries which hash
indexes can not even help with. For instance, suppose you want to find out all of the employees
who are less than 40 years old. How could you do that with a hash table index? Well, its not
possible because a hash table is only good for looking up key value pairs which means queries
that check for equality (like WHERE name = Jesus'). What is implied in the key value
mapping in a hash table is the concept that the keys of a hash table are not sorted or stored in any
particular order. This is why hash indexes are usually not the default type of data structure used
by database indexes because they arent as flexible as B- trees when used as the index data
structure. Also see: Binary trees versus Hash Tables.

What are some other types of indexes?


Indexes that use a R- tree data structure are commonly used to help with spatial problems. For
instance, a query like Find all of the Starbucks within 2 kilometers of me would be the type of
query that could show enhanced performance if the database table uses a R- tree index.

Another type of index is a bitmap index, which work well on columns that contain Boolean
values (like true and false), but many instances of those values basically columns with low
selectivity.

How does an index improve performance?

Because an index is basically a data structure that is used to store column values, looking up
those values becomes much faster. And, if an index is using the most commonly used data
structure type a B- tree then the data structure is also sorted. Having the column values be
sorted can be a major performance enhancement read on to find out why.

Lets say that we create a B- tree index on the Employee_Name column This means that when
we search for employees named Jesus using the SQL we showed earlier, then the entire
Employee table does not have to be searched to find employees named Jesus. Instead, the
database will use the index to find employees named Jesus, because the index will presumably be
sorted alphabetically by the Employees name. And, because it is sorted, it means searching for a
name is a lot faster because all names starting with a J will be right next to each other in the
index! Its also important to note that the index also stores pointers to the table row so that other
column values can be retrieved read on for more details on that.

What exactly is inside a database index?


So, now you know that a database index is created on a column in a table, and that the index
stores the values in that specific column. But, it is important to understand that a database index
does not store the values in the other columns of the same table. For example, if we create an
index on the Employee_Name column, this means that the Employee_Age and
Employee_Address column values are not also stored in the index. If we did just store all the
other columns in the index, then it would be just like creating another copy of the entire table
which would take up way too much space and would be very inefficient.

An index also stores a pointer to the table row


So, the question is if the value that we are looking for is found in an index (like Jesus) , how
does it find the other values that are in the same row (like the address of Jesus and his age)?
Well, its quite simple database indexes also store pointers to the corresponding rows in the
table. A pointer is just a reference to a place in memory where the row data is stored on disk. So,
in addition to the column value that is stored in the index, a pointer to the row in the table where
that value lives is also stored in the index. This means that one of the values (or nodes) in the
index for an Employee_Name could be something like (Jesus, 0x82829), where 0x82829 is the
address on disk (the pointer) where the row data for Jesus is stored. Without that pointer all
you would have is a single value, which would be meaningless because you would not be able to
retrieve the other values in the same row like the address and the age of an employee.

How does a database know when to use an index?


When a query like SELECT * FROM Employee WHERE Employee_Name = Jesus is run,
the database will check to see if there is an index on the column(s) being queried. Assuming the
Employee_Name column does have an index created on it, the database will have to decide
whether it actually makes sense to use the index to find the values being searched because
there are some scenarios where it is actually less efficient to use the database index, and more
efficient just to scan the entire table. Read this article to understand more about those scenarios:
Selectivity in SQL.

Can you force the database to use an index on a query?


Generally, you will not tell the database when to actually use an index that decision will be
made by the database itself. Although it is worth noting that in most databases (like Oracle and
MySQL), you can actually specify that you want the index to be used.

How to create an index in SQL:


Heres what the actual SQL would look like to create an index on the Employee_Name column
from our example earlier:

CREATE INDEX name_index


ON Employee (Employee_Name)

How to create a multi-column index in SQL:


We could also create an index on two of the columns in the Employee table , as shown in this
SQL:

CREATE INDEX name_index


ON Employee (Employee_Name, Employee_Age)

What is a good analogy for a database index?


A very good analogy is to think of a database index as an index in a book. If you have a book
about dogs and you are looking for the section on Golden Retrievers, then why would you flip
through the entire book which is the equivalent of a full table scan in database terminology
when you can just go to the index at the back of the book, which will tell you the exact pages
where you can find information on Golden Retrievers. Similarly, as a book index contains a page
number, a database index contains a pointer to the row containing the value that you are
searching for in your SQL.

What is the cost of having a database index?


So, what are some of the disadvantages of having a database index? Well, for one thing it takes
up space and the larger your table, the larger your index. Another performance hit with indexes
is the fact that whenever you add, delete, or update rows in the corresponding table, the same
operations will have to be done to your index. Remember that an index needs to contain the same
up to the minute data as whatever is in the table column(s) that the index covers.

As a general rule, an index should only be created on a table if the data in the indexed column
will be queried frequently.

You should also read more about normalization, and also check out the example of first normal
form.

In databases, what is a full table scan? Also, what are some of the causes of full
table scans?

A full table scan looks through all of the rows in a table one by one to find the data that a
query is looking for. Obviously, this can cause very slow SQL queries if you have a table with a
lot of rows just imagine how performance-intensive a full table scan would be on a table with
millions of rows. Using an index can help prevent full table scans.

Lets go through some different scenarios which cause a full table scan:

Full table scan if statistics havent been updated


Normally, statistics are kept on tables and indexes. But, if for some reason table or index
statistics have not been updated, then this may result in a full table scan. This is because most
RDBMSs have query optimizers that use those statistics to figure out if using an index is
worthwhile. And if those statistics are not available, then the RDBMS may wrongly determine
that doing a full table scan is more efficient than using an index.

If a query does not have a WHERE clause to filter out the rows which appear in the result set,
then a full table scan might be performed.

Full table scan with an index

There are some scenarios in which a full table scan will still be performed even though an index
is present on that table. Lets go through some of those scenarios.

If a query does have a WHERE clause, but none of the columns in that WHERE clause match the
leading column of an index on the table, then a full table scan will be performed.

Even if a query does have a WHERE clause with a column that matches the first column of an
index, a full table scan can still occur. This situation arises when the comparison being used by
the WHERE clause prevents the use of an index. Here are some scenarios in which that could
happen:

o If the NOT EQUAL (the <>) operator is used. An example is WHERE


NAME <> PROGRAMMERINTERVIEW'. This could still result in a full
table scan, because indexes are usually used to find what is inside a
table, but indexes (in general) cannot be used to find what is not inside
a table.

o If the NOT operator is used. An example is WHERE NOT NAME =


PROGRAMMERINTERVIEW'.

o If the wildcard operator is used in the first position of a comparison


string. An example is WHERE NAME LIKE %INTERVIEW%'.

What are the differences between a hash table and a binary search tree? Suppose
that you are trying to figure out which of those data structures to use when
designing the address book for a cell phone that has limited memory. Which data
structure would you use?

A hash table can insert and retrieve elements in O(1) (for a big-O refresher read here). A binary
search tree can insert and retrieve elements in O(log(n)), which is quite a bit slower than the hash
table which can do it in O(1).
A hash table is an unordered data structure

When designing a cell phone, you want to keep as much data as possible available for data
storage. A hash table is an unordered data structure which means that it does not keep its
elements in any particular order. So, if you use a hash table for a cell phone address book, then
you would need additional memory to sort the values because you would definitely need to
display the values in alphabetical order it is an address book after all. So, by using a hash table
you have to set aside memory to sort elements that would have otherwise be used as storage
space.

A binary search tree is a sorted data structure

Because a binary search tree is already sorted, there will be no need to waste memory or
processing time sorting records in a cell phone. As we mentioned earlier, doing a lookup or an
insert on a binary tree is slower than doing it with a hash table, but a cell phone address book
will almost never have more than 5,000 entries. With such a small number of entries, a binary
search trees O(log(n)) will definitely be fast enough. So, given all that information, a binary
search tree is the data structure that you should use in this scenario, since it is a better choice than
a hash table.

How does Big-O Notation work, and can you provide an example?

First and foremost, do not even walk into a software interview without knowing what Big O
Analysis is all about you will embarrass yourself. Big O Notation is simply something that you
must know if you expect to get a job in this industry. Here we present a tutorial on Big O
Notation, along with some simple examples to really help you understand it. You can consider
this article to be sort of a big O notation for dummies tutorial, because we really try to make it
easy to understand.

What is Big O Analysis in computer science a tutorial:


When solving a computer science problem there will usually be more than just one solution.
These solutions will often be in the form of different algorithms, and you will generally want to
compare the algorithms to see which one is more efficient.

This is where Big O analysis helps it gives us some basis for measuring the efficiency of an
algorithm. A more detailed explanation and definition of Big O analysis would be this: it
measures the efficiency of an algorithm based on the time it takes for the algorithm to run as a
function of the input size. Think of the input simply as what goes into a function whether it be
an array of numbers, a linked list, etc.

Sounds quite boring, right?

Its really not that bad at all and it is something best illustrated by an example with actual code
samples.

Big O Notation Practice Problems

Even if you already know what Big O Notation is, you can still check out the example algorithms
below and try to figure out the Big O Notation of each algorithm on your own without reading
our answers first. This will give you some good practice finding the Big O Notation on your own
using the problems below.

Big O Notation Examples in Java

Now its really time to pay attention lets start our explanation of Big O Notation with an actual
problem. Here is the problem we are trying to solve:

Lets suppose that we want to create a function that, when given an array of integers greater
than 0, will return the integer that is the smallest in that array.

In order to best illustrate the way Big-O analysis works, we will come up with two different
solutions to this problem, each with a different Big-O efficiency.

Heres our first function that will simply return the integer that is the smallest in the array. The
algorithm will just iterate through all of the values in the array and keep track of the smallest
integer in the array in the variable called curMin.

Lets assume that the array being passed to our function contains 10 elements this number is
something we arbitrarily chose. We could have said it contains 100, or 100000 elements either
way it would have made no difference for our purposes here.

The CompareSmallestNumber Java function


int CompareSmallestNumber (int array[ ])
{
int x, curMin;

// set smallest value to first item in array


curMin = array[0];

/* iterate through array to find smallest value


and also assume there are only 10 elements
*/
for (x = 1; x < 10; x++)
{
if( array[x] < curMin) {
curMin = array[x];
}
}

// return smallest value in the array


return curMin;
}

As promised, we want to show you another solution to the problem. In this solution, we will use
a different algorithm - we will soon compare the big O Notation of the two different solutions
below. What we do for our second solution to the problem is compare each value in the array to
all of the other numbers in the array, and if that value is less than or equal to all of the other
numbers in the array then we know that it is the smallest number in the array.

The CompareToAllNumbers Java function


int CompareToAllNumbers (int array[ ])
{
bool is Min;

int x, y;

/* iterate through each element in array,


assuming there are only 10 elements:
*/

for (int x = 0; x < 10; x++)


{
isMin = true;

for (int y = 0; y < 10; y++)


{

/* compare the value in array[x] to the other values


if we find that array[x] is greater than any of the
values in array[y] then we know that the value in
array[x] is not the minimum
remember that the 2 arrays are exactly the same, we
are just taking out one value with index 'x' and
comparing to the other values in the array with
index 'y'
*/

if( array[x] > array[y])


isMin = false;

if(isMin)
break;
}

return array[x];
}
Now, you've seen 2 functions that solve the same problem - but each one uses a different
algorithm. We want to be able to say which algorithm is more efficient using mathematical
terms, and Big-O analysis allows us to do exactly that.

Big O analysis of algorithms

For our purposes, we assumed an input size of 10 for the array. But when doing Big O analysis,
we don't want to use specific numbers for the input size - so we say that the input is of size n.

Remember that Big-O analysis is used to measure the efficiency of an algorithm based on the
time it takes for the algorithm to run as a function of the input size.

When doing Big-O analysis, "input" can mean a lot of different things depending on the problem
being solved. In our examples above, the input is the array that is passed into the different
functions. But, input could also be the number of elements of a linked list, the nodes in a tree, or
whatever data structure you are dealing with.

Since input is of size n, and in our example the input is an array - we will say that the array is of
size n. We will use the 'n' to denote input size in our Big-O analysis.

So, the real question is how Big-O analysis measures efficiency. Basically, Big-O will want to
express how many times the 'n' input items are 'touched'. The word 'touched' can mean different
things in different algorithms - in some algorithms it may mean the number of times a constant is
multiplied by an input item, the number of times an input is added to a data structure, etc.

But in our functions CompareSmallestNumber and CompareToAllNumbers, it just means the


number of times an array value is compared to another value.

Big O notation time complexity

In the function CompareSmallestNumber, the n (we used 10 items, but lets just use the variable
'n' for now) input items are each 'touched' only once when each one is compared to the minimum
value. In Big O notation, this would be written as O(n) - which is also known as linear time.
Linear time means that the time taken to run the algorithm increases in direct proportion to the
number of input items. So, 80 items would take longer to run than 79 items or any quantity less
than 79. Another way to phrase this is to say that the algorithm being used in the
CompareSmallestNumber function has order of n time complexity.

Subscribe to our newsletter for more free interview questions.

You might also see that in the CompareSmallestNumber function, we initialize the curMin
variable to the first value of the input array. And that does count as 1 'touch' of the input. So, you
might think that our Big O notation should be O(n + 1). But actually, Big O is concerned with the
running time as the number of inputs - which is 'n' in this case - approaches infinity. And as 'n'
approaches infinity the constant '1' becomes very insignificant - so we actually drop the constant.
Thus, we can say that the CompareSmallestNumber function has O(n) and not O(n + 1).

Also, if we have n3 + n, then as n approaches infinity it's clear that the "+ n" becomes very
insignificant - so we will drop the "+ n", and instead of having O(n3 + n), we will have O(n3), or
order of n3 time complexity.

What is Big O notation worst case?

Now, let's do the Big O analysis of the CompareToAllNumbers function. The worst case of Big
O notation in our example basically means that we want to find the scenario which will take the
longest for the CompareToAllNumbers function to run. When does that scenario occur?

Well, let's think about what the worst case running time for the CompareToAllNumbers function
is and use that as the basis for the Big O notation. So, for this function, let's assume that the
smallest integer is in the very last element of the array - because that is the exact scenario which
will take the longest to run since it will have to get to the very last element to find the smallest
element. Since we are taking each element in the array and comparing it to every other element
in the array, that means we will be doing 100 comparisons - assuming, of course, that our input
size is 10 (10 * 10 = 100). Or, if we use a variable "n" to represent the input size, that will be n2
'touches' of the input. Thus, this function uses a O(n2 ) algorithm.

Big O analysis measures efficiency

Now, let's compare the 2 functions: CompareToAllNumbers is O(n2) and


CompareSmallestNumber is O(n). So, let's say that we have 10,000 input elements, then
CompareSmallestNumber will 'touch' on the order of 10,000 elements, whereas
CompareToAllNumbers will 'touch' 10,000 squared or 100,000,000 elements. That's a huge
difference, and you can imagine how much faster CompareSmallestNumber must run when
compared to CompareToAllNumbers - especially when given a very large number of inputs.
Efficiency is something that can make a huge difference and it's important to be aware of how to
create efficient solutions.

In an interview, you may be asked what the Big-O of an algorithm that you've come up with is.
And even if not directly asked, you should provide that information in order to show that you are
well aware of the need to come up with an efficient solution whenever possible.

What about Big Omega notation?

Big O and Big Omega notations are not the same thing. You can read about the differences here:
Big O versus Big Omega.

What is selectivity in SQL? How is selectivity calculated and how does it relate to a
database index?

The terms selectivity and cardinality are closely related in fact, the formula used to calculate
selectivity uses the cardinality value. The term selectivity is used when talking about database
indexes. This is the formula to use to calculate the selectivity of an index dont worry we do
explain what it all means below:

How to calculate the selectivity of an index:


Selectivity of index = cardinality/(number of records) * 100%
Note that the number of records is equivalent to the number of rows in the table.

What does selectivity mean?

So, you see the formula and you are thinking thats great, but what does this actually mean? Well,
lets say we have a table with a Sex column which has only two possible values of Male and
Female. Then, that Sex column would have a cardinality of 2, because there are only two
unique values that could possibly appear in that column Male and Female. If there are 10,000
rows in the table, then this means that the selectivity of an index on that particular column will
be 2/10,000 * 100%, which is .02%.

The key with the selectivity value is that it basically measures how selective the values within
a given column are in other words how many different values are available in the given sample
set. A selectivity of .02% is considered to be really low, and means that given the number of
rows, there is a very small amount of variation in the actual values for that column. In our
example Sex column,

Why does the database actually care about the selectivity and how does it use it? Well, lets
consider what a low selectivity means. A low selectivity basically means there is not a lot of
variation in the values in a column that there is not a lot of possibilities for the values of a
column. Suppose, using the example table that we discussed earlier, that we want to find the
names of all the females in the table.

How does selectivity affect usage of a database index

Database query optimizers have to make a decision about whether it would actually make sense
to either use the index to find certain rows in a table or to not use the index. This is because there
are times when using the index is actually less efficient than just directly scanning the table
itself. This is something that you should remember: even if a column has an index created for it,
that does not mean the index will always be used, because scanning the table directly without
going through the index first could be a better, more efficient, option.

When is better to not use a database index?

So, when exactly is it better to not use a database index? Well, when there is a low selectivity
value! Why does a low selectivity mean that using the index is not a good idea? Well, think about
it lets say we want to run a query that will find the names of all the females in the table we
are of course assuming that there is another column for Name in addition to the Sex column.
If we are searching for all the female rows in a table with 10,000 rows then there is a good
chance that 50% of the rows are females, because there really are just two possible values male
and female. Assuming that 50% of the rows are indeed females, then this means that we would
have to access the index 5,000 times to find all the female rows. Accessing the index takes time,
and consumes resources. If we are accessing the index 5,000 times, it is actually faster to just
directly access the table and do a full table scan. So, you can see that the selectivity value was
used by the query optimizer to determine whether it was more efficient to use an index or just
read the table directly.

What selectivity value determines if an index will be used or not?

Its really hard to say since that exact value varies from one database to another.

Of course, a high selectivity value means that the index should definitely be used. For example,
if we are dealing with a column that has a selectivity of 100%, then all the values in that column
are unique. This means that if a query is searching for just one of those values then it makes
much more sense to use the index, because it will be far more efficient than risking a full table
scan which is the worst case scenario if the table is searched directly without consulting the
index first.

What is cardinality in SQL?


The term cardinality actually has two different meanings depending on the context of its usage
one meaning is in the context of data modeling and the other meaning is in the context of SQL
statements. Lets go through the first meaning when the word cardinality is used in the context
of data modeling, it simply refers to the relationship that one table can have with another table.
These relationships include: many-to-many, many-to-one/one-to-many, or one-to-one
whichever one of these characteristics a table has in relationship with another table is said to be
the cardinality of that table. An example will help clarify further.

An example of cardinality in data modeling

Suppose we have three tables that are used by a company to store employee information: an
Employee table, an Employee_Salary table, and a Department table. The Department table will
have a one to many relationship with the Employee table, because every employee can belong to
only one department, but a department can consist of many employees. In other words, the
cardinality of the Department table in relationship to the employee table is one to many. The
cardinality of the Employee table in relationship to the Employee_Salary table will be one to
one, since an employee can only have one salary, and vice versa (yes, two employees can have
the same salary, but there will still be exactly one salary entry for each employee regardless of
whether or not someone else has the same salary).

Example of Cardinality in SQL

The other definition of cardinality is probably the more commonly used version of the term.
In SQL, the cardinality of a column in a given table refers to the number of unique values that
appear in the table for that column. So, remember that the cardinality is a number. For example,
lets say we have a table with a Sex column which has only two possible values of Male and
Female. Then, that Sex column would have a cardinality of 2, because there are only two
unique values that could possibly appear in that column Male and Female.

Cardinality of a primary key

Or, as another example, lets say that we have a primary key column on a table with 10,000 rows.
What do you think the cardinality of that column would be? Well, it is 10,000. Because it is a
primary key column, we know that all of the values in the column must be unique. And since
there are 10,000 rows, we know that there are 10,000 entries in the column, which translates to a
cardinality of 10,000 for that column. So, we can come up with the rule that the cardinality of a
primary key column will always be equal to the number of records in the same table.

What does a cardinality of zero mean?

Well, if a column has a cardinality of zero, it means that the column has no unique values. This
could potentially happen if the column has all NULLs which means that the column was never
really used anyways.

What is database normalization?

Before anything else lets state simply why we need database normalization. There are two
primary goals that database normalization looks to achieve:

1. Reduce redundancy. Basically we don't want to keep unnecessarily


repeating data in our tables.

2. Improve the integrity of the data.


History of database normalization

The concept of normalization was introduced by a researcher at IBM by the name of E.F. Codd
in the 1970s. He was also the inventor of the relational model which is of course used in
relational databases today.

How is normalization achieved?

So, the next logical question is how do we actually apply normalization to a database? Well,
there are a set of rules called forms that must be followed to normalize a database. You may
have heard of 1st normal form, 2nd normal form each of these forms defines different rules.
Each form is basically a set of rules that must be followed.

You can read more about first normal form here: Example of first normal form.

Forms must be followed in order

Note that the forms must be followed in order. So, if we want to achieve 2nd normal form,
then we must first achieve 1st normal form.

If a database follows all the rules for 1st normal form then the database is said to be 1st normal
form compliant, or 1NF compliant. The same holds true for 2nd normal form.

Key points of normalization

Here are some key points of normalization that you should remember:

- Normalization looks to do 2 things:


1. Reduce redundancy
2. Improve the data integrity
- To normalize, we use forms, which are a set of rules.
What is a self join? Explain it with an example and tutorial.

Lets illustrate the need for a self join with an example. Suppose we have the following table
that is called employee. The employee table has 2 columns one for the employee name (called
employee_name), and one for the employee location (called employee_location):

employee

employee_name employee_location

Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York

Now, suppose we want to find out which employees are from the same location as the employee
named Joe. In this example, that location would be New York. Lets assume for the sake of our
example that we can not just directly search the table for people who live in New York with a
simple query like this (maybe because we dont want to hardcode the city name) in the SQL
query:

SELECT employee_name
FROM employee
WHERE employee_location = "New York"
So, instead of a query like that what we could do is write a nested SQL query (basically a query
within another query which is more commonly called a subquery) like this:

SELECT employee_name
FROM employee
WHERE employee_location in
( SELECT employee_location
FROM employee
WHERE employee_name = "Joe")

A subquery is inefficient

Using a subquery for such a simple question is inefficient. Is there a more efficient and elegant
solution to this problem?

It turns out that there is a more efficient solution we can use something called a self join. A self
join is basically when a table is joined to itself. The way you should visualize a self join for a
given table is by imagining that a join is performed between two identical copies of that table.
And that is exactly why it is called a self join because of the fact that its just the same table
being joined to another copy of itself rather than being joined with a different table.

How does a self join work

Before we come up with a solution for this problem using a self join, we should go over some
concepts so that you can fully understand how a self join works. This will also make the SQL in
our self join tutorial a lot easier to understand, which you will see further below.

A self join must have aliases

In a self join we are joining the same table to itself by essentially creating two copies of that
table. But, how do we distinguish between the two different copies of the table because there is
only one table name after all? Well, when we do a self join, the table names absolutely must use
aliases otherwise the column names would be ambiguous. In other words, we would not know
which of the two copies of the tables columns is being referenced without using an alias for each
copy of the table. If you dont already know what an alias is, its simply another name given to a
table think of an alias as a nickname and that nickname is then used in the SQL query to
reference the table. Because we need two copies of the employee table, we will just use the
aliases e1 and e2 for the employee table when we do a self join.

Self join predicate

As with any join there must be a condition upon which a self join is performed we can not just
arbitrarily say do a self join, without specifying some condition. That condition will be our
join predicate. If you need a refresher on join predicates (or just joins in general) then check this
link out: Inner vs. Outer joins.

Now, lets come up with a solution to the original problem using a self join instead of a subquery.
This will help illustrate how exactly a self join works. The key question that we must ask
ourselves is what should our join predicate be in this example? Well, we want to find all the
employees who have the same location as Joe.

Because we want to match between our two tables (both of which are the same table employee
aliased as e1 and e2) on location our join predicate should clearly be WHERE
e1.employee_location = e2.employee_location. But is that enough to give us what we want?
No, its not, because we also want to filter the rows returned since we only want people who are
from the same location as Joe.
So, how can we filter the rows returned so that only people from Joes location are returned?
Well, what we can do is simply add a condition on one of the tables (e2 in our example) so that it
only returns the row where the name is Joe. Then, the other table (e1) will match up all the
names that have the same location in e2, because of our join predicate which is WHERE
e1.employee_location = e2.employee_location. We will then just select the names from e1, and
not e2 because e2 will only have Joes name. If thats confusing then keep reading further to
understand more about how the query will work.

So, the self join query that we come up with looks like this:

Self Join SQL Example


SELECT e1.employee_name
FROM employee e1, employee e2
WHERE e1.employee_location = e2.employee_location
AND e2.employee_name="Joe";

This query will return the names Joe and Jack since Jack is the only other person who lives in
New York like Joe.

Generally, queries that refer to the same table can be greatly simplified by re-writing the queries
as self joins. And, there is definitely a performance benefit for this as well.

What does a self join look like?

It will help tremendously to actually visualize the actual results of a self join internally.
Remember that a self join is just like any other join, where the two tables are merged into one
temporary table. First off, you should visualize that we have two separate copies of the employee
table, which are given aliases of e1 and e2. These copies would simply look like this note that
we shortened the column names from employee_name and employee_location to just Name and
Location for convenience:

e1 e2

Name Location

Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York


And the final results of running the self join query above the actual joined table would look
like this:

e1.employee_nam e1.employee_locatio e2.employee_nam e2.employee_locatio


e n e n

Joe New York Joe New York

Jack New York Joe New York

Self joins versus inner joins

Are self joins and inner joins the same? You might be wondering if all self joins are also inner
joins. After all, in our example above our self join uses an inner join because only the rows that
match based on the join predicate are returned non-matching rows are not returned. Well, it
turns out that a self join and inner join are completely different concepts. A self join could just as
well be an outer join or an inner join it just depends on how the query is written. We could
easily change the query we used above to do a LEFT OUTER JOIN while the query still
remains a self join but that wouldnt give us the results we want in our example. So, we use an
implied inner join instead because that gives us the correct results. Remember that a query is a
self join as long as the two tables being joined are exactly the same table, but whether its an
inner join or outer join depends on what is specified in the SQL. And, inner/outer joins are
separate concepts entirely from a self join.

Self joins manager employee example

The most commonly used example for self joins is the classic employee manager table. The table
is called Employee, but holds all employees including their managers. Every employee has an
ID, and there is also a column for the manager ID. So, for example, lets say we have a table that
looks like this and we call it Employee:

EmployeeID Name ManagerID

1 Sam 10

2 Harry 4
4 Manager NULL

10 AnotherManager NULL

Notice that in the table above there are two managers, conveniently named Manager and
AnotherManager. And, those managers dont have managers of their own as noted by the
NULL value in their Manager column.

Now, given the table above, how can we return results that will show each employees name, and
his/her managers name in nicely arranged results with the employee in one column and his/her
managers name in the other column. Well, it turns out we can use a self join to do this. Try to
come up with the SQL on your own before reading our answer.

Self join manager employee answer

In order to come up with a correct answer for this problem, our goal should be to perform a self
join that will have both the employee information and manager information in one row. First off,
since we are doing a self join, it helps to visualize the one table as two tables lets give them
aliases of e1 and e2. Now, with that in mind, we want the employees information on one side of
the joined table and the managers information on the other side of the joined table. So, lets just
say that we want e1 to hold the employee information and e2 to hold the corresponding
managers information. What should our join predicate be in that case?

Well, the join predicate should look like ON e1.ManagerID = e2.EmployeeID this basically
says that we should join the two tables (a self join) based on the condition that the manager ID in
e1 is equal to the employee ID in e2. In other words, an employees manager in e1 should have
the managers information in e2. An illustration will help clarify this. Suppose we use that
predicate and just select everything after we join the tables. So, our SQL would look like this:

SELECT *
FROM Employee e1
INNER JOIN Employee e2
ON e1.ManagerID = e2.EmployeeID
The results of running the query above would look like this:

e1.EmployeeI e1.Nam e1.ManagerI e2.EmployeeI e2.ManagerI


e2.Name
D e D D D

AnotherManag
1 Sam 10 10 NULL
er

2 Harry 4 4 Manager NULL

Note that there are only 2 rows returned this is because an inner join is performed, which
means that only when there is a match between employee IDs and manager IDs will there be a
result returned. And since there are 2 people without managers (who have a manager ID of
NULL), they will not be returned as part of table e1, because no employees have a matching ID
of NULL.

Now, remember that we only want to return the names of the employee and corresponding
manager as a pair. So, we can fine-tune the SQL as follows:

SELECT e1.Name, e2.Name


FROM Employee e1
INNER JOIN Employee e2
ON e1.ManagerID = e2.EmployeeID
Running the SQL above would return:

Sam AnotherManager
Harry Manager
And that is the answer to the employee manager problem using a self join! Feel free to post any
comments.

In SQL, whats the difference between an inner and outer join?

Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.

It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:

Employee Location

EmpID EmpName EmpID EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India


25 Johnson 39 Bangalore, India

Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.

Outer Joins

Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:

select * from employee left outer join location


on employee.empID = location.empID;

Subscribe to our newsletter for more free interview questions.

In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL
below. This is equivalent to what we have above.

select * from employee left join location


on employee.empID = location.empID;

A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

25 Johnson NULL NULL

The Join Predicate a geeky term you should know


Earlier we had mentioned something called a join predicate. In the SQL above, the join predicate
is "on employee.empID = location.empID". This is the heart of any type of join, because it
determines what common column between the 2 tables will be used to "join" the 2 tables. As you
can see from the result set, all of the rows from the left table are returned when we do a left outer
join. The last row of the Employee table (which contains the "Johson" entry) is displayed in the
results even though there is no matching row in the Location table. As you can see, the non-
matching columns in the last row are filled with a "NULL". So, we have "NULL" as the entry
wherever there is no match.

A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:

select * from employee right outer join location


on employee.empID = location.empID;

// taking out the "outer", this also works:

select * from employee right join location


on employee.empID = location.empID;

Using the tables presented above, we can show what the result set of a right outer join would
look like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

NULL NULL 39 Bangalore, India

We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.

Inner Joins
Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres
what the SQL for an inner join will look like:

select * from employee inner join location on


employee.empID = location.empID

This can also be written as:

select * from employee, location


where employee.empID = location.empID

Now, here is what the result of running that SQL would look like:

Employee.EmpID Employee.EmpName Location.EmpID Location.EmpLoc

13 Jason 13 San Jose

8 Alex 8 Los Angeles

3 Ram 3 Pune, India

17 Babu 17 Chennai, India

Inner vs Outer Joins

We can see that an inner join will only return rows in which there is a match based on the join
predicate. In this case, what that means is anytime the Employee and Location table share an
Employee ID, a row will be generated in the results to show the match. Looking at the original
tables, one can see that those Employee IDs that are shared by those tables are displayed in the
results. But, with a left or right outer join, the result set will retain all of the rows from either the
left or right table.

Suppose we have the Employee table below, and we want to retrieve all of the cities
that the employees live in, but we dont want any duplicates. How can we do this in
SQL?

employee

employee_name employee_location

Joe New York


Sunil India

Alex Russia

Albert Canada

Jack New York

Alex Russia

In SQL, the distinct keyword will allow us to do that. Heres what the simple SQL would look
like:

SELECT DISTINCT employee_location from employee;

Running this query will return the following results:

employee_l
ocation

New York

India

Russia

Canada

So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.

Its worth noting that the DISTINCT keyword can be used with more than one column. That
means that only the unique combination of columns will be returned. Again, this is best
illustrated by an example.

Suppose we run the following SQL:

SELECT DISTINCT employee_name, employee_location from employee;

If we run the SQL above, it will return this:

employee_name employee_location
Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York

Note that the one extra entry for "Alex, Russia" is missing in the result set above. This is because
when we select a distinct combination of name and location, if there are 2 entries with the same
exact name and location then the sql that we ran above will only return one of those entries.

In the table below, how would you retrieve the unique values for the
employee_location without using the DISTINCT keyword?

employee

employee_name employee_location

Joe New York

Sunil India

Alex Russia

Albert Canada

Jack New York

Alex Russia

We can actually accomplish this with the GROUP BY keyword. Heres what the SQL would look
like:

SELECT employee_location from employee


GROUP BY employee_location

Running this query will return the following results:

employee_l
ocation

New York

India

Russia

Canada

So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.

This is a valid alternative to using the DISTINCT keyword. If you need a refresher on the
GROUP BY clause, then check out this question: Group By and Having. This question would
probably be asked just to see how good you are with coming up with alternative options for SQL
queries. Although, it probably doesnt prove much about your SQL skills.

Practice SQL Interview questions and Answers

Theres no better way to improve your SQL skills than to practice with some real SQL interview
questions and these SQL practice problems are a great way to improve your SQL online. We
recommend first creating the following simple tables presented below in the RDBMS software of
your choice MySQL, Oracle, DB2, SQL Server, etc, and then actually try to figure out the
answer on your own if possible.

The following SQL practice exercises were actually taken from real interview tests with Google
and Amazon. Once again, we highly recommended that you try finding the answers to these SQL
practice exercises on your own before reading the given solutions. The practice problems are
based on the tables presented below.

Salesperson Customer

I Na A ID Name City Industry Type


Sal
Dm g
ary
e e 4 Samsonic pleasant J

1 Ab 6 140 6 Panasung oaktown J


e 1 000

2 Bo 3 440 7 Samony jackson B


b 4 00
9 Orange Jackson B
5 Ch 3 400
ris 4 00

7 Da 4 520
n 1 00

8 Ke 5 115
n 7 000

1
3 380
1 Joe
8 00

Orders

Number order_date cust_id salesperson_id Amount

10 8/2/96 4 2 540

20 1/30/99 4 8 1800

30 7/14/95 9 1 460

40 1/29/98 7 2 2400

50 2/3/98 6 7 600

60 3/2/98 6 7 720

70 5/6/98 9 7 150

Given the tables above, find the following:

a. The names of all salespeople that have an order with Samsonic.

b. The names of all salespeople that do not have any order with Samsonic.

c. The names of salespeople that have 2 or more orders.


d. Write a SQL statement to insert rows into a table called highAchiever(Name, Age),
where a salesperson must have a salary of 100,000 or greater to be included in the table.

Lets start by answering part a. Its obvious that we would need to do a SQL join, because the
data in one table will not be enough to answer this question. This is a good question to get some
practice with SQL joins, so see if you can come up with the solution.

Now, what tables should we use for the join? We know that the customer ID of Samsonic is 4, so
we can use that information and do a simple join with the salesperson and customer tables. The
SQL would look like this:

select Salesperson.Name from Salesperson, Orders where


Salesperson.ID = Orders.salesperson_id and cust_id = '4';

We can also use subqueries (a query within a query) to come up with another possible answer.
Here is an alternative, but less efficient, solution using a subquery:

select Salesperson.Name from Salesperson where


Salesperson.ID = '{select Orders.salesperson_id from Orders,
Customer where Orders.cust_id = Customer.id
and Customer.name = 'Samsonic'}';

Click on the Next button below to check out the answer to parts B and C of this SQL interview
question.

Practice SQL Interview Questions

Lets now work on answering parts B and C of the original question. We present the tables below
again for your convenience.

Here is part B: Find the names of all salespeople that do not have any orders with
Samsonic.

This is part C: Find the names of salespeople that have 2 or more orders.

Salesperson Customer
I Na A ID Name City Industry Type
Sal
Dm g
ary
e e 4 Samsonic pleasant J

1 Ab 6 140 6 Panasung oaktown J


e 1 000
7 Samony jackson B
2 Bo 3 440
b 4 00 9 Orange Jackson B

5 Ch 3 400
ris 4 00

7 Da 4 520
n 1 00

8 Ke 5 115
n 7 000

1
3 380
1 Joe
8 00

Orders

Number order_date cust_id salesperson_id Amount

10 8/2/96 4 2 540

20 1/30/99 4 8 1800

30 7/14/95 9 1 460

40 1/29/98 7 2 2400

50 2/3/98 6 7 600

60 3/2/98 6 7 720

70 5/6/98 9 7 150
Part B of the question asks for the names of the salespeople who do not have an order with
Samsonic. A good way to approach this problem is to break it down: if we can first find the name
of all the salespeople who do have an order with Samsonic. Then, perhaps we can work with that
list and get all the salespeople who do not have an order with Samsonic.

So, lets start by just getting a list of all the salespeople IDs that have an order with Samsonic.
We can get this list by doing a join with a condition that the customer is Samsonic. We can use
both the Customer and Orders table. The SQL for this will look like:

select Orders.salesperson_id from Orders, Customer where


Orders.cust_id = Customer.ID and Customer.Name = 'Samsonic'

This will give us a list of all the salespeople IDs that have an order with Samsonic. Now, we can
get a list of the names of all the salespeople who do NOT have an order with Samsonic. SQL has
a NOT operator that easily allows us to exclude elements of the result set. We can use this to
our advantage. Here is one possible answer to question B, and this is what the final SQL will
look like:

select Salesperson.Name from Salesperson


where Salesperson.ID NOT IN(
select Orders.salesperson_id from Orders, Customer
where Orders.cust_id = Customer.ID
and Customer.Name = 'Samsonic')

Now, lets work on answering part C. As always, its best to break the problem down into more
manageable pieces. So, lets focus on one table: the Orders table. Looking at that table we can
find the IDs that belong to the salespeople who have 2 or more orders. This will require use of
the "group by" syntax in SQL, which allows us to group by whatever column we choose. In this
case, the column that we would be grouping by is the salesperson_id column, because for a given
salesperson ID we would like to find out how many orders were placed under that ID. With that
said, we can write this SQL:

select salesperson_id from Orders group by


salesperson_id having count(salesperson_id) > 1

Note how we used the having clause instead of the where clause because we are using the count
aggregate. Well, now we have a SQL statement that gives us the IDs of the salespeople who
have more than 1 order. But, what we really want is the names of the salespeople who have those
IDs. This is actually quite simple if we do a join on the Salesperson and Orders table, and use
the SQL that we came up earlier. It would look like this:
SELECT name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.id
GROUP BY name, salesperson_id
HAVING COUNT( salesperson_id ) >1

Based on our tables, this SQL will return the names of Bob and Dan. Click on the Next button
below to check out the answer to part D.

Practice SQL Interview Questions

Weve finally come to the last part of this question. Question D is presented below again for your
convenience.

Part D: Write a SQL statement to insert rows into a table called highAchiever(Name, Age),
where a salesperson must have a salary of 100,000 or greater to be included in the table.

Looking at part D, its easy to come up with the SQL to specify the condition that the salary of
the salesperson must be greater or equal to 100,000. It would look like this "WHERE SALARY
>= 100000". The only slightly difficult part of this question is how we insert values into the
highachiever table while selecting values from the salesperson table. It turns out that the SQL for
this is:

insert into highAchiever (name, age)


(select name, age from salesperson where salary > 100000);

Because we are inserting values into the highAchiever table based off of what we select from
another table, we dont use the "Values" clause that we would normally use when inserting. This
is what a regular insertion would look like (note the use of the "values" clause):

insert into highAchiever(name, age) values ('Jackson', 28)

As you can see the answer to this one is pretty simple. Click next below to read part 2 of our
practice SQL interview questions.

Practice SQL Interview Question #2

This question was asked in a Google interview: Given the 2 tables below, User and
UserHistory:

User
user_id
name
phone_num

UserHistory
user_id
date
action
1. Write a SQL query that returns the name, phone number and most recent date for any user
that has logged in over the last 30 days (you can tell a user has logged in if the action field in
UserHistory is set to "logged_on").

Every time a user logs in a new row is inserted into the UserHistory table with user_id, current
date and action (where action = "logged_on").

2. Write a SQL query to determine which user_ids in the User table are not contained in the
UserHistory table (assume the UserHistory table has a subset of the user_ids in User table).
Do not use the SQL MINUS statement. Note: the UserHistory table can have multiple entries
for each user_id.

Note that your SQL should be compatible with MySQL 5.0, and avoid using subqueries.

Lets start with #1 by breaking down the problem into smaller, more manageable problems. Then
we can take the pieces and combine them to provide a solution to the overall problem.

Figuring out how to tell whether a user has logged on in the past 30 days seems like a good place
to start. We want to see how we can express this in MySQL. You can look online for some Mysql
functions that will help with this calculation. MySQL has a "date_sub" function, in which we can
pass the current date (as in todays date) and an interval of 30 days, and it will return us the date
30 days ago from today. Once we have that date, we can compare it with the date in the
UserHistory table to see if it falls within the last 30 days. One question that remains is how we
will retrieve the current date. This is simple, because MySQL comes built in with a function
called curdate() that will return the current date.

So, using the date_sub function, we can come up with this piece of SQL:
UserHistory.date >= date_sub(curdate(), interval 30 day)

This will check to see that the date in the UserHistory table falls within the last 30 days. Note
that we use the ">=" operator to compare dates in this case, we are simply saying that the date
in the UserHistory table is greater than or equal to the date returned from the date_sub function.
A date is "greater" than another date when it occurs further in the future than the other date. So,
2007-9-07 will be considered "greater" than 2006-08-19, because 2007-9-07 occurs further in the
future than 2006-08-19.

Now, thats only one piece of the overall problem, so lets continue. The problem asks us to
retrieve the name, phone number, and the most recent date for any user thats logged in over
the last 30 days. We have one table with the user_id and the phone number, but only the other
table contains the actual date. Clearly, we will have to do a join on the 2 tables in order to
combine the data into a form that will allow us to solve this problem. And since the 2 tables only
share one column the user_id column its clear what common column we will use to join the
2 tables. Doing a join, selecting the required fields, and using the date condition will look like
this:

select name, phone_num, date from User, UserHistory


where User.user_id=UserHistory.user_id
and UserHistory.date >= date_sub(curdate(), interval 30 day)

So far, we are selecting the name, phone number, and the date for any user thats logged in over
the last 30 days. But, wait a minute the problem specifically asks for "the most recent date for
any user thats logged in over the last 30 days." The problem with this is that we could get
multiple entries for a user that logged on more than once in the last 30 days. That is not what we
want we want to see the most recent date that someone logged on in the last 30 days this will
return a maximum of 1 entry per user.

Now, the question is how do we get the most recent date? This is quite simple again, as MySQL
provides a MAX aggregate function that we can use to find the most recent date. Given a group
of dates, the MAX function will return the "maximum" date which is basically just the most
recent date (the one furthest in the future). Because this is an aggregate function, we will have to
provide the GROUP BY clause in order to specify what column we would like to use as a
container of the group of dates. So, now our SQL looks like this:

select User.name, User.phone_num, max(UserHistory.date)


from User, UserHistory
where User.user_id = UserHistory.user_id and
UserHistory.date >= date_sub(curdate(), interval 30 day)
group by (User.user_id);
Now all we need is to add the condition that checks to see that the users action equals
"logged_on". So, the final SQL, and the answer to the problem looks like this:

select User.name, User.phone_num, max(UserHistory.date)


from User, UserHistory
where User.user_id = UserHistory.user_id
and UserHistory.action = 'logged_on'
and UserHistory.date >= date_sub(curdate(), interval 30 day)
group by (User.user_id);

Phew! We are finally done with question 1, click next to check out the answer to question #2.

Practice SQL Interview Question #2

Given the 2 tables below, User and UserHistory:

User
user_id
name
phone_num

UserHistory
user_id
date
action
Lets continue with the 2nd question, presented again below

2. Given the tables above, write a SQL query to determine which user_ids in the User table are
not contained in the UserHistory table (assume the UserHistory table has a subset of the
user_ids in User table). Do not use the SQL MINUS statement. Note: the UserHistory table
can have multiple entries for each user_id.

Note that your SQL should be compatible with MySQL 5.0, and avoid using subqueries.

Basically we want the user_ids that exist in the User table but not in the UserHistory table. If
we do a regular inner join on the user_id column, then that would just do a join on all the
rows in which the User and UserHistory table share the same user_id values . But the
question specifically asks for just the user_ids that are in the User table, but are not in the
UserHistory table. So, using an inner join will not work.

What if, instead of an inner join, we use a left outer join on the user_id column? This will
allow us to retain all the user_id values from the User table (which will be our "left" table)
even when there is no matching user_id entry in the "right" table (in this case, the
UserHistory table). When there is no matching record in the "right" table the entry will just
show up as NULL. This means that any NULL entries are user_id values that exist in the User
table but not in the UserHistory table. This is exactly what we need to answer the question. So,
heres what the SQL will look like:

select distinct u.user_id


from User as u
left join UserHistory as uh on u.user_id=uh.user_id
where uh.user_id is null

You may be confused by the "User as u" and the "UserHistory as uh" syntax. Those are
whats called aliases. Aliases allow us to assign a shorter name to a table, and it makes for
cleaner and more compact SQL. In the example above, "u" will actually be another name for
the "User" table and "uh" will be another name for the "UserHistory" table.

We also use the distinct keyword. This will ensure that each user_id is returned only once.

That concludes our series of practice sql interview questions. If you are looking for some more
advanced and challenging SQL interview questions the check out our other articles: Advanced
SQL practice questions.

n SQL, whats the difference between the having clause and the group by statement?

In SQL, the having clause and the group by statement work together when using aggregate
functions like SUM, AVG, MAX, etc. This is best illustrated by an example. Suppose we have a
table called emp_bonus as shown below. Note that the table has multiple entries for employees
A and B which means that both employees A and B have received multiple bonuses.

emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus amount that each employee has received, then we
would write a SQL statement like this:

select employee, sum(bonus) from emp_bonus group by employee;

The Group By Clause

In the SQL statement above, you can see that we use the "group by" clause with the employee
column. The group by clause allows us to find the sum of the bonuses for each employee
because each employee is treated as his or her very own group. Using the group by in
combination with the sum(bonus) statement will give us the sum of all the bonuses for
employees A, B, and C.

Subscribe to our newsletter for more free interview questions.

Running the SQL above would return this:

Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for
the year of 2012 this is assuming of course that the emp_bonus table contains bonuses only
for the year of 2012. This is when we need to use the HAVING clause to add the additional
check to see if the sum of bonuses is greater than $1,000, and this is what the SQL look like:

GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;

And the result of running the SQL above would be this:

Employee Sum(Bonus)
A 1500
B 3250
Difference between having clause and group by statement

So, from the example above, we can see that the group by clause is used to group column(s) so
that aggregates (like SUM, MAX, etc) can be used to find the necessary information. The
having clause is used with the group by clause when comparisons need to be made with those
aggregate functions like to see if the SUM is greater than 1,000, as in our example above.
So, the having clause and group by statements are not really alternatives to each other but
they are used alongside one another!

S-ar putea să vă placă și