In this video, we are going to continue on from our previous video where we had brought
in several hundred thousand records from the federal election campaign of president Obama
in 2008. We will actually perform some queries against this data. We will look at how Access,
and relational databases in general, can be used to analyze large sums of data.
If I go to the Relationships view, I see my primary keys in each table and I see the one
to many relationship to the foreign key in the contribution table. If I go to the contribution
table, I see the primary key, the contribution amount, and the date of the contribution,
and I see the foreign key to the contribution table.
If you look at the number of records, I have 317,000 records in the contribution table
and I have 76,000 records in the contributor table. If I look at the contributor table,
I have the name, city, employer, occupation, and primary key for each contributor. I am
going to close both of my tables and I am going to create some queries. I am going to
go to the Create ribbon and click on Query Design.
I am going to add both of my tables. I can do this control-clicking on each table to
make sure it is included or I can also double click on a table and it will add it, and I
will click on close. Because my relationship has been setup by my analyze table function,
I can see the relationship that exists between these two tables. If I just wanted to dump
the information, like what I had in Excel, I could drag down the contributor name, city,
employer, occupation, the receipt amount, and the receipt date.
Notice that I am ignoring the keys, both the primary keys that exist in both tables, and
the foreign key that helps me build my relationship. When I click on the run button, I see all
the information as it appeared in my Excel spreadsheet.
If I wanted to, I could create a query that contained only the contributor table and just
show the contributor names.
If I wanted to see all of the contributors that were attorneys, I can right click on
my query, go to Design view, and where it says criteria, I can type in the word “attorney”
in quotes and then I can run my query. And I can see all of the attorneys that are in
my database. There were 6,424 that identified themselves as attorneys. In some cases, I
may see typos or duplicate records that were introduced by spelling errors when forms were
submitted or changes in employment. The data isn’t terribly accurate.
I want to some analysis with my data. For example, I will go back to the Create ribbon
and click on Query Design, and I am interested in seeing which occupations donated the most
money. So I am going to drag down the occupation field, and I am going to drag down the receipt
amount field, and then I am going to go to the Query Tools Design ribbon and click on
Totals, and I want to group by occupation while summing the contribution amount. And
also what I will do is I will sort by the contribution amount descending, which means
it is going to show me my highest summed contribution amount first and I will click on run.
Notice that attorneys are the highest contributors, followed by retirees, people who did not type
in their information. If I wanted to I could change my sort order here. We also see blank
data, see lawyers, physicians, consultants, etc.
I am going to create a new query. This time I want to see which city on average contributed
the most money. So I am going to pull down contributor city, and then I am going to pull
down contributor amount. I am going to click on Totals. I am going to group by contributor
city, but I am going to average contribution amount, and I am going to click on run. I
forgot to sort my data, so I can right click and say Sort Largest to Smallest and do it
outside of my query. Looking at this data, it strikes that the
information may not be terribly reliable because it is very likely that the max contribution
of $2,300 could be donated by one person and their city will then appear at the top of
the list. So I am going to go back to Query Design and this time I will remember my sort.
And I am going to add a criteria so I am going to pull down one of the contribution fields.
It doesn’t really matter which one I pull down because I am still grouping by the contribution
city and I am going to count the number of contributions in each city. And then I am
going to run my query to see if this affecting my results.
Now when I run my query, I can see that my suspicion was accurate: that cities with only
one contributor or two contributors skew the results. I want to ignore these, so let me
take a brief look at my data and it looks like I have a few cities where we have many
contributors. So I am going to pick a number, let’s say 10, and I am going to make that
the requirement for information to appear in my query. I am going to go to the criteria
field for the contribution count and I am going to put in a greater than sign (<) and
I am going to put in 9, so only contributions greater than 9 are going to appear in my query.
I am going to run my query, and here I see information that seems much more valid. So
Grafton is appearing as the top contributor city followed by Millneck, Purchase, Kings
Point, etc. If I did not want the count to appear in my query, but I still want to use
the count to process my query, I can click on the show box. The count will still affect
my query, but will not appear in my query. As I create queries like this, I am actually
creating Structured Query Language queries behind the scenes. If I Right Click on my
query tab and click on SQL view, it shows me the SQL code that is being generated.
I am going to create a new query. Suppose I am interested in finding cases of fraud.
Not implying that fraud was committed by the election committee, but websites such as this
that provide the ability to process donations via credit card have the potential of being
used for credit card fraud to test to see whether or not a credit card, a stolen credit
card, is valid. I am going to create another query and going
to the Create ribbon. I am going to click on query design. I am going to add both of
my tables. I am going to look for unusual contributions. Particularly, I am going to
look for contributions where there have been a high number of contributions. So I am going
to drag down the contributor name, city, employer, and occupation. And I am going to drag one
of the contribution fields. It does not matter which contribution field I pull down because
I am going to use these field to count. I am going to click on the Totals icon, grouping
by all of my contributor fields, but counting by contributions. I will need to sort by the
count descending and I will click on Run. And I see a very suspicious entry at the top
of the list. Doodad Pro in the city of the Nunda, NY with the employer of Loving and
the occupation of You contributed a total of 1,920 times.
If I create a new query and look at the actual contributions, adding both tables, I am going
to pull down my fields. For name, I am going to add a criteria and I will do my search.
And here I can see my contributions are negative numbers and positive numbers. As these fraudulent
donations were detected, the charges were reversed.
I am going to do a Google search for this information and see how it appears in the
news. So I will do a search for Doodad Pro and I immediately find some articles in Newsweek,
ABC News, Politico, that sort of thing. If I click on Newsweek, there is an article talking
about the campaign and they discuss this information. The person at the contributor’s address,
which isn’t listed in the public data, has never actually even heard of the person, so
this is a case of Internet fraud and using queries like in Access we can detect this
kind of fraud.