Sunteți pe pagina 1din 6

Preliminary Instructions: In order for my research task to be successfully completed the well documented R

code is to be written precisely clearly answering all questions.

The essential data for the project can be downloaded here:


https://www.dropbox.com/s/p41n7unv7y75sib/HW_3.zip?dl=0

Does Gender Matter?


In this problem we will try to understand how the gender a ffects the probability to
win a gold medal at the international mathematics olympiad (IMO);
1. Data on participants in international mathematics olympiads (IMO) -
folder, IMO_ALL, contains IMO results at the individual level from 1959
to 2018, obtained from https://www.imo-official.org/;

Part 1 - Getting Ready


1. Write a function which takes as input one of the files in the folder
IMO_ALL (i.e., it should take a string variable, with the full name of the
file, for example, “1959.html”) and returns a data.table with the fol-
lowing variables:
(a) Name and the surname of the participant - this variable should be
called name;
(b) Removes observations where the gender indicator is missing;
(c) Gender - this variable should be called gender;
(d) Country - this variable should be called country;
(e) Score of p1 - this variable should be called p_1;
(f) Score of p2 - this variable should be called p_2;
(g) Score of p3 - this variable should be called p_3;
(h) Score of p4 - this variable should be called p_4;
(i) Score of p5 - this variable should be called p_5;
(j) Score of p6 - this variable should be called p_6;
(k) Score of p7 (notice that p7 is present only in some years) - this
variable should be called p_7;
(l) Total score - this variable should be called total;
(m) Rank - this variable should be called rank;
(n) Award - this variable should be called medal;
(o) Year when the Olympiad took place - this variable should be called
year;
Remove from the data all participants whose name equals * or ?; Also
remove all participants who have ? (question mark) in their name (i.e.,
there is a question mark within a string);

(a) read_html from xml2 package;


(b) Then via xpath find all .//tr nodes;
(c) Then for each .//tr node find all .//td nodes.

1
2. Write a function, which takes as an input, an absolute path to folder
IMO_ALL, then applies to each .html file in this folder a function from
point 1, combines results together row wise and outputs a .csv file
imo_all.csv;
3. Clean data a little bit:
You should write a function, function_clean, which does the following:
(a) Takes as an input imo_all.csv (please use fread to read .csv file);
(b) Removes leading “\n” , gender indicator and \n at the end of the name
- for example \nBohuslav Diviš \n should now look like, Bohuslav
Diviš;
(c) For the variable name, trims white spaces at the beginning and the
end of the string;
(d) Creates new variables:
i. gold_ind =1 if a medal variable contains strings “Gold” or
“gold” and 0 otherwise;
ii. silv_ind=1 if a medal variable contains strings “Silver” or
“silver” and 0 otherwise.
iii. d_female=1 if a participant is a female and 0 otherwise.
iv. nr_order - the variable showing which time a given participant
is participating in an olympiad, participant here is defined by
the combination of a country and name;
(e) Replaces empty cells for p1 - p7 with NAs;
(f) Outputs the file “final_data_imo.csv”;

Part 2 Does Gender Matter?


Reminder: linear probability model is the same as OLS, except your
dependent variable now lies in f0; 1g set.

1. We try to understand how the probability to win a gold medal at IMO


varies between men and women. We do that using linear probability
model. In the .pdf file please report your estimated coe fficients and
pro-vide explanations. Please report only estimates of coe fficients you
are asked to report.

2
(a) Run the the following model:

3
and control linearly for nr_orderi. Report an estimate of , ,
is it statistically significant? Is this model more or less flexible
than the one you estimated in point d);
Conclude whether gender affects the probability to obtain a gold at IMO.

Detection of Suspicious Auctions


In this question we will try to detect anomalies in the bidding behavior. To do
that we will analyze two datasets:
notifications_final - contains information on notifications for three places
in Russia;
protocols_final - contains information on bids for three places in Russia.

Exactly the same comments about the way we will grade functions apply
as in the previous problem.

1. Write a function, function_fields_extract_not_zk, which takes as an


argument an absolute path to notificationzk.csv and does the following:
(a) Reads this file using fread, as a column separator please use “#”
symbol, the class of each column should be “character”;
(b) Keeps following fields in the data:
i. notificationnumber - notification number;
ii. versionnumber - version of the notification;
iii. publishdate - publication date;
iv. customerrequirement_maxprice - starting price in the auction;
v. notificationcommission_p1date - bid submission start date;
vi. notificationcommission_p2date - bid submission end date.
2. Write a function, function_combine_not_zk, which does the following:
(a) As input takes absolute path to the folder, notifications_final;
(b) Applies function_fields_extract_not_zk to all files with the name
notificationzk.csv within folder notifications_final (including subfold-
ers) and combines them together into a data.table;
(c) Orders this data.table by notificationnumber, versionnumber and
publishdate in the decreasing order;
(d) Keeps unique observations by notificationnumber;

(e) Keeps only those observations where field


customerrequirement_maxprice does not contain “&” symbol;
(f) Removes those observations whose notificationnumber starts with
“99”;
(g) Creates a new variable days which equals to the time di fference in
days between notificationcommission_p2date and notificationcom-
mission_p1date;
(h) Writes the created data.table to file, notification_zk_clean.csv - this
file should be written to the folder notifications_final;
(i) Please perform all these operations in the order which is specified
above.
3. Write a function, function_fields_extract_prot_zk, which as an argu-ment
takes an absolute path to the file protocolzk1.csv and does the fol-
lowing:
(a) Opens the file (please use fread to read .csv file, use “#” as a col-
umn separator, column classes for all variables should be set to be
characters);
(b) Extracts following variables:
i. protocoldate;
ii. notificationnumber - notification id;
iii. protocolprotocolapplications_application_journalnumber - this
variable provides an information on unique identifiers of bid-
ders (separated by &&&&), for example, for notification number
0348100035711000002, this variable equals 1&&&&3 - this tells
us that there were two bidders in this auction, first had an iden-
tifier 1 and the second had an identifier 3, notice that numeric
values do not have a meaning here.
iv. protocolprotocolapplications_application_price - this variable pro-
vides us with bids of all bidders;
v. protocolprotocolapplications_application_appdate - provides you
with the information on date and time when a particular bid was
submitted, “T” and “Z” here do not have any meaning and should
be replaced by empty characters.
4. Write a function, function_combine_prot_zk, which does the following:
(a) As an input takes absolute path to folder protocols_final;
(b) Applies the function_fields_extract_prot_zk to all files called pro-
tocolzk1.csv within folder protocols_final (including subfolders) and
combines them together into a data.table;
(c) Orders this data.table by notificationnumber, protocoldate in the de-
creasing order;

(d) Keeps unique observations by notificationnumber;


(e) Removes those observations numbers whose notificationnumber
starts with “99”;
(f) Writes the created data.table to file, protocols_zk_clean.csv - this
file should be written to the folder protocols_final;

5. Merge notification_zk_clean.csv with protocols_zk_clean.csv, as a


merg-ing variable you should use notificationnumber, additionally do
the follow-ing:
(a) Keep only those observations where you have at least 2 bidders;
(b) Keep only those observations where all submitted bids are di fferent;
(c) Keep only those observations where all submission times are different;
(d) For each row obtain min bid and second minimum bid, call these
variable min_bid, second_min_bid;
(e) Calculate whether the bidder who submitted the minimum bid, bided
the last (i.e., had the largest protocolprotocolapplications_application_appdate).
Please create a dummy variable, winner_last, which equals 1 if this
is the case and 0 otherwise.

S-ar putea să vă placă și