Research Short-Term Task

Preliminary Instructions: In order for my research task to be successfully completed the well documented R
code is to be written precisely clearly answering all questions.
The essential data for the project can be downloaded here:

https://www.dropbox.com/s/p41n7unv7y75sib/HW_3.zip?dl=0
Does Gender Matter?

In this problem we will try to understand how the gender a ffects the probability to
win a gold medal at the international mathematics olympiad (IMO);
1. Data on participants in international mathematics olympiads (IMO) -
folder, IMO_ALL, contains IMO results at the individual level from 1959
to 2018, obtained from https://www.imo-official.org/;
Part 1 - Getting Ready

1. Write a function which takes as input one of the files in the folder
IMO_ALL (i.e., it should take a string variable, with the full name of the
file, for example, “1959.html”) and returns a data.table with the fol-
lowing variables:
(a) Name and the surname of the participant - this variable should be
called name;
(b) Removes observations where the gender indicator is missing;
(c) Gender - this variable should be called gender;
(d) Country - this variable should be called country;
(e) Score of p1 - this variable should be called p_1;
(f) Score of p2 - this variable should be called p_2;
(g) Score of p3 - this variable should be called p_3;
(h) Score of p4 - this variable should be called p_4;
(i) Score of p5 - this variable should be called p_5;
(j) Score of p6 - this variable should be called p_6;
(k) Score of p7 (notice that p7 is present only in some years) - this
variable should be called p_7;
(l) Total score - this variable should be called total;
(m) Rank - this variable should be called rank;
(n) Award - this variable should be called medal;
(o) Year when the Olympiad took place - this variable should be called
year;
Remove from the data all participants whose name equals * or ?; Also
remove all participants who have ? (question mark) in their name (i.e.,
there is a question mark within a string);
(a) read_html from xml2 package;

(b) Then via xpath find all .//tr nodes;
(c) Then for each .//tr node find all .//td nodes.
1
2. Write a function, which takes as an input, an absolute path to folder
IMO_ALL, then applies to each .html file in this folder a function from
point 1, combines results together row wise and outputs a .csv file
imo_all.csv;
3. Clean data a little bit:
You should write a function, function_clean, which does the following:
(a) Takes as an input imo_all.csv (please use fread to read .csv file);
(b) Removes leading “\n” , gender indicator and \n at the end of the name
- for example \nBohuslav Diviš \n should now look like, Bohuslav
Diviš;
(c) For the variable name, trims white spaces at the beginning and the
end of the string;
(d) Creates new variables:
i. gold_ind =1 if a medal variable contains strings “Gold” or
“gold” and 0 otherwise;
ii. silv_ind=1 if a medal variable contains strings “Silver” or
“silver” and 0 otherwise.
iii. d_female=1 if a participant is a female and 0 otherwise.
iv. nr_order - the variable showing which time a given participant
is participating in an olympiad, participant here is defined by
the combination of a country and name;
(e) Replaces empty cells for p1 - p7 with NAs;
(f) Outputs the file “final_data_imo.csv”;
Part 2 Does Gender Matter?

Reminder: linear probability model is the same as OLS, except your
dependent variable now lies in f0; 1g set.
1. We try to understand how the probability to win a gold medal at IMO

varies between men and women. We do that using linear probability
model. In the .pdf file please report your estimated coe fficients and
pro-vide explanations. Please report only estimates of coe fficients you
are asked to report.
2
(a) Run the the following model:
3
and control linearly for nr_orderi. Report an estimate of , ,
is it statistically significant? Is this model more or less flexible
than the one you estimated in point d);
Conclude whether gender affects the probability to obtain a gold at IMO.
Detection of Suspicious Auctions

In this question we will try to detect anomalies in the bidding behavior. To do
that we will analyze two datasets:
notifications_final - contains information on notifications for three places
in Russia;
protocols_final - contains information on bids for three places in Russia.
Exactly the same comments about the way we will grade functions apply
as in the previous problem.
1. Write a function, function_fields_extract_not_zk, which takes as an

argument an absolute path to notificationzk.csv and does the following:
(a) Reads this file using fread, as a column separator please use “#”
symbol, the class of each column should be “character”;
(b) Keeps following fields in the data:
i. notificationnumber - notification number;
ii. versionnumber - version of the notification;
iii. publishdate - publication date;
iv. customerrequirement_maxprice - starting price in the auction;
v. notificationcommission_p1date - bid submission start date;
vi. notificationcommission_p2date - bid submission end date.
2. Write a function, function_combine_not_zk, which does the following:
(a) As input takes absolute path to the folder, notifications_final;
(b) Applies function_fields_extract_not_zk to all files with the name
notificationzk.csv within folder notifications_final (including subfold-
ers) and combines them together into a data.table;
(c) Orders this data.table by notificationnumber, versionnumber and
publishdate in the decreasing order;
(d) Keeps unique observations by notificationnumber;
(e) Keeps only those observations where field

customerrequirement_maxprice does not contain “&” symbol;
(f) Removes those observations whose notificationnumber starts with
“99”;
(g) Creates a new variable days which equals to the time di fference in
days between notificationcommission_p2date and notificationcom-
mission_p1date;
(h) Writes the created data.table to file, notification_zk_clean.csv - this
file should be written to the folder notifications_final;
(i) Please perform all these operations in the order which is specified
above.
3. Write a function, function_fields_extract_prot_zk, which as an argu-ment
takes an absolute path to the file protocolzk1.csv and does the fol-
lowing:
(a) Opens the file (please use fread to read .csv file, use “#” as a col-
umn separator, column classes for all variables should be set to be
characters);
(b) Extracts following variables:
i. protocoldate;
ii. notificationnumber - notification id;
iii. protocolprotocolapplications_application_journalnumber - this
variable provides an information on unique identifiers of bid-
ders (separated by &&&&), for example, for notification number
0348100035711000002, this variable equals 1&&&&3 - this tells
us that there were two bidders in this auction, first had an iden-
tifier 1 and the second had an identifier 3, notice that numeric
values do not have a meaning here.
iv. protocolprotocolapplications_application_price - this variable pro-
vides us with bids of all bidders;
v. protocolprotocolapplications_application_appdate - provides you
with the information on date and time when a particular bid was
submitted, “T” and “Z” here do not have any meaning and should
be replaced by empty characters.
4. Write a function, function_combine_prot_zk, which does the following:
(a) As an input takes absolute path to folder protocols_final;
(b) Applies the function_fields_extract_prot_zk to all files called pro-
tocolzk1.csv within folder protocols_final (including subfolders) and
combines them together into a data.table;
(c) Orders this data.table by notificationnumber, protocoldate in the de-
creasing order;
(d) Keeps unique observations by notificationnumber;

(e) Removes those observations numbers whose notificationnumber
starts with “99”;
(f) Writes the created data.table to file, protocols_zk_clean.csv - this
file should be written to the folder protocols_final;
5. Merge notification_zk_clean.csv with protocols_zk_clean.csv, as a

merg-ing variable you should use notificationnumber, additionally do
the follow-ing:
(a) Keep only those observations where you have at least 2 bidders;
(b) Keep only those observations where all submitted bids are di fferent;
(c) Keep only those observations where all submission times are different;
(d) For each row obtain min bid and second minimum bid, call these
variable min_bid, second_min_bid;
(e) Calculate whether the bidder who submitted the minimum bid, bided
the last (i.e., had the largest protocolprotocolapplications_application_appdate).
Please create a dummy variable, winner_last, which equals 1 if this
is the case and 0 otherwise.

Research Short-Term Task

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Research Short-Term Task

Încărcat de

Drepturi de autor:

Formate disponibile

Preliminary Instructions: In order for my research task to be successfully completed the well documented R

code is to be written precisely clearly answering all questions.

The essential data for the project can be downloaded here:

Does Gender Matter?

Part 1 - Getting Ready

(a) read_html from xml2 package;

Part 2 Does Gender Matter?

1. We try to understand how the probability to win a gold medal at IMO

Detection of Suspicious Auctions

1. Write a function, function_fields_extract_not_zk, which takes as an

(e) Keeps only those observations where field

(d) Keeps unique observations by notificationnumber;

5. Merge notification_zk_clean.csv with protocols_zk_clean.csv, as a

S-ar putea să vă placă și