Sunteți pe pagina 1din 4

1.

Data Collection
Option 1:
a. The goal of this collection is to observe the time in swimming
competitions and generate the outgoing and incoming time split
differential for the different lanes in the Olympic Swimming
competitions (2016 Rio). The mode of this data collection is
observation and generation. The data collection need is driven by
questions such how likely that the circular current in the swimming
pool would have influence on swimmers who are in the low number
lanes.
Option2:
a. The goal of the data collection is to find out the might-being
relationship between the GDP and peoples life expectancy. The data
collection mode is observation. And we collect these data for we
want to find out how GDP will impact on peoples life expectancy at
birth so as to get a clear idea on how can we help us to live longer
life.
So the collection need is driven by questions.
Option1, Option2:
b.
Logical Collection:
For Option1: We will collect all the data produced in 2016 Rio
Olympic swimming games. There will be in total 32 events. For
men and women is 16 events respectively. The events are
differentiated by distance, swimming styles and competition
ways. There are four swimming styles: Freestyle, Backstroke,
Breaststroke and Butterfly. The distance for individual participant
are 50m, 100m 200m, 400m, 800m(exclusively for women),
1500m(exclusively for men) and 4X100m medley. As for group
participants, there are 4X100m, 4X200m Freestyle Relay and
4X100m Medley Relay. Also, there are different stages for each
event, usually divided into Heat, semi and final. And there are in
total 18 countries. So for each event, we will need to collect
every participants record for each stage. We will get the
accumulated time for every 50m as the lane is designed to be
50m in length.
For Option2: We will collect the data of all the countries on the
earth, which means in total 193 countries, excluding the regions.
For each country, we will collect its GDP and life expectancy data
from 1960 to 2014. The data can be obtained from the world
bank website.

Physical Data Handling


For Option1, we can get the data form Rio Olympic official
website:
https://www.rio2016.com/en/swimming-featured-results-download
from this site, we can get the raw data for each event with
complete records. For the data format, we will choose xml. And
we will use the accumulated time as distance increasing to get

the time for each 50m. The precision for the time will be to the
1/100th of a second. Also, the format for the time description will
be like this: 3:12.45, which before the colon is the minute and
after the colon is the seconds. And for each record, we will collect
the participants first name and last name, the lane number, the
stage of the event, the rank, and the country.
For option2, we can get the data from world bank website:
http://www.worldbank.org/. And we will sort the raw data. We will
classify the country by regions,
For Option2,
Interoperability Support
The collected data can be applied on all platforms with neutral
encodings. Also there will be API for third-party uses.
Security Support
Public-used data file: with all the identifiers removed, can be
accessed from RPI with a solid RPI account and authorized IP
address. Restricted-used data file: must apply for it and agree on
the security access controls.
Data ownership
RPI maintains the ownership of the produced data.
Metadata Collection
For Option1, the metadata collection will include the event title,
the event competition time, the event type, the swimming style it
takes, the stage of the competition, the swimming pools name,
the water temperature, the length of the swimming pool, the
timing machine and the timing protocols it takes. The metadata
format is contextual information about the data in a text based
document.
For Option2,
Persistence
For Option1, the data will be stored at Github repository. Once we
want to add in other competitions data, Guo Yu will update the
data by adding into the xml file more data. Also, the metadata
document will be updated.
For Option2,
Knowledge and Information discovery
My data can be searched on Google or Github. Given the unique
name, tagged with the key words, It might be easier for the data
collection to be found.
Data dissemination and publication
The data producer should maintain the version control
mechanism for the data collection updates and changes. Those
who subscribe to the data collection will be informed by email or
online system notifications about the updates and changes.

For Question2,3,4, the citation and sources of information are list


below:

Example Data Management Plan: NSF General,


https://www.dataone.org/sites/all/documents/DMP_MaunaLoa_Formatted.pdf
Data Management Plans,
https://library.stanford.edu/research/data-managementservices/data-management-plans
Sample Data Management Plan for Depositing Data with ICPSR
http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/plan.html
Metadata and Provenance,
From class2 reading material,
Write A Data Management Plan
https://libraries.mit.edu/data-management/plan/write/
Rio Olympic Swimming Game
https://www.rio2016.com/en/swimming-featured-results-download
World Bank
http://www.worldbank.org/

2. Survey of data storage/formats


Option1: For the swimming records, the data formats might be selfdescribing format, table-driven, database, graphs. For this data collection,
the self-describing format is suitable for one single record. The tabledriven will allow us to easily retrieve a swimmers record in one event. But
as we will generate more data, so the table-driven format will be inadequate to perform more complex calculation and comparisons. So the
database format will allow more flexibility and usability on different
calculation methods. So we will store each record as a row in the
database, the lane number, the competition round, the participant name,
the country, the event title, the accumulated time by distance, the split
time for each lap(50m) and etc. will be stored as an attribute. For the
time-related data, we will use the 1/100th of a sound as its precision and
take the format like 3:12.45 standing for 3 minutes and 12.45 seconds.
For the rank number, lane number will take 1-8 to represent the sequence.
3. Survey of metadata conventions, standards
Option1: About this data collection, there are already some
conventions, like for the time, the ISO 8601 is the standard, and the
country name abbreviation, the ISO 3166-1 is the standard it takes. And
the swimming pool the Olympic games use has its own standard length
and width and other specifications. Also, some terms in swimming sport to
describe the event and swimmers performance will also be our metadata
standards, like the lap split time, the WR stands for world record and etc.
There also need some self-defined standard, like how to describe the
influence the circular current has on the swimmer.
Option2:
4. Provenance
Option1: Guo Yu collects the data from Rio Olympic official website,
download the dataset for 32 events in swimming competition, and
organized them by events. Use database to store these data and backup
on Github. Guo Yu will use the accumulated data to get the spilt time for
each lap in one event of each participants. For each records, there will be

lane number, participant name, country, competition stage, accumulated


time after each 50m, and split time for each lap(50m). Also, for the
metadata, collecting event title, the event start time, swimming style,
event type, the swimming pool name, the water temperature and etc.
organized in a contextual format is the way to process. And when collect
these data, using the related data standards so as to keep consistence.
Option2:

S-ar putea să vă placă și