Sunteți pe pagina 1din 8

1.

You have been provided with large CSV files containing health expediture
data about Australia. Unfortunately, the data is `noisy': some people have
made data entry mistakes, or intentionally entered incorrect data. Your first
task as a programmer-analyst is to clean up the noisy data for later analysis.
译文:您已经获得了包含澳大利亚卫生 expediture 数据的大型 CSV 文件。不幸的是,数据
是“有噪声的”:一些人犯了数据输入错误,或者故意输入了不正确的数据。作为程序员-
分析师,您的第一个任务是清理嘈杂的数据,以便以后进行分析。

There are a few particular errors in this data:


 Typos have occured in the expediture resulting in some non-numeric
values.
译文:在 expediture 中出现了输入错误,导致了一些非数字值。

 People have entered expenditure areas that are out-of-date and no


longer valid. The valid areas are listed in a variable
called VALID_AREAS, which is given to you.
译文:人们已经进入了过时和不再有效的支出领域。有效区域列在一个名为
VALID_AREAS 的变量中,这个变量是给定给您的。

 Some people have formatted the financial year incorrectly. Using words
instead of digits, e.g. inputting `twenty-ten to eleven' instead of `2010-
11' or using too many or too few digits, e.g. `10-11' -- others have
entered years outside the range of the dataset. The data provided is for
financial years within the range 1997-98 and 2011-12.
译文:有些人把财政年度的格式搞错了。用单词代替数字,例如:用“twenty-ten to
eleven”代替“2010-11”,或用太多或太少的数字,例如:“10-11”——有些年份超出了数据
集的范围。所提供的数据是 1997-98 年和 2011-12 年财政年度的数据。

Write a function clean_data(data) which takes one argument, a dictionary


of data in the format returned by read_data. This data has been read directly
from a CSV file, and is noisy! Your function should construct and return a new
data dictionary which is identical to the input dictionary, except that invalid
data values have been replaced with None. You should not modify the
argument dictionary, data.
译文:编写一个函数 clean_data(data),它接受一个参数,即 read_data 返回格式的数据
字典。这个数据是从 CSV 文件中直接读取的,而且很吵!

您的函数应该构造并返回一个与输入字典相同的新数据字典,只是无效的数据值被替换为
None。你不应该修改参数字典,数据。

For example, let’s look at the data contained in noisy_sample.csv:


from header import read_data, VALID_AREAS
import re

def is_int(s):
try:
int(s)
return True
except ValueError:
return False

def clean_data(data):
my_data=data.copy()
p=re.compile(r'[0-9]{4}-[0-9]{2}')
for key, value in sorted(my_data.items()):

if len(re.findall(p,value['fin_year'])):
left=int(value['fin_year'][:4])
right=int(value['fin_year'][:2]+value['fin_year'][-2:])
if left == 1999:
right+=100
if left > 1996 and right <2013 and right-left ==1:
#print("passed")
pass
else:
my_data[key]['fin_year']=None
else:
my_data[key]['fin_year']=None
if value['area'] not in set(VALID_AREAS):
my_data[key]['area']=None
if not is_int(value['expenditure']):
my_data[key]['expenditure']=None
#print(my_data[key])
return my_data
2.
Write a function called avg_expenditure(data, start, end) which takes
three arguments, a dictionary of data in the format returned by read_data,
a start range of financial year in format XXXX-XX, and an end range in
same format. You can assume the financial year input start and end is valid.
The function calculates the average expenditure within the provided range of
financial years and returns the average rounded to the closest whole number.
If the start is greater than end the function should return -1.
编写一个名为 avg_expenditure(data, start, end)的函数,

该函数有三个参数,read_data 返回的格式的数据字典,XXXX-XX 格式的财政年度起始范


围,以及相同格式的结束范围。

您可以假设财务年度的输入开始和结束是有效的。该函数计算提供的财政年度范围内的平
均支出,并将平均值四舍五入到最接近的整数。

如果开始大于结束,函数应该返回-1。

You may assume the health expenditure data in data is ‘clean’, that is all
invalid values have been replaced by None. If a nested dictionary contains
a None value for the fin_year key or expenditure key, you should ignore it
in your calculation. (If the dictionary has None for a different key, e.g. area,
you should still include it in the calculation.)
您可以假设数据中的医疗支出数据是“clean”的,即所有无效的值都被 None 替换。

如果嵌套字典包含 fin_year 键或 expenditure 键的 None 值,那么在计算时应该忽略它。


(如果字典中没有其他键,例如 area,你仍然应该把它包括在计算中。)

Here are some examples of what your function should return for different
datasets and financial year brackets:
这里有一些例子,你的功能应该返回不同的数据集和财政年度括号:
3.
Your employers are interested in the distribution of expenditure across
different areas of the health sector.
One way to analyse this is to divide the range of possible expenditures into a
number of equal-sized 'bins' -- where a bin is just a subset of the overall range
-- then count the number of expenditures falling into each bin (if you've ever
worked with histograms before, this should be very familiar).

译文:你的雇主对卫生部门不同领域的支出分配感兴趣。

分析的一种方法是将可能的支出范围划分为许多大小相同的“垃圾桶”,本就是一个整体范
围的子集,然后计算支出落入每箱的数量 (如果你曾经使用过直方图,这应该是非常熟悉)。

For example, we could divide the total expenditure range [0, 5000] into ten
bins: [0, 499], [500, 999], [1000, 1499], and so on, up to [4500, 5000]. The
distribution of expenditures would then be summarised by 10 integers
corresponding to the ten bins. In general, we experiment with the number of
bins to find the number that gives the most informative distribution.
译文:例如,我们可以将总支出范围[0,5000]分成 10 个部分:[0,499]、[500,999]、
[1000,1499]等等,一直到[45,5000]。支出的分配将以对应于十个箱子的 10 个整数来总结。
一般来说,我们会对箱子的数量进行实验,以找出能提供最多信息的箱子数量。

Write a function called funding_dist(data, n_bins, area)


Here is an example of how your function should behave:

funding_dist(data, n_bins, area), which calculates the distribution of


expenditures greater than or equal to the minimum expenditure and less than
or equal to the max expenditure for a given area, by dividing that range
into n_bins bins and counting the number of expenditures that fall into each
bin. The bin width should be an integer. Your function should return
a list of ints, with each integer representing the number of expenditures
falling in the corresponding bin.
译文:funding_dist(data, n_bins, area),通过将该范围划分为 n_bins 并计算每个 bin 的支
出数量,计算出大于或等于最小支出而小于或等于最大支出的支出分布。箱子的宽度应该
是整数。您的函数应该返回一个 int 列表,每个整数表示在相应的 bin 中下降的支出数量。

If a nested dictionary in data contains a None value for


the expenditure or area key, you should ignore it in your calculation. (If the
dictionary has None for a different key, you should still include it in the
calculation.)
如果数据中的嵌套字典包含支出或区域键的 None 值,则应该在计算时忽略它。(如果字典
中没有对应不同键的值,那么您仍然应该将它包含在计算中。

You may assume that n_bins is a positive integer. Notice that including the
maximum expenditure in the last bin may make the last bin slightly ‘wider’
than the others. For example, if
译文:您可以假设 n_bins 是一个正整数。请注意,在最后一个箱子中包含最大支出可能会
使最后一个箱子比其他箱子稍微“宽”一些。例如,如果

max_expend == 101 and min_expend == -20,


n_bins == 6, and bin_width = [101 - (-20)] // 6 == 20
the bins would be [-20, -1], [0, 19], [20, 39], [40, 59], [60, 79], and [80, 101].
Noted that in this example, the last bin is wider than all others.
4.
Write a function called area_expenditure_counts(data, lower_spent,
upper_spent) which creates a dictionary of the number of expenditures in
the given expenditure amount bracket for each area. That is, each key in the
dictionary should be an area name, and the value for that area should be
an int corresponding to the number of expenditures in the area who fall in
the expenditure amount bracket specified
by lower_spent and upper_spent (inclusive). Your dictionary should have a
key for all areas in VALID_AREAS, even the ones that have no expenditures in
the expenditure amount bracket. The function should return the top 5 areas by
expenditure count as a list of tuples[(area, expenditure_count), ...].
Top 5 areas should be listed in descending order by count, and ties should
be broken by alphabetical order.
编写一个名为 area_expenditure_counts(data, lower_spent,
upper_spent)的函数,该函数创建一个针对每个地区给定支出金额括号内支出数量的字
典。也就是说,字典中的每个键都应该是一个区域名,该区域的值应该是一个 int,与处于
由 lower_spent 和 upper_spent(包括)指定的支出金额括号内的该区域的支出数量相对应。
您的字典应该对 VALID_AREAS 中的所有区域都有一个键,即使是那些在支出金额括号中
没有支出的区域。该函数应该按支出计数返回前 5 个区域,作为元组列表[(面积,支出
_count),…]。排名前 5 位的区域应按计数降序排列,并列应按字母顺序排列。

In this question, you should ignore any nested dictionary with a None value for
the area key or the expenditure key. None values for other keys are
acceptable. You may assume that lower_spent and upper_spent are
positive. If lower_spent > upper_spent, your function should return a list
with a value of 0 for every expenditure.
译文:在这个问题中,您应该忽略对于 area 键或 expenditure 键具有 None 值的任何嵌套
字典。其他键的值都是不可接受的。您可以假设小写和大写都是正数。如果 lower_spent >
upper_spent,您的函数应该为每个支出返回一个值为 0 的列表。

Here are some examples of how your function should behave:


5.
A prestigious Victorian university has asked the AIHW to produce a report on
health expenditure in Australia. They have asked you to help them generate
some of the data for this report.
译文:一所著名的维多利亚大学要求 AIHW 撰写一份关于澳大利亚医疗支出的报告。他们
要求你帮助他们为这份报告生成一些数据。

Write a function called main(datafile) which takes a filename as an


argument, which reads the health data contained in that file, cleans the data,
and uses the data to print out some facts about health expenditure. You
should assume that the data in datafile is noisy. Your function should
calculate and print out the following facts:
译文:编写一个名为 main(datafile)的函数,该函数以文件名为参数,读取文件中包含的健
康数据,清除数据,并使用这些数据打印出关于健康支出的一些事实。您应该假设数据文
件中的数据是有噪声的。你的函数应该计算并打印出以下事实:

1. The average expenditure for each financial year between 1997-98 and
2011-12.
2. A list of the top 5 expenditure areas by count. The report is only
interested in expenditures between 0-800 million (inclusive).
一份清单的前 5 个支出领域的计数。该报告只对 0-8 亿(含 8 亿)之间的支出感兴趣。

Average expenditure should be listed in cronological order by financial


year with format XXXX-XX, and only listed if they have a non-zero average.
Next to the finalical year, in brackets, print the average expenditure for that
year.
平均支出应按会计年度的 cronological order (cronological order)排序,格式为 xx - xx,并
且只有在平均支出非零的情况下才可以列出。在最后年度的旁边,括号内,列印该年度的
平均开支。

Top 5 areas should be listed in descending order by count, and only listed if
they have at least one expenditure. Ties should be broken by alphabetical
order. Next to the area name, in brackets, print the number of expenditures in
the area.
排名前 5 位的领域应按计数降序排列,并且仅当它们至少有一项支出时才列出。领带应该
按字母顺序分开。在地区名称旁边,在括号中,打印该地区的支出数字。

Here is an example of what your function should print. Make sure your
function matches the format exactly.

S-ar putea să vă placă și