Documente Academic
Documente Profesional
Documente Cultură
You have been provided with large CSV files containing health expediture
data about Australia. Unfortunately, the data is `noisy': some people have
made data entry mistakes, or intentionally entered incorrect data. Your first
task as a programmer-analyst is to clean up the noisy data for later analysis.
译文:您已经获得了包含澳大利亚卫生 expediture 数据的大型 CSV 文件。不幸的是,数据
是“有噪声的”:一些人犯了数据输入错误,或者故意输入了不正确的数据。作为程序员-
分析师,您的第一个任务是清理嘈杂的数据,以便以后进行分析。
Some people have formatted the financial year incorrectly. Using words
instead of digits, e.g. inputting `twenty-ten to eleven' instead of `2010-
11' or using too many or too few digits, e.g. `10-11' -- others have
entered years outside the range of the dataset. The data provided is for
financial years within the range 1997-98 and 2011-12.
译文:有些人把财政年度的格式搞错了。用单词代替数字,例如:用“twenty-ten to
eleven”代替“2010-11”,或用太多或太少的数字,例如:“10-11”——有些年份超出了数据
集的范围。所提供的数据是 1997-98 年和 2011-12 年财政年度的数据。
您的函数应该构造并返回一个与输入字典相同的新数据字典,只是无效的数据值被替换为
None。你不应该修改参数字典,数据。
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
def clean_data(data):
my_data=data.copy()
p=re.compile(r'[0-9]{4}-[0-9]{2}')
for key, value in sorted(my_data.items()):
if len(re.findall(p,value['fin_year'])):
left=int(value['fin_year'][:4])
right=int(value['fin_year'][:2]+value['fin_year'][-2:])
if left == 1999:
right+=100
if left > 1996 and right <2013 and right-left ==1:
#print("passed")
pass
else:
my_data[key]['fin_year']=None
else:
my_data[key]['fin_year']=None
if value['area'] not in set(VALID_AREAS):
my_data[key]['area']=None
if not is_int(value['expenditure']):
my_data[key]['expenditure']=None
#print(my_data[key])
return my_data
2.
Write a function called avg_expenditure(data, start, end) which takes
three arguments, a dictionary of data in the format returned by read_data,
a start range of financial year in format XXXX-XX, and an end range in
same format. You can assume the financial year input start and end is valid.
The function calculates the average expenditure within the provided range of
financial years and returns the average rounded to the closest whole number.
If the start is greater than end the function should return -1.
编写一个名为 avg_expenditure(data, start, end)的函数,
您可以假设财务年度的输入开始和结束是有效的。该函数计算提供的财政年度范围内的平
均支出,并将平均值四舍五入到最接近的整数。
如果开始大于结束,函数应该返回-1。
You may assume the health expenditure data in data is ‘clean’, that is all
invalid values have been replaced by None. If a nested dictionary contains
a None value for the fin_year key or expenditure key, you should ignore it
in your calculation. (If the dictionary has None for a different key, e.g. area,
you should still include it in the calculation.)
您可以假设数据中的医疗支出数据是“clean”的,即所有无效的值都被 None 替换。
Here are some examples of what your function should return for different
datasets and financial year brackets:
这里有一些例子,你的功能应该返回不同的数据集和财政年度括号:
3.
Your employers are interested in the distribution of expenditure across
different areas of the health sector.
One way to analyse this is to divide the range of possible expenditures into a
number of equal-sized 'bins' -- where a bin is just a subset of the overall range
-- then count the number of expenditures falling into each bin (if you've ever
worked with histograms before, this should be very familiar).
译文:你的雇主对卫生部门不同领域的支出分配感兴趣。
分析的一种方法是将可能的支出范围划分为许多大小相同的“垃圾桶”,本就是一个整体范
围的子集,然后计算支出落入每箱的数量 (如果你曾经使用过直方图,这应该是非常熟悉)。
For example, we could divide the total expenditure range [0, 5000] into ten
bins: [0, 499], [500, 999], [1000, 1499], and so on, up to [4500, 5000]. The
distribution of expenditures would then be summarised by 10 integers
corresponding to the ten bins. In general, we experiment with the number of
bins to find the number that gives the most informative distribution.
译文:例如,我们可以将总支出范围[0,5000]分成 10 个部分:[0,499]、[500,999]、
[1000,1499]等等,一直到[45,5000]。支出的分配将以对应于十个箱子的 10 个整数来总结。
一般来说,我们会对箱子的数量进行实验,以找出能提供最多信息的箱子数量。
You may assume that n_bins is a positive integer. Notice that including the
maximum expenditure in the last bin may make the last bin slightly ‘wider’
than the others. For example, if
译文:您可以假设 n_bins 是一个正整数。请注意,在最后一个箱子中包含最大支出可能会
使最后一个箱子比其他箱子稍微“宽”一些。例如,如果
In this question, you should ignore any nested dictionary with a None value for
the area key or the expenditure key. None values for other keys are
acceptable. You may assume that lower_spent and upper_spent are
positive. If lower_spent > upper_spent, your function should return a list
with a value of 0 for every expenditure.
译文:在这个问题中,您应该忽略对于 area 键或 expenditure 键具有 None 值的任何嵌套
字典。其他键的值都是不可接受的。您可以假设小写和大写都是正数。如果 lower_spent >
upper_spent,您的函数应该为每个支出返回一个值为 0 的列表。
1. The average expenditure for each financial year between 1997-98 and
2011-12.
2. A list of the top 5 expenditure areas by count. The report is only
interested in expenditures between 0-800 million (inclusive).
一份清单的前 5 个支出领域的计数。该报告只对 0-8 亿(含 8 亿)之间的支出感兴趣。
Top 5 areas should be listed in descending order by count, and only listed if
they have at least one expenditure. Ties should be broken by alphabetical
order. Next to the area name, in brackets, print the number of expenditures in
the area.
排名前 5 位的领域应按计数降序排列,并且仅当它们至少有一项支出时才列出。领带应该
按字母顺序分开。在地区名称旁边,在括号中,打印该地区的支出数字。
Here is an example of what your function should print. Make sure your
function matches the format exactly.