MATH 1530 Capstone Projects

Capstone data projects are designed to create a more transferrable experience for students in a beginning statistics course. Following the Guidelines for Assessment and Instruction in Statistical Education (GAISE) recommendations, these data projects use real data and require the use of technology to fully explore statistical concepts. Expand each of the sections below to see a brief description of each dataset/project and preliminary discussion questions that can be used with students at the beginning of the project.

Youth Smoking Crisis

Source: Adapted from data and article in “Journal of Statistics Education” by Michael Kahn, Wheaton College

Variables: Age (years), Forced Expiratory Volume (liters), Height (inches), Sex (Female 0), Smoking (Nonsmoker 0)
Number of observations: 654

About the dataset

In 1979 and 1983, two of the earliest studies in the US were conducted to determine the relationship between children’s lung (pulmonary) function and the absence or presence of cigarette smoke, whether passively or actively inhaled.

In particular, researchers from these two studies measured the forced expiratory volume of children aged 3-19. Forced expiratory volume measures how much air (in liters) a person can exhale during a forced breath. To perform pulmonary function tests such as FEV, the patient is asked to take the deepest breath they can, and then exhale into the sensor as hard as possible, for as long as possible, preferably at least 6 seconds. Sometimes, the test will be preceded by a period of quiet breathing in and out from the sensor (tidal volume).

The maneuver is highly dependent on patient cooperation and effort, and is normally repeated at least three times to ensure quality of results. Due to the effort required, pulmonary function tests can only be used on children old enough to comprehend and follow the instructions given. Other types of lung function tests are available for infants and unconscious persons.

Average values for FEV in healthy people depend on varying factors, as we will examine during the capstone. Values of between 80% and 120% of the average value are considered a normal range.

Preliminary Discussion Questions for Students

What conjectures (educated guesses) could be made about the relationship between smoking and FEV?
What do you think is the difference between average and normal range?
Why do you think the researchers chose FEV to measure the effects of smoking on lung function?

Reverse Discrimination: the Ricci v DeStefano case

Source: Adapted from data and article in “Journal of Statistics Education” by Weiwen Miao, Haverford College.

Variables: Race (W, B, H), Position (Lieutenant, Captain), Oral exam score, Written exam score
Number of observations: 118

About the dataset

In 2003, the New Haven Fire Department had seven openings for Captain and eight openings for Lieutenant. The department gave civil service examinations for fill the open positions. The exams consisted of two parts: a written exam worth 60% and an oral exam worth 40%. A total score greater than or equal to 70% was considered a passing grade.

After reviewing (and publishing) the test results, the city of New Haven decided the test was discriminatory against black candidates. Because no black firefighters were eligible for advancement, the city threw out the results.

Ricci and sixteen other white test takers, plus one Hispanic, all of whom would have qualified for consideration for a promotion, sued the city of New Haven including the Mayor John DeStefano, Jr. Their suit claimed that, by discarding the test results, the city discriminated against the plaintiffs based on their race. The city officials defended their actions, arguing that if they had certified the results they could have faced liability for adopting a practice that had an adverse impact on the minority firefighters.

Adverse impact is defined as a substantially different rate of selection in hiring, promotion or other employment decision which works to the disadvantage of members of a race, sex, or ethnic group. Title VII of the Civil Rights Act of 1964 prohibits employment discrimination on the basis of race, color, religion, sex, or national origin (these groups are referred to as protected classes.

Adverse impact is generally the first step in establishing evidence of discrimination under Title VII. The burden is on the plaintiff to show that an employment decision adversely impacted a protected class. The finding of adverse impact shifts the burden of proof to the defendant and would require the employing organization to defend the employment decision in question by providing evidence that the process used to make the decision was valid.

Preliminary Discussion Questions for Students

What are your expectations for fairness in hiring?
What does the law say about fairness in hiring decisions?
How does our society enforce the definition of fairness in hiring decisions?

This is Appalachia

Source: Appalachian Regional Commission, a federal-state economic development partnership.

Variables: State, County, Economic Ranking (Distressed, At-Risk, Transitional, Competitive, Attainment), Completion of High School, Completion of College Degrees, Poverty rate, Income, Mortality rate, Coal production (tons)
Number of observations: 420

About the dataset

According to the Appalachia Regional Commission, the Appalachian Region is a205,000-square-mile region that follows the spine of the Appalachian Mountains from southern New York to northern Mississippi. It includes all of West Virginia and parts of 12 other states: Alabama, Georgia, Kentucky, Maryland, Mississippi, New York, North Carolina, Ohio, Pennsylvania, South Carolina, Tennessee, and Virginia. Forty-two percent of the Region's population is rural, compared with 20 percent of the national population.

The Appalachian Region's economy, once highly dependent on mining, forestry, agriculture, chemical industries, and heavy industry, has become more diversified in recent times, and now includes manufacturing and professional service industries. Appalachia has come a long way in the past five decades: its poverty rate, 31 percent in 1960, was 16.6 percent over the 2008–2012 period. The number of high-poverty counties in the Appalachian Region (those with poverty rates more than 1.5 times the U.S. average) declined from 295 in 1960 to 107 over the 2008–2012 period.

These gains have transformed the Region from one of widespread poverty to one of economic contrasts: some communities have successfully diversified their economies, while others still require basic infrastructure such as roads and water and sewer systems. The contrasts are not surprising in light of the Region's size and diversity. The Region includes 420 counties in 13 states. It extends more than 1000 miles, from southern New York to northeastern Mississippi and is home to more than 25 million people.

The dataset we will be working with this semester is based on US Census data analyzed and collated by the Appalachia Regional Commission. The data is arranged by both state and county so that regional differences can be examined.

Preliminary Discussion Questions for Students

Describe what you think of when you hear the word Appalachia.
What factors do you believe are important to an analysis of the Appalachian region?
What is the definition of per capita income? What might be a flaw in using only the per capita income as a measure of an area's economic status?

US Police-Involved Deaths

Source: Created from data compiled by Fatal Encounters, a 501(c)3 organization. Note: This data is NOT a compilation of police shooting incidents. It includes deaths that occurred during police actions of all types.

Variables: Age (years), Gender (male, female), Race (European American, Hispanic-Latino, African American, Native American, Asian, Middle Eastern, Pacific Islander, Unknown), State, Method of Death, Disposition of case, Date of death (mm/dd/yy), Year of death
Number of observations: 3369

About the dataset

Case 1 - According to the article written by Maria L. La Ganga and Tina Susman in the Los Angeles Times; Nov. 16, 2014

James Boyd, a 38-year-old mentally ill homeless man who suffered from delusions, was camping illegally in the Sandia Foothills on March 16 of this year when Albuquerque police officers tried to arrest him. During a standoff, Boyd waved two knives, and 41 officers from various agencies surrounded him. “Finally, when Mr. Boyd appeared to be surrendering, officers threw a flash bang at him, released a dog to take him down, and shot him with a taser rifle,” according to a wrongful death suit filed against the city. “As Mr. Boyd turned away from the officers, two officers shot three rounds each, hitting him three times, twice in his side and back and once on his arm.” Boyd’s last words on the recording were, “Please don’t hurt me,” and “I can’t move.” Boyd was taken by ambulance to the University of New Mexico Hospital. His right arm was amputated and his spleen and intestine were removed. He died at 2:55 a.m. on March 17.

Case 2 - According to the article written by Richard Fausset in the Los Angeles Times; June 2, 2002

A 4-year-old girl was killed Saturday morning when an auto-theft suspect being pursued by Los Angeles police ran a red light on a busy downtown street, causing a chain-reaction accident that knocked over a traffic light, crushing the girl, authorities said.

Perhaps surprisingly, both of these cases would be considered “police-involved” deaths. Police-involved deaths is a much broader data base than the often-assumed “police shootings”. It includes any incident in which police were called and a death(s) occurred. Understandably, concern about police involved deaths has become a very emotional and divisive controversy in this country. Allegations of racial bias and targeting abound; as does the charge that police often default to lethal responses when use of less force would have been sufficient. In response, policing agencies remind us that while in hindsight a lower level response may seem more reasonable, the choice of response “in the moment” is often informed by intangibles like perceived threat level, adrenalin surge and many other factors which are hard to measure after the fact.

In an attempt to examine the magnitude of the problem nation-wide, many people have begun to search for a database of police involved deaths of all types. It may be surprising to know that no one complete source of this information is being collected.

Recently a nonprofit group called Fatal Encounters has taken on the challenge of collecting this data from all over the US and placing it into a single searchable database. The organization chose to consider only incidents occurring on or after January 1, 2000. It is entirely self-funded and run with volunteers. The only paid employees are data-entry assistants. In this Capstone, we will be examining this database and trying to extract meaning from it.

Preliminary Discussion Questions for Students

Based on the two cases given, you can see that police-involved deaths is much more than simply police shootings. What other situations do you think would fit into this database? Give at least three general scenarios.
Do you think that the number of police-involved deaths has increased, decreased, or remained about the same since 2000? Why?
When two things occur together, it is called a correlation. Does that also mean that one thing causes the other? Why or why not?

Guns, Gun Ownership, and Crime

Source: Based on data collected from NationMaster and the World Health Organization

Variables: Country name, Gun ownership (per 100 in population), Gun deaths (per 100000 in population), Murder rate (per 100000 in population), Crime rate (per 1000 in population), Suicide rate (per 100000 in population), Suicides by gun (percentage), Suicides by other methods (percentage)
Number of observations: 27

About the dataset

In the current political climate it is difficult for Americans to dispassionately consider the impact of guns in our society. The issue of easy access to guns and what collateral problems this fact may, or may not, create has become a focus for many groups. Discourse about this subject has become rather divisive as the level of gun violence has risen. The rhetoric of the debate implies that conclusions about the effects of high gun ownership can be easily determined through the application of simple logic. If so, why are there so many diverse conclusions about guns in our society?

In this capstone project we will attempt to analyze data regarding guns, gun ownership and crimes. The data has been collected from 26 “developed” countries which have highly developed economies and advanced technological infrastructure. This group of countries has many traits in common and none are engaged in an active war on their own territory. When considering the domestic ownership and use of guns, nations engaged in organized, armed conflicts on their own soil must be eliminated from the comparison.

The data was collected from NationMaster and from the World Health Organization (WHO). NationMaster is a global team of statistical analysts who mine data from many sources, such as the CIA World Factbook and the United Nations data collections. In general, the data has been adjusted to show “per capita” rates which makes analysis more straight forward

Preliminary Discussion Questions for Students

What conjectures (educated guesses) could be made about the relationship between guns and crime?
Why would data which simply gave the “number” of guns or crimes be more difficult to compare across various nations?
Why does the data contain a column for gun deaths as well as a column for murders?

Global Health

Source: Based on data from the World Health Organization report : World Health Statistics 2017

Variables: Country, Infant mortality (per 1000 live births), Health expenditures per capita, Obesity rate, Average income per capita, Suicide rate (per 100000), Life expectancy, Universal health care (yes, no), Diabetes rate, Leading cause of death, Hospital beds (per 100000)
Number of observations: 47

About the dataset

The dataset for Global Health has been compiled from the massive dataset of the World Health Organization. The World Health Organization (WHO) is a specialized agency of the United Nations that is concerned with international public health. It was established on 7 April 1948, and is headquartered in Geneva, Switzerland.

Since its creation, WHO has played a leading role in the eradication of smallpox. Its current priorities include communicable diseases, in particular HIV/AIDS, Ebola, malaria and tuberculosis; the mitigation of the effects of non-communicable diseases; sexual and reproductive health, development, and aging; nutrition, food security and healthy eating; occupational health; substance abuse; and driving the development of reporting, publications, and networking.

The WHO is responsible for the World Health Report, the worldwide World Health Survey, and World Health Day. The Director-General of WHO is Tedros Adhanom who started his five-year term on 1 July 2017.

The dataset includes a wide range of countries but does not attempt to include data from all countries. The selected variables range from economic indicators (average per capita income) to health care delivery reflected in whether or not the country has universal health care

Preliminary Discussion Questions for Students

What factors do you think are most important in examining global health issues?
Describe what you think are some significant differences in the health of citizens in the United States and Canada.
What is the definition of average per capita income? What might be a flaw of using only per capita income as a measure of a country’s economic status?

Global Terrorism

Source: Based on data collected from the Global Terrorism Database (GTD) maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism at the University of Maryland

Variables: Country, Region, Target type, Attacker identity, Weapon type, Number killed, Number wounded
Number of observations: 16820

About the dataset

The Global Terrorism Database (GTD) is maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism at the University of Maryland. Using many sources for governments and organizations worldwide, the GTD collects and vets incidents of terrorism for the database. To determine whether a violent incident should be included in the database, the GTD applies a definition of terrorism. The GTD defines a terrorist attack as the threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation. In practice to be included as a terrorist event in the database the incident must have all three of the following characteristics:

The incident must be intentional.
The incident must entail some level of violence or immediate threat of violence.
The perpetrators of the incidents must be sub-national actors.

In addition, the event must have at least two of the following three additional criteria:

The act must be aimed at attaining a political, economic, religious, or social goal.
There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience than the immediate victims.
The action must be outside the context of legitimate warfare activities.

The database we will examine is extracted from the full GTD database and includes only data from 2014. The variables that we will consider include global region as defined by the GTD, country where incident occurred, date, type of attack (weapon), terrorist group responsible (if known), number of deaths and number of wounded.

Preliminary Discussion Questions for Students

Do you believe that terrorism is a problem in the United States? Why?
Why do you think that having a formal definition of “terrorism” is needed when gathering this type of data?
One of the factors in the dataset is type of attack or weapon used. Do you believe that the number of deaths will be related to the type of attack perpetrated? Explain your reasoning.

For more information about these projects, including access to datasets and project tasks, contact SADobbyn 'at' pstcc.edu or BLMosby 'at' pstcc.edu.

Social Justice Data

Capstone Projects for Introductory Statistics

Youth Smoking Crisis

About the dataset

Preliminary Discussion Questions for Students

Reverse Discrimination: the Ricci v DeStefano case

About the dataset

Preliminary Discussion Questions for Students

This is Appalachia

About the dataset

Preliminary Discussion Questions for Students

US Police-Involved Deaths

About the dataset

Preliminary Discussion Questions for Students

Guns, Gun Ownership, and Crime

About the dataset

Preliminary Discussion Questions for Students

Global Health

About the dataset

Preliminary Discussion Questions for Students

Global Terrorism

About the dataset

Preliminary Discussion Questions for Students