Module 1 Unit 4 - Framing a Data Science Question#
Framing a Data Science Question#
Now that we’re more familiar with the applications and potential risks of data science, let’s look at what it takes to get started.
Data availability#
To do data science, we’ll need some data to work with. If a question doesn’t use data that can be easily found online or collected as a class activity, it might be worth changing or revising it.
“Tasks” by XKCD is licensed under CC BY-NC 2.5
Data science professionals are often asked to work with data sets that already exist. Coming up with questions that make good use of a data set that was not originally collected with these questions in mind is a great example of creative thinking and problem solving.
There are many organizations that make real data they have collected as part of their operations publicly available online.
For example, the Government of Canada, along with provincial, territorial, and municipal governments have open data portals where anyone can view and access data sets governments have created.
In most cases, we can easily find the websites that host these data sets by simply doing an online search for the country, region, or community name plus “open data.”
Many post-secondary and research institutions, scientific organizations, and specific governmental agencies also make data available for public use.
Examples of data sources
Sports data sets for baseball, hockey, and basketball
Wikipedia tables
We’ll explore sources of data in more detail in the next module.
A note about data ownership#
Just like a video, photograph, book, or other form of media, data is considered to be property. Organizations are increasingly recognizing data sets as business assets and are controlling who is permitted to access and use them.
Just like a video, photograph, book, or other form of media, data is considered to be property. Organizations are increasingly recognizing data sets as business assets and are controlling who is permitted to access and use them.
đź“š Read#
The website data.world provides a list of common license types for data sets and details about their restrictions.
Choosing great data science questions#
As we explored in earlier units, insights from data science can help us make better decisions than if we only had gut feelings or incomplete information to go by.
That said, our instincts and emotions can be a great resource for helping us design questions that are exciting, that relate to our interests or the interests of our students.
Many of us already do this when planning lessons or when teaching students about science experiments. When considering what questions could be explored using data science, try asking:
person in a lab coat conducting a science experiment
What is an observation, fact, or topic we feel curious about?
What kind of variables might impact that observation?
Are those variables measurable?
Are data sets related to this question or group already available?
What kind of benefits might an analysis of this data have? Is there a way it could help us or help others make decisions?
Let’s look at a quick example.
Alyssa is curious about the factors that can strengthen or weaken a country’s economy, and wonders if the health of residents has a clear impact.
On the Gapminder website she finds data that compares income and life expectancy by country over time.
In the visualization generated by Gapminder, she notices that both measures have improved over the last 200 years, and that as income (per capita GDP) increases, the average life expectancy increases.
She concludes that there is a correlation between income and life expectancy, and that this indicates that there may also be a relationship between initiatives that are known to impact life expectancy and economic output.
🎥 Watch#
In this 5 minute video, Instructor David Hay explores additional Gapminder data using Jupyter notebooks and looks into the relationship between birth rates and child mortality.
from IPython.display import YouTubeVideo
YouTubeVideo('o8JEHJaDg4o')
Here are some other examples of data science questions we could explore with students as part of our regular curriculum:
Does weather affect voter turnout?
How many 10-digit prime numbers exist?
What are the dominant tree species in a neighbourhood?
In Shakespeare’s Hamlet, which characters speak the most?
Is a hockey player’s salary correlated to the number of points they earn?
What data from particle collisions support the existence of the Higgs boson?
🏷️ Activity#
Come up with three possible questions that could be answered using data science. Can you find a data set that could help answer one of them?
Share your question and data set link in the discussion area below. Do you see any posts by other people with questions or data sets you could use in your classroom?
Example question:
“How does Canada compare to other countries in the world? By looking at the data, we find that the gap between Canada and other countries we consider less developed isn’t as big as we might think. The Gapminder foundation has great data sources to explore about the world. For example, this dataset explores Gross Domestic Product of countries.”
🗣️ Discussion Board#
Conclusion#
In this module, we have introduced data science and how it relates to education. The next module will have us using data, and will introduce Jupyter notebooks.
References#
Reference materials
A web comic featuring math and programming humour - XKCD.
The importance of digital and data literacy - National Skills Coalition.
High school data science education, a student perspective - Harvard Data Science Review.
Key data science concepts: