Module 2 Unit 3 - DData Sources#
Data Sources#
As we previously mentioned, many organizations such as post-secondary or private research institutions, scientific organizations, and governmental agencies make data they have collected available for public use.
Data sets published under an open license can be used for all kinds of things, including as a teaching tool or to conduct research unrelated to its original purpose.
๐ Read#
One way to find data sets is through a simple internet search. Google has created a specialized search tool for this very purpose, the Google Dataset Search.
Letโs take a closer look at some different types of data sources.
Private data sources#
Companies that provide online services such as Alphabet (Google), Amazon, Apple, Twitter, and Meta, collect vast amounts of data about users and their online activities.
These companies use the data to improve their services and develop new products, but also sell access to much of this data to other organizations for advertising, research, and marketing purposes.
Data from private companies is often not freely available for educational purposes, however there are exceptions. The social media platform Twitter allows anyone to search and download posts made in the last week, and provides a variety of filters to help tailor the results to their needs.
That said, people pulling this data are only permitted access to a very limited subset, and must use an application programming interface (API), but this free access is noteworthy and used quite extensively by social science researchers, such as the Social Media Lab, a research laboratory at Ryerson University.
There are a variety of free tools available that can help people interested in using social media for research access and download data from Twitter and other platforms.
๐ Read#
Social media data in research: a review of the current landscape. This short 2019 article by Lily Davies, a Digital Humanities masters student at UCL, summarizes some of the tools used to scrape data from social media platforms.
Government data sources#
Governments are increasingly making an effort to provide public access to data they have collected.
Data that is freely available to be used, shared, and built on is referred to as open data. In many cases, this data is also structured to be machine readable and is accompanied with documentation about the format and metadata regarding how the data was collected and intended to be used.
The Government of Canada, Statistics Canada, the provincial and territorial governments, and even many municipalities have open data portals where anyone can find data sets created as part of government projects.
Explore
Open Government Programs in Canada is an interactive map of the various open data portals around the country.
Explore
Major Smart Cities with Open Data is a list of cities around the world with open data portals.
Academic data sources#
Post-secondary institutions generate a lot of valuable research data, and thanks to the UNESCO recommendation on Open Science, are increasingly making their data sets available to the public in formats that allow them to be explored, shared, and expanded upon. These efforts include:
OpenDOAR, a global directory of open access repositories.
Re3Data, an online registry of research data repositories.
Figshare, a another repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.
Dryad, a community-owned and curated research data resource.
Non-profit data sources#
Rich data sets are also made available by other sources including non-profit organizations. Some of the non-profit organizations sharing open data sets include:
Overall, the challenging part is often finding the relevant data source which is what makes data set aggregators like the Google Dataset search so valuable.
Generating our own data sets#
As previously mentioned, reusing existing data sets can often be faster and easier than creating our own. However some methods of data collection are reasonable for use in a classroom setting.
Web scraping involves using automated tools to gather information from webpages and convert it into a format that is convenient for data analysis.
For example, we could use this method to gather data related to NHL hockey teams and individual player performance records.
Scraping live website data can be technically challenging, so we wonโt be exploring these methods in this course. However, for those teachers and students who are interested in learning how to do this on their own, the CodeAcademy article linked below provides more information.
๐ Read (Optional)#
๐ Actvity#
Whatโs your favourite hobby? Can you find a data set associated with it (preferably an open data set)?
Hint: Try Google Dataset Search
OR based on your location, can you find the nearest government open data set thatโs relevant to you? Within that data repository, can you find a data set that interests you?
Conclusion#
In this unit, we learned about some of the resources teachers and students can use to access data and different types of use licenses.
Open data portals let us explore real data that is relevant to our lives and is more interesting to explore than outdated or made-up examples.
However there is so much out there that it can be hard to choose a data set for use in the classroom.
In the next unit, weโll dive deeper into what makes a data set good for classroom analysis and data science in general.