Module 2 Unit 5 - Getting Started with Data Science Tools#
The case for code#
Data sets used by data scientists tend to be big. Large data sets can mean more accurate insights, and a greater variety of questions can be answered. Even ones too small and simple to be considered big data are still much larger than what we have traditionally worked with in schools.
Remember the Census by Community from Open Calgary that we looked at in a previous unit? That data set is relatively small, with only 306 rows and 142 columns.
By contrast, the Community Crime Statistics from the same open data portal has 37,500 rows, 12 columns and contains several years of data.
Calgary is a relatively large Canadian city with approximately 1.2 million people. But what if we wanted to compare crime data sets with another city, such as Toronto? At approximately 2.9 million people, Toronto is over double the size of Calgary and the Major Crime Indicators (MCI) 2014 to 2019 data set collected by the Toronto Police Service has 206,435 rows and 27 columns.
If a data set is small, some analysis can be done inside a spreadsheet application like Microsoft Excel or Google Sheets, but it takes code to really unlock the potential of data science. Many fascinating data sets are simply too large for spreadsheet applications to efficiently work with, but beyond that, code gives us the ability to explore and visualize data in a much greater variety of ways.
Instead of relying on the pre-packaged features in spreadsheet software for performing analysis and creating charts, code lets us re-create nearly any form of exploration or analysis we have heard of, or even invent our own.
In terms of what you can do and create, it’s like the difference between having a microwave, and having access to a fully-stocked professional kitchen.
Most teachers and students will not become software developers or professional data scientists, but knowing our way around the basics gives us agency in an increasingly data-oriented world, and passing these skills to our students helps to break down digital inequity.
Python logo
In this course, we’ll be using the coding language Python, and the web application Jupyter Notebooks to run it.
This powerful and versatile combination is used by most professional data scientists.
We’ll be working from the Callysto Hub, a free educational environment where we can store data, write code, experiment with visualizations, and show our results.
Callysto Hub#
Jupyter Notebooks#
A Jupyter notebook serves as both the text editor for writing code and the environment for running it and displaying the output.
For many people familiar with coding, being able to write, and run code in the same document might be a new experience that takes some getting used to, however this system is great for new learners as they can work more quickly and don’t need to access as many different tools.
An example of structured data is the Census by Community from the City of Calgary’s open data portal, Open Calgary.
Additional notes about Python and Jupyter notebooks#
🏁 Actvity#
Conclusion#
In this module, we learned more about what data is, why it’s useful, and where to find it. We also explored the qualities that make a data set useful for data science projects, and we familiarized ourselves with some tools for working with data sets with code, Python code and Jupyter notebooks.
In the next module, we’ll roll up our sleeves and put what we’ve learned to work. We’ll learn how to add data to Jupyter notebooks and how to organize and transform it in ways that allow us to see it from new perspectives.