{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)\n", "\n", "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Joining Datasets\n", "\n", "Unlike standard SQL, SODA does not (currently) support the joining of separate datasets. As you recall, this is a fundamental part of working with data. Often, there are different sets of data for the same entity, and it doesn't make sense to store them all in one dataset. Even if you wanted to, this would quickly increase the # of columns in a dataset.\n", "\n", "For example, the [School Locations](https://data.calgary.ca/Services-and-Amenities/School-Locations/fd9t-tdn2) dataset we have already looked at stores both the geographical information of schools in Calgary. The [School Enrolment](https://data.calgary.ca/Demographics/School-Enrolment-Data/9qye-mibh) dataset stores the enrollment information for schools. It should be quite obvious that these two datasets should remain separate. Most often, you would not need enrollment #s when looking for locations, and vice versa. If the datasets were permanently combined, you likely have a lot of redundant columns. Moreover, the school enrollment dataset is *keyed* on the school year. You would have to repeat the geographical data in each row, which would be a poor use of space.\n", "\n", "But, what if we wanted to demonstrate additional features of the scattermapbox by visualizing school enrollment? We will need to obtain data from the two seperate datasets, and dynamically *join* them together. As mentioned, we can't do it in SODA, but we can do so programatically using the pandas 'merge' function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests as rq\n", "import pandas as pd\n", "import io as io\n", "\n", "domain = \"https://data.calgary.ca/resource/\"\n", "\n", "uuid_school_locations = \"fd9t-tdn2\"\n", "uuid_school_enrollment = \"9qye-mibh\"\n", "\n", "def run_query(domain, uuid, query):\n", " session = rq.Session()\n", " results = session.get(domain + uuid +\".csv?$query=\" + query)\n", " dataframe = pd.read_csv(io.StringIO(results.content.decode('utf-8')))\n", " return dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first dataset contains latitude and longitude " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " *\n", "\"\"\"\n", "\n", "calgary_school_location = run_query(domain, uuid_school_locations, query)\n", "calgary_school_location" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second dataset contains enrollment data. Note that the enrollment dataset also contains a school_year column, which makes sense, as enrollment values do change. However, our intent is to visualize enrollment at a specific point in time, so we use the WHERE clause to filter out rows that do not match that time. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " *\n", "WHERE\n", " school_year = '2019-2020'\n", "\"\"\"\n", "\n", "calgary_school_enrollment = run_query(domain, uuid_school_enrollment, query)\n", "print(calgary_school_enrollment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data from CSV files\n", "\n", "We could instead use pandas to import data from CSV ([comma separated values](https://en.wikipedia.org/wiki/Comma-separated_values)) files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "calgary_school_location = pd.read_csv(\"https://data.calgary.ca/resource/64t2-ax4d.csv\")\n", "calgary_school_enrollment = pd.read_csv(\"https://data.calgary.ca/resource/9qye-mibh.csv?$where=School_Year%20=%20'2019-2020'\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "calgary_school_location" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining data sets\n", "\n", "Whichever method we use to import the data, we can now *join* the two datasets on the school name (`school_name` in `calgary_school_location` and `name` or `NAME` in `calgary_school_enrollment`) as the values in these fields match up. In proper SQL parlance, they share a *key*. Again, the dataset has already been filtered for a single school year. This means that there should be no duplicates of school name in the school enrollment dataset.\n", "\n", "Thus, we can do a simple left join with the enrollment on the left and the locations on the right. That way, our resulting dataframe will only have data on schools that have enrollment data. We would run into problems later on if there are NaN (aka null) values in the total field." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "calgary_school_location_enrollment = pd.merge(left=calgary_school_enrollment, \n", " right=calgary_school_location, \n", " how='left', \n", " left_on='school_name', \n", " right_on='NAME' )\n", " # right_on='name' ) #use this line if you imported data using Requests\n", "\n", "calgary_school_location_enrollment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the data\n", "\n", "We will now visualize the resulting dataset, and use the `size` and `color` to add two extra dimensions of data - the total enrollment and the school authority.\n", "\n", "##### Note\n", "\n", "- the `showlegend` parameter is used to hide the legend, but this is only to clean up the presentation!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "figure1 = px.scatter_mapbox(calgary_school_location_enrollment, \n", " size=\"total\",\n", " color=\"school_authority_name\", \n", " #showlegend= False,\n", " color_continuous_scale=px.colors.cyclical.IceFire,\n", " size_max=45, \n", " lat=\"latitude\", \n", " lon=\"longitude\", \n", " hover_name=\"school_name\", \n", " hover_data=[\"TYPE\", \"GRADES\", \"ADDRESS_AB\"],\n", " #hover_data=[\"type\", \"grades\", \"address_ab\"], #use this line if you imported data using Requests\n", " zoom=10, \n", " height=600)\n", "\n", "figure1.update_layout(mapbox_style=\"open-street-map\")\n", "figure1.update_layout(showlegend= False)\n", "figure1.update_layout(margin={\"r\":0,\"t\":0,\"l\":0,\"b\":0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook introduced the ability to join datasets. The last notebook in this series is [filtering datasets](./4-filtering-datasets.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }