{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)\n", "\n", "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Socrata and Soda\n", "\n", "by Gustaaf J Wehnes, Calgary Board of Education\n", "\n", "We will be using Socrata open data for our forays into the rapidly emerging field of data science. Socrata is used by many governments and organisations, including our own City of Calgary. Take some time to familiarize yourself with the technology through their [home page](https://dev.socrata.com/data/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The SODA API\n", "As programmers, we will want to know how to *programatically* interact with open data sources. will be using an API (Application Programming Interface) developed Thus, you will also want to bookmark the [Socrata Open Data API (aka SODA)](https://dev.socrata.com/). Pay particular attention to the [Getting Started](https://dev.socrata.com/consumers/getting-started.html) and the [API Docs](https://dev.socrata.com/docs/endpoints.html) pages. \n", "\n", "If you don't know what an API is, here is a [definition](https://www.webopedia.com/TERM/A/API.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The City of Calgary datasets\n", "Also have a look at [The City of Calgary’s Open Data Portal](https://data.calgary.ca/). There are all kinds of cool data sets here, along with tools for visualizing this data. Hopefully you can already start seeing how open data provides a massive opportunity for app developers and researchers.\n", "\n", "In particular, have a look at three datasets we will be using in our journey: [River Levels and Flows](https://data.calgary.ca/Environment/River-Levels-and-Flows/5fdg-ifgr), [School Locations](https://data.calgary.ca/Services-and-Amenities/School-Locations/fd9t-tdn2), and [School Enrolment](https://data.calgary.ca/Demographics/School-Enrolment-Data/9qye-mibh)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://dev.socrata.com/docs/functions/#,\n", "\n", "https://data.calgary.ca/Environment/River-Levels-and-Flows/5fdg-ifgr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Queries in SODA\n", "\n", "The Socrata interface allows the building of SQL-like queries: [query Parameter](https://dev.socrata.com/docs/queries/query.html). Theses queries are much more readable than the queries we were writing before using keywords in a URL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To allow the potential running of multiple queries, let's take the time to create a function that does some of the dirty work for us. We use the requests library to obtain data from an internet request based on the domain, uuid, and query. The get method returns a byte array, which we then have to convert to a dataframe\n", "\n", "**function** *run_query*\n", "\n", "**parameter** *domain*: refers to the organization that supplies the data\n", "\n", "**parameter** *uuid*: universal unique ID; refers to the specific dataset.\n", "\n", "**parameter** *query*: SODA query\n", "\n", "**returns**: pandas dataframe\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python tools\n", "\n", "The last few pieces that needs introducing are the Python libraries that we will use. Again, take some time to have a look. The first is an open source data analysis and manipulation tool called [pandas](https://pandas.pydata.org/). Two other libraries, which you only need a passing familiarity with, are [requests](https://requests.readthedocs.io/en/master/) and [Python io](https://docs.python.org/3/library/io.html).\n", "\n", "Don't get overwhelmed; there is a lot of functionality here, and just like me, you will just learn it as you go.\n", "\n", "The beauty of using Jupyter is that these tools are pre-installed! So, no messing with downloading and installing. We simply import them :-)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests as rq\n", "import pandas as pd\n", "import io as io" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some terminology and a helper function\n", "\n", "To simplify the execution of multiple queries, let's take the time to create a function that does some of the dirty work for us. We use the requests library to obtain data from an internet request based on the [domain](https://www.webopedia.com/TERM/D/domain_name.html), [uuid](https://en.wikipedia.org/wiki/Universally_unique_identifier) of the dataset, and the *query* itself. A 'session' object is created with the requests library; the get method is used to return a byte array from an [http](https://www.webopedia.com/TERM/H/HTTP.html) source, and we then covert that array to a [pandas dataframe object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Note that we also import the four libraries previously mentioned.\n", "\n", "**function** *run_query*\n", "\n", "**parameter** *domain*: refers to the organization that supplies the data\n", "\n", "**parameter** *uuid*: universal unique ID; refers to the specific dataset.\n", "\n", "**parameter** *query*: SODA query\n", "\n", "**returns**: pandas dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_query(domain, uuid, query):\n", " session = rq.Session()\n", " results = session.get(domain + uuid +\".csv?$query=\" + query)\n", " dataframe = pd.read_csv(io.StringIO(results.content.decode('utf-8')))\n", " return dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## READY? Lets query some data!\n", "\n", "You may already have come across the documentation for the SODA [query](https://dev.socrata.com/docs/queries/query.html). This will look quite familiar to the SQL statements that you have already learned about. Let's run through a pile of examples, where you can see how you can use the tools to obtain data in CSV (or JSON) format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### simple query\n", "Retrieve the entire dataset.\n", "\n", "##### Note:\n", "- A maximum of 1000 records are returned. That is probably a good thing, as the dataset itself is *huge* (5.48 million records as of April 2020, and ever growing)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domain = \"https://data.calgary.ca/resource/\"\n", "\n", "uuid_river_levels = \"5fdg-ifgr\"\n", "uuid_school_enrollment = \"9qye-mibh\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT *\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### limiting the amount of data\n", "Retrieve only the first 100 record of the dataset.\n", "\n", "##### Note:\n", "- You can set any limit that you like, but be aware of how much data you are going to get back. Obviously, the larger the dataset (rows * columns) the longer the download time (and the data used)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " *\n", "LIMIT\n", " 100\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### paging through data\n", "Retrieve records 101 to 200. You will again receive 100 record.\n", "\n", "##### Note:\n", "- This is known as [paging](https://dev.socrata.com/docs/paging.html). You could, if you want, retrieve the data in small chunks. We will not be doing this in this course, but it is good to know that it can be done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " *\n", "LIMIT\n", " 100\n", "OFFSET\n", " 100\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### select a limited number of fields\n", "This can dramatically reduce the amount of data returned (as the dataframe is like a table, so rows x columns)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " station_number,\n", " timestamp,\n", " level\n", "LIMIT\n", " 100 \n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### providing an alias for a field\n", "You can rename a field with a custom name. This will be useful later when we calculate our own fields; for now, you can see the syntax is simple. Observe that the dataframe has the desired field names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " station_number as station,\n", " timestamp as time,\n", " level as river_level\n", "LIMIT\n", " 100 \n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### data transform functions\n", "If you look at the dataset 'meta' data on the City of Calgary's website, you will notice that all of the fields have a type. This is no different than in most programming languages: you have text, integers, and even date types.\n", "\n", "Unfortunately, fields are sometimes not of the desired type. This may be due to the dataset not being well designed, or for reasons of saving space. Fortunately, SODA provides a series of [Data Transform functions](https://dev.socrata.com/docs/transforms/) which allow you to transform (or 'cast') the field to a different type.\n", "\n", "##### Note:\n", "- Subclauses in the WHERE clause are separated by AND instead of commas, to reflect that these are boolean expressions\n", "- OR could be used as well\n", "- at this time, I am not certain why I cannot convert level to a number using the to_number function. Stay tuned...\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " station_name,\n", " timestamp,\n", " date_extract_y(to_floating_timestamp(timestamp,'UTC')) as year,\n", " date_extract_m(to_floating_timestamp(timestamp,'UTC')) as month,\n", " date_extract_d(to_floating_timestamp(timestamp,'UTC')) as day,\n", " date_extract_hh(to_floating_timestamp(timestamp,'UTC')) as hour,\n", " date_extract_mm(to_floating_timestamp(timestamp,'UTC')) as minute,\n", " date_extract_ss(to_floating_timestamp(timestamp,'UTC')) as second,\n", " level\n", "LIMIT\n", " 10\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### WHERE clause\n", "Just like in SQL, we can filter the records returned using the WHERE clause. There are a lot of powerful operators available, as per the [documentation](https://dev.socrata.com/docs/queries/where.html). **Don't hesitate to experiment!**\n", "\n", "##### Note:\n", "- Subclauses in the WHERE clause are separated by AND instead of commas, to reflect that these are boolean expressions\n", "- OR could be used as well" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " station_name,\n", " timestamp,\n", " date_extract_y(to_floating_timestamp(timestamp,'UTC')) as year,\n", " date_extract_m(to_floating_timestamp(timestamp,'UTC')) as month,\n", " date_extract_d(to_floating_timestamp(timestamp,'UTC')) as day,\n", " date_extract_hh(to_floating_timestamp(timestamp,'UTC')) as hour,\n", " date_extract_mm(to_floating_timestamp(timestamp,'UTC')) as minute,\n", " date_extract_ss(to_floating_timestamp(timestamp,'UTC')) as second,\n", " level\n", "WHERE\n", " station_name = 'Bow River at Calgary' AND\n", " year = 2004 AND\n", " (month between 4 AND 5 OR month between 9 AND 10) AND\n", " day = 1 AND\n", " hour = 0 AND\n", " minute = 0\n", "LIMIT\n", " 10\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ORDER BY clause\n", "Just like in SQL, we can order the records returned using the ORDER BY clause. You can specify multiple fields in order to create a secondary order." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " station_name,\n", " timestamp,\n", " date_extract_y(to_floating_timestamp(timestamp,'UTC')) as year,\n", " date_extract_m(to_floating_timestamp(timestamp,'UTC')) as month,\n", " date_extract_d(to_floating_timestamp(timestamp,'UTC')) as day,\n", " date_extract_hh(to_floating_timestamp(timestamp,'UTC')) as hour,\n", " date_extract_mm(to_floating_timestamp(timestamp,'UTC')) as minute,\n", " date_extract_ss(to_floating_timestamp(timestamp,'UTC')) as second,\n", " level\n", "WHERE\n", " station_name = 'Bow River at Calgary' AND\n", " year = 2004 AND\n", " (month between 4 AND 5 OR month between 9 AND 10) AND\n", " (day = 1 OR day = 15) AND\n", " hour = 0 AND\n", " minute = 0\n", "ORDER BY\n", " month DESC,\n", " day DESC\n", "LIMIT\n", " 10\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid_river_levels, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GROUP BY clause\n", "This is where things get interesting, as you would know if you've gone through the [w3schools'scourse](https://www.w3schools.com/sql/default.asp) on SQL. As this [group by section](https://www.w3schools.com/sql/sql_groupby.asp) explains, grouping is most often used along with aggregate functions (COUNT, MAX, MIN, SUM, AVG).\n", "\n", "For these next examples, we will use the school enrollment dataset. The first example below will check how many high schools existed for each school authority over three years. \n", "\n", "You can think of grouping as collapsing the data that would be returned if you ran the query without the GROUP BY clause. All of the records that share the same values for all of the fields are aggregated (collapsed) into a single row. The fields that you are *not* grouping by thus need to have an aggregate function of some sort in the WHERE clause: hence COUNT(level), which will do exactly that.\n", "\n", "##### Note:\n", "- Observe how the aggregated field is automatically named " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " school_authority_name,\n", " school_year,\n", " COUNT(school_name)\n", "WHERE\n", " (school_year = \"2018_2019\" OR school_year = \"2017_2018\" OR school_year = \"2016_2017\" OR school_year = \"2015_2016\") AND\n", " grade_12 > 0\n", "GROUP BY\n", " school_authority_name,\n", " school_year\n", "ORDER BY\n", " school_authority_name,\n", " school_year\n", "LIMIT\n", " 60\n", "\"\"\"\n", " \n", "school_enrollment = run_query(domain, uuid_school_enrollment, query)\n", "school_enrollment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also try the MAX, MIN, AVG, and SUM functions. In this case, let's only run the query on schools that have Grade 12 students.\n", "\n", "##### Note:\n", "- We will provide slightly more descriptive field aliases.\n", "- You can re-use a field to get multiple aggregations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " school_authority_name,\n", " COUNT(school_name) as total_schools,\n", " MAX(grade_12) as max_grade_12,\n", " AVG(grade_12) as avg_grade_12,\n", " MIN(grade_12) as min_grade_12,\n", " SUM(grade_12) as total_grade12\n", "WHERE\n", " school_year = \"2018_2019\" AND\n", " grade_12 > 0\n", "GROUP BY\n", " school_authority_name\n", "LIMIT\n", " 50\n", "\"\"\"\n", " \n", "school_enrollment = run_query(domain, uuid_school_enrollment, query)\n", "school_enrollment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AN IMPORTANT DISTINCTION\n", "The data above begs the question of what the largest Grade 12 class in the city is! However, you do not have to use the MAX function to find this. Rather, you could simply use the SORT BY and the LIMIT function. This gets around the difficulty of having to either group or aggregate on all fields. If you grouped the above query by school_name, you would receive a maximum for each school. If you were to use an aggregate function instead, you would not get the right school name!\n", "\n", "This also allows me to quickly discuss the [NaN](https://www.python-course.eu/dealing_with_NaN_in_python.php) value, which can be though of like a null. If you do not include the WHERE clause in the query below, you will get a record with NaN," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT\n", " COUNT(school_name) as total_schools,\n", " MAX(grade_12) as max_grade_12\n", "WHERE\n", " school_year = \"2018_2019\" AND\n", " grade_12 > 0\n", "LIMIT\n", " 1\n", "\"\"\"\n", " \n", "school_enrollment = run_query(domain, uuid_school_enrollment, query)\n", "school_enrollment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook introduced Socrata and the SODA API. Up next is an [Introduction to Plotly](./2-introduction-to-plotly.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }