{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)\n", "\n", "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Plotly - introduction\n", "\n", "by Gustaaf J Wehnes, Calgary Board of Education\n", "\n", "There are two major parts to data science: querying the data, and then interpreting the data. For the latter, few people like to look at tables. A picture is worth a thousand words, so let's *visualize* the data! We will use a brilliant library of graphing tools called [plotly](https://plotly.com/graphing-libraries/). Many of the visualizations that you see on the internet will have been built using this library!\n", "\n", "Again, we can just import the library here. You would need to install it locally on your device to run it there, but the install is not difficult. Note that plotly actually runs in JavaScript, and can thus be accessed through many other languages, including Java." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will again need our function for obtaining data and our UUIDs. That is just the nature of a Jupyter notebook; they need to be self-contained." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "import io\n", "\n", "def run_query(domain, uuid, query):\n", " session = requests.Session()\n", " results = session.get(domain + uuid +\".csv?$query=\" + query)\n", " dataframe = pd.read_csv(io.StringIO(results.content.decode('utf-8')))\n", " return dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Data\n", "\n", "Say that we are interested in how and when the Bow River tends to peak for the spring run-off. This would be extremely useful to know for the City of Calgary. You will likely remember the floods of 2013! Can we perhaps find patterns and thereby predict if another flood is coming?\n", "\n", "Below is a somewhat complex query. However, you should be able to see that the query is finding the maximum daily river level of the Bow River between April and October for all years since 2004.\n", "\n", "The tricky part of the query is in the SELECT statement. That is because we want to plot the river levels for each day as a continuum. It would be simplest if we translate the *date* of each measurement to the *day of year*. That way, the data sorts itself nicely. The first two lines in the SELECT clause do exactly that through a bit of fancy calculation. Strangely enough, there is no '+' operation in SODA, so I am subtracting a negative instead. Programmers have to get creative to work around obstacles!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domain = \"https://data.calgary.ca/resource/\"\n", "uuid = \"5fdg-ifgr\"\n", "\n", "query = \"\"\"\n", "SELECT\n", " station_name,\n", " ((date_extract_m(to_floating_timestamp(timestamp,'UTC')) - 1) * 30) as temp,\n", " (temp - -date_extract_d(to_floating_timestamp(timestamp,'UTC'))) as day_of_year,\n", " date_extract_y(to_floating_timestamp(timestamp,'UTC')) as year,\n", " date_extract_m(to_floating_timestamp(timestamp,'UTC')) as month,\n", " date_extract_d(to_floating_timestamp(timestamp,'UTC')) as day,\n", " MAX(level) as level\n", "WHERE\n", " station_name = 'Bow River at Calgary' AND\n", " year > 2004 AND\n", " month between 4 AND 10\n", "GROUP BY\n", " station_name,\n", " temp,\n", " day_of_year,\n", " year,\n", " month,\n", " day\n", "ORDER BY\n", " year ASC,\n", " day_of_year ASC\n", "LIMIT\n", " 10000\n", "OFFSET\n", " 0\n", "\n", "\"\"\"\n", "\n", "river_levels = run_query(domain, uuid, query)\n", "river_levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Visualization\n", "\n", "The plotly library has the ability to quickly graph multiple data series in line graphs: [Line Charts in Python](https://plotly.com/python/line-charts/).\n", "\n", "Note that a new object called `fig` is created, and that it takes in the dataset as the first argument. The x and the y axis are similarily mapped. The year is a third *dimension* in our data, so let's have the graph use different color lines for each year.\n", "\n", "Ready for some magic?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "fig = px.line(river_levels, x=\"day_of_year\", y=\"level\", color='year')\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is a colorful graph! Perhaps there is a little too much data in there, but it does show you around what time of year the Bow River crests. It also clearly demonstrates how extra-ordinary the floods of 2013 was.\n", "\n", "Note: the gaps in the 2013 data are likely from the station not reporting on those days due to the power outages that occured." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot schools in Calgary on an open street map\n", "\n", "For some further shock and awe, let's use a different type of visualization - [maps](https://plotly.com/python/maps/).\n", "\n", "In this case, the query for the data will be very simple. The City of Calgary provides a dataset of all schools within its boundary, and includes the geographical location - i.e. the latitude and the longitude." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uuid_school_locations = \"fd9t-tdn2\"\n", "\n", "query = \"\"\"\n", "SELECT\n", " *\n", "\"\"\"\n", "\n", "school_locations = run_query(domain, uuid_school_locations, query)\n", "school_locations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now use the Mapbox class in plotly to visualize the data. Note that a new `fig` object is created, and that it takes in the dataset as the first argument. The `lat` argument receives the name of the column in the dataset that contains latitude values (in our case \"latitude\"). The `lon` argument similarly is set to 'longitude'. The other arguments should be quite obvious when we run the remaining code" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "figure1 = px.scatter_mapbox(school_locations, lat=\"latitude\", lon=\"longitude\",\n", " color_discrete_sequence=[\"green\"],\n", " zoom=10, height=600,\n", " hover_name=\"name\",\n", " hover_data=[\"type\", \"grades\", \"address_ab\"],)\n", "\n", "figure1.update_layout(mapbox_style=\"open-street-map\")\n", "figure1.update_layout(margin={\"r\":0,\"t\":0,\"l\":0,\"b\":0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook introduced visualizations of Socrata using Plotly. Up next is [joining datasets](./3-joining-datasets.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }