{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)\n", " \n", "\"Open" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Shakespeare and Statistics\n", "\n", "![Shakespeare and a chart](images/shakespeare-n-stats.png)\n", "*Image from https://en.wikipedia.org/wiki/Droeshout_portrait*\n", "\n", "Can art and science be combined? Natural language processing allows us to use the same statistical skills you might learn in a science class, such as counting up members of a population and looking at their distribution, to gain insight into the written word. Here's an example of what you can do. Let's consider the following question:\n", "\n", "## What are the top 20 phrases in Shakespeare's Macbeth?\n", "\n", "Normally when we study Shakespeare we critically read his plays and study the plot, characters, and themes. While this is definitely interesting and useful, we can gain very different insights by taking a multidisciplinary approach.\n", "\n", "This is something we would probably never want to do if we had to do it by hand. Imagine getting out your clipboard, writing down every different word or phrase you come across and then counting how many times that same word or phrase reappears. Check out how quickly it can be done using Python code in this Jupyter notebook.\n", "\n", "## Loading the text\n", "\n", "There are many public domain works available at [Project Gutenberg](http://www.gutenberg.org). You can [search](http://www.gutenberg.org/ebooks) or [browse](http://www.gutenberg.org/catalog) to find works that you are interested in analysing.\n", "\n", "We are going to search for the play *Macbeth* by William Shakespeare. On the **Download This eBook** page we'll copy the `Plain Text UTF-8` link, then use the `Requests` Python library to download it into a variable called `macbeth`. We can then refer to it by using `macbeth` at any point from here on" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "text_link = 'http://www.gutenberg.org/files/1533/1533-0.txt'\n", "\n", "import requests\n", "r = requests.get(text_link) # get the online book file\n", "r.encoding = 'utf-8' # specify the type of text encoding in the file\n", "macbeth = r.text.split('***')[2] # get the part after the header\n", "macbeth = macbeth.replace(\"’\",\"'\").replace(\"“\",'\"').replace(\"”\",'\"') # replace any 'smart quotes'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we can just print out the first 1000 letters to see that we've grabbed the correct document." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", "\r\n", "This etext was prepared by the PG Shakespeare Team,\r\n", "a team of about twenty Project Gutenberg volunteers.\r\n", "\r\n", "\r\n", "\r\n", "cover \r\n", "\r\n", "\r\n", "\r\n", "MACBETH\r\n", "\r\n", "by William Shakespeare\r\n", "\r\n", "Contents\r\n", "\r\n", "ACT I\r\n", "Scene I. An open Place.\r\n", "Scene II. A Camp near Forres.\r\n", "Scene III. A heath.\r\n", "Scene IV. Forres. A Room in the Palace.\r\n", "Scene V. Inverness. A Room in Macbeth's Castle.\r\n", "Scene VI. The same. Before the Castle.\r\n", "Scene VII. The same. A Lobby in the Castle.\r\n", "\r\n", "\r\n", "ACT II\r\n", "Scene I. Inverness. Court within the Castle.\r\n", "Scene II. The same.\r\n", "Scene III. The same.\r\n", "Scene IV. The same. Without the Castle.\r\n", "\r\n", "\r\n", "ACT III\r\n", "Scene I. Forres. A Room in the Palace.\r\n", "Scene II. The same. Another Room in the Palace.\r\n", "Scene III. The same. A Park or Lawn, with a gate leading to the Palace.\r\n", "Scene IV. The same. A Room of state in the Palace.\r\n", "Scene V. The heath.\r\n", "Scene VI. Forres. A Room in the Palace.\r\n", "\r\n", "\r\n", "ACT IV\r\n", "Scene I. A dark Cave. In the middle, a Cauldron Boiling.\r\n", "Scene II. Fife. A Room in Macduff's Castle.\r\n", "Scene III. \n" ] } ], "source": [ "print(macbeth[0:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good! But that's a lot of reading to do. And a lot of phrases to count and keep track of. Here's where some Python libraries come into play.\n", "\n", "## Crunching the text\n", "\n", "`noun_phrases` will grab groups of words that have been identified as phrases containing nouns. This isn't always 100% correct. English can be a challenging language even for machines, and sometimes the files on [Project Gutenberg](http://www.gutenberg.org) contain errors that make it even harder, but it can usually do a pretty good job.\n", "\n", "This code cell installs two Python libraries for natural language processing, [textblob](https://textblob.readthedocs.io/en/dev) and [nltk](https://www.nltk.org), then downloads a [corpora data file](http://www.nltk.org/nltk_data) that will allow us to process the text.\n", "\n", "This may take a while to run. On the left you will see `In [*]:` while it is running. Once it finishes you should see the output printed on the screen." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install textblob --user\n", "!pip install nltk --user\n", "import nltk\n", "try:\n", " nltk.data.find('tokenizers/punkt')\n", "except LookupError:\n", " nltk.download('punkt')\n", "try:\n", " nltk.data.find('tokenizers/brown')\n", "except LookupError:\n", " nltk.download('brown')\n", "from textblob import TextBlob\n", "macbeth_phrases = TextBlob(macbeth).noun_phrases\n", "print(macbeth_phrases)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What you're seeing is no longer raw text. It's now a list of strings. How long is the list? Let's find out. `len` is short for \"length\", and it will tell you how many items are in any list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(macbeth_phrases)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like we have over 3000 noun phrases. We don't yet know how many of them are repeated.\n", "\n", "## Counting everything up\n", "\n", "Here's where this starts to look like a real science project! Let's count the unique phrases and create a table of how many times they occur. They'll be sorted from most to least frequent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "unique_texts = list(set(macbeth_phrases))\n", "text_counts = {text: macbeth_phrases.count(text) for text in unique_texts}\n", "sorted_texts = sorted(text_counts.items(), key=lambda x: x[1], reverse=True)\n", "macbeth_counts = pd.DataFrame(data=sorted_texts, columns=['text', 'count'])\n", "macbeth_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of them, so we'll use `.head(20)` which means show the top twenty. In these lists, the first item is always number 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "macbeth_counts.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There we have it! The top 20 phrases in Macbeth! Let's put those in a plot.\n", "\n", "## Plotting the results\n", "\n", "You can do almost any kind of plot or other visual representation of observations like this you could think of in Callysto. We'll use the `cufflinks` library to produce a horizontal bar plot, ordered from most to least frequent word." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cufflinks as cf\n", "cf.go_offline()\n", "macbeth_counts_sorted = macbeth_counts.head(20).sort_values(by='count', ascending=True)\n", "macbeth_counts_sorted.iplot(kind='barh', x='text', y='count', title='Phrase Frequencies in MacBeth', xTitle='Count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Surprise, surprise. *Macbeth* is the top phrase in Macbeth. Our main character is mentioned more than twice the number of times as the next most frequent phrase, *Macduff*, and more than three times the frequency that *Lady Macbeth* is mentioned.\n", "\n", "## Thinking about the results\n", "\n", "One of the first things we might realize from this simple investigation is the importance of proper nouns. Phrases containing the main characters occur far more frequently than other phrases, and the main character of the play is mentioned far more times than any other characters.\n", "\n", "Are these observations particular to Macbeth? Or to Shakespeare's plays? Or are they more universal?\n", "\n", "Now that we've gone through Macbeth, how hard could it be to look at other texts?\n", "\n", "Let's define a function to to download an ebook from a text url, pull out all the noun phrases, count them up, and plot them. We can then use this for any ebook text that we would like to visualize." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def word_frequency_plot(text_url):\n", " r = requests.get(text_url)\n", " r.encoding = 'utf-8'\n", " if 'gutenberg' in text_url:\n", " text = r.text.split('***')[2]\n", " else:\n", " text = r.text\n", " text = text.replace(\"’\",\"'\").replace(\"“\",'\"').replace(\"”\",'\"')\n", " phrases = TextBlob(text).noun_phrases\n", " unique_texts = list(set(phrases))\n", " text_counts = {text: phrases.count(text) for text in unique_texts}\n", " sorted_texts = sorted(text_counts.items(), key=lambda x: x[1], reverse=True)\n", " counts = pd.DataFrame(data=sorted_texts, columns=['text', 'count'])\n", " counts_sorted = counts.head(20).sort_values(by='count', ascending=True)\n", " counts_sorted.iplot(kind='barh', x='text', y='count', title='Phrase Frequencies', xTitle='Count')\n", "print('Word frequency plot function defined.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at Hamlet\n", "\n", "**Hamlet** can be found on [Project Gutenberg](http://www.gutenberg.org) under [EBook-No. 1524](http://www.gutenberg.org/ebooks/1524).\n", "\n", "Run the following code to download **Hamlet**, pull out all the noun phrases, count them up, and plot them out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "word_frequency_plot('http://www.gutenberg.org/files/1524/1524-0.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Now that we have a function to plot word frequency, you can try this with other ebooks from [Project Gutenberg](http://www.gutenberg.org). You could also try running this on news articles or any other text, although there may be some tweaks required.\n", "\n", "However this example is only scratching the surface of natural language processing. Libraries such as [spaCy](https://spacy.io) can be used to identify [parts of speech](https://spacy.io/api/annotation#pos-tagging), [vaderSentiment](https://github.com/cjhutto/vaderSentiment) can categorize text based on its tone, and [textstat](https://github.com/shivam5992/textstat) can check the readability and complexity of text. Examples can be found [here](https://github.com/callysto/interesting-problems/blob/main/notebooks/analysing-text-statistics.ipynb).\n", "\n", "You could use these libraries to compare different writers or different times in history, see if you can trends in sentiment or [word usage](https://books.google.com/ngrams), or investigate changes in language and style." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }