{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "
\n", " \n", " \"QuantEcon\"\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "- [Pandas](#Pandas) \n", " - [Overview](#Overview) \n", " - [Series](#Series) \n", " - [DataFrames](#DataFrames) \n", " - [On-Line Data Sources](#On-Line-Data-Sources) \n", " - [Exercises](#Exercises) \n", " - [Solutions](#Solutions) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to what’s in Anaconda, this lecture will need the following libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "!pip install --upgrade pandas-datareader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "[Pandas](http://pandas.pydata.org/) is a package of fast, efficient data analysis tools for Python.\n", "\n", "Its popularity has surged in recent years, coincident with the rise\n", "of fields such as data science and machine learning.\n", "\n", "Here’s a popularity comparison over time against STATA, SAS, and [dplyr](https://dplyr.tidyverse.org/) courtesy of Stack Overflow Trends\n", "\n", "![https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_vs_rest.png](https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_vs_rest.png)\n", "\n", " \n", "Just as [NumPy](http://www.numpy.org/) provides the basic array data type plus core array operations, pandas\n", "\n", "1. defines fundamental structures for working with data and \n", "1. endows them with methods that facilitate operations such as \n", " - reading in data \n", " - adjusting indices \n", " - working with dates and time series \n", " - sorting, grouping, re-ordering and general data munging [1] \n", " - dealing with missing values, etc., etc. \n", "\n", "\n", "More sophisticated statistical functionality is left to other packages, such\n", "as [statsmodels](http://www.statsmodels.org/) and [scikit-learn](http://scikit-learn.org/), which are built on top of pandas.\n", "\n", "This lecture will provide a basic introduction to pandas.\n", "\n", "Throughout the lecture, we will assume that the following imports have taken\n", "place" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.rcParams[\"figure.figsize\"] = [10,8] # Set default figure size\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "\n", "\n", "Two important data types defined by pandas are `Series` and `DataFrame`.\n", "\n", "You can think of a `Series` as a “column” of data, such as a collection of observations on a single variable.\n", "\n", "A `DataFrame` is an object for storing related columns of data.\n", "\n", "Let’s start with Series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s = pd.Series(np.random.randn(4), name='daily returns')\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you can imagine the indices `0, 1, 2, 3` as indexing four listed\n", "companies, and the values being daily returns on their shares.\n", "\n", "Pandas `Series` are built on top of NumPy arrays and support many similar\n", "operations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s * 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "np.abs(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `Series` provide more than NumPy arrays.\n", "\n", "Not only do they have some additional (statistically oriented) methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But their indices are more flexible" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Viewed in this way, `Series` are like fast, efficient Python dictionaries\n", "(with the restriction that the items in the dictionary all have the same\n", "type—in this case, floats).\n", "\n", "In fact, you can use much of the same syntax as Python dictionaries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s['AMZN']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "s['AMZN'] = 0\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "'AAPL' in s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrames\n", "\n", "\n", "\n", "While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable.\n", "\n", "In essence, a `DataFrame` in pandas is analogous to a (highly optimized) Excel spreadsheet.\n", "\n", "Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.\n", "\n", "Here’s the content of `test_pwt.csv`" ] }, { "cell_type": "markdown", "metadata": { "hide-output": false }, "source": [ "```text\n", "\"country\",\"country isocode\",\"year\",\"POP\",\"XRAT\",\"tcgdp\",\"cc\",\"cg\"\n", "\"Argentina\",\"ARG\",\"2000\",\"37335.653\",\"0.9995\",\"295072.21869\",\"75.716805379\",\"5.5788042896\"\n", "\"Australia\",\"AUS\",\"2000\",\"19053.186\",\"1.72483\",\"541804.6521\",\"67.759025993\",\"6.7200975332\"\n", "\"India\",\"IND\",\"2000\",\"1006300.297\",\"44.9416\",\"1728144.3748\",\"64.575551328\",\"14.072205773\"\n", "\"Israel\",\"ISR\",\"2000\",\"6114.57\",\"4.07733\",\"129253.89423\",\"64.436450847\",\"10.266688415\"\n", "\"Malawi\",\"MWI\",\"2000\",\"11801.505\",\"59.543808333\",\"5026.2217836\",\"74.707624181\",\"11.658954494\"\n", "\"South Africa\",\"ZAF\",\"2000\",\"45064.098\",\"6.93983\",\"227242.36949\",\"72.718710427\",\"5.7265463933\"\n", "\"United States\",\"USA\",\"2000\",\"282171.957\",\"1\",\"9898700\",\"72.347054303\",\"6.0324539789\"\n", "\"Uruguay\",\"URY\",\"2000\",\"3219.793\",\"12.099591667\",\"25255.961693\",\"78.978740282\",\"5.108067988\"\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supposing you have this data saved as `test_pwt.csv` in the present working directory (type `%pwd` in Jupyter to see what this is), it can be read in as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv')\n", "type(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select particular rows using standard Python array slicing notation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df[2:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select columns, we can pass a list containing the names of the desired columns represented as strings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df[['country', 'tcgdp']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select both rows and columns using integers, the `iloc` attribute should be used with the format `.iloc[rows, columns]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df.iloc[2:5, 0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select rows and columns using a mixture of integers and labels, the `loc` attribute can be used in a similar way" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df.loc[df.index[2:5], ['country', 'tcgdp']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s imagine that we’re only interested in population (`POP`) and total GDP (`tcgdp`).\n", "\n", "One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = df[['country', 'POP', 'tcgdp']]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the index `0, 1,..., 7` is redundant because we can use the country names as an index.\n", "\n", "To do this, we set the index to be the `country` variable in the dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = df.set_index('country')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s give the columns slightly better names" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df.columns = 'population', 'total GDP'\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Population is in thousands, let’s revert to single units" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df['population'] = df['population'] * 1e3\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we’re going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df['GDP percap'] = df['total GDP'] * 1e6 / df['population']\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the nice things about pandas `DataFrame` and `Series` objects is that they have methods for plotting and visualization that work through Matplotlib.\n", "\n", "For example, we can easily generate a bar plot of GDP per capita" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "ax = df['GDP percap'].plot(kind='bar')\n", "ax.set_xlabel('country', fontsize=12)\n", "ax.set_ylabel('GDP per capita', fontsize=12)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the moment the data frame is ordered alphabetically on the countries—let’s change it to GDP per capita" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = df.sort_values(by='GDP percap', ascending=False)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting as before now yields" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "ax = df['GDP percap'].plot(kind='bar')\n", "ax.set_xlabel('country', fontsize=12)\n", "ax.set_ylabel('GDP per capita', fontsize=12)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## On-Line Data Sources\n", "\n", "\n", "\n", "Python makes it straightforward to query online databases programmatically.\n", "\n", "An important database for economists is [FRED](https://research.stlouisfed.org/fred2/) — a vast collection of time series data maintained by the St. Louis Fed.\n", "\n", "For example, suppose that we are interested in the [unemployment rate](https://research.stlouisfed.org/fred2/series/UNRATE).\n", "\n", "Via FRED, the entire series for the US civilian unemployment rate can be downloaded directly by entering\n", "this URL into your browser (note that this requires an internet connection)" ] }, { "cell_type": "markdown", "metadata": { "hide-output": false }, "source": [ "```text\n", "https://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Equivalently, click here: [https://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv](https://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv))\n", "\n", "This request returns a CSV file, which will be handled by your default application for this class of files.\n", "\n", "Alternatively, we can access the CSV file from within a Python program.\n", "\n", "This can be done with a variety of methods.\n", "\n", "We start with a relatively low-level method and then return to pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing Data with requests\n", "\n", "\n", "\n", "One option is to use [requests](https://requests.readthedocs.io/en/master/), a standard Python library for requesting data over the Internet.\n", "\n", "To begin, try the following code on your computer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "r = requests.get('http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there’s no error message, then the call has succeeded.\n", "\n", "If you do get an error, then there are two likely causes\n", "\n", "1. You are not connected to the Internet — hopefully, this isn’t the case. \n", "1. Your machine is accessing the Internet through a proxy server, and Python isn’t aware of this. \n", "\n", "\n", "In the second case, you can either\n", "\n", "- switch to another machine \n", "- solve your proxy problem by reading [the documentation](https://requests.readthedocs.io/en/master/) \n", "\n", "\n", "Assuming that all is working, you can now proceed to use the `source` object returned by the call `requests.get('http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv')`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "url = 'http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv'\n", "source = requests.get(url).content.decode().split(\"\\n\")\n", "source[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "source[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "source[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now write some additional code to parse this text and store it as an array.\n", "\n", "But this is unnecessary — pandas’ `read_csv` function can handle the task for us.\n", "\n", "We use `parse_dates=True` so that pandas recognizes our dates column, allowing for simple date filtering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "data = pd.read_csv(url, index_col=0, parse_dates=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data has been read into a pandas DataFrame called `data` that we can now manipulate in the usual way" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "type(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "data.head() # A useful method to get a quick look at a data frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "pd.set_option('precision', 1)\n", "data.describe() # Your output might differ slightly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also plot the unemployment rate from 2006 to 2012 as follows" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "ax = data['2006':'2012'].plot(title='US Unemployment Rate', legend=False)\n", "ax.set_xlabel('year', fontsize=12)\n", "ax.set_ylabel('%', fontsize=12)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that pandas offers many other file type alternatives.\n", "\n", "Pandas has [a wide variety](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) of top-level methods that we can use to read, excel, json, parquet or plug straight into a database server." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using pandas_datareader to Access Data\n", "\n", "\n", "\n", "The maker of pandas has also authored a library called pandas_datareader that gives programmatic access to many data sources straight from the Jupyter notebook.\n", "\n", "While some sources require an access key, many of the most important (e.g., FRED, [OECD](https://data.oecd.org/), [EUROSTAT](https://ec.europa.eu/eurostat/data/database) and the World Bank) are free to use.\n", "\n", "For now let’s work through one example of downloading and plotting data — this\n", "time from the World Bank.\n", "\n", "The World Bank [collects and organizes data](http://data.worldbank.org/indicator) on a huge range of indicators.\n", "\n", "For example, [here’s](http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS/countries) some data on government debt as a ratio to GDP.\n", "\n", "The next code example fetches the data for you and plots time series for the US and Australia" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "from pandas_datareader import wb\n", "\n", "govt_debt = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country=['US', 'AU'], start=2005, end=2016).stack().unstack(0)\n", "ind = govt_debt.index.droplevel(-1)\n", "govt_debt.index = ind\n", "ax = govt_debt.plot(lw=2)\n", "ax.set_xlabel('year', fontsize=12)\n", "plt.title(\"Government Debt to GDP (%)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [documentation](https://pandas-datareader.readthedocs.io/en/latest/index.html) provides more details on how to access various data sources." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "With these imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "import datetime as dt\n", "from pandas_datareader import data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a program to calculate the percentage price change over 2019 for the following shares:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "ticker_list = {'INTC': 'Intel',\n", " 'MSFT': 'Microsoft',\n", " 'IBM': 'IBM',\n", " 'BHP': 'BHP',\n", " 'TM': 'Toyota',\n", " 'AAPL': 'Apple',\n", " 'AMZN': 'Amazon',\n", " 'BA': 'Boeing',\n", " 'QCOM': 'Qualcomm',\n", " 'KO': 'Coca-Cola',\n", " 'GOOG': 'Google',\n", " 'SNE': 'Sony',\n", " 'PTR': 'PetroChina'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here’s the first part of the program" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "def read_data(ticker_list,\n", " start=dt.datetime(2019, 1, 2),\n", " end=dt.datetime(2019, 12, 31)):\n", " \"\"\"\n", " This function reads in closing price data from Yahoo\n", " for each tick in the ticker_list.\n", " \"\"\"\n", " ticker = pd.DataFrame()\n", "\n", " for tick in ticker_list:\n", " prices = data.DataReader(tick, 'yahoo', start, end)\n", " closing_prices = prices['Close']\n", " ticker[tick] = closing_prices\n", "\n", " return ticker\n", "\n", "ticker = read_data(ticker_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Complete the program to plot the result as a bar graph like this one:\n", "\n", "![https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_share_prices.png](https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_share_prices.png)\n", "\n", " \n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "Using the method `read_data` introduced in [Exercise 1](#pd-ex1), write a program to obtain year-on-year percentage change for the following indices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "indices_list = {'^GSPC': 'S&P 500',\n", " '^IXIC': 'NASDAQ',\n", " '^DJI': 'Dow Jones',\n", " '^N225': 'Nikkei'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Complete the program to show summary statistics and plot the result as a time series graph like this one:\n", "\n", "![https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_indices_pctchange.png](https://python-programming.quantecon.org/_static/lecture_specific/pandas/pandas_indices_pctchange.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "There are a few ways to approach this problem using Pandas to calculate\n", "the percentage change.\n", "\n", "First, you can extract the data and perform the calculation such as:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "p1 = ticker.iloc[0] #Get the first set of prices as a Series\n", "p2 = ticker.iloc[-1] #Get the last set of prices as a Series\n", "price_change = (p2 - p1) / p1 * 100\n", "price_change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively you can use an inbuilt method `pct_change` and configure it to\n", "perform the correct calculation using `periods` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "change = ticker.pct_change(periods=len(ticker)-1, axis='rows')*100\n", "price_change = change.iloc[-1]\n", "price_change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then to plot the chart" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "price_change.sort_values(inplace=True)\n", "price_change = price_change.rename(index=ticker_list)\n", "fig, ax = plt.subplots(figsize=(10,8))\n", "ax.set_xlabel('stock', fontsize=12)\n", "ax.set_ylabel('percentage change in price', fontsize=12)\n", "price_change.plot(kind='bar', ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "Following the work you did in [Exercise 1](#pd-ex1), you can query the data using `read_data` by updating the start and end dates accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "indices_data = read_data(\n", " indices_list,\n", " start=dt.datetime(1928, 1, 2),\n", " end=dt.datetime(2020, 12, 31)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, extract the first and last set of prices per year as DataFrames and calculate the yearly returns such as:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "yearly_returns = pd.DataFrame()\n", "\n", "for index, name in indices_list.items():\n", " p1 = indices_data.groupby(indices_data.index.year)[index].first() # Get the first set of returns as a DataFrame\n", " p2 = indices_data.groupby(indices_data.index.year)[index].last() # Get the last set of returns as a DataFrame\n", " returns = (p2 - p1) / p1\n", " yearly_returns[name] = returns\n", "\n", "yearly_returns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, you can obtain summary statistics by using the method `describe`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "yearly_returns.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, to plot the chart" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 2, figsize=(10, 8))\n", "\n", "for iter_, ax in enumerate(axes.flatten()): # Flatten 2-D array to 1-D array\n", " index_name = yearly_returns.columns[iter_] # Get index name per iteration\n", " ax.plot(yearly_returns[index_name]) # Plot pct change of yearly returns per index\n", " ax.set_ylabel(\"percent change\", fontsize = 12)\n", " ax.set_title(index_name)\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

[1] Wikipedia defines munging as cleaning data from one raw form into a structured, purged one." ] } ], "metadata": { "date": 1614096163.7640731, "filename": "pandas.md", "kernelspec": { "display_name": "Python", "language": "python3", "name": "python3" }, "title": "Pandas" }, "nbformat": 4, "nbformat_minor": 4 }