Basic charts#

Documentation: https://plotly.com/python/basic-charts/

This section highlights some of the basic charts that can be created with Plotly: scatter plots, bar charts, line charts, and histograms. Each type of chart is demonstrated with a simple example here, and discussed in more detail in later sections.

import plotly.express as px

Scatter plot#

Use gapminder data, focusing on 2002.

gap_data = px.data.gapminder()

just_2002 = gap_data.query('year == 2002')

just_2002
country continent year lifeExp pop gdpPercap iso_alpha iso_num
10 Afghanistan Asia 2002 42.129 25268405 726.734055 AFG 4
22 Albania Europe 2002 75.651 3508512 4604.211737 ALB 8
34 Algeria Africa 2002 70.994 31287142 5288.040382 DZA 12
46 Angola Africa 2002 41.003 10866106 2773.287312 AGO 24
58 Argentina Americas 2002 74.340 38331121 8797.640716 ARG 32
... ... ... ... ... ... ... ... ...
1654 Vietnam Asia 2002 73.017 80908147 1764.456677 VNM 704
1666 West Bank and Gaza Asia 2002 72.370 3389578 4515.487575 PSE 275
1678 Yemen, Rep. Asia 2002 60.308 18701257 2234.820827 YEM 887
1690 Zambia Africa 2002 39.193 10595811 1071.613938 ZMB 894
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 ZWE 716

142 rows × 8 columns

Suppose we want to look at the relationship between GDP per capita and life expectancy in 2002, using a scatter plot. We just need to use the scatter function, and specify the data frame, and the columns to use for the x and y axes.

scatter_2002 = px.scatter(
	data_frame=just_2002,
	x='gdpPercap',
	y='lifeExp',
)
scatter_2002

Bar chart#

Using the same 2002 dataset, make a population bar chart for five countries:

  1. United States

  2. Canada

  3. Italy

  4. Sweden

  5. Taiwan

Again, we just need to use the bar function, and specify the data frame, and the columns to use for the x and y axes.

country_list = [ 'United States', 'Canada', 'Italy', 'Sweden', 'Taiwan' ]

bar_data = just_2002.query('country in @country_list')

bar_data
country continent year lifeExp pop gdpPercap iso_alpha iso_num
250 Canada Americas 2002 79.77 31902268 33328.96507 CAN 124
778 Italy Europe 2002 80.24 57926999 27968.09817 ITA 380
1474 Sweden Europe 2002 80.04 8954175 29341.63093 SWE 752
1510 Taiwan Asia 2002 76.99 22454239 23235.42329 TWN 158
1618 United States Americas 2002 77.31 287675526 39097.09955 USA 840
bar_fig = px.bar(
	data_frame=bar_data,
	x='country',
	y='pop',
)
bar_fig

Line chart#

Plot the stock prices of companies in a line chart.

df_stocks = px.data.stocks()
df_stocks
date GOOG AAPL AMZN FB NFLX MSFT
0 2018-01-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
1 2018-01-08 1.018172 1.011943 1.061881 0.959968 1.053526 1.015988
2 2018-01-15 1.032008 1.019771 1.053240 0.970243 1.049860 1.020524
3 2018-01-22 1.066783 0.980057 1.140676 1.016858 1.307681 1.066561
4 2018-01-29 1.008773 0.917143 1.163374 1.018357 1.273537 1.040708
... ... ... ... ... ... ... ...
100 2019-12-02 1.216280 1.546914 1.425061 1.075997 1.463641 1.720717
101 2019-12-09 1.222821 1.572286 1.432660 1.038855 1.421496 1.752239
102 2019-12-16 1.224418 1.596800 1.453455 1.104094 1.604362 1.784896
103 2019-12-23 1.226504 1.656000 1.521226 1.113728 1.567170 1.802472
104 2019-12-30 1.213014 1.678000 1.503360 1.098475 1.540883 1.788185

105 rows × 7 columns

This example is slightly different in that we want to plot two lines on the same chart. We can do this by specifying a list of columns for the y axis.

stock_fig = px.line(
	data_frame=df_stocks,
	x='date',
	y=['GOOG', 'AAPL']
)
stock_fig

Histogram#

Plot the distribution of tips using a histogram. In this example, we do not need to specify the y axis, since the y axis will be the count of the number of tips in each bin.

tips_df = px.data.tips()
tips_df
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

tips_fig = px.histogram(
	data_frame=tips_df,
	x='tip',
)
tips_fig

Histograms vs bar charts#

On the surface, bar charts and histograms appear to be similar representations of data, but they serve different purposes and are used in distinct contexts within the field of information visualization. Understanding their differences is crucial for effectively communicating data insights.

Bar charts are used to compare discrete categories or groups. In a bar chart, each category is represented by a bar, and the length or height of the bar corresponds to the value it represents. The bars are separated by spaces to emphasize that the categories are distinct and not related to each other in a quantitative manner. Bar charts are versatile and can be used to represent a wide range of data types, including counts, percentages, or other metrics associated with different categories. They are particularly useful for visualizing data where the categories do not have a natural order or are ranked.

In contrast, histograms are used to display the distribution of a continuous variable over a set of intervals, known as bins. Unlike bar charts, the bars in a histogram touch each other to convey the continuous nature of the data. Histograms are valuable for showing the shape of the data distribution, such as whether it is skewed to the left or right, has a single peak (unimodal) or multiple peaks (bimodal or multimodal), and to identify outliers or unusual gaps in the data. Each bar in a histogram represents the frequency or count of data points within a particular range, and the width of the bars can vary if the intervals are not uniform, although they are often kept the same for simplicity.

The key differences between bar charts and histograms thus lie in the type of data they represent and the way they are constructed. Bar charts are suitable for categorical data and emphasize comparison between different categories, while histograms are designed for continuous data and focus on showing the distribution of a variable across different intervals. Understanding these distinctions is essential for selecting the appropriate visualization technique to convey the right message about the data.