Creating Scatter Plots: Visualizing Data with Python

파이썬으로 분산형 차트 그리기
(Drawing a Scatter Chart in Python)

Want to see the relationship between multiple pieces of data at a glance? Scatter plots are perfect for that, especially when you add different colors for different categories and trend lines to make even complex data easier to understand.

In this article, we'll show you how to draw a scatter plot using the PythonThe matplotliband numpyin this step-by-step guide. We've also included practical tips for adding trendlines, applying styles, and more, so don't miss out!

What is a scatter plot?

분산형 차트 이용 인사이트

A scatter plot is a powerful tool for visually representing the relationship between two variables. You place data on the x- and y-axis, and the position of each point represents a specific piece of data.

With scatter plots, you can gain insights such as

  • The correlation between variables (positive, negative, or no correlation).
  • Patterns and clusters in your data.
  • Outlier detection.

Drawing Scatter Plots with Python

The code below shows the Scatterplots segmented by categorical dataand add a trend line as well.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

Set the # Korean font
rc('font', family='HCR Dotum')

Generate # random data
np.random.seed(42)
x = np.random.normal(0, 1, 100)
y = x * 0.5 + np.random.normal(0, 0.5, 100)
categories = np.random.choice(['A', 'B', 'C'], 100)

Set the # graph style
plt.style.use('ggplot') change to # ggplot style
plt.figure(figsize=(10, 6))

# Create a scatterplot with different colors for each category
for category in np.unique(categories):
    mask = categories == category
    plt.scatter(x[mask], y[mask],
                label=category,
                alpha=0.6,
                s=100)

Add a # trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x, p(x), "r--", alpha=0.8)

Decorate the # graph
plt.title("Scatter Plot with Trend Line", pad=20)
plt.xlabel("X Variable")
plt.ylabel("Y Variable")

Add a # grid
plt.grid(True, linestyle='--', alpha=0.7)

Show the # legend
plt.legend(title="Categories", loc='upper left')

Set the # axis range
plt.xlim(min(x)-0.5, max(x)+0.5)
plt.ylim(min(y)-0.5, max(y)+0.5)

Save and display the # graph
plt.savefig('scatter_plot.png', dpi=300, bbox_inches='tight')
plt.show()

Code commentary

  1. Importing libraries:
    • numpy: Generate data and utilize it for computation.
    • matplotlib.pyplotThe main tool for creating scatter plots.
  2. Korean font settings: rc('font', family='HCR Dotum')
    • Set the font to display the chart title and labels in Korean.
  3. Generate data: np.random.seed(42) x = np.random.normal(0, 1, 100) y = x * 0.5 + np.random.normal(0, 0.5, 100) categories = np.random.choice(['A', 'B', 'C'], 100)
    • x: Data generated from a normal distribution with mean 0 and standard deviation 1.
    • y: xand adds some noise.
    • categories: Randomly assigns a category of one of A, B, or C.
  4. Create a scatter plot by category: for category in np.unique(categories): mask = categories == category plt.scatter(x[mask], y[mask], label=category, alpha=0.6, s=100)
    • Create a scatter plot with a different color for each category (A, B, C).
  5. Add a trendline: z = np.polyfit(x, y, 1) p = np.poly1d(z) plt.plot(x, p(x), "r--", alpha=0.8)
    • np.polyfit: Computes a linear regression between X and Y.
    • np.poly1d: Generate a regression polynomial and represent it as a trendline.
  6. Setting up styles and visual elements:
    • Title, X/Y axis label settings: plt.title("Scatter Plot with Trend Line", pad=20) plt.xlabel("X Variable") plt.ylabel("Y Variable")
    • Category descriptions can be found in the legend (plt.legend).
  7. Saving and printing: plt.savefig('scatter_plot.png', dpi=300, bbox_inches='tight') plt.show()
    • Save the chart in high resolution (scatter_plot.png) and output.

Utilizing Scatter Plots

1. analyze correlations between variables

  • Positive correlation: a pattern where as X increases, Y also increases.
    • Example: When an increase in ad spend (X) results in an increase in revenue (Y).
  • Negative correlation: A pattern in which Y decreases as X increases.
    • Example: When an increase in price (X) results in a decrease in sales (Y).
  • No relationship: There is no relationship between X and Y.
    • Example: Weather in a specific region and internet traffic consumed in that region.

Scatter plots give you an intuitive sense of correlations, which can be used to perform deeper data analysis.

2. data distribution and cluster detection

  • Identify data clusters
    Scatter plots are great for seeing if your data is concentrated (clustered) in certain areas.
    • Example: When you plot a scatter plot of customer age (X) versus purchase frequency (Y), you might notice a pattern that customers of a certain age buy more often.
  • Outlier Detection
    Outlier data shows a markedly different pattern than the rest of the data. Outliers are easy to detect visually, as they appear as prominent, outlying dots in scatter plots.

3. Visualize data classifications and categories

  • If your data is divided into multiple categories, a scatter plot is an effective way to represent it.
    • Example: A scatterplot of revenue (X) and customer satisfaction (Y) by product category (A, B, C) can be represented as a color-coded representation of the characteristics of each category.
  • You can analyze data patterns by category to see if certain categories are performing better than others.

4. regression analysis and predictive modeling

  • Add trend lines: Add trend lines to your scatterplots to better analyze the relationships between your data.
    • Example: Plot a scatter plot of residential neighborhoods (X) and home prices (Y), then add a trend line to visualize changes in home prices.
  • These trends are an important reference point for designing predictive models or making data-driven decisions.

5. Marketing and performance analytics

  • Customer behavior analytics
    • See which age groups of customers are most active by expressing their age (X) and number of purchases per month (Y).
  • Track performance metrics
    • By expressing the click-through rate (X) and conversion rate (Y) of an ad campaign, you can explain with data which ads are performing better.

Organize

In this post, we implemented scatter plotting in Python and learned how to visualize relationships and patterns in our data. Trend lines and categorization allowed us to analyze our data more clearly.

You too can use the code above to visualize a variety of data! Why not take your first steps into data analysis with Python? To take your Python visualization skills a step further, check out the How to draw and nest doughnut charts with Python Visualization Check out the post!

Similar Posts