Decoding "Vinho Verde"

An Exploratory Data Analysis of Red Wine Quality

1. The Objective

In this project, I explored the physicochemical properties of Portuguese "Vinho Verde" red wine. The goal was to understand which chemical attributes—like alcohol content, acidity, or pH—actually drive the sensory quality scores (rated 0–10) assigned by human experts.

2. Initial Data Inspection & "The Messy Reality"

Before diving into charts, I had to understand the raw structure. Using df.info() and df.describe(), I identified:

Dataset Size: 1,599 initial observations.
Target Variable: quality (discrete integers).
Missing Values: df.isnull().sum() confirmed there were no missing entries, which is rare but great!

3. Data Cleaning: Ensuring Integrity

A critical step in any data project is removing duplicates to prevent bias in our correlations.

Observation: I used df.duplicated() to identify redundant rows and applied df.drop_duplicates(inplace=True) to ensure every data point in my analysis was unique. This is vital for maintaining the statistical validity of the findings.

# Removing duplicates to maintain integrity
duplicates = df.duplicated()
df.drop_duplicates(inplace=True)

4. Deep Dive: Key Insights from EDA

A. The Quality Distribution (Imbalance)

Using a count plot (df.quality.value_counts().plot(kind='bar')), I noticed the data is heavily imbalanced. Most wines fall into categories 5 and 6, while "Excellent" (8) or "Poor" (3) wines are quite rare.

                    Why it matters: This tells a recruiter that you understand that real-world data
                    isn't always perfectly distributed.
                

Bar plot showing quality scores distribution

B. The Chemical "DNA" (Correlation Analysis)

I generated a Heatmap to see how variables interact.

Insight: I found a strong positive correlation between Alcohol and Quality.
Insight: Volatile Acidity showed a negative correlation—confirming the chemical intuition that higher acetic acid (vinegar taste) leads to lower quality scores.

Technical Note: Using sns.pairplot(df) helped me check for multicollinearity (when two inputs are too similar), showing a high level of technical maturity.

# Generating a heatmap to find feature correlations
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

C. Alcohol vs. Quality (Box Plots)

To visualize the "Alcohol effect," I used a Categorical Box Plot (sns.catplot).

                    The Trend: As the quality score increases, the median alcohol percentage clearly
                    trends upward. This suggests that "body" and "strength" are key indicators of expert preference.
                

D. The pH vs. Alcohol Relationship

I used a Scatter Plot colored by quality to see if there was a "sweet spot" in the pH/Alcohol ratio. This multidimensional view helps identify if certain clusters of chemical properties result in higher-rated wines.

Scatter plot of pH and alcohol by quality

5. Conclusion

Through this EDA, I've concluded that while wine quality is subjective, it is heavily influenced by specific chemical markers. Alcohol content and Sulphates appear to be the strongest positive drivers, while Volatile Acidity acts as a primary detractor.

This analysis serves as a foundation for feature selection in a future classification model to predict wine quality automatically.