Skip to content

vivek8849/Python_data_science

Repository files navigation

Overview

Welcome to my analysis of the data job market, focusing on data scientist roles. This project was created out of a desire to navigate and understand the job market more effectively. It delves into the top-paying and in-demand skills to help find optimal job opportunities for data scientists.

The Questions

Below are the questions I want to answer in my project:

  1. What are the skills most in demand for the top 3 most popular data roles?
  2. How are in-demand skills trending for Data Scientist?
  3. How well do jobs and skills pay for Data Scientist?
  4. What are the optimal skills for data scientists to learn? (High Demand AND High Paying)

Tools I Used

For my deep dive into the data scientist job market, I harnessed the power of several key tools:

  • Python: The backbone of my analysis, allowing me to analyze the data and find critical insights.I also used the following Python libraries:
    • Pandas Library: This was used to analyze the data.
    • Matplotlib Library: I visualized the data.
    • Seaborn Library: Helped me create more advanced visuals.
  • Jupyter Notebooks: The tool I used to run my Python scripts which let me easily include my notes and analysis.
  • Visual Studio Code: My go-to for executing my Python scripts.
  • Git & GitHub: Essential for version control and sharing my Python code and analysis, ensuring collaboration and project tracking.

Data Preparation and Cleanup

This section outlines the steps taken to prepare the data for analysis, ensuring accuracy and usability.

Import & Clean Up Data

I start by importing necessary libraries and loading the dataset, followed by initial data cleaning tasks to ensure data quality.

# Importing Libraries
import ast
import pandas as pd
import seaborn as sns
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

Filter Germany Jobs

To focus my analysis on the Germany job market, I apply filters to the dataset, narrowing down to roles based in the Germany.

df_DE = df[df['job_country'] == 'Germany']

The Analysis

Each Jupyter notebook for this project aimed at investigating specific aspects of the data job market. Here’s how I approached each question:

1. What are the most demanded skills for the top 3 most popular data roles?

To find the most demanded skills for the top 3 most popular data roles. I filtered out those positions by which ones were the most popular, and got the top 5 skills for these top 3 roles. This query highlights the most popular job titles and their top skills, showing which skills I should pay attention to depending on the role I'm targeting.

View my notebook with detailed steps here: 2_Skill_Demand.

Visualize Data

fig, ax = plt.subplots(len(job_titles), 1)


for i, job_title in enumerate(job_titles):
    df_plot = df_skills_perc[df_skills_perc['job_title_short'] == job_title].head(5)[::-1]
    sns.barplot(data=df_plot, x='skill_percent', y='job_skills', ax=ax[i], hue='skill_count', palette='dark:b_r')

plt.show()

Results

Likelihood of Skills Requested in the Germany Job Postings

*Bar graph visualizing the salary for the top 3 data roles and their top 5 skills associated with each

Insights:

  • SQL is the most requested skill for Data Analysts and Data Scientists, with it in over half the job postings for both roles. For Data Engineers, Python is the most sought-after skill, appearing in 53% of job postings.
  • Data Engineers require more specialized technical skills (AWS, Azure, Spark) compared to Data Analysts and Data Scientists who are expected to be proficient in more general data management and analysis tools (Excel, Tableau).
  • Python is a versatile skill, highly demanded across all three roles, but most prominently for Data Scientists (62%) and Data Engineers (53%).

2. How are in-demand skills trending for Data Scientist?

To find how skills are trending in 2023 for Data Scientist, I filtered data scientist positions and grouped the skills by the month of the job postings. This got me the top 5 skills of data scientists by month, showing how popular skills were throughout 2023.

View my notebook with detailed steps here: 3_Skills_Trend.

Visualize Data

from matplotlib.ticker import PercentFormatter

df_plot = df_DS_DE_percent.iloc[:, :5]
sns.lineplot(data=df_plot, dashes=False, legend='full', palette='tab10')

plt.gca().yaxis.set_major_formatter(PercentFormatter(decimals=0))

plt.show()

Results

Trending Top Skills for Data Scientist in the Germany
Bar graph visualizing the trending top skills for data scientists in the Germany in 2023.

Insights:

  • Python dominates demand throughout the year despite a late-summer dip, confirming it as the most critical skill for Data Scientists in Germany.

  • SQL shows stable, consistent demand across all months, reinforcing its role as a foundational data skill.

  • R, Azure, and Pandas exhibit moderate and fluctuating demand, indicating their value as complementary skills rather than core requirements.

3. How well do jobs and skills pay for Data Scientist?

To identify the highest-paying roles and skills, I only got jobs in the Germany and looked at their median salary. But first I looked at the salary distributions of common data jobs like Data Scientist, Data Engineer, and Data Analyst, to get an idea of which jobs are paid the most.

View my notebook with detailed steps here: 4_Salary_Analysis.

Visualize Data

sns.boxplot(data=df_US_top6, x='salary_year_avg', y='job_title_short', order=job_order)

ticks_x = plt.FuncFormatter(lambda y, pos: f'${int(y/1000)}K')
plt.gca().xaxis.set_major_formatter(ticks_x)
plt.show()

Results

Salary Distributions of Data Jobs in the Germany
Box plot visualizing the salary distributions for the top 6 data job titles.

Insights

  • There is a clear variation in salary ranges across data roles in Germany. Senior Data Scientist and Machine Learning Engineer roles show the highest salary potential, highlighting the strong market value of advanced expertise and specialization.

  • Senior Data Scientist and Senior Data Engineer roles display several high-end outliers, indicating that exceptional experience, niche skills, or specific industries can significantly increase compensation. In contrast, Data Analyst roles show tighter salary distributions with fewer outliers.

  • Median salaries increase with seniority and technical specialization. Senior-level roles not only earn higher median pay but also exhibit greater salary dispersion, reflecting increased responsibility and performance-based compensation.

Highest Paid & Most Demanded Skills for Data Scientist

Next, I narrowed my analysis and focused only on data scientist roles. I looked at the highest-paid skills and the most in-demand skills. I used two bar charts to showcase these.

Visualize Data

fig, ax = plt.subplots(2, 1)  

# Top 10 Highest Paid Skills for Data Scientist
sns.barplot(data=df_DS_top_pay, x='median', y=df_DS_top_pay.index, hue='median', ax=ax[0], palette='dark:b_r')

# Top 10 Most In-Demand Skills for Data Scientist')
sns.barplot(data=df_DS_skills, x='median', y=df_DS_skills.index, hue='median', ax=ax[1], palette='light:b')

plt.show()

Results

Here's the breakdown of the highest-paid & most in-demand skills for data scientists in the US:

The Highest Paid & Most In-Demand Skills for Data Scientist in the Germany Two separate bar graphs visualizing the highest paid skills and most in-demand skills for data scientists in the Germany.

Insights:

  • The top graph shows that specialized engineering and infrastructure skills such as PySpark, Terraform, FastAPI, and Jenkins are associated with the highest median salaries, reaching close to €200K. This indicates that advanced, production-level data science and MLOps skills significantly boost earning potential in Germany.

  • The bottom graph highlights that foundational and widely used skills like SQL, Python, Spark, and Hadoop are the most in-demand, even though they do not always correspond to the highest salaries. This emphasizes their importance for employability and day-to-day data science work.

  • There is a clear distinction between high-paying specialized skills and high-demand core skills. Data scientists aiming to maximize career growth should build a strong foundation in essential tools while strategically adding specialized technologies to increase long-term earning potential.

4. What are the most optimal skills to learn for Data Scientist?

To identify the most optimal skills to learn ( the ones that are the highest paid and highest in demand) I calculated the percent of skill demand and the median salary of these skills. To easily identify which are the most optimal skills to learn.

View my notebook with detailed steps here: 5_Optimal_Skills.

Visualize Data

from adjustText import adjust_text
import matplotlib.pyplot as plt

plt.scatter(df_DS_skills_high_demand['skill_percent'], df_DS_skills_high_demand['median_salary'])
plt.show()

Results

Most Optimal Skills for Data Scientist in the Germany
A scatter plot visualizing the most optimal skills (high paying & high demand) for data scientists in the Germany.

Insights:

-Python stands out as the most optimal skill, combining the highest demand with a strong median salary, making it the most valuable core skill for Data Scientists in Germany.

-Big data and cloud technologies such as Spark SQL, Redshift, Kubernetes, AWS, and Azure offer high median salaries despite lower demand, indicating strong market value for specialized, infrastructure-focused expertise.

-Visualization tools like Tableau and Power BI show lower salaries and demand compared to engineering and cloud skills, suggesting they are better suited as complementary skills rather than primary drivers of compensation in data science roles.

Visualizing Different Techonologies

Let's visualize the different technologies as well in the graph. We'll add color labels based on the technology (e.g., {Programming: Python})

Visualize Data

from matplotlib.ticker import PercentFormatter

# Create a scatter plot
scatter = sns.scatterplot(
    data=df_DS_skills_tech_high_demand,
    x='skill_percent',
    y='median_salary',
    hue='technology',  # Color by technology
    palette='bright',  # Use a bright palette for distinct colors
    legend='full'  # Ensure the legend is shown
)
plt.show()

Results

Most Optimal Skills for Data Scientist in the Germany with Coloring by Technology
A scatter plot visualizing the most optimal skills (high paying & high demand) for data scientists in the Germany with color labels for technology.

Insights:

  • The scatter plot shows that most programming skills (blue), such as Python, R, and Go, cluster at higher salary levels, indicating that strong programming expertise offers the greatest compensation benefits for Data Scientists in Germany.

  • Cloud and big data technologies (red and purple), including AWS, Azure, Kubernetes, Redshift, and Hadoop, are associated with high median salaries despite lower demand, highlighting their value as specialized, high-impact skills.

  • Analyst tools (green), such as Tableau and Power BI, show lower salaries and demand compared to programming and cloud skills, suggesting they are best positioned as supporting skills rather than primary drivers of compensation in data science roles.

What I Learned

Throughout this project, I deepened my understanding of the data scientist job market and enhanced my technical skills in Python, especially in data manipulation and visualization. Here are a few specific things I learned:

  • Advanced Python Usage: Utilizing libraries such as Pandas for data manipulation, Seaborn and Matplotlib for data visualization, and other libraries helped me perform complex data analysis tasks more efficiently.
  • Data Cleaning Importance: I learned that thorough data cleaning and preparation are crucial before any analysis can be conducted, ensuring the accuracy of insights derived from the data.
  • Strategic Skill Analysis: The project emphasized the importance of aligning one's skills with market demand. Understanding the relationship between skill demand, salary, and job availability allows for more strategic career planning in the tech industry.

Insights

This project provided several general insights into the data job market for scientists:

  • Skill Demand and Salary Correlation: There is a clear correlation between the demand for specific skills and the salaries these skills command. Advanced and specialized skills like Python and redshift often lead to higher salaries.
  • Economic Value of Skills: Understanding which skills are both in-demand and well-compensated can guide data scientists in prioritizing learning to maximize their economic returns.

Challenges I Faced

This project was not without its challenges, but it provided good learning opportunities:

  • Data Inconsistencies: Handling missing or inconsistent data entries requires careful consideration and thorough data-cleaning techniques to ensure the integrity of the analysis.
  • Complex Data Visualization: Designing effective visual representations of complex datasets was challenging but critical for conveying insights clearly and compellingly.
  • Balancing Breadth and Depth: Deciding how deeply to dive into each analysis while maintaining a broad overview of the data landscape required constant balancing to ensure comprehensive coverage without getting lost in details.

Conclusion

This exploration into the data scientist job market has been incredibly informative, highlighting the critical skills and trends that shape this evolving field. The insights I got enhance my understanding and provide actionable guidance for anyone looking to advance their career in data Scientist. As the market continues to change, ongoing analysis will be essential to stay ahead in data Scientist. This project is a good foundation for future explorations and underscores the importance of continuous learning and adaptation in the data field.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published