用資料分析的方法來探討 — 如何成為資料科學家?-系列2

一、如果要深入資料科學,可以透過什麼資源來了解呢?

Q12. Who/what are your favorite media sources that report on data science topics? (Select all that apply) — Selected Choice

Q13. On which platforms have you begun or completed data science courses? (Select all that apply) — Selected Choice

1. 喜愛資源資料預處理

# 取出Q12的資料
lst = []
for i in multiple_choice.columns:
if i[:3] == 'Q12':
lst.append(i)
source = multiple_choice[lst].iloc[1:,]
# 取出Q12多選題的選項
choose = multiple_choice[lst].iloc[0].apply(lambda x: ''.join(x.split('-')[2:]).replace(' ',''))
# 統計各選項選擇數量
source_df = pd.DataFrame()
source_df['choice'] = list(choose.values)
source_df['value'] = list(map(lambda i: len(source.iloc[:,i].dropna()), list(range(source.shape[1]))))source_df
圖一、喜愛資源資料預處理

2. 喜愛課程資料預處理

# 取出Q13的資料
lst = []
for i in multiple_choice.columns:
if i[:3] == 'Q13':
lst.append(i)
course = multiple_choice[lst].iloc[1:,]
# 取出Q13多選題的選項
choose = multiple_choice[lst].iloc[0].apply(lambda x: ''.join(x.split('-')[2:]).replace(' ',''))
# 統計各選項選擇數量
course_df = pd.DataFrame()
course_df['choice'] = list(choose.values)
course_df['value'] = list(map(lambda i: len(course.iloc[:,i].dropna()), list(range(course.shape[1]))))course_df['choice'] = course_df['choice'].replace('KaggleCourses(i.e.KaggleLearn)','KaggleCourses')course_df['choice'] = course_df['choice'].replace('UniversityCourses(resultinginauniversitydegree)','UniversityCourses')course_df
圖二、喜愛課程資料預處理

3. 喜愛資源資料視覺化

# 圖表繪製
import plotly.express as px
fig = px.pie(source_df.iloc[:-3,], values='value', names='choice',      title='Media Sources Selected',
     color_discrete_sequence=px.colors.sequential.Sunsetdark)
fig.update_traces(textposition='inside')
fig.show()
# 自動儲存
pyplt(fig, filename ='Media Sources Selected.html')
圖三、喜愛資源圓餅圖

4. 喜愛課程資料視覺化

# 圖表繪製
import plotly.express as px
fig = px.pie(course_df.iloc[:-3,], values='value', names='choice',
title='Data Science Courses Selected',
color_discrete_sequence=px.colors.sequential.Mint)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
# 自動儲存
pyplt(fig, filename ='Data Science Courses Selected.html')
圖四、喜愛課程圓餅圖

5. 分析

  • Kaggle (forums, blog, social media, etc.)
  • Blogs (TowardsDataScience, Medium, AnalyticsVidhya, KDnuggets, etc.)
  • YouTube (CloudAIAdventures, SirajRaval, etc.)

來幫大家結論一下~

  • 資料科學家主要工作內容有哪些?
    分別為「分析並了解資料對產品或企業決策的影響、探索機器學習應用領域、改善現有的機器學習模型」簡單地來說,只要掌握好數理、資訊科學、應用領域三方面的技能,資料科學家的工作絕對難不倒你!
  • 資料科學家會使用哪種程式語言?
    Python近乎完勝所有工作領域,因此會建議新手從Python開始學起。而各個程式語言的功能也有所不同,可以根據資料處理的需求來選擇最適合的工具。此外,也非常推薦學習R語言,在資料分析中的統計分析及視覺化上可以大大助您一臂之力!
  • 如果要深入資料科學,可以透過什麼資源來了解呢?
    首先,我建議可以從Udemy的Machine learning A-Z課程來入門。在了解Python的基礎後,推薦觀看台大李宏毅教授的機器學習課程也可以根據自己的需求或有興趣的領域,在CourseraKaggleCoursesUdemy平台上選擇最適合的課程!

Plotly套件補充資訊!

  1. Dynamic & Interactive
  2. Open-source plotting library
  3. Supports over 40 unique chart types
  4. Covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases
  5. Support for Python and R

1. 資料預處理

# 抓取Q3、Q5、Q5,並去除空值及符號
world_salary = multiple_choice[['Q3','Q5','Q10']].iloc[1:,:]
world_salary = world_salary.dropna()
world_salary['Q10'] = world_salary['Q10'].str.replace('$','', regex=False)
world_salary['Q10'] = world_salary['Q10'].str.replace(',','', regex=False)
world_salary
圖五、取出所需資料
# 薪水的上、下區間
world_salary['salary_lower_bound'] = pd.to_numeric(world_salary['Q10'].str.split('-', expand=True)[0], errors='coerce')
world_salary['salary_upper_bound'] = pd.to_numeric(world_salary['Q10'].str.split('-', expand=True)[1], errors='coerce') + 1# 取上、下區間中間值
world_salary['salary'] = (world_salary['salary_upper_bound'] + world_salary['salary_lower_bound'])/2
# 更改column names
world_salary.columns = ['country', 'job', 'salary_range', 'salary_lower_bound', 'salary_upper_bound', 'salary']
world_salary
圖六、薪水取區間中間值
# 取出資料科學家的資料
world_salary_ds = world_salary[world_salary['job'] == 'Data Scientist']
# 計算國家的平均薪水
country_salary_ds = pd.DataFrame(world_salary_ds.groupby('country')['salary'].mean().astype('int'))
country_salary_ds = country_salary_ds.reset_index()
country_salary_ds = country_salary_ds.sort_values(by = 'salary', ascending = False).reset_index(drop = True)
# 更改國家名稱
country_salary_ds['country'] = country_salary_ds['country'].str.replace('United States of America','United States')
country_salary_ds['country'] = country_salary_ds['country'].str.replace('United Kingdom of Great Britain.*','United Kingdom')country_salary_ds['country'] = country_salary_ds['country'].str.replace('Iran, Islamic Republic.*','Iran')country_salary_ds['country'] = country_salary_ds['country'].str.replace('South Korea','Korea, Rep.')country_salary_ds['country'] = country_salary_ds['country'].str.replace('Republic of Korea','Korea, Dem. Rep.')country_salary_ds['country'] = country_salary_ds['country'].str.replace('Hong Kong (S.A.R.)','Hong Kong, China')country_salary_ds['country'] = country_salary_ds['country'].str.replace('Viet Nam','Vietnam')country_salary_ds.head(10)
圖七、每個國家的資料科學家平均薪水
#加入country code
import plotly.express as px
df = px.data.gapminder().query("year == 2007")[['country','continent','iso_alpha']]
country_salary_merge_ds = pd.merge(country_salary_ds, df, on = 'country')# 將缺少的國家加入
country_salary_merge_ds = country_salary_merge_ds.append({'country':'Russia','salary':28000,'continent':'Europe','iso_alpha':'RUS'},ignore_index=True)
country_salary_merge_ds = country_salary_merge_ds.append({'country':'Ukraine','salary':29102,'continent':'Europe','iso_alpha':'UKR'},ignore_index=True)country_salary_merge_ds = country_salary_merge_ds.append({'country':'Belarus','salary':27568,'continent':'Europe','iso_alpha':'BLR'},ignore_index=True)country_salary_merge_ds.head(10)
圖八、加入country code

2. 資料視覺化

# 圖表繪製
import plotly.express as px

fig = px.scatter_geo(country_salary_merge_ds, locations="iso_alpha",
size="salary", color="salary", hover_name="country",
projection="natural earth",
title = 'Average Data Scientists Salary For Every Countries')
fig.show()
# 自動儲存
pyplt(fig, filename ='Average Data Scientists Salary For Every Countries.html')
影一、資料科學家平均薪水
歡迎加入我們的Telegram獲取即時訊息!https://t.me/marketingdatascience
歡迎加入我們的Line@獲取即時訊息!https://line.me/R/ti/p/%40cde8265r

您可能有興趣:

📢TMR全新線上直播課程

📢TMR為您量身打造「全方位數據課程」

【TMR】 X 【 好學校】
👨‍💼Python商業全系列數位課程👨‍💼

#行銷人必學實用Python課程

#實用投資分析課程

#好學校企業百大課程之一:

【TMR】 X 【工研院】
🏭工業4.0大數據智慧應用課程🏭

2020課程地圖

🏆 國內第一本行銷資料科學專書

🏆 國內第一本行銷資料科學 ” 實作 ” 專書

🏆國內第一本「股票小祕書」專書

  1. 天瓏書局:http://bit.ly/stock-secretaryBot
  2. 博客來:https://www.books.com.tw/products/0010833772

--

--

--

Marketing data science. 台灣第一個行銷資料科學(MDS)知識部落,本粉絲專頁在探討行銷資料科學之基礎概念、趨勢、新工具和實作,讓粉絲們瞭解資料科學的行銷運用,並開啟厚植數據分析能力之契機。粉絲專頁:https://www.facebook.com/MarketingDataScienceTMR

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
行銷資料科學

行銷資料科學

Marketing data science. 台灣第一個行銷資料科學(MDS)知識部落,本粉絲專頁在探討行銷資料科學之基礎概念、趨勢、新工具和實作,讓粉絲們瞭解資料科學的行銷運用,並開啟厚植數據分析能力之契機。粉絲專頁:https://www.facebook.com/MarketingDataScienceTMR

More from Medium

Implementing Meta Pseudo Labels — Part 3 — Losses, Training and Evaluating

Has WWE Finally Moved On From The Nostalgia Era? (2/2)

Bigcommerce Dynamics AX Integration Connector

Data Cleansing made Simple using SimpleData Management