[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(2. 전체 feature 수집 및 특정 feature 추출)

2020. 8. 31. 11:28

https://beaver-sohyun.tistory.com/53?category=864438

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(1. 상세사이트 링크 수집)

홍홍홍 하하하 안녕하세요 작은 비버 입니다. 요즘들어 왜이렇게 활발하게 운영하냐고요? 왜냐하면 제가 '구글 애드센스'를 신청했기 때문입니다.. - 사실 구글 애드센스 신청이 승인 될 가능성�

beaver-sohyun.tistory.com

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(1. 상세사이트 링크 수집)

5. 수집된 링크에서 feature 수집

상세 사이트에서 프로젝트 링크, 실제로 모인 금액, 목표 금액, 제작자 이름 등등 프로젝트 성공 여부에 영향을 미쳤을 만한 모든 feature를 수집하였습니다.

- 이 중에서 수치화를 통해 영향을 측정할 수 있는 feature를 5개로 추려내어 학습을 진행하였습니다.

- 이번 포스팅에서는 모든 feature를 수집하는 것에 대해 설명드리겠습니다.

1) 수집한 feature : 프로젝트 링크 , 실제로 모인 금액 , 목표 금액 , 제작자 이름 , 제작자 지역 , 프로젝트 장르 , 서포트 단계별 금액 리스트 , 서포트 최고, 최소 금액 , 서포트 단계별 금액 후원자 수

2) 추려낸 5개의 feature : 총 후원자수 , 제작자 프로젝트 개설횟수 , 서포트 금액 단계별 총 개수 , 서포트 최고 금액 후원자 수 , 서포트 최소 금액 후원자 수 (3번째 포스팅 참고)

5-1. 성공한 프로젝트의 상세 사이트에서 feature수집

5-1-1. 전체 feature 수집

# 수집되는 Featrue들을 넣기 위해 각 Feature별로 리스트 생성

money_list=[]
pledged_money_list=[]
creator_list=[]
backers_list=[]
final_created_list=[]
city_list=[]
category_list=[]
support_bankroll_list=[]
support_max_money_list=[]
support_min_money_list=[]
support_max_backers_list=[]
support_min_backers_list=[]
level_num_list=[]
support_backers_list=[]

for i in range(len(success_link_list)):
    url = success_link_list[i]
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    print("링크",url)

    #실제로모인금액
    money = soup.select('span.money')[0].text
    money = re.findall('\d+',money)
    money = "".join(money)
    money_list.append(money)
    print("모인금액",money)

    #목표금액
    pledged_money = soup.select('span.money')[2].text
    pledged_money = re.findall('\d+',pledged_money)
    pledged_money = "".join(pledged_money)   
    pledged_money_list.append(pledged_money)
    print("목표금액",pledged_money)

    #만든사람
    creator = soup.select('a.hero__link')[1].text
    creator = creator.strip()
    creator_list.append(creator)
    print("만든사람",creator)

    #후원자수
    backers = soup.select('h3.mb0')[1].text
    backers = backers.strip()
    backers_list.append(backers)
    print("총 후원자수",backers)

    #제작자의프로젝트개설횟수
    a_tags = soup.find_all("a", {'class':"hero__link remote_modal_dialog js-update-text-color"}) 
    
    for created in a_tags :
        print(created["href"]) # 찾은 a태그의 href 값 크롤링
    url = "https://www.kickstarter.com" + created["href"]
    driver.get(url)     
    html_s = driver.page_source
    soup_s = BeautifulSoup(html_s, 'html.parser')

    created_dummy = driver.find_element_by_class_name("created-projects.py2.f5.mb3")
    final_created = re.findall('\d+',created_dummy.text)

    for i in range(len(final_created)):
        final_created[i] = int(final_created[i])
    if len(final_created) == 1:
        created_num = 0
    else:
        created_num = final_created[0]
    final_created_list.append(created_num)
    
    print("개설횟수",final_created_list)

    #제작자지역
    city = soup.select('a.grey-dark')[4].text
    city = city.strip()
    city_list.append(city)
    print("제작자지역",city)

    #프로젝트장르
    category = soup.select('a.grey-dark')[5].text
    category = category.strip()
    category_list.append(category)
    print("장르",category)

    #서포트금액
    support_money = soup.select('span.money')
    support_bankroll = []
    for line in support_money:
        line = line.get_text()         
        line = re.findall('\d+',line)
        line = "".join(line)
        support_bankroll.append(int(line))       
    for i in range(3):
        support_bankroll.pop(0)

    support_bankroll_list.append(support_bankroll)
    print("서포트금액",support_bankroll)

    #서포트금액 최소/최대
    support_max_idx = 0
    support_min_idx = 0
    max = support_bankroll[0]
    min = support_bankroll[0]
    
    for i in range(len(support_bankroll)):  
        if max < support_bankroll[i]:
            max = support_bankroll[i]
            support_max_idx = i
        if min > support_bankroll[i]:
            min = support_bankroll[i]
            support_min_idx = i
    print("최대금액",max)
    print("최소금액",min)
    support_max_money_list.append(max)
    support_min_money_list.append(min)

    #단계개수
    level_num = len(support_bankroll)
    level_num_list.append(level_num)
    print("서포트 단계",level_num)

    #서포트후원자수
    support_backer = soup.select(' div > div > div > div > div > div > div > ol > li > div > div > span')
    support_backers = []
    for line in support_backer:
        line = line.get_text() 
        if line == 'Includes:':
            continue
        line = line.replace('\n','')
        if line == 'Limited':
            continue
        if line == 'Reward no longer available':
            continue
        line = line.replace(' backers','')
        line = line.replace(' backer','')
        support_backers.append(line)
    support_backers_list.append(support_backers)
    print("서포트 금액별 후원자수",support_backers)

    #서포트최소/최대후원자수
    a_ar = np.array(support_bankroll)
    max_money_index = np.argmax(support_bankroll)
    min_money_index = np.argmin(support_bankroll)
    max_money_backers = support_backers[max_money_index]
    min_money_backers = support_backers[min_money_index]
    support_max_backers_list.append(max_money_backers)
    support_min_backers_list.append(min_money_backers)
    print("최대 금액 후원자수",max_money_backers)
    print("최소 금액 후원자수",min_money_backers)
    print("\n")

전체 feature 수집되는 형식

5-1-2. csv파일로 수집한 모든 featrue 저장

import pandas as pd

success_list = []
result = []

for money_list,pledged_money_list,creator_list,backers_list,final_created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list in zip(money_list,pledged_money_list,creator_list,backers_list,final_created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list):       
    success_list = [money_list,pledged_money_list,creator_list,backers_list,final_created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list]
    result.append(success_list)
    
df = pd.DataFrame(result, columns = ['money','pledged_money','creator','backers','Number of projects opened','Producer area','genre','Support amount list','The highest amount of support','The minimum amount of support','Total number of support steps','List of supporters','Maximum amount of sponsors','Minimum amount of sponsors'])
df.to_csv('success_link_feature.csv',encoding = 'utf-8-sig')

5-2. 실패한 프로젝트의 상세 사이트에서 feature수집 후 추려내기

5-1-1. 전체 feature 수집

# 수집되는 Featrue들을 넣기 위해 각 Feature별로 리스트 생성

money_list=[]
pledged_money_list=[]
creator_list=[]
backer_list=[]
created_list=[]
city_list=[]
category_list=[]
support_bankroll_list=[]
support_backers_list = []
level_num_list=[]
support_max_money_list = []
support_min_money_list = []
support_max_backers_list = []
support_min_backers_list = []

for i in range(len(fail_link_list)):
    url = fail_link_list[i]
    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    print('링크',url)
    time.sleep(8)
      
    #실제로모인금액
    money = soup.select('span.soft-black')[1].text     
    money= re.findall('\d+',money)
    money= "".join(money)
    money_list.append(money)
    print('실제로 모인 금액',money)
    time.sleep(0.2)
    
    #목표금액
    pledged_money = soup.find('span',{'class':'inline-block hide-sm'}).text
    pledged_money= re.findall('\d+',pledged_money)
    pledged_money= "".join(pledged_money)
    pledged_money_list.append(pledged_money)
    print('목표금액',pledged_money)
    time.sleep(0.2)
    
    #제작자이름
    creator = soup.select('div.text-left')[1].text
    creator_list.append(creator)
    print('제작자이름',creator)
    
    #후원자수
    time.sleep(0.2)
    backers = soup.select(' div > div > div > div > div.flex.flex-column-lg.mb4.mb5-sm > div.ml5.ml0-lg.mb4-lg > div > span')[0].text
    backer_list.append(backers)    
    print('후원자수',backers)
    time.sleep(0.2)
    
    #제작자의프로젝트개설횟수
    created= soup.select('div.text-left')[2].text
    final_created= re.findall('\d+',created)
    for i in range(len(final_created)):
        final_created[i]=int(final_created[i])
    if len(final_created)==1:
        created_num=0
    else:
        created_num=final_created[0]
    created_list.append(created_num)
    print('제작개설횟수',created_num)   
    #time.sleep(0.2)
    
    #제작자지역
    city = soup.select('span.ml1')[1].text
    city_list.append(city)
    print('제작자지역',city)
    time.sleep(0.2)
    
    #프로젝트장르
    category = soup.select('span.ml1')[0].text
    category_list.append(category)
    print('프로젝트 장르',category)
    time.sleep(0.2)

    #서포트금액
    support_money = soup.select('div.NS_projects__content > section.js-project-content.js-project-description-content.project-content > div > div > div > div.col.col-4.max-w62.sticky-rewards.z10 > div > div.mobile-hide > div > ol > li > div.pledge__info > h2 > span.money')
    support_bankroll=[]
    for line in support_money:
        line=line.get_text()         
        line= re.findall('\d+',line)
        line = "".join(line)
        support_bankroll.append(int(line)) 

    support_bankroll_list.append(support_bankroll)
    print('서포트금액', support_bankroll)
    time.sleep(0.2)
    
    #서포트후원자수
    support_backer = soup.select('div.NS_projects__content > section.js-project-content.js-project-description-content.project-content > div > div > div > div.col.col-4.max-w62.sticky-rewards.z10 > div > div.mobile-hide > div > ol > li > div.pledge__info > div.pledge__backer-stats > span')
    support_backers = []
    for line in support_backer:
        line = line.get_text() 
        if line == 'Includes:':
            continue
        line = line.replace('\n','')
        if line == 'Limited':
            continue
        if line == 'Reward no longer available':
            continue
        line = line.replace(' backers','')
        line = line.replace(' backer','')
        support_backers.append(line)
    support_backers_list.append(support_backers)
    print("서포트 금액별 후원자수",support_backers)   
    time.sleep(0.2)
    
    #단계개수
    level_num=len(support_bankroll)
    level_num_list.append(level_num)
    print('단계',level_num)
    time.sleep(0.2)
    
    
    #서포트금액 최소/최대
    support_max_idx = 0
    support_min_idx = 0
    max = support_bankroll[0]
    min = support_bankroll[0]
    
    for i in range(len(support_bankroll)):  
        if max < support_bankroll[i]:
            max = support_bankroll[i]
            support_max_idx = i
        if min > support_bankroll[i]:
            min = support_bankroll[i]
            support_min_idx = i
    print("최대금액",max)
    print("최소금액",min)
    support_max_money_list.append(max)
    support_min_money_list.append(min)
    time.sleep(0.2)
    
    #서포트최소/최대후원자수
    a_ar = np.array(support_bankroll)
    max_money_index = np.argmax(support_bankroll)
    min_money_index = np.argmin(support_bankroll)
    max_money_backers = support_backers[max_money_index]
    min_money_backers = support_backers[min_money_index]
    support_max_backers_list.append(max_money_backers)
    support_min_backers_list.append(min_money_backers)
    print("최대 금액 후원자수",max_money_backers)
    print("최소 금액 후원자수",min_money_backers)
    print('\n')
    time.sleep(0.2)

전체 feature 수집되는 형식

5-1-2. csv파일로 수집한 모든 featrue 저장

import pandas as pd

fail_list = []
result_fail = []

for money_list,pledged_money_list,creator_list,backer_list,created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list in zip(money_list,pledged_money_list,creator_list,backer_list,created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list):       
    fail_list = [money_list,pledged_money_list,creator_list,backer_list,created_list,city_list,category_list,support_bankroll_list,support_max_money_list,support_min_money_list,level_num_list,support_backers_list,support_max_backers_list,support_min_backers_list]
    result_fail.append(fail_list)
    
df_fail = pd.DataFrame(result_fail, columns = ['money','pledged_money','creator','backers','Number of projects opened','Producer area','genre','Support amount list','The highest amount of support','The minimum amount of support','Total number of support steps','List of supporters','Maximum amount of sponsors','Minimum amount of sponsors'])
df_fail.to_csv('fail_link_feature.csv',encoding = 'utf-8-sig')

이번 포스팅에서는 수집한 link에서 개발자도구와 python을 이용해 모든 feature 추출을 해보았습니다.

이 글을 보시는 분들은 이 내용에 대한 정보가 필요해서 들어오셨겠죠?

본인이 직접 스스로 해보시면 살-짝 재미를 느끼실수도있어요!!!

다음시간에는 수집한 데이터들을 가지고 decision tree를 이용해 데이터 분석을 해보겠습니다.

긴 글 봐주셔서 감사드립니다

다들 행복한 하루 보내시길 바랄께요!

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(3. Decision tree를 이용한 데이터 분석)

https://beaver-sohyun.tistory.com/54?category=864438

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(3. Decision tree를 이용한 데이터

https://beaver-sohyun.tistory.com/52?category=864438 [빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(2. 전체 feature 수집 및 특정 fea https://beaver-sohyun.tistory.com/53?cate..

beaver-sohyun.tistory.com

저작자표시

'작은비버의 성장과정 > 과제 및 성장기' 카테고리의 다른 글

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(4. Decision tree 결과 분석) (0)	2020.09.02
[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(3. Decision tree를 이용한 데이터 분석) (0)	2020.09.01
[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(1. 상세사이트 링크 수집) (0)	2020.08.30
[프로젝트]주제 Hate Speech Detection_2020(++ 실험 결과, Word Attention visualization) (1)	2020.06.03
[프로젝트]코로나19에 대한 사회적 반응[지하철 승객 수의 변화] (0)	2020.05.10

열정가득 작은비버

[빅데이터] kickstarter에서 Project에 큰 영향을 미치는 Feature 확인하기(2. 전체 feature 수집 및 특정 feature 추출)

'작은비버의 성장과정 > 과제 및 성장기' 카테고리의 다른 글

+ Recent posts

티스토리툴바