230215_DB공부내용

DB 공부하기

230215_DB공부내용

보랏 2023. 2. 15. 23:01

안녕하세요. 오늘도 많은 걸 배워서 바로 정리하려고 합니다.

바로 배운내용 복습하도록 하겠습니다.

1. 튜플

튜플의 속성은 리스트와 유사하지만 리스트, 딕셔너리와 달리 한 번 정해지면 값을 변경할 수 없는 자료형입니다. 파이썬 사용 시 많이 사용하지 않지만 어제 배운 enumerate를 사용 시 튜플로 묶이게 됩니다.
하지만 enumerate 사용 시 idx와 함께 사용하면 튜플로 묶이지 않게 되어 인덱싱 등이 가능하게 됩니다.

for x in enumerate(["a","b","c"],start = 1) : 
    print(x) # (1, "a") (2, "b") (3, "c")

# idx 사용 시 
for idx, x in enumerate(["a","b","c"],start = 1) : 
    print(idx, x)

2. 서버 지식

https : s는 방어 및 보안이 되어있는 홈페이지 주소
http : 방어 및 보안이 없는 홈페이지 주소
remote address : ipv4주소, 32bit, 1~256까지의 숫자로 구성, 맨 마지막 000은 port ex) 255.255.255.0
known-port : 각 포트별 지정 명령어 (참고 : https://ko.wikipedia.org/wiki/TCP/UDP%EC%9D%98_%ED%8F%AC%ED%8A%B8_%EB%AA%A9%EB%A1%9D)
DNS(Domain Name Server) : ip주소를 도메인주소로 변경해주는 서버

3. post 크롤링

오늘도 스타벅스 홈페이지를 크롤링한 방법을 공부하였습니다. 스타벅스 홈페이지에서 f12키를 누르고 Network → Payload → view source 를 들어가면 Data를 확인 가능합니다.
스타벅스 전국매장을 크롤링하기 위해 매장정보 소스(payload)와 시도번호 소스({'rndCod': 'VH3P0RO6DH'})를 활용하였습니다.
시도번호 소스는 01~17까지 있어서 이 리스트 파일을 돌려서 r2.json() 객체의 시도코드(data['sido_cd'])와 r.json객체의 ['p_sido_cd']의 시도코드가 맞으면, total 리스트에 매장 정보(r.json)를 추가하는 코드입니다.

#스타벅스 홈페이지 Payload Data
payload = {"in_biz_cds":"0",
"in_scodes":"0",
"ins_lat":"37.4866131",
"ins_lng":"127.0205989",
"search_text":"",
"p_sido_cd":"01",
"p_gugun_cd":"",
"in_distance":"0",
"in_biz_cd":"",
"isError":"true",
"searchType":"C",
"set_date":"",
"all_store":"0",
"T03":"0",
"T01":"0",
"T27":"0",
"T12":"0",
"T09":"0",
"T30":"0",
"T05":"0",
"T22":"0",
"T21":"0",
"T10":"0",
"T36":"0",
"T43":"0",
"T48":"0",
"P10":"0",
"P50":"0",
"P20":"0",
"P60":"0",
"P30":"0",
"P70":"0",
"P40":"0",
"P80":"0",
"whcroad_yn":"0",
"P90":"0",
"new_bool":"0",
"iend":"1000",
"rndCod":"N8OLKY2KL8",}

# 스타벅스 전체 매장 리스트
url = "https://www.starbucks.co.kr/store/getStore.do?r=PO47G07Y8Y"
r= requests.post(url, data = payload)
r.json()['list']

# 스타벅스 시도번호 리스트
url_sido = "https://www.starbucks.co.kr/store/getSidoList.do"
r2 = requests.post(url_sido, data={'rndCod': 'VH3P0RO6DH'})
r2.json()['list']

# 전국 스타벅스 매장 확인 
for data in r2.json()['list'] : 
	payload['p_sido_cd'] = data['sido_cd']
    total += r.json()['list']
    print(data['sido_cd'], data['sido_nm']

이렇게 만든 total 파일 중 매장이름과 오픈날짜를 확인하기 위해 x['s_name'], x['open_dt']를 사용하고 해당 값을 딕셔너리로 저장
딕셔너리로 저장한 변수를 value기준 내림차순 정렬한 최종코드입니다.

# 매장이름, 오픈날짜 확인
for x in total : 
    print(x['s_name'], x['open_dt'])

# 딕셔너리 변수로 저장
store = {x['s_name'] : x['open_dt'] for x in total}

# value기준 내림차순 정렬
sorted(store.items(), key = lambda x : x[1], reverse = True)

4. pickle, with구문

with 구문은 지난 시간에 배운 f.open, f.read, f.close와 같이 파일을 읽고 닫는 구문입니다. 다만 with구문 close할 필요 없이 변수명 저장 후 코드만 써주면 됩니다.
pickle라이브러리는 형식을 유지하면서 데이터를 저장하는 방법입니다. 피클 데이터는 바이너리 데이터로 눈으로 확인이 불가합니다.
pickle.dump : 파일 저장
pickle.load : 파일 로드

with open("파일 이름", "mode") as f : 
	코드블록 # f.write(텍스트파일 쓰기), f.read() 텍스트파일 읽기
    
import pickle
with open("./workspace.pkl", "wb") as f : 
    pickle.dump(a,f)

with open("./workspace.pkl", "rb") as f : 
    star = pickle.load(f)

5. 스타벅스 강남구 매장 찾기

위에서 만든 전국 스타벅스 매장 정보(total)과 앞선 pickle, with구문에서 만든 바이너리 객체를 활용하여 star라는 리스트 데이터에서 'gugun_name' 이 있으며 변수와 같은 데이터를 total리스트에 append하는 코드입니다.

def search_store(store_gu):
    total = []
    for data in star[0]:
        if data['gugun_name'] != None and store_gu == data['gugun_name']:
            total.append(data['s_name'])
    return total

# list comprehension
def search_store(store_gu) : 
	return [data['s_name'] for data in star if data['gugun_name'] != None and store_gu == data['gugun_name']]

6. html 구조 및 get방식 크롤링

BeautifulSoup를 활용하여 get방식의 주소를 크롤링하는 방법입니다. url을 변수로 지정하여 BeautifulSoup에 지정 후 find, find_all 을 활용하여 각 구조를 구하는 방법입니다.
find : 기준에 맞는 태그를 한 개 반
find_all : 기준에 맞는 태그를 모두 가져오기 때문에 리스트 타입을 반환

<!DOCTYPE html>
<html>
    <head>
        제목
    </head>
    <body>
        데이터 수집 연습
    </body>
</html>

html_doc =  """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>"""

bs = BeautifulSoup(html_doc)
bs.find("p") # <p class="title"><b>The Dormouse's story</b></p>
bs.find_all("a") #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 				  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 				  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
bs.find("a",id = "link3") #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
bs.find("a", class_="sister") #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

이상으로 오늘 배운 내용 복습 마치겠습니다.

오늘은 평소보다 피곤해서 일찍 쉬어야겠어요...

감사합니다.