사이트맵에서 페이지 목록 파싱해보기

사이트맵에서 페이지 목록 파싱해보기

Taedi

2021. 2. 13. 02:00

블로그를 시작하다 보니 온통 관심사가 블로그로 쏠리게 되었습니다. 구글과 네이버에 사이트 등록을 하면서 sitemap.xml의 존재에 대해 알게 되었고, 확인해보니 사이트 내의 페이지 목록과 최종 수정 일자가 표기되어 있는 것을 알게 되었습니다. 그래서 이 내용을 파싱 하여 페이지 리스트를 얻고 각 페이지에 접속해 alt 속성 누락 여부를 확인할 수 있을까? 여기에 추가해 최종 수정 일자를 대조하여 수시로 점검이 가능할까? 하는 궁금증이 생겼고 sitemap.xml 파싱에 도전하게 되었습니다.

실패기

막연한 생각에 .xml 확장자니 'ElementTree'를 활용하면 되지 않을까 하는 생각이 들었고 다음과 같이 코드를 짜보았으나 제대로 되지 않았습니다. 그나마 가장 근접했다고 생각하는 코드는 이렇습니다.

코드

import requests
from xml.etree import ElementTree as ET

url = "https://tae-di.tistory.com/sitemap.xml"  # sitemap 경로

res = requests.get(url) # requests를 활용해 사이트 수집

root = ET.fromstring(res.text)

for i in root:
    print(i)

결과

<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A672EAD60>
<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A699464A0>
<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A699469F0>
<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A69946A90>
<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A69946B30>
...
<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}url' at 0x0000022A6998AEF0>

분명히 값은 나오는데 원하는 URL을 찾는 것은 실패했습니다. 공식문서를 찾아봤지만 기본지식이 없다 보니 iter인지 findall인지 어떤 메서드를 사용해야 할지도 모르겠고...

그러다 결국 스택오버플로우에서 관련 된 내용을 찾아 해결하게 되었습니다.

방법 1 - 정규표현식 re 활용

먼저 소개드릴 방식은 정규표현식을 활용한 방법입니다. 직접 시도해볼 때 사용하지 않으려고 했던 방법이지만 확실히 패턴만 잘 만들면 xml이나 ElementTree 같은 패키지를 제대로 활용하지 못해도 원하는 결과를 얻을 수 있다는 점이 좋았습니다.

코드

import re, requests

url = "https://tae-di.tistory.com/sitemap.xml"

res = requests.get(url)

pattern = '(?<=<loc>)[a-zA-z]+://[^\s]*(?=</loc>)'
# pattern = '<loc>(.*?)</loc>'
url_list = re.findall(pattern,res.text)

print(url_list)

결과

['https://tae-di.tistory.com', 'https://tae-di.tistory.com/category', 'https://tae-di.tistory.com/category/Study', 'https://tae-di.tistory.com/category/Etc', 'https://tae-di.tistory.com/category/Study/Mac', 'https://tae-di.tistory.com/category/Study/AutoHotkey', 'https://tae-di.tistory.com/category/Study/Git', 'https://tae-di.tistory.com/category/Study/Python', 'https://tae-di.tistory.com/category/Study/Excel', 'https://tae-di.tistory.com/category/Study/Blog', 'https://tae-di.tistory.com/category/Study/Win', 'https://tae-di.tistory.com/category/Tip', 'https://tae-di.tistory.com/category/Tip/Review', 'https://tae-di.tistory.com/category/Tip/Site', 'https://tae-di.tistory.com/category/Tip/Application', 'https://tae-di.tistory.com/tag', 'https://tae-di.tistory.com/guestbook', 'https://tae-di.tistory.com/19', 'https://tae-di.tistory.com/18', 'https://tae-di.tistory.com/20', 'https://tae-di.tistory.com/16', 'https://tae-di.tistory.com/15', 'https://tae-di.tistory.com/14', 'https://tae-di.tistory.com/13', 'https://tae-di.tistory.com/12', 'https://tae-di.tistory.com/11', 'https://tae-di.tistory.com/10', 'https://tae-di.tistory.com/9', 'https://tae-di.tistory.com/8', 'https://tae-di.tistory.com/7', 'https://tae-di.tistory.com/6', 'https://tae-di.tistory.com/5', 'https://tae-di.tistory.com/4', 'https://tae-di.tistory.com/3', 'https://tae-di.tistory.com/2']

참고 : https://stackoverflow.com/questions/55174600/whats-the-most-efficient-way-to-parse-this-xml-sitemap-with-python/

pattern이 코드와 주석 부분에 두 가지가 존재하는 데 정상적인 사이트맵 페이지에서는 모두 동일하게 동작합니다. 정규표현식을 익히기 위해서 두가지 표현을 모두 해석해 보았습니다.

정규표현식 해석

(?<=<loc>)[a-zA-z]+://[^\s]*(?=</loc>)

- (?<=<loc>) : possitive lookbehind (?<=RegEx), <loc>이 매칭 된 이후 부분 탐색

- [a-zA-z]+ : set of characters [], 알파벳 대소문자 전체가 1회 이상(+, 메타 문자 + 는 1회 이상 반복)

- [^\s]* : set of characters [], 띄어쓰기를 제외한 문자가 0개 이상(*, 메타 문자 *는 0회 이상 반복), [/S]* 와 동일

- (?=</loc>) : possitive lookahead (?=RegEx), </loc>이 매칭 된 이전 부분 탐색

해석 : "<loc> 이후 </loc> 이전"에 "알파벳들://(문자들)" 형태를 가지는 string을 반환

<loc>(.*?)</loc>

- .*? : 메타문자 .(dot)는 줄 바꿈을 제외한 모든 문자와 매칭, *? (non-greeedy) * 단독으로 사용하면 최대한 많은 내용을 매칭 시키려 하기 때문에 최소로 한정

- (RegEx) : 그룹, 매칭 된 내용에서 그롭화 된 내용만 반환(<loc>, </loc>는 반환하지 않음), non-capture 그룹은 (?:RegEx) 활용

해석 : "<loc>(문자들)</loc>" 형태를 가지는 string에서 문자 부분만 반환(<loc>, </loc> 제외)

지식이 부족해 난해한 설명이 되었습니다. 내용은 파이썬 공식문서를 참고했는데 가급적 영문사이트로 보는 것이 좋을 것 같습니다. 한글 번역은 부자연스러워 내용을 이해하기가 어려웠습니다.

파이썬 re 공식문서 : https://docs.python.org/3/library/re.html

방법 2 - ElementTree 활용

이후에 ElementTree를 활용한 방법도 찾을 수 있었고 올바른 결과를 얻을 수는 있었지만 어떤 방식으로 동작하는지는 아직 이해가 잘 가지 않습니다.

코드

import requests
from xml.etree import ElementTree as ET

url = "https://tae-di.tistory.com/sitemap.xml"

res = requests.get(url)

root = ET.fromstring(res.text)

loc_element = root.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc')

for loc in loc_element:
    print(loc.text)

결과

https://tae-di.tistory.com
https://tae-di.tistory.com/category
https://tae-di.tistory.com/category/Study
...
https://tae-di.tistory.com/2

참고 : https://stackoverflow.com/questions/52990128/how-can-i-get-some-value-from-xml-int-python

저작자표시 비영리 변경금지 (새창열림)

Taedi's Log