1. 우선 짚고 넘어가야할 용어

저장소(Repository): 줄여서 Repo라고 합니다. 프로젝트(프로그램 작업해놓은 것들)를 저장하는 곳입니다.

위 사진에 있는 pyqt-font-dialog, pyqt-dark.. 등등이 저장소입니다

브랜치(Branch): 브랜치는 프로젝트 작업장 같은 겁니다. 하나의 저장소는 여러 브랜치를 가질 수 있습니다. 브랜치는 하나 이상 꼭 있어야 됩니다! 그래야 작업을 할 수 있으니까요.

List Previous Next

Home / Posts

How to save Wikipedia document to local text file with Python?

Jung Gyu Yoon Jun 30, 2022

Python is a programming language used in many fields such as web crawling.

Getting the Wikipedia document’s content is also simple in Python.

All you need is the bs4, urllib3 and certifi packages.

bs4 is dummy package of beautifulsoup4. beautifulsoup4 is a library that makes it easy to scrape information from web pages. This package is often used for web crawling.

urllib3 is used for HTTP request. You need “request” if you want to scrape information from web page, and this will handle it.

certifi is used for validating the trustworthiness of SSL certificate while verifying the identity of TLS hosts. Simply put, you can’t be trusted if you won’t install this.

If the above 3 packages do not exist, you must install them. Here’s how to install it:

python -m pip install certifi

python -m pip install urllib3

python -m pip install bs4

Write the following code.

import certifi
import urllib3
from bs4 import BeautifulSoup
import re, os

def convertWikiToText(url):
    # check if it is Wikipedia based URL
    m = re.search(r'\.wikipedia\.org', url) 
    if m:
        http = urllib3.PoolManager(ca_certs=certifi.where())
        # request and get the response info
        resp = http.request('GET', url)
        # decode the response data to make it look better
        resp_data = resp.data.decode('utf-8')
        # if success (200 means success, 404 means error as you know)
        if resp.status == 200:
            soup = BeautifulSoup(resp_data, features='lxml')
            # get the title of document.
            title = soup.find('title').text.split(' - ')[0]
            # Tag that has bodyContent as an id contains all the contents in the document, so find the info of the tag.
            content_tag = soup.find(id='bodyContent')
            # get the text of it.
            content = content_tag.text
            filename = title + '.txt'
            # create the text file
            f = open(filename, 'w', encoding='utf-8')
            f.write(content)
            f.close()
            # start the file right away
            os.startfile(filename)
    else:
        raise Exception('This is not the right URL')

convertWikiToText('https://en.m.wikipedia.org/wiki/Chocolatey')

You will get the wikipedia document as a text file.

example

If you see more about this, check my GitHub repo “wiki-offline”.

List Previous Next

Algolia Indexing(인덱싱) 방법

Electron exe 파일 초간단 설치법

How to solve CMake error in Flutter