スクレイピング - k2-ornata（個人的「やってみたこと」ブログ）

スクレイピング：指定した文字と文字の間のhtmlデータを全て取り出す

タイトルの通り、何かのhtml タグではなく、ブラウザに表示される文字列のスタート部分とエンド部分を指定して html データを抜き出してみました。

１．Pythonプログラム

# -*- coding: utf-8 -*-
from datetime import datetime, timedelta
import requests
import re
from bs4 import BeautifulSoup

# Set the day before yesterday's date
beforeyesterday = datetime.now() - timedelta(days=17)
beforeyesterday_str = beforeyesterday.strftime("%Y/%m/%d")

# URL of the top page
url = "https://www.xxx.jp/"・・・スクレイピングサイト"/"付き
domain ="https://www.xxx.jp"・・・スクレイピングサイト"/"無し

# start_string と end_string の間のストリングを取得する
start_string = "○○の一例:"
end_string = "○○○した場合は、○○○までご連絡ください。"

# Get the HTML content of the top page
response = requests.get(url)
html = response.content.decode("utf-8")

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(html, "html.parser")

# Find the parent element with the "○○○情報" title
security_info = soup.find('h1', string="○○○情報").parent
#print(security_info)

# Find all <h3> tags within the parent element
h3_tags = security_info.find_all('h3')

filtered_links = [tag.a['href'] for tag in h3_tags if tag.a is not None and beforeyesterday_str in tag.a.text]

for link in filtered_links:
    if link.startswith('http'):
        print(link)
    else:
        link = domain + link
        print(link)

    # トップページから取得した指定日の記事を読み込む
    response = requests.get(link)
    html = response.content.decode("utf-8")

    # BeautifulSoupを使ってHTMLを解析
    # soup = BeautifulSoup(html, 'html.parser')
    # article = soup.find('article', class_="nc-content-list")
    # print(article)

    # texts_p = [c.get_text() for c in article.find_all('p')]
    # print(texts_p)

    # Split the target text on the start and end strings and take the middle part
    target_text = html.split(start_string)[1].split(end_string)[0].split("</p>")[0]

    print(target_text)

    # 改行コードを空文字列に置換して一つのテキストにする
    target_text = target_text.replace('\n', '')

    # <br />タグを区切り文字として順番に配列に入れる
    result_array = [text for text in target_text.split('<br />') if text]

    # 結果の出力
    print(result_array)

２．プログラムのポイント

ChatGPTと相談しながらつくったのではっきりわかってないところはありますが＾＾、大体以下の意味合いかと思います。

タグとストリングで検索し親タグを含めたデータを取得

# この方法<p>タグなどには使えないかもしれません。

27行目：security_info = soup.find(‘h1’, string=”○○○”).parent
これは”○○○”というストリングが含まれる<h1>タグを見つけて、その１つ上位層のタグまでのデータをsecurity_infoに渡してくれているようです。

31行目：h3_tags = security_info.find_all(‘h3’)
21行目のsecurity_infoから<h3>タグの部分だけ抜き出してh3_tagsに入れてくれているようです。

33行目：filtered_links = [tag.a[‘href’] for tag in h3_tags if tag.a is not None and beforeyesterday_str in tag.a.text]
ここやたら長いですが、結果的には
h3_tagsからbeforeyesterday_strが含まれるhrefのリンクデータのみ抜き出し、filtered_linksに入れてくれているようです。

スタートとエンドの文字列を指定して抜き出し、お尻の不要なデータを削除

57行目：target_text = html.split(start_string)[1].split(end_string)[0].strip().split(“</p>”)[0]
ここも長いですが、
htmlに読み込まれているWebページのデータについて、start_string から end_stringまでのデータを抜き出してくれているようです。
また、split(“</p>”)[0]をつけることで、お尻にくっついていた</p>タグ以降を削除しています。

３．実行例

以下、実際にプログラムを実行した際の出力結果です。

https://web.xxx.jp/faqs/xxxx
○○○のサイズがクォータ制限に達しました<br />○○○のストレージ容量が少ない<br />～ 受信メールエラー<br />○○○のストレージがいっぱいです、○○○が一時停止されています
['○○○のサイズがクォータ制限に達しました', '○○○のストレージ容量が少ない', '～ 受信メールエラー', '○○○のストレージがいっぱいです、○○○が一時停止されています']

スクレイピング：Webページの指定したH5タグの配下にあるデータを抽出する

WebページのHTMLの指定した<h5>タグの配下の、<div>タグの中のデータを取得するプログラムを作成しましたので、記録として残しておきます。

<h5 class="alert_h5">○○○の件名</h5>
<div class="dit_a">
<p>【重要】（○○○）<br>
【重要・緊急】○○○のお知らせ<br>
【○○○】○○○等の確認について<br>
<br>
※○○○<br></font></p>
</div>

１．作成したPythonプログラム

以下作成したサンプルプログラムです。

# -*- coding: utf-8 -*-
from datetime import datetime, timedelta
import requests
import re
from bs4 import BeautifulSoup

# Set the day before yesterday's date
beforeyesterday = datetime.now() - timedelta(days=15)
beforeyesterday_str = beforeyesterday.strftime("%Y%m%d")
mail_line_pattern = "From: \"[a-zA-Z0-9_.+-]+.+[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\""
mail_pattern = "^[0-9a-zA-Z_.+-]+@[0-9a-zA-Z-]+\.[0-9a-zA-Z-.]+$"
env_mail_pattern = "<+[0-9a-zA-Z_.+-]+@[0-9a-zA-Z-]+\.[0-9a-zA-Z-.]+>"
subject_line_pattern = "Subject:"

# Initialize an empty list to store the articles
articles_beforeyesterday = []
text_beforeyesterday = []
link_beforeyesterday = []
mail_list = []
email_list = []
env_email_list = []
subject_list = []
title_list = []

# URL of the top page
url = "https://www.xxx.jp/news/"
domain ="http://www.xxx.jp"

# Get the HTML content of the top page
response = requests.get(url)
html = response.content.decode("utf-8")

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(html, "html.parser")

# Find all <a href> elements in the HTML
for a in soup.find_all("a",href=re.compile(beforeyesterday_str)):
    if a in articles_beforeyesterday:
        print("duplicated")
    else:
        text=a.getText()
        link=a.get("href")
        articles_beforeyesterday.append(a)
        text_beforeyesterday.append(text)
        link_beforeyesterday.append(link)

print(link_beforeyesterday)

for link in link_beforeyesterday:
    # Get the HTML content of the top page

    if link.startswith('http'):
        print(link)
    else:
        link = domain + link
        print(link)

    response = requests.get(link)
    html = response.content.decode("utf-8")

    # Use BeautifulSoup to parse the HTML content
    soup = BeautifulSoup(html, "html.parser")

    h5_tags = soup.find_all('h5', class_='alert_h5')

    for tag in h5_tags:
        if tag.get_text() == '○○○の件名':
            print(tag.find_next('p').get_text())

２．プログラムの解説

今回のポイントは２つあります。

データが指定した文字で始まっているか確認

１つ目は53 から57行目になります。

「if link.startswith(‘http’):」でlinkに入っているデータが’http’で始まっている場合、printしているだけですが、始まっていない場合、”http://www.xxx.jp”というデータを頭にくっつけています。
　これは今回対象となったサイトのリンクがいきなりディレクトリで始まっていた為です。

指定したデータが含まれるタグ配下のデータを取得

２つ目のポイントは最後の４行になります。

　「○○○の件名」が含まれる “h5″タグを見つけて、その配下の “p”タグで囲まれた文字列をピックアップしています。

３．実行結果

このプログラムの実行結果（出力）例は以下の通りです。

duplicated
['/news/xxxxxx_20230728.html', 'http://www.xxx.jp/news/xxxxx_20230728.html']
http://www.xxx.jp/news/xxxxxx_20230728.html
【重要】（○○○）
【重要・緊急】○○○のお知らせ
【○○○】○○○等の確認について

※○○○
http://www.xxx.jp/news/xxxxxx_20230728.html
【重要】（○○○）
【重要・緊急】○○○のお知らせ
【○○○】○○○等の確認について

※○○○

スクレイピング：Webページからhrefの中に指定した文字が含まれる記事を抜き出し、その記事から必要なデータを抽出する

指定したサイトから一昨日の日付を<a href>のリンクに持つ記事をピックアップし、その記事の中からメールアドレスとメール件名を抽出する pythonをChatGPT code interpreterの手も借りながら作成しましたので記録として残しておきます。

１．作成したpythonコード

最終的に作成したpythonコードは以下のとおりです。

# -*- coding: utf-8 -*-
from datetime import datetime, timedelta
import requests
import re
from bs4 import BeautifulSoup

# Set the day before yesterday's date
beforeyesterday = datetime.now() - timedelta(days=2)
beforeyesterday_str = beforeyesterday.strftime("%Y%m%d")
# mail_line_pattern = "From: \"[a-zA-Z0-9_.+-]+.+[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\""
mail_line_pattern = "From:"
mail_pattern = "^[0-9a-zA-Z_.+-]+@[0-9a-zA-Z-]+\.[0-9a-zA-Z-.]+$"
env_mail_pattern = "<+[0-9a-zA-Z_.+-]+@[0-9a-zA-Z-]+\.[0-9a-zA-Z-.]+>"
subject_line_pattern = "Subject:"

# Initialize an empty list to store the articles
articles_beforeyesterday = []
text_beforeyesterday = []
link_beforeyesterday = []
mail_list = []
email_list = []
env_email_list = []
subject_list = []
title_list = []

# URL of the top page
url = "https://www.xxx.jp/news/"・・・スクレイピングするサイトのURL

# Get the HTML content of the top page
response = requests.get(url)
html = response.content.decode("utf-8")

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(html, "html.parser")

# Find all <a href> elements in the HTML
for a in soup.find_all("a",href=re.compile(beforeyesterday_str)):
    if a in articles_beforeyesterday:
        print("duplicated")
    else:
        text=a.getText()
        link=a.get("href")
        articles_beforeyesterday.append(a)
        text_beforeyesterday.append(text)
        link_beforeyesterday.append(link)

print(articles_beforeyesterday)

for link in link_beforeyesterday:
    # Get the HTML content of the top page
    response = requests.get(link)
    html = response.content.decode("utf-8")

    # Use BeautifulSoup to parse the HTML content
    soup = BeautifulSoup(html, "html.parser")
    print(link)
    mail_list=soup.find_all(string=re.compile(mail_line_pattern))
    
    for mail in mail_list:
        # email = re.findall(mail_line_pattern, mail.replace('\n', ''))
        # ヘッダーメールアドレスが日本語の場合もあるので、以下に修正
        # lstrip()・・・左端の文字（\n）を削除
　　　　# strip(mail_line_pattern)・・・"From:"を削除
        # split('<')[0]・・・"<"以降を削除
        # replace('"', '')・・・ダブルクォーテーションを削除
        email = mail.lstrip().strip(mail_line_pattern).split('<')[0].replace('"', '')
        env_email = re.findall(env_mail_pattern, mail.replace('\n', ''))
        email_list.append(email)
        env_email_list.append(env_email)
    print(email_list)
    print(env_email_list)

    subject_list=soup.find_all(string=re.compile(subject_line_pattern))
    for title in subject_list:
        email_title_line = title.replace('\n', '')
        email_title = email_title_line.replace('Subject: ', '')
        title_list.append(email_title)
    print(title_list)

    # 配列の初期化
    mail_list = []
    email_list = []
    env_email_list = []
    subject_list = []
    title_list = []

２．pythonコード解説

指定した文字列が含まれるhref のリンクをピックアップ

38行目：for a in soup.find_all(“a”,href=re.compile(beforeyesterday_str)):
ここで、reライブラリを利用し一昨日の日付が href のリンクの中に含まれているものを抜き出しています。

指定した文字列で始まる行をピックアップ

59行目：mail_list=soup.find_all(string=re.compile(mail_line_pattern))
では、タグを指定せず、soup に取り込まれた HTMLページからmail_line_patternの正規表現に合致するものを抜き出しています。

文字列からいろいろ削除する

67行目：email = mail.lstrip().strip(mail_line_pattern).split(‘<‘)[0].replace(‘”‘, ”)
コメントも書いていますが、元々の文字列からいろいろな方法でデータを削除し、必要なものだけにしています。

３．実行結果

なお、この時の実行結果は以下の通りです。

% python3 scraping.py
python3 info-tech-center.py
duplicated
[<a href="https://www.xxx.jp/news/20230621xxxxxx.html">○○メールに関する注意喚起a</a>]
https://www.xxx.jp/news/20230621xxxxxx.html
[['From: "xx.xxx.ac.jp"'], ['From: "xx.xxx.ac.jp"'], ['From: "Postmaster@xx.xxx.ac.jp"']]
[['<xxxx-xxxxxxxx@xxx.xxxxxxx.ne.jp>'], ['<xxx@xxxxxxxxxxx.co.jp>'], ['<xxxx@xx.uec.ac.jp>']]
['xxxx@xx.uec.ac.jp 電子メール通知', 'メール認証 xxxx@xx.uec.ac.jp', 'メールボックスのストレージがいっぱいです。アカウントは停止されています。確認が要求されました']

＜参考サイト＞

・PythonとBeautiful Soupでスクレイピング（Qiita）（https://qiita.com/itkr/items/513318a9b5b92bd56185）
・Python配列のループ処理（Qiita）（https://qiita.com/motoki1990/items/d06fc7559546a8471392）
・図解！Python BeautifulSoupの使い方を徹底解説！(select、find、find_all、インストール、スクレイピングなど)（https://ai-inter1.com/beautifulsoup_1/#st-toc-h-19）
・Pythonで文字列を抽出（位置・文字数、正規表現）（note.nkmk.me）（https://note.nkmk.me/python-str-extract/#_12）
・サルにも分かる正規表現入門（https://userweb.mnet.ne.jp/nakama/）
・Pythonで特定の文字以降を削除する（インストラクターのネタ帳）（https://www.relief.jp/docs/python-get-string-before-specific-character.html）
・【Python】文字列を「〜で始まる/終わる/〜が含まれる」で抽出する方法（てっくらぶ）（https://www.tecrab.com/articles/python-starts-ends-with-in.html）
・[Python コピペ] 文字列内のダブルクォーテーション（”）、シングルクォーテーション（’）の削除方法（資産運用を考えるブログ）（https://kabu-publisher.net/index.php/2021/10/31/16/python_double_single1/）