[개발이야기#032] 내가 해보고 싶은 것 - 자동 보팅 프로그램 사용자 게시글 수집기 작성하기 [postingcuration]

talkit (68)in #kr • 4 days ago

안녕하세요 가야태자 @talkit 입니다.

[개발이야기#028] 내가 해보고 싶은 것 - 자동 보팅 프로그램 SQLite vs DuckDB [postingcuration]

[개발이야기#029] 내가 해보고 싶은 것 - 자동 보팅 프로그램 사용자 및 포스트 테이블 생성하기 [postingcuration]

[개발이야기#031] 내가 해보고 싶은 것 - 자동 보팅 프로그램 사용자 등록 프로그램 작성하기 [postingcuration]

수집기 컨셉트

제가 생각하는 수집기는 앞에서 만든 user 테이블에 들어 있는 사용자의 글 중에 최근 100개의 글을 수집하는 것입니다.

사용자 테이블에 수집할 사용자도 입력을 했으니 수집을 진행 해보겠습니다.

그리고, 수집은 매1시간 마다 수집을 진행 합니다.

매 1시간 마다는 추후 스케쥴러로 해결 할 예정입니다. 본 글에서는 그냥 스케쥴 없이 수집만 합니다.

post url이 동일하기 때문에 보팅을 위해서 수집한거라서 내용을 업데이트 하지는 않습니다.

앞으로 다른 목적으로 수집 할 경우는 내용이 바꼈으면 업데이트하도록 코드를 변경하면 되겠지요 ^^

그리고 맨 마지막에 현재 수집된 내용을 요약해서 출력 합니다.

수집기 코드

import duckdb
from steem import Steem
import pandas as pd
from datetime import datetime
import traceback
import json  # 추가: JSON 파싱을 위해 필요

# Steemit에 접속 (필요 시 API 노드 설정)
steem = Steem(nodes=['https://api.steemit.com'])  # 노드 설정을 통해 안정적인 연결

# DuckDB에 연결 (파일 기반 데이터베이스)
conn = duckdb.connect('steemit_auto_posting.db')

# 로그 함수
def log(message):
    print(f"{datetime.now()} - {message}")

# 상세 오류 로그 출력 함수
def log_error(e):
    error_message = ''.join(traceback.format_exception(None, e, e.__traceback__))
    print(f"{datetime.now()} - ERROR: {error_message}")

# 특정 URL이 postings 테이블에 존재하는지 확인하는 함수
def url_exists(url):
    try:
        result = conn.execute("SELECT COUNT(1) FROM postings WHERE post_id = ?", (url,)).fetchone()[0]
        return result > 0
    except Exception as e:
        log_error(e)
        return False

# 모든 사용자의 최근 게시물 수집 함수
def fetch_recent_posts(username, limit=100):
    try:
        log(f"Starting to fetch posts for user '{username}' with limit {limit}")
        
        # 각 사용자의 최근 게시물 가져오기
        posts = steem.get_discussions_by_blog({'tag': username, 'limit': limit})
        post_list = []

        #log(f"Fetched {len(posts)} posts for user '{username}'")

        for post in posts:
            try:
                # 본인 게시물만 수집 (리스팀한 게시물 제외)
                if post['author'] != username:
                    #log(f"Skipping resteemed post: {post['permlink']}")
                    continue
                
                # 게시물의 태그 목록 추출
                json_metadata = post.get('json_metadata', {})
                
                # json_metadata가 문자열일 경우 JSON 파싱
                if isinstance(json_metadata, str):
                    json_metadata = json.loads(json_metadata)
                
                tags_list = json_metadata.get('tags', [])
                # 태그들을 콤마로 연결된 문자열로 변환
                tags_str = ','.join(tags_list)

                # 대표 태그 설정 로직
                if 'postingcuration' in tags_list:
                    main_tag = 'postingcuration'
                else:
                    main_tag = tags_list[0] if tags_list else ''  # 태그가 없으면 빈 문자열
                
                # URL 생성
                url = f"https://steemit.com/{post['category']}/@{post['author']}/{post['permlink']}"

                # 기존 URL 확인
                if url_exists(url):
                    #log(f"Post with URL '{url}' already exists. Skipping.")
                    continue  # 중복된 게시물이면 스킵

                post_details = {
                    'post_id': url,  # URL을 post_id 대신 저장
                    'user_id': post.get('author'),
                    'title': post.get('title'),
                    'body': post.get('body'),
                    'tags': tags_str,
                    'main_tag': main_tag,
                    'voting_status': False,
                    'posting_date': post.get('created'),
                    'created_at': post.get('created'),
                    'modified_at': post.get('last_update')
                }

                #log(f"Processed post: {post_details}")
                post_list.append(post_details)
            
            except json.JSONDecodeError:
                log(f"Failed to parse json_metadata for post '{post.get('id')}'")
            except Exception as e:
                log_error(e)

        return post_list
    except Exception as e:
        log_error(e)
        return []

# 사용자 목록 가져오기 함수
def get_users():
    try:
        log("Fetching active users from database")
        # 활성화된 사용자 목록을 읽어오기
        df_users = conn.execute("SELECT user_id FROM users WHERE is_active = 'Y'").df()
        log(f"Found {len(df_users)} active users")
        return df_users['user_id'].tolist()
    except Exception as e:
        log_error(e)
        return []

def main():
    # 모든 활성화된 사용자 가져오기
    usernames = get_users()

    # 수집 요약을 저장할 딕셔너리
    summary = {}

    # 각 사용자에 대해 게시물 수집
    for username in usernames:
        try:
            log(f"Fetching posts for user '{username}'")
            posts = fetch_recent_posts(username, limit=100)

            # 수집한 게시물이 있을 경우 DataFrame으로 변환
            if posts:
                df_posts = pd.DataFrame(posts)
                log(f"Collected {len(df_posts)} posts for user '{username}'")
                
                # 수집한 게시물 데이터 출력 (필요에 따라 저장 등 추가 작업 가능)
                log(f"Post data for user '{username}':\n{df_posts}")
                
                # 수집한 게시물을 데이터베이스에 삽입
                conn.register('temp_posts', df_posts)
                conn.execute("""
                    INSERT INTO postings
                    SELECT * FROM temp_posts
                """)
                log(f"Inserted {len(df_posts)} posts for user '{username}' into the database.")
                
                # 요약 정보에 수집된 글 수 저장
                summary[username] = len(df_posts)
                
            else:
                log(f"No posts collected for user '{username}'")
                summary[username] = 0
        except Exception as e:
            log_error(e)
            summary[username] = 0

    # 요약 정보 출력
    log("Summary of collected posts:")
    for user, count in summary.items():
        log(f"User '{user}': {count} posts collected.")

if __name__ == "__main__":
    main()

Steemit 게시물 수집 프로그램 설명

이 프로그램은 Steemit의 postingcuration 태그를 가진 사용자의 최근 게시물을 수집하고 DuckDB에 저장합니다. 아래는 각 기능에 대한 상세 설명입니다.

1. 필요한 모듈 및 라이브러리 임포트

duckdb: DuckDB 데이터베이스 연결 및 조작.
steem: Steemit 블록체인에 연결하고 API를 통해 데이터를 수집.
pandas: 수집된 데이터를 DataFrame으로 변환하여 처리.
datetime: 현재 시각을 로깅하는 데 사용.
traceback: 예외 발생 시 상세한 오류 로그를 출력.
json: Steemit 게시물의 json_metadata 파싱.

2. Steemit 및 DuckDB 연결 설정

Steem() 객체를 생성하여 Steemit API에 연결하고, DuckDB 파일에 연결합니다.

3. 로깅 함수 정의

일반 로그 출력 함수: log()는 현재 시각과 함께 메시지를 출력합니다.
오류 로그 출력 함수: log_error()는 발생한 오류의 상세 스택 트레이스를 출력합니다.

4. 기존 URL 중복 확인 함수

postings 테이블에서 주어진 url이 이미 존재하는지 확인합니다.
이미 존재하면 True를 반환하고, 존재하지 않으면 False를 반환합니다.

5. 최근 게시물 수집 함수

각 사용자의 최근 게시물을 최대 100개까지 수집합니다.
리스팀된 게시물은 제외하며, 각 게시물의 메타데이터를 파싱하여 필요한 정보를 추출합니다.
postingcuration 태그가 있는 경우 이를 main_tag로 설정하고, 그렇지 않으면 첫 번째 태그를 대표 태그로 설정합니다.
URL이 이미 postings 테이블에 있는 경우 해당 게시물을 스킵합니다.

6. 사용자 목록 가져오기

users 테이블에서 is_active가 Y인 사용자만 선택하여 리스트로 반환합니다.

7. 메인 함수: 게시물 수집 및 데이터베이스 저장

활성화된 사용자 목록을 가져와 각 사용자의 게시물을 수집합니다.
수집된 게시물은 DataFrame으로 변환된 후 DuckDB의 postings 테이블에 저장합니다.
각 사용자별로 수집된 게시물의 수를 summary에 저장하고 로그로 출력합니다.

8. 실행 결과 출력

프로그램이 끝나면 각 사용자별로 수집된 게시물 수를 요약하여 출력합니다.

9. 실행 구문

__name__ == "__main__" 조건문을 통해 프로그램이 직접 실행될 때 main() 함수를 호출합니다.

실행 방법

프로그램 저장

collect_steem_postings.py

위 파일명으로 저장합니다.

필요한 패키지 설치

pip install steem numpy pandas

steem 은 이미 설치 되어 있을 것으로 생각되고, pandas라는 패키지를 설치 합니다.

프로그램 실행

python collect_steem_postings.py

실행

저도 실제로 프로그램을 실행해 보겠습니다.

2024-10-08 23:10:15.820727 - Summary of collected posts:
2024-10-08 23:10:15.820727 - User 'talkit': 0 posts collected.
2024-10-08 23:10:15.820727 - User 'newiz': 0 posts collected.
2024-10-08 23:10:15.820727 - User 'banguri': 0 posts collected.
2024-10-08 23:10:15.820727 - User 'dozam': 0 posts collected.
2024-10-08 23:10:15.820727 - User 'epitt925': 0 posts collected.
2024-10-08 23:10:15.820727 - User 'etainclub': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'happycoachmate': 1 posts collected.
2024-10-08 23:10:15.821704 - User 'jungjunghoon': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'kimyg18': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'maikuraki': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'parisfoodhunter': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'parkname': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'peterpa': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'powerego': 1 posts collected.
2024-10-08 23:10:15.821704 - User 'shrah011': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'tsf-leejgn': 0 posts collected.
2024-10-08 23:10:15.821704 - User 'yoghurty': 1 posts collected.
2024-10-08 23:10:15.821704 - User 'ssglanders': 0 posts collected.talkit.m2e
2024-10-08 23:10:15.821704 - User 'june0620': 0 posts collected.
Finished running collect_postingcuration.py

이번 수집 타임에 ^^

@yoghurty 님 ^^ @powerego @happycoachmate 님이 글을 작성 하셨네요 ^^

감사합니다.

Posted through the ECblog app (https://blog.etain.club)

#kr-dev #postingcuration #talkit #python #steemit #ecblog

4 days ago in #kr by talkit (68)

$1.04

Sort:

Trending

[-]

kr-dev.cu4 (51) 4 days ago

[광고] STEEM 개발자 커뮤니티에 참여 하시면, 다양한 혜택을 받을 수 있습니다.

$0.00

[-]

bigbear34 (71) 3 days ago

드디어 수집 기능이 완성 되었군요 ㅎㅎ 나중에 자기가 썼던 글 찾기가 더 쉬워질 거 같아요ㅎㅎ

$0.00

[-]

talkit (68) 3 days ago

네 감사합니다.
이 수집기를 이용해서 다음 글이나 다다음 글에서는 분석하거나 보팅하는 글이 나갈 예정입니다. ^^

$0.00

[개발이야기#032] 내가 해보고 싶은 것 - 자동 보팅 프로그램 사용자 게시글 수집기 작성하기 [postingcuration]

관련글

수집기 컨셉트

수집기 코드

Steemit 게시물 수집 프로그램 설명

1. 필요한 모듈 및 라이브러리 임포트

2. Steemit 및 DuckDB 연결 설정

3. 로깅 함수 정의

4. 기존 URL 중복 확인 함수

5. 최근 게시물 수집 함수

6. 사용자 목록 가져오기

7. 메인 함수: 게시물 수집 및 데이터베이스 저장

8. 실행 결과 출력

9. 실행 구문

실행 방법

프로그램 저장

필요한 패키지 설치

프로그램 실행

실행