logo image of EssayBoard website
x
feature image of Migrating WordPress Blog To Django

Migrating WordPress Blog To Django   

I just finished migrating EssayBoard.com from WordPress.com to Django Framework.  What is Django Framework?  It's a Python-based framework that allows me to code any Python app but with less work because Django Framework provides a robust, secure Python framework in which many common tasks can be easily coded in fewer lines of code.  Especially, when you're using class-based views that Django Framework allows.

Migrating from WordPress to Django is quite tedious because you need to create the Django blog app first, then write a script to migrate all the blog posts from WordPress to Django.  The migration wasn't smooth too because I had to deal with some blog posts that didn't migrate correctly such as linebreaks from WordPress blog format were not translated into linebreaks in Django.  Also, YouTube links in my original WordPress blog were automatically displayed YouTube videos, but when migrating to the Django blog web app that I coded -- these YouTube links are just links.  I had to implement Django Embed Video to allow these YouTube links to be displayed as YouTube videos in each blog post.  Then another obstacle in regards to YouTube links is that strangely the script I coded could only grab the first YouTube link in the blog post but not the rest -- thus only one YouTube video gets shown and the rest stayed as links still.  Nonetheless, I'm too lazy to dig back into the script and figure out why since I already migrated 3000 plus WordPress blog posts to Django.  I decided this is a good opportunity to review and update old blog posts one by one when I have time -- thus if a blog post got more than one YouTube link -- these links will be turned into YouTube videos by me manually through this process.

I did add a few Django admin actions to allow bulk publishing and bulk unpublishing (draft) for blog posts.  This is why I'm able to bulk unpublishing 3000 plus migrated WordPress blog posts.  Right now, I only allow a few recent blog posts ordered by published dates to be shown on EssayBoard.com because I had already reviewed and updated these blog posts.

For what it is worth, the script that I had coded to migrate WordPress blog posts to Django did help me a bit in terms of getting all the titles, published dates, and other metadata into my Django blog app.  Furthermore, if I have to use this script ever again for any reason, then I can basically go back into the script to fix the bug in which it could not translate all YouTube links into YouTube videos.  At the present state, I'm not in the mood to fix it.  Regardless, of what it is, if you want to use it for your own project, then you can copy this script below.  By the way, for this script to work, you must install Django Extensions.

# This script is now fully functioning in importing WordPress posts from WordPress' exported XML files.
# This script only import an image for each post if the image is of wp_attachment_url type (this is in XML files).
# Furthermore, wp_attachment_url image will be set as feature image in the database after importing it.

import os
import random
import string
import time
from datetime import datetime
import csv

import feedparser
import boto3
from post.models import Post, YouTubeLink
from account.models import Account
from category.models import Category
from slugify import slugify
import pandas as pd
import requests
from pathlib import Path
import re
from essayboard.settings import AWS_S3_CUSTOM_DOMAIN

STORAGE_PATH = 'media/images'
XML_BASE_DIR = 'xml_dir/'
WP_EXPORTED_FILES = []
YOUTUBE_REGEX = (
    r'(https?://)?(www\.)?'
    '(youtube|youtu|youtube-nocookie)\.(com|be)/'
    '(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')


def run():
    s3_bucket_folder_path = os.getenv('S3_BUCKET_FOLDER_PATH')
    try:
        Path(STORAGE_PATH).mkdir(parents=True, exist_ok=True)
    except Exception as e:
        raise e

    for filename in os.listdir(XML_BASE_DIR):
        if filename.endswith('.xml'):
            WP_EXPORTED_FILES.append(str(XML_BASE_DIR + filename))
        else:
            continue

    s3client = boto3.client('s3',
                            aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
                            aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
                            region_name=os.getenv('REGION_NAME')
                            )

    def get_random_string(length):
        # choose from all lowercase letter
        letters = string.ascii_lowercase
        result_str = ''.join(random.choice(letters) for i in range(length))
        # print("Random string of length", length, "is:", result_str)
        return result_str

    random_string = get_random_string(9)

    def get_things_done(file=None):
        i = file
        data = feedparser.parse(i)
        entries = data['entries']
        return entries

    def download_image(url, storage_path, dry_run):
        ext = url.split('/')[-1].split('.')[-1]
        file_name = url.split('/')[-1].split('.')[0]
        if dry_run is True:
            # Download image from wp_attachment_url
            print(f'Downloading image from the original post.')
            with open(
                    f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
                    'a',
                    encoding='utf-8') as f:
                f.write(f'Downloading image from the original post.\n')
            return f'{storage_path}/{file_name}.{ext}'
        else:
            # Download image from wp_attachment_url
            with open(f'{storage_path}/{file_name}.{ext}', 'wb') as handle:
                try:
                    response = requests.get(url, stream=True)
                except requests.exceptions.Timeout:
                    print('Timeout occurred')
                if not response.ok:
                    print(response)
                for block in response.iter_content(1024):
                    if not block:
                        break
                    handle.write(block)
            return f'{storage_path}/{file_name}.{ext}'

    def upload_to_s3(file_name, bucket, dry_run=False, object_name=None, args=None):
        if object_name is None:
            object_name = file_name
        if dry_run is True:
            print(
                f'Uploading to Amazon S3 using filename: {file_name}, bucket: {bucket}, object_name: {object_name}, ExtraArgs: {args}')
            with open(
                    f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
                    'a',
                    encoding='utf-8') as f:
                f.write(
                    f'Uploading to Amazon S3 using filename: {file_name}, bucket: {bucket}, object_name: {object_name}, ExtraArgs: {args}\n')
        else:
            s3client.upload_file(file_name, bucket, object_name, ExtraArgs=args)

    def get_images_n_process(image_list: list, dry_run=False):
        dry_run = dry_run
        for i in image_list:
            try:
                # First download image to local folder on local machine
                image_url = download_image(url=i, storage_path=STORAGE_PATH, dry_run=dry_run)

                # Upload this image to S3.
                upload_to_s3(image_url, os.getenv('BUCKET_NAME'), dry_run,
                             s3_bucket_folder_path + image_url.split('/')[-1])

                # Feature image link is now available.  Get only the first image from post_images list.
                if image_list.index(i) == 0:
                    feature_image_link_on_s3 = 'https://%s/%s/' % (
                        AWS_S3_CUSTOM_DOMAIN, s3_bucket_folder_path + image_url.split('/')[-1])
                    return feature_image_link_on_s3
                else:
                    return None
            except Exception as e:
                raise e

    def process_entry(entry, dry_run=False):
        post_images = []
        tag_list = []
        feature_image_link_on_s3 = None
        category = None
        try:
            post_title = entry['title']
        except Exception:
            post_title = ''
        published_date = entry['published']
        published_date = pd.to_datetime(published_date).strftime('%Y-%m-%d')
        post_content = entry['content'][0]['value']
        try:
            post_images.append(entry['wp_attachment_url'])
        except Exception:
            post_images = None
        try:
            tags = entry['tags']
        except Exception:
            tags = ''

        for j in tags:
            tag_list.append(j['term'])
            if j['scheme'] == 'category':
                category = j['term']

        if dry_run is True:
            try:
                get_category = Category.objects.filter(name=category).exists()
                if get_category is True:
                    category = get_category.name
                else:
                    pass
            except Exception:
                pass
        else:
            # First checking to see if category is in db, if not create one with
            # category name from WordPress' exported file.
            try:
                get_category = Category.objects.filter(name=category).exists()
                if get_category is True:
                    category = get_category.name
                else:
                    Category.objects.create(name=category,
                                            slug=slugify(category, entities=True,
                                                         decimal=True,
                                                         hexadecimal=True, separator='-', lowercase=True))
            except Exception:
                pass

        # If there are images in entry['wp_attachment_url'], then we
        # will download and upload them to S3 and return a feature image for post.
        if post_images:
            feature_image_link_on_s3 = get_images_n_process(image_list=post_images, dry_run=dry_run)
            try:
                feature_image_link_on_s3 = feature_image_link_on_s3.strip('/')
            except Exception:
                pass

        slug = slugify(post_title, entities=True, decimal=True, hexadecimal=True,
                       separator='-',
                       lowercase=True)
        # Get YouTube link from each post
        youtube_link_group = []
        if 'https://youtu.be' in entry['content'][0]['value']:
            youtube_link = re.search(YOUTUBE_REGEX, entry['content'][0]['value'])
            if youtube_link:
                youtube_link_group.append(youtube_link.group())
        elif 'http://youtu.be' in entry['content'][0]['value']:
            youtube_link = re.search(YOUTUBE_REGEX, entry['content'][0]['value'])
            if youtube_link:
                youtube_link_group.append(youtube_link.group())

        return_list = {'post_title': post_title, 'slug': slug, 'post_content': post_content,
                       'feature_image_link_on_s3': feature_image_link_on_s3, 'published_date': published_date,
                       'category': category, 'tag_list': tag_list, 'youtube_link_group': youtube_link_group}
        return return_list

    def create_post(dry_run, post_title, slug, post_content, feature_image_link_on_s3, published_date, category,
                    tag_list, youtube_link_group):
        if dry_run is not True:
            try:
                post = Post(
                    title=post_title,
                    slug=slug,
                    author=Account.objects.get(pk=1),
                    content=post_content,
                    feature_image=feature_image_link_on_s3,
                    publish_date=published_date,
                    status=2,
                    categories=Category.objects.filter(name=category).first(),
                )
                post.save()
            except Exception:
                post = None
            for tag in tag_list:
                post.tags.add(tag)
            for link in youtube_link_group:
                post_obj = Post.objects.filter(title=post_title).first()
                YouTubeLink.objects.create(
                    post=post_obj,
                    video=link
                )
        else:
            with open(
                    f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
                    'a',
                    encoding='utf-8') as f:
                f.writelines(
                    f"Creating post title: {post_title}.\n"
                    f"Creating slug: {slug}.\n"
                    f"Assigning author: {Account.objects.get(pk=1)} to post: {post_title}.\n"
                    f"Creating post content: {post_content}.\n"
                    f"Feature image link is: {feature_image_link_on_s3}.\n"
                    f"Publish date is: {published_date}.\n"
                    f"Post status is: 2.\n"
                    f"Post\'s category name: {category}.\n"
                    f"Tags are: {', '.join(tag_list)}.\n")
            with open(
                    f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
                    'a',
                    encoding='utf-8') as f:
                f.write(f'YouTube videos are: {", ".join(youtube_link_group)}\n\n\n\n\n')

    def main(dry_run):
        wp_post_id_df = None
        count = 0
        file_counts = len(WP_EXPORTED_FILES)
        # field = []
        #
        # # If the file is not yet existed, we will open it and write a header 'wp_post_id'
        # with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
        #     field.append('wp_post_id')
        #     csvwriter = csv.writer(f)
        #     csvwriter.writerow(field)

        # Here we begin to capture user input to run the imported WordPress XML files.
        while count < file_counts:
            if WP_EXPORTED_FILES[count]:
                user_input = input(
                    f'Do you want to run {WP_EXPORTED_FILES[count]} file?  Type "y" for yes "n" for no.  ')
                if user_input.lower() == 'n':
                    break
                else:

                    # Try to parse the XML files for dictionary results
                    imported_posts = get_things_done(file=WP_EXPORTED_FILES[count])

                    # Import certain amount of posts
                    # for i in imported_posts[-15:]:
                    #     wp_post_id_df_list = []

                    # Import all posts
                    for i in imported_posts:
                        wp_post_id_df_list = []

                        # Try to read in wp_post_id.csv file to get unique id of each imported post,
                        # if the file is not available, just ignore and continue
                        try:
                            wp_post_id_df = pd.read_csv('scripts/wp_post_id.csv')
                            # wp_post_id_df.drop_duplicates('wp_post_id')

                            # Convert each numpy array value into int type so warning won't appear.
                            for j in wp_post_id_df.values:
                                wp_post_id_df_list.append(int(j))
                        except Exception:
                            pass

                        # Get wp_post_id value from parsed results
                        wp_post_id = i['wp_post_id']

                        # Get wp_post status
                        wp_status = i['wp_status']

                        try:
                            if int(i['wp_post_id']) not in wp_post_id_df_list or wp_post_id_df is None and wp_status == 'publish':
                                results = process_entry(entry=i, dry_run=dry_run)
                                create_post(dry_run=dry_run, post_title=results['post_title'], slug=results['slug'],
                                            post_content=results['post_content'],
                                            feature_image_link_on_s3=results['feature_image_link_on_s3'],
                                            published_date=results['published_date'], category=results['category'],
                                            tag_list=results['tag_list'],
                                            youtube_link_group=results['youtube_link_group'])

                                # We will append the wp_post_id to the csv file (working backward kind of way),
                                # earlier we read - but if this is the first run of the script - the read will
                                # yield no result.  Now we write the wp_post_id - by the second run - the script
                                # will be able to read in wp_post_id - this means we can use this id to tell
                                # the script to not import the same post again if it got this id.
                                try:
                                    with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
                                        row = [wp_post_id]
                                        csvwriter = csv.writer(f)
                                        csvwriter.writerow(row)
                                except Exception as e:
                                    raise e
                                time.sleep(2)
                        except Exception:
                            pass
                    count += 1
            else:
                pass

    def run_everything():
        start_time = time.time()
        dry_run = True
        user_input = input('Do you want to run a dry run?  Type "y" for yes "n" for no.  ')
        if user_input.lower() == 'n':
            dry_run = False

        # If the file is not yet existed, we will open it and write a header 'wp_post_id'
        # This file will be responsible for script to resume importing if failed and not have to
        # import the same post with same id again.
        try:
            with open(f'scripts/wp_post_id.csv', 'r', encoding='utf-8') as f:
                csvreader = csv.reader(f)
                field = next(csvreader)
                if field == 'wp_post_id':
                    pass
                else:
                    pass
        except FileNotFoundError:
            field = ['wp_post_id']
            with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
                csvwriter = csv.writer(f)
                csvwriter.writerow(field)

        main(dry_run)
        print(f'---time--- {time.time() - start_time}')
        print('All WordPress posts are now imported.')

    run_everything()

You must customize this script somewhat to your need because you need to import environment variables for Boto3/S3 stuff.  This script also relies on FeedParser to parse WordPress XML imported files.  This means you must also install this module for Django.  Don't forget to install the Embed Video module for Django to allow the script to convert YouTube links to YouTube videos.

Updated:  In the script where I had commented as # Get YouTube link from each post, I forgot to implement a loop to convert multiple YouTube links into YouTube videos for each blog post.  I guess you could update the script with a while loop of sorts or create a for loop with a predetermined amount of how many YouTube links you want to convert to YouTube videos for each blog post.  Doing this will allow this script to convert multiple YouTube links into Youtube videos for each blog post.

In summary, if you already had coded a blog application based on Django Framework and want to import WordPress blog posts to this Django app, you could use the Python script I posted above.  Be sure to install Django Extensions so you can use Django Extensions' runscript to run this script!  Without this module, you won't be able to run the script above.  I recommend you limit the amount of each blog post to be migrated to around 200 posts or less per migration.  You can do this in the section where I commented as # Import all posts.  Each time you migrate, the script will append a unique WordPress wp_post_id to a file named wp_post_id.csv.  This will hasten up the process of migrating new WordPress blog posts and will not migrate any blog post that is already migrated.  To reset the whole process, you must delete this file.

profile image of Vinh Nguyen

Famous quote by:   
Pablo Picasso

“Everything you can imagine is real.”

Post Comments