Migrating WordPress Blog To Django
I just finished migrating EssayBoard.com from WordPress.com to Django Framework. What is Django Framework? It's a Python-based framework that allows me to code any Python app but with less work because Django Framework provides a robust, secure Python framework in which many common tasks can be easily coded in fewer lines of code. Especially, when you're using class-based views that Django Framework allows.
Migrating from WordPress to Django is quite tedious because you need to create the Django blog app first, then write a script to migrate all the blog posts from WordPress to Django. The migration wasn't smooth too because I had to deal with some blog posts that didn't migrate correctly such as linebreaks from WordPress blog format were not translated into linebreaks in Django. Also, YouTube links in my original WordPress blog were automatically displayed YouTube videos, but when migrating to the Django blog web app that I coded -- these YouTube links are just links. I had to implement Django Embed Video to allow these YouTube links to be displayed as YouTube videos in each blog post. Then another obstacle in regards to YouTube links is that strangely the script I coded could only grab the first YouTube link in the blog post but not the rest -- thus only one YouTube video gets shown and the rest stayed as links still. Nonetheless, I'm too lazy to dig back into the script and figure out why since I already migrated 3000 plus WordPress blog posts to Django. I decided this is a good opportunity to review and update old blog posts one by one when I have time -- thus if a blog post got more than one YouTube link -- these links will be turned into YouTube videos by me manually through this process.
I did add a few Django admin actions to allow bulk publishing and bulk unpublishing (draft) for blog posts. This is why I'm able to bulk unpublishing 3000 plus migrated WordPress blog posts. Right now, I only allow a few recent blog posts ordered by published dates to be shown on EssayBoard.com because I had already reviewed and updated these blog posts.
For what it is worth, the script that I had coded to migrate WordPress blog posts to Django did help me a bit in terms of getting all the titles, published dates, and other metadata into my Django blog app. Furthermore, if I have to use this script ever again for any reason, then I can basically go back into the script to fix the bug in which it could not translate all YouTube links into YouTube videos. At the present state, I'm not in the mood to fix it. Regardless, of what it is, if you want to use it for your own project, then you can copy this script below. By the way, for this script to work, you must install Django Extensions.
# This script is now fully functioning in importing WordPress posts from WordPress' exported XML files.
# This script only import an image for each post if the image is of wp_attachment_url type (this is in XML files).
# Furthermore, wp_attachment_url image will be set as feature image in the database after importing it.
import os
import random
import string
import time
from datetime import datetime
import csv
import feedparser
import boto3
from post.models import Post, YouTubeLink
from account.models import Account
from category.models import Category
from slugify import slugify
import pandas as pd
import requests
from pathlib import Path
import re
from essayboard.settings import AWS_S3_CUSTOM_DOMAIN
STORAGE_PATH = 'media/images'
XML_BASE_DIR = 'xml_dir/'
WP_EXPORTED_FILES = []
YOUTUBE_REGEX = (
r'(https?://)?(www\.)?'
'(youtube|youtu|youtube-nocookie)\.(com|be)/'
'(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')
def run():
s3_bucket_folder_path = os.getenv('S3_BUCKET_FOLDER_PATH')
try:
Path(STORAGE_PATH).mkdir(parents=True, exist_ok=True)
except Exception as e:
raise e
for filename in os.listdir(XML_BASE_DIR):
if filename.endswith('.xml'):
WP_EXPORTED_FILES.append(str(XML_BASE_DIR + filename))
else:
continue
s3client = boto3.client('s3',
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
region_name=os.getenv('REGION_NAME')
)
def get_random_string(length):
# choose from all lowercase letter
letters = string.ascii_lowercase
result_str = ''.join(random.choice(letters) for i in range(length))
# print("Random string of length", length, "is:", result_str)
return result_str
random_string = get_random_string(9)
def get_things_done(file=None):
i = file
data = feedparser.parse(i)
entries = data['entries']
return entries
def download_image(url, storage_path, dry_run):
ext = url.split('/')[-1].split('.')[-1]
file_name = url.split('/')[-1].split('.')[0]
if dry_run is True:
# Download image from wp_attachment_url
print(f'Downloading image from the original post.')
with open(
f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
'a',
encoding='utf-8') as f:
f.write(f'Downloading image from the original post.\n')
return f'{storage_path}/{file_name}.{ext}'
else:
# Download image from wp_attachment_url
with open(f'{storage_path}/{file_name}.{ext}', 'wb') as handle:
try:
response = requests.get(url, stream=True)
except requests.exceptions.Timeout:
print('Timeout occurred')
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
return f'{storage_path}/{file_name}.{ext}'
def upload_to_s3(file_name, bucket, dry_run=False, object_name=None, args=None):
if object_name is None:
object_name = file_name
if dry_run is True:
print(
f'Uploading to Amazon S3 using filename: {file_name}, bucket: {bucket}, object_name: {object_name}, ExtraArgs: {args}')
with open(
f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
'a',
encoding='utf-8') as f:
f.write(
f'Uploading to Amazon S3 using filename: {file_name}, bucket: {bucket}, object_name: {object_name}, ExtraArgs: {args}\n')
else:
s3client.upload_file(file_name, bucket, object_name, ExtraArgs=args)
def get_images_n_process(image_list: list, dry_run=False):
dry_run = dry_run
for i in image_list:
try:
# First download image to local folder on local machine
image_url = download_image(url=i, storage_path=STORAGE_PATH, dry_run=dry_run)
# Upload this image to S3.
upload_to_s3(image_url, os.getenv('BUCKET_NAME'), dry_run,
s3_bucket_folder_path + image_url.split('/')[-1])
# Feature image link is now available. Get only the first image from post_images list.
if image_list.index(i) == 0:
feature_image_link_on_s3 = 'https://%s/%s/' % (
AWS_S3_CUSTOM_DOMAIN, s3_bucket_folder_path + image_url.split('/')[-1])
return feature_image_link_on_s3
else:
return None
except Exception as e:
raise e
def process_entry(entry, dry_run=False):
post_images = []
tag_list = []
feature_image_link_on_s3 = None
category = None
try:
post_title = entry['title']
except Exception:
post_title = ''
published_date = entry['published']
published_date = pd.to_datetime(published_date).strftime('%Y-%m-%d')
post_content = entry['content'][0]['value']
try:
post_images.append(entry['wp_attachment_url'])
except Exception:
post_images = None
try:
tags = entry['tags']
except Exception:
tags = ''
for j in tags:
tag_list.append(j['term'])
if j['scheme'] == 'category':
category = j['term']
if dry_run is True:
try:
get_category = Category.objects.filter(name=category).exists()
if get_category is True:
category = get_category.name
else:
pass
except Exception:
pass
else:
# First checking to see if category is in db, if not create one with
# category name from WordPress' exported file.
try:
get_category = Category.objects.filter(name=category).exists()
if get_category is True:
category = get_category.name
else:
Category.objects.create(name=category,
slug=slugify(category, entities=True,
decimal=True,
hexadecimal=True, separator='-', lowercase=True))
except Exception:
pass
# If there are images in entry['wp_attachment_url'], then we
# will download and upload them to S3 and return a feature image for post.
if post_images:
feature_image_link_on_s3 = get_images_n_process(image_list=post_images, dry_run=dry_run)
try:
feature_image_link_on_s3 = feature_image_link_on_s3.strip('/')
except Exception:
pass
slug = slugify(post_title, entities=True, decimal=True, hexadecimal=True,
separator='-',
lowercase=True)
# Get YouTube link from each post
youtube_link_group = []
if 'https://youtu.be' in entry['content'][0]['value']:
youtube_link = re.search(YOUTUBE_REGEX, entry['content'][0]['value'])
if youtube_link:
youtube_link_group.append(youtube_link.group())
elif 'http://youtu.be' in entry['content'][0]['value']:
youtube_link = re.search(YOUTUBE_REGEX, entry['content'][0]['value'])
if youtube_link:
youtube_link_group.append(youtube_link.group())
return_list = {'post_title': post_title, 'slug': slug, 'post_content': post_content,
'feature_image_link_on_s3': feature_image_link_on_s3, 'published_date': published_date,
'category': category, 'tag_list': tag_list, 'youtube_link_group': youtube_link_group}
return return_list
def create_post(dry_run, post_title, slug, post_content, feature_image_link_on_s3, published_date, category,
tag_list, youtube_link_group):
if dry_run is not True:
try:
post = Post(
title=post_title,
slug=slug,
author=Account.objects.get(pk=1),
content=post_content,
feature_image=feature_image_link_on_s3,
publish_date=published_date,
status=2,
categories=Category.objects.filter(name=category).first(),
)
post.save()
except Exception:
post = None
for tag in tag_list:
post.tags.add(tag)
for link in youtube_link_group:
post_obj = Post.objects.filter(title=post_title).first()
YouTubeLink.objects.create(
post=post_obj,
video=link
)
else:
with open(
f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
'a',
encoding='utf-8') as f:
f.writelines(
f"Creating post title: {post_title}.\n"
f"Creating slug: {slug}.\n"
f"Assigning author: {Account.objects.get(pk=1)} to post: {post_title}.\n"
f"Creating post content: {post_content}.\n"
f"Feature image link is: {feature_image_link_on_s3}.\n"
f"Publish date is: {published_date}.\n"
f"Post status is: 2.\n"
f"Post\'s category name: {category}.\n"
f"Tags are: {', '.join(tag_list)}.\n")
with open(
f'scripts/dry_run_importing_wp_posts_{datetime.now().strftime("%m_%d_%Y")}_{random_string}.txt',
'a',
encoding='utf-8') as f:
f.write(f'YouTube videos are: {", ".join(youtube_link_group)}\n\n\n\n\n')
def main(dry_run):
wp_post_id_df = None
count = 0
file_counts = len(WP_EXPORTED_FILES)
# field = []
#
# # If the file is not yet existed, we will open it and write a header 'wp_post_id'
# with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
# field.append('wp_post_id')
# csvwriter = csv.writer(f)
# csvwriter.writerow(field)
# Here we begin to capture user input to run the imported WordPress XML files.
while count < file_counts:
if WP_EXPORTED_FILES[count]:
user_input = input(
f'Do you want to run {WP_EXPORTED_FILES[count]} file? Type "y" for yes "n" for no. ')
if user_input.lower() == 'n':
break
else:
# Try to parse the XML files for dictionary results
imported_posts = get_things_done(file=WP_EXPORTED_FILES[count])
# Import certain amount of posts
# for i in imported_posts[-15:]:
# wp_post_id_df_list = []
# Import all posts
for i in imported_posts:
wp_post_id_df_list = []
# Try to read in wp_post_id.csv file to get unique id of each imported post,
# if the file is not available, just ignore and continue
try:
wp_post_id_df = pd.read_csv('scripts/wp_post_id.csv')
# wp_post_id_df.drop_duplicates('wp_post_id')
# Convert each numpy array value into int type so warning won't appear.
for j in wp_post_id_df.values:
wp_post_id_df_list.append(int(j))
except Exception:
pass
# Get wp_post_id value from parsed results
wp_post_id = i['wp_post_id']
# Get wp_post status
wp_status = i['wp_status']
try:
if int(i['wp_post_id']) not in wp_post_id_df_list or wp_post_id_df is None and wp_status == 'publish':
results = process_entry(entry=i, dry_run=dry_run)
create_post(dry_run=dry_run, post_title=results['post_title'], slug=results['slug'],
post_content=results['post_content'],
feature_image_link_on_s3=results['feature_image_link_on_s3'],
published_date=results['published_date'], category=results['category'],
tag_list=results['tag_list'],
youtube_link_group=results['youtube_link_group'])
# We will append the wp_post_id to the csv file (working backward kind of way),
# earlier we read - but if this is the first run of the script - the read will
# yield no result. Now we write the wp_post_id - by the second run - the script
# will be able to read in wp_post_id - this means we can use this id to tell
# the script to not import the same post again if it got this id.
try:
with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
row = [wp_post_id]
csvwriter = csv.writer(f)
csvwriter.writerow(row)
except Exception as e:
raise e
time.sleep(2)
except Exception:
pass
count += 1
else:
pass
def run_everything():
start_time = time.time()
dry_run = True
user_input = input('Do you want to run a dry run? Type "y" for yes "n" for no. ')
if user_input.lower() == 'n':
dry_run = False
# If the file is not yet existed, we will open it and write a header 'wp_post_id'
# This file will be responsible for script to resume importing if failed and not have to
# import the same post with same id again.
try:
with open(f'scripts/wp_post_id.csv', 'r', encoding='utf-8') as f:
csvreader = csv.reader(f)
field = next(csvreader)
if field == 'wp_post_id':
pass
else:
pass
except FileNotFoundError:
field = ['wp_post_id']
with open(f'scripts/wp_post_id.csv', 'a', encoding='utf-8') as f:
csvwriter = csv.writer(f)
csvwriter.writerow(field)
main(dry_run)
print(f'---time--- {time.time() - start_time}')
print('All WordPress posts are now imported.')
run_everything()
You must customize this script somewhat to your need because you need to import environment variables for Boto3/S3 stuff. This script also relies on FeedParser to parse WordPress XML imported files. This means you must also install this module for Django. Don't forget to install the Embed Video module for Django to allow the script to convert YouTube links to YouTube videos.
Updated: In the script where I had commented as # Get YouTube link from each post, I forgot to implement a loop to convert multiple YouTube links into YouTube videos for each blog post. I guess you could update the script with a while loop of sorts or create a for loop with a predetermined amount of how many YouTube links you want to convert to YouTube videos for each blog post. Doing this will allow this script to convert multiple YouTube links into Youtube videos for each blog post.
In summary, if you already had coded a blog application based on Django Framework and want to import WordPress blog posts to this Django app, you could use the Python script I posted above. Be sure to install Django Extensions so you can use Django Extensions' runscript to run this script! Without this module, you won't be able to run the script above. I recommend you limit the amount of each blog post to be migrated to around 200 posts or less per migration. You can do this in the section where I commented as # Import all posts. Each time you migrate, the script will append a unique WordPress wp_post_id to a file named wp_post_id.csv. This will hasten up the process of migrating new WordPress blog posts and will not migrate any blog post that is already migrated. To reset the whole process, you must delete this file.