A client of mine runs WooCommerce on a WordPress site. They have a lot of orders going through this site. One thing I was bumping my head into for a long time is how to create GDPR-safe backups that I can use in my local development environment.
The challenge with WordPress
The hard part with WordPress is that many things resolve back to a post: A post is a post (duh), but also, a page, category, and products are posts. And WooCommerce orders are also posts. This means that all data of these types are saved in 2 tables: wp_posts
and wp_postmeta
. This makes for a pretty flexible programming experience: Adding extra data to any of these types is easy. But getting this data safely out of the database is something different.
Smile GdprDump
After investigating some tools, I tried to build my tool to do this. But as I was inspecting other tools on how they did the database dumping part, I figured out I could use an existing tool for this purpose: Smile GdprDump.
This tool allows you to configure how dumps are created. It allows for several configurable options: Only select a part of the table, use Faker to change data, or only dump the table structure but not the data.
I removed the data from the dump as I don't need the orders on my local system.
Installation is simple: Install it through Composer:
composer require --dev smile/gdpr-dump
Configuration
I ended up with this configuration:
--- database: host: '%env(DB_HOST)%' user: '%env(DB_USER)%' password: '%env(DB_PASSWORD)%' name: '%env(DB_NAME)%' dump: output: 'wordpress-development-dump.sql.gz' compress: 'gzip' tables: wp_comments: truncate: true wp_wc_customer_lookup: truncate: true wp_woocommerce_order_itemmeta: truncate: true wp_posts: filters: - ['post_type', 'in', ['nav_menu_item', 'page', 'post', 'product', 'product_variation']] wp_postmeta: filters: - ['post_id', 'in', 'expr: select ID from wp_posts where post_type IN ("nav_menu_item", "page", "post", "product", "product_variation")'] wp_users: filters: - ['user_email', 'in', 'expr: select user_email from wp_users where user_email like "%@controlaltdelete.nl"'] wp_usermeta: filters: - ['user_id', 'in', 'expr: select ID from wp_users where user_email like "%@controlaltdelete.nl"']
This does a few things:
This creates a gzipped file on
wordpress-development-dump.sql.gz
.
It truncates the data for
wp_comments
,wp_wc_customer_lookup
, andwp_woocommerce_order_itemmeta
. This means that it adds the table structure to the dump, but not the data.For
wp_posts
andwp_postmeta
it only dumps items that are anav_menu_item
,page
,post
,product
, orproduct_variation
. All other types are skipped. So that includes things like orders.For
wp_users
andwp_usermeta
it only dumps the data for users where the email ends on@controlaltdelete.nl
. All other users are skipped.
It is always a good idea to check the project you are working on, it could be that there is more privacy-sensitive data in there. I like to search on @gmail.com
in the dump to find extra tables to change.
Running the GDPR dumper
I have created a little bash script that runs this script for me. This is the content of that file:
export DB_HOST=db export DB_USER=wordpress export DB_PASSWORD=wordpress export DB_NAME=wordpress /var/www/html/vendor/bin/gdpr-dump /var/www/html/gdpr-dump/wordpress.yml
I'm running all my projects in Docker, so the path is always the same. Calling this is something like this:
sh create-gdpr-safe-backup.sh
Automating the GDPR dump
Now we have this working, we can take this a step further. I've created a GitHub Action that takes these steps:
Download a backup from Amazon S3. This is an automated backup that is created daily.
Start a MySQL server (withing GitHub Actions), and import this dump in there.
Run the GDPR dump program.
Upload the result back to S3 in a separate folder.
The file looks like this:
name: Create development dump on: workflow_dispatch: schedule: - cron: '33 9 * * 1' jobs: create-development-dump: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up AWS credentials uses: aws-actions/configure-aws-credentials@v2 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: eu-central-1 - name: Download latest backup run: | BUCKET="s3://supersecret.com-backup/" OBJECT="$(aws s3 ls $BUCKET --recursive |grep ".zip" | sort | tail -n 1 | awk '{print $4}')" aws s3 cp s3://supersecret.com-backup/$OBJECT gdpr-dump/ --no-progress unzip gdpr-dump/$OBJECT supersecret.com.sql -d gdpr-dump/backup/ - name: Start mysql & import database run: | sudo /etc/init.d/mysql start mysql -e 'CREATE DATABASE wordpress;' -uroot -proot mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root';" -uroot -proot cat gdpr-dump/backup/supersecret.com.sql | mysql --binary-mode -uroot -proot wordpress - name: Setup PHP uses: shivammathur/setup-php@v2 with: php-version: '8.0' - name: Run Composer Install run: | composer install --no-interaction --no-progress - name: Run GDPR dump run: | DB_HOST=localhost DB_NAME=wordpress DB_USER=root DB_PASSWORD=root vendor/bin/gdpr-dump gdpr-dump/wordpress.yml - name: Upload to S3 run: aws s3 cp ./wordpress-development-dump.sql.gz s3://supersecret.com-backup/development-dump/
For safety purposes, I have created a IAM user in S3 that has read-only access to the full dump and read/write access to the development-dump
folder.
This script will run once a week, as that's more than enough for me.
What's next?
I don't need to have this automated for development purposes: I don't have to reset my local database that often. But my end goal is to automate the WordPress (plugin) updates. For this, I need end-2-end tests which can be run in the GitHub Actions pipeline. This backup is a big part of that, as it will get imported into the temporary environment before running the tests.
Want to respond?