Friday, November 6, 2009

PBMS backup and S3 storage

Hi,

So I had to make some changes to the way that backup worked with PBMS in order that it would behave as expected when the BLOB data is stored remotely on an S3 server.

I will go over how the PBMS repository is backed up first and then explain how it works with S3 storage.

PBMS provides a pbms_backup table in its 'pbms' database that records all the backups that have been performed on any of the PBMS repositories on the server. The create statement for this table looks like this:

CREATE TABLE pbms.pbms_backup (
id INT NOT NULL AUTO_INCREMENT,
Database_Name VARCHAR(64) NOT NULL,
Started TIMESTAMP,
Completed TIMESTAMP,
IsRunning BOOL,
IsDump BOOL,
Location VARCHAR(1024),
Cloud_Ref INT,
Cloud_backup_no INT,
PRIMARY KEY (id)
)

There are 2 ways to backup a PBMS repository:

  • MySQLDump: To backup the repository using 'mysqldump' you simply dump the database's 'pbms_dump' table. This table is a one column table of type 'long blob'. Any select from this table is assumed to be a repository backup and any insert into the table is assumed to be a repository recovery. Selecting the data from the 'pbms_dump' table results in a new record being added to the 'pbms.pbms_backup' table. To recover the repository the dumped data is inserted back into the 'pbms_dump' table. During the recovery it is important to set the pbms variable "Restoring-Dump" in the database's 'pbms_variable' table to "TRUE". This tells the PBMS engine that the database is being recovered and that insertion of BLOB references should not increment the BLOB reference count.
  • Engine level backup: To perform an engine level backup all you do is insert a record into the 'pbms.pbms_backup' table providing the name of the database to be backed up and the location into which the backup should be placed. This starts an asynchronous backup operation. To recover the repository from the backup all you do is drag and drop the backup into your recovered database.
So how does this work when the actual BLOB data is stored on an S3 server?

What happens is the backup process makes a copy of the BLOB data on the S3 server. In the case of an engine level backup the user can specify the S3 Server that the BLOBs should be backed up to which may be a different server entirely. BLOB data is stored on the S3 server using generated names matching the pattern:

<database_id>/<backup_no>/<cloud_ref>.<time_stamp>.<sequence_count>
Example: 1257393091/0/87.1257393139.626

  • database_id is the id of the database as provided by the PBMS engine.
  • backup_no is the backup number of the BLOB. For BLOBs that are not backup copies this number is always zero. The backup number is just a sequence counter that ensures that the backup BLOBs have a unique name. All backup BLOBs from the same backup will have the same backup number. This backup number is stored in the 'Cloud_backup_no' column of the 'pbms.pbms_backup' table.
  • cloud_ref is the reference into the 'pbms.pbms_cloud' table that refers to the S3 server and bucket in which the BLOB is stored.
  • time_stamp is the time in seconds at which the BLOB was created.
  • sequence_count is just a counter to ensure that no 2 BLOBs get the same name.
When a backup is started a check is done to find an unused backup number. Then as each BLOB repository record is backed up the S3 server is sent a request to copy the BLOB to its new name with the new backup number which may also be in a different bucket or on a different server.

The first step of the database recovery is to delete any BLOBs from the original database. This is done by performing the S3 equivalent of 'rm -r <database_id>/0/* ' .

Recovery from an sqldump backup is just the reverse of the backup, as each BLOB repository record is written back to the repository the BLOB is copied back to its original name and location with the backup number set to zero. The 'cloud_ref' number is used to look up the S3 server location from which the original BLOB came.

Recovery from an engine level backup is a bit different because the repository recovery is just a drag and drop operation. The first time the database is accessed the PBMS engines sees that the repository has just been recovered and starts the S3 BLOB recovery process. To recover the S3 BLOBs a list of all the BLOBs in the backup is made and then using the 'cloud_ref' number from the BLOB name the original location of the BLOB is found and the backed up BLOB is copied back to it.

The nice thing about using S3 storage for the BLOBs is that the database BLOB repository can be backed up quite nicely just using mysqldump.

When deleting old backups it is important to remember that there may be BLOBs on an S3 server that also need to be cleaned up. This is where you can make use of the 'pbms.pbms_backup' table to find the location of the backed up BLOBs and using a tool, such as the "S3 FireFox Organizer", delete them. After that the backup record can be deleted from the 'pbms.pbms_backup' table. I could have had the PBMS engine delete all the backed up BLOBs for a backup when the record was deleted from the 'pbms.pbms_backup' table but I thought that that could lead to lost data if the user did not realize the side effects of deleting a backup record.

Barry

No comments: