Friday, November 6, 2009

PBMS backup and S3 storage

Hi,

So I had to make some changes to the way that backup worked with PBMS in order that it would behave as expected when the BLOB data is stored remotely on an S3 server.

I will go over how the PBMS repository is backed up first and then explain how it works with S3 storage.

PBMS provides a pbms_backup table in its 'pbms' database that records all the backups that have been performed on any of the PBMS repositories on the server. The create statement for this table looks like this:

CREATE TABLE pbms.pbms_backup (
id INT NOT NULL AUTO_INCREMENT,
Database_Name VARCHAR(64) NOT NULL,
Started TIMESTAMP,
Completed TIMESTAMP,
IsRunning BOOL,
IsDump BOOL,
Location VARCHAR(1024),
Cloud_Ref INT,
Cloud_backup_no INT,
PRIMARY KEY (id)
)

There are 2 ways to backup a PBMS repository:

  • MySQLDump: To backup the repository using 'mysqldump' you simply dump the database's 'pbms_dump' table. This table is a one column table of type 'long blob'. Any select from this table is assumed to be a repository backup and any insert into the table is assumed to be a repository recovery. Selecting the data from the 'pbms_dump' table results in a new record being added to the 'pbms.pbms_backup' table. To recover the repository the dumped data is inserted back into the 'pbms_dump' table. During the recovery it is important to set the pbms variable "Restoring-Dump" in the database's 'pbms_variable' table to "TRUE". This tells the PBMS engine that the database is being recovered and that insertion of BLOB references should not increment the BLOB reference count.
  • Engine level backup: To perform an engine level backup all you do is insert a record into the 'pbms.pbms_backup' table providing the name of the database to be backed up and the location into which the backup should be placed. This starts an asynchronous backup operation. To recover the repository from the backup all you do is drag and drop the backup into your recovered database.
So how does this work when the actual BLOB data is stored on an S3 server?

What happens is the backup process makes a copy of the BLOB data on the S3 server. In the case of an engine level backup the user can specify the S3 Server that the BLOBs should be backed up to which may be a different server entirely. BLOB data is stored on the S3 server using generated names matching the pattern:

<database_id>/<backup_no>/<cloud_ref>.<time_stamp>.<sequence_count>
Example: 1257393091/0/87.1257393139.626

  • database_id is the id of the database as provided by the PBMS engine.
  • backup_no is the backup number of the BLOB. For BLOBs that are not backup copies this number is always zero. The backup number is just a sequence counter that ensures that the backup BLOBs have a unique name. All backup BLOBs from the same backup will have the same backup number. This backup number is stored in the 'Cloud_backup_no' column of the 'pbms.pbms_backup' table.
  • cloud_ref is the reference into the 'pbms.pbms_cloud' table that refers to the S3 server and bucket in which the BLOB is stored.
  • time_stamp is the time in seconds at which the BLOB was created.
  • sequence_count is just a counter to ensure that no 2 BLOBs get the same name.
When a backup is started a check is done to find an unused backup number. Then as each BLOB repository record is backed up the S3 server is sent a request to copy the BLOB to its new name with the new backup number which may also be in a different bucket or on a different server.

The first step of the database recovery is to delete any BLOBs from the original database. This is done by performing the S3 equivalent of 'rm -r <database_id>/0/* ' .

Recovery from an sqldump backup is just the reverse of the backup, as each BLOB repository record is written back to the repository the BLOB is copied back to its original name and location with the backup number set to zero. The 'cloud_ref' number is used to look up the S3 server location from which the original BLOB came.

Recovery from an engine level backup is a bit different because the repository recovery is just a drag and drop operation. The first time the database is accessed the PBMS engines sees that the repository has just been recovered and starts the S3 BLOB recovery process. To recover the S3 BLOBs a list of all the BLOBs in the backup is made and then using the 'cloud_ref' number from the BLOB name the original location of the BLOB is found and the backed up BLOB is copied back to it.

The nice thing about using S3 storage for the BLOBs is that the database BLOB repository can be backed up quite nicely just using mysqldump.

When deleting old backups it is important to remember that there may be BLOBs on an S3 server that also need to be cleaned up. This is where you can make use of the 'pbms.pbms_backup' table to find the location of the backed up BLOBs and using a tool, such as the "S3 FireFox Organizer", delete them. After that the backup record can be deleted from the 'pbms.pbms_backup' table. I could have had the PBMS engine delete all the backed up BLOBs for a backup when the record was deleted from the 'pbms.pbms_backup' table but I thought that that could lead to lost data if the user did not realize the side effects of deleting a backup record.

Barry

Thursday, November 5, 2009

PBMS Cloud storage is back!

Hi,

Support for S3 BLOB storage has now been fully integrated into the PBMS engine. It works in much the same way that I mentioned in an earlier post but with some important changes so I will explain it all again here.

When using S3 BLOB storage with PBMS the BLOB reference tracking and metadata is handled the same as before in that they are stored in the BLOB record in the repository, but the actual BLOB is stored on an S3 server.

To setup S3 storage you need to add an S3 cloud reference record to the pbms.pbms_cloud table provided by PBMS. For example:

INSERT INTO pbms.pbms_cloud(ID, Server, bucket, PublicKey, PrivateKey) VALUES(16, "S3.amazonaws.com", "PBMS-Test", "abc123", "amjr15vWq");

Then you need to tell PBMS which database should use S3 cloud storage for its BLOBs. This is done by updating a couple of records in the pbms_variable table that PBMS provides for each user database. For example to setup the database "myDB" for S3 cloud storage you would do the following:

UPDATE myDB.pbms_variable set value = "16" where name = "S3-Cloud-Ref";
UPDATE myDB.pbms_variable set value = "CLOUD" where name = "Storage-type";

The database "myDB" is now setup for cloud storage. All BLOB data will now be stored in the bucket "PBMS-Test" on the S3 server "S3.amazonaws.com".

This diagram shows the steps taken when the PBMS client library uploads a BLOB to the PBMS repository using S3 cloud storage. All of these steps are performed by one call the the PBMS client lib and the client application knows nothing about the type of BLOB storage being used:




  • Step 1: The BLOB metadata is sent to the PBMS engine.
  • Step 2: A repository record is created containing the BLOB metadata.
  • Step 3: A reply is sent back to the client containing the BLOB reference which is passed back up to the client application to be inserted into the user's table in place of the BLOB. An S3 authorization signature is also returned to the client. The authorization signature is generated by the PBMS engine using the Public/Private keys for the S3 server to sign the BLOB upload request.
  • Step 4: The PBMS client library uses the authorization signature to upload the BLOB to the S3 server.
Alternatively the BLOB can be inserted directly into the user's table in which case the PBMS engine will upload it to the S3 server. This is not the preferred way of doing things because it forces the MySQL server to handle the BLOB data, which is what we are trying to avoid by using the PBMS engine.


Here is how the BLOB is accessed. Keep in mind that all the client application does is provide the PBMS lib with a PBMS BLOB reference as it currently does and receives the BLOB in return. It knows nothing about the cloud.




  • Step 1: A BLOB request containing a PBMS BLOB reference is sent to the PBMS engine’s HTTP server.
  • Step 2: The BLOB’s repository record is read from local storage.
  • Step 3: An HTTP redirect reply is sent back to the client redirecting the request to the BLOB stored in the cloud. The metadata associated with the BLOB is returned to the client in the reply’s HTTP headers. The redirect URL is an authenticated query string that gives the client time limited access to the BLOB data. Use of an authenticated query string allows the data in the cloud to have access protection without requiring the client applications to know the private key normally required to get access.
  • Step 4: The redirect is followed to the cloud and the BLOB data is retrieved.
The beauty of this system is that the client applications need never know how or where the actually BLOB data is stored and since the BLOB transfer is all directly between the client and the S3 server, the BLOB data doesn't need to be handled at all by the machine on which the MySQL server is running.

Because the BLOB repository records indirectly refer back to the S3 server via the pbms_cloud table, the administrator is free to move the BLOB data between S3 servers and/or buckets with out having to do anything more than updating the record in the pbms_cloud table. For example given the following setup:




all that would be required to relocate the BLOB data from the Amazon S3 server to the Sun S3 server would be to:

  • Copy the BLOB data from the amazon server to the Sun server.
  • Update the pbms_cloud table as: UPDATE pbms.pbms_cloud set server = "S3.sunaws.com", Bucket = "B3" PublicKey = "zft123", PrivateKey = "abc123" where id = 17;
  • Delete the BLOBs on the amazon server.
resulting in:



I will explain how the new BLOB repository backup system works with cloud storage in another blog entry.

Barry