Wednesday, April 1, 2009

BLOB storage in the cloud with PBMS

I am pleased to announce a cloud storage version of the PBMS engine.

What I have created is a version of PBMS that stores the BLOB data in a cloud and when a request for the data comes in the client is sent a redirect to get the BLOB directly from the cloud. The BLOB reference tracking and metadata is handled the same as before in that they are stored in the BLOB record in the repository but the actual BLOB is stored somewhere else.

This has several advantages over storing the BLOB in the repository record:
  1. It reduces the disk storage requirement of the database server’s machine.
  2. It reduces the bandwidth requirement of the database server’s machine.
The beauty of it is that the client application doesn’t need to know anything about the cloud because it is all handled by the PBMS engine and PBMS lib.

Here is a diagram showing how an insert works:

  • Step 1: The BLOB is sent to the PBMS engine. Ideally this is done by using the PBMS lib to send it to the PBMS engine’s HTTP server and then inserting the returned BLOB reference into the table in place of the BLOB. Optionally the BLOB can be insert directly with an ‘insert’ statement and the PBMS engine will replace it with a BLOB reference internally. The first method is preferable though since it impacts server performance less and reduces the server’s memory usage because the BLOB is streamed into the repository file and is never stored in memory.
  • Step 2: A repository record is created containing the BLOB data.
(Steps 1 and 2 are the same for both cloud and none-cloud versions of PBMS.)
  • Step 3: The BLOB is scheduled for upload to the cloud. A separate thread, called the ‘cloud’ thread, which operates in the background, performs the upload.
  • Step 4: The local BLOB data is scheduled for deletion. This action can be delayed for a time to ensure that the data is actually accessible by the clients before the local copy of the BLOB is deleted. The repository record remains but the space used by the actual BLOB is added to the repository’s garbage count and the space will be reclaimed later by the PBMS compactor thread.

Here is how the BLOB is accessed. Keep in mind that all the client application does is provide the PBMS lib with a PBMS BLOB reference as it currently does and receives the BLOB in return. It knows nothing about the cloud.

  • Step 1: A BLOB request containing a PBMS BLOB reference is sent to the PBMS engine’s HTTP server.
  • Step 2: The BLOB’s repository record is read from local storage.
  • Step 3: An HTTP redirect reply is sent back to the client redirecting the request to the BLOB stored in the cloud. The metadata associated with the BLOB is returned to the client in the reply’s HTTP headers. The redirect URL is an authenticated query string that gives the client time limited access to the BLOB data. Use of an authenticated query string allows the data in the cloud to have access protection without requiring the client applications to know the private key normally required to get access.
  • Step 4: The redirect is followed to the cloud and the BLOB data is retrieved.

Note the absence of lines passing through the MySQL server. All of this is done outside of the database server, freeing the database server to process database queries rather than serve up BLOB data.

The current version I am working on uses Amazon S3 storage but the cloud access is done via a fairly simple class interface, which should be easy to implement for any Amazon S3 - like interface.

If you want to find out more be sure to attend my talk at the MySQL conference “BLOB Streaming: Efficient Reliable BLOB Handling for all Storage Engines “.