Tuesday, April 5, 2011

Why use PBMS?

Why use PBMS?


I have talked to people about why they should use PBMS to handle BLOB data often enough, so I was surprised when someone asked me where they could find this information and I discovered I had never actually written it down anywhere.  So here it is.

If you are unfamiliar with PBMS, PBMS stands for PrimeBase Media Streaming. For details please have a look at the home page for BLOB Streaming.

 
Both MySQL and Drizzle are not designed to handle BLOB data efficiently. This is not a storage engine problem, most storage engines can store BLOB data reasonably efficiently, but the problem is in the server architecture itself. The problem is that the BLOB data is transferred to and from the server as part of the regular result set. To do this both the server and the client must allocate a buffer large enough to hold the entire BLOB. DBMSs that are designed to handle BLOBs such as Oracle, SyBase, and PrimeBase all pass the BLOB data outside of the regular result set. This way they avoid the requirement of having to buffer the entire BLOB. APIs such as ODBC understand this and provide functions such as SQLGetData() that can be called multiple times to retrieve data in chunks so that the client doesn’t need to buffer BLOBs if it doesn’t need to.

PBMS is designed to address this problem and provide MySQL and Drizzle with a means to efficiently handle BLOB data by allowing BLOB data to be transferred outside of the regular result set.

There are currently 2 approaches to handling BLOBs when using MySQL or Drizzle, one approach is to just store the BLOB data in the database in a Blob column which I will call the “BLOBs in database” approach, the other is to store the BLOB in a file some where and then store the path to the file in the database, which I will call the “BLOBs in files” approach.  I will compare these 2 methods of handling BLOB data to the "PBMS" approach which is to use the PBMS daemon.


 The BLOBs in database approach:

Advantages:
  • Simple to implement, BLOB data is treated no different than any other data.
  • Flexible, DBMS independent applications can be written using ODBC or JDBC to access the data.
  • The referential integrity of the BLOB data is ensured by the database.
  • Standard database maintenance ensures the security of the data.

Disadvantages:
  • The BLOB data is buffered on both the server and client side so that a 100 M BLOB will require a 100 M buffer on the server and then another on the client to receive the BLOB into.  If the server is busy handling 100 such requests then it will need 10 G of buffer space.
  • Database replication becomes impractical because of the size of the logs when the BLOB data is written to them.
  • The use of mysqldump, or similar tools to backup databases result in huge backup files because the BLOBs must be converted to hex strings in order to write then to the backup log which doubles their size.
  • The MySQL cluster server cannot be used with databases containing BLOBs.


The BLOBs in files approach:

Advantages:
  • The BLOB data is not part of the result set so large buffers are not required.
  • The BLOB data can be store in a location that is remote from the database server.
  • Standard replication will work (but the BLOB data will not be replicated).
  • Standard backup procedures can be used with the database (but the BLOB data will not be backed up).

Disadvantages:
  • A separate backup solution must be found for the BLOB data while keeping it consistent with the database backups.
  • A separate solution is required to replicate BLOB data.
  • Requires a custom designed system including client software.
  • The client software needs to know how the BLOBs are stored and needs to be provided with a method of accessing the BLOBs. If the BLOBs are not located locally to the client then additional software may be required.
  • The referential integrity, making sure that the BLOB files being stored on the file system are consistent with the BLOB references stored in the database, is no longer controlled by the database server.
  • Doesn’t scale well because most file system perform poorly when the number of files starts to exceed a couple of million. 
  • Installation and maintenance is more complex because specialized knowledge is required.
  • The client application is responsible for handling  the effects of transaction rollbacks in ensuring referential integrity.


The PBMS approach:

Advantages:
  • Simple to implement, all data storage and access is handled by the database server and PBMS engine.
  • Flexible, DBMS independent applications can be written using JDBC to access the data.
  • The referential integrity of the BLOB data is ensured by the database.
  • Standard database maintenance ensures the security of the data. No special knowledge is required.
  • Replication of the BLOB data is possible.
  • The BLOB data can be streamed in and out of the database so that buffer sizes are independent of the size of the BLOBs.
  • Better performance, test show that inserts and selects are significantly faster when BLOBs of 50 K or more are handled using PBMS.
  • The solution scales well, BLOBs are packed into files so the number of files in the file system is much less than you would get with a one BLOB per file system.
  • The maximum size of a BLOB is only limited by the maximum file size of the host machine.
  • The BLOB data can be stored in a location remote from the database server, such as on a different machine or in S3 cloud storage. This reduces the load on the database server host and it’s network bandwidth use.
  • The PBMS daemon ensures that BLOB inserts and deletes are handled properly in the event of a transaction rollback. Transaction check points are also supported.

Disadvantages:
  • PBMS is not shipped with any MySQL distribution, but it is ship with Drizzle and can be downloaded from http://www.blobstreaming.org.
  • The MySQL server does not directly support PBMS but Drizzle does provides direct support.

Although MySQL doesn’t support PBMS directly it is not difficult to add support to the InnoDB engine and anyone interested can contact me and I will happily assist them with it. I use InnoDB for most of my testing.


Conclusions:

PBMS provides efficient BLOB handling that is missing from MySQL and Drizzle. This would be enhanced greatly by integrating PBMS support more directly into the MySQL server.

I currently have plans to increase the support for PBMS in drizzle by adding a new column type for storing PBMS BLOB references. This enables the use of PBMS to be part of the database schema design and simplifies support for client libraries. The client library will then be able to recognize it is getting a BLOB reference and can then make calls to the PBMS daemon to stream the data back to the client. Ideally this is the type of support MySQL should also provide PBMS.

1 comment:

Unknown said...

Excellent information. It would be cool if you could prep an article explaining how to use PMBS in MySQL, once you have the time.