Wednesday, March 23, 2011

New PBMS version

A new version of PBMS for drizzle has been pushed up to launchpad:

drizzle_pbmsV2

I have rewritten PBMS and changed the way that BLOBs are referenced in order to make PBMS more flexible and to fix some of it's limitations. I have also removed some of the more confusing parts of the code and reorganized it in an attempt to make it easier for people to find there way around it.

So apart form some cosmetic changes what is different?

Maybe the best answer would be to say what hasn't changed: the user and engine API  and the way in which the actual data is stored on the disk remains pretty much unchanged, but everything else has changed.

The best place to start is with the BLOB URL, the old URL looked like this:
"~*1261157929~5-128-6147b252-0-0-37"
the new URL looks like this:
"pbmsAdaVAQCAAAAAAAAAANaVAQCAAAAAAAAAAG30qzsGAAAAAAAAAAEAAACAAAAAAAAAAAEAAAAAAAAA"
which is obviously a lot more intuitive.  :)

OK maybe it is bigger and uglier but it contains a lot more information. It is actually a base64 URL encoding of a data  structure containing information about the BLOB that makes it universally locatable across different PBMS daemons running locally or remotely.

How is this done?

When a BLOB is uploaded to a PBMS daemon the URL generated for it contains, among other things, the PBMS daemon's server id as well as the database ID and the BLOB's repository index value. These 3 values remain with the BLOB for it's life regardless of what server or database it may eventually end up in.  This allows you to insert a BLOB URL from one database into another database, possibly on a different server, and the PBMS engine will be able to use the URL to look up the blob, if it cannot find it in the database's repository then PBMS will automatically fetch the BLOB from the source server or database.

When fetching the BLOB the current server id, database id and index, which are also stored in the URL, are used.

This means that the following will work:

insert into foo.blob_table1 select * from  bar.ablob_table;
insert into foo.blob_table2 select * from  bar.ablob_table;

The first insert will copy the BLOBs from the BLOB repository for database 'bar' into the repository for database 'foo'.

The second insert will will recognize that the BLOBs already exists in foo's BLOB repository and just add references to them.

The same would hold true if database 'foo' was on a different server on the other side of the world.

BLOBs and replication:

A practical use for this is with replication, the replication process replicates the BLOB URLS to the slave server and PBMS pulls the BLOB across automatically. You can try this out using drizzle replication and my drizzle_pbmsV2 branch.

The only thing you need to do is tell the slave server how to map the PBMS server ID to the actual server. You do that by inserting the information into the 'pbms_server' table in the slave machine's pbms database. The master server's PBMS server ID can be found by doing the following select on the master server:
select * from pbms.pbms_server;
Resulting in something like this:
+-------+-----------+------+--------------+
| Id    | Address   | Port | Description  |
+-------+-----------+------+--------------+
| 38358 | localhost | 8080 | This server. |
+-------+-----------+------+--------------+

Then on the slave server do the following insert:
 insert into pbms.pbms_server values(38358, "master_host", 8080, "The master replication server");
where "master_host" will be the same IP address as you have for "master-host" in the drizzle slave config file.

Note: The PBMS server ID is not the same as the drizzle server id.

What else:

With the new design of the PBMS daemon it would not be very difficult to create a stand alone BLOB repository server that could be used as a backup for BLOBs or a a central repository for a cluster of servers. 

The next step though is to update the PBMS documentation and build a version  for MySQL.

Friday, March 11, 2011

PBMS Performance

I have been doing some performance testing with PBMS and found a few things that were kind of interesting. The main finding was that you start to see performance improvements when data sizes start to reach the 20K level. This was seen when replacing a 20 K varchar field with a longblob column in a PBMS enabled table.

The following graph shows the performance differences for 'select' and 'insert' statements using a PBMS enabled version of InnoDB on an 8 core machine.


The test compares the insert and select performance of LongBlob columns with PBMS support against that of varchar and longtext columns when using InnoDB.

The test shows that depending on if your application is more heavily weighted towards Inserts or selects it may be beneficial to replace columns containing more than 10K of data with longblob columns with PBMS support. In all cases the performance of both 'selects' and 'inserts' was improved for columns containing more than 20K when using longblob columns and PBMS. As the data size in the column increased the performance gain by using PBMS also increased, at 10 M there was %580 performance improvement for selects.

The testing showed some performance irregularities with PBMS which when fixed should result in even better performance.

If anyone would like to try this themselves please contact me and I can give you a copy of the test tool I used as well as a PBMS enabled version of InnoDB.

Barry