Author Topic: Design for public API/URL identifiers (Read 3140 times)

hans · « **on:** February 02, 2009, 03:18:30 PM »

My company is transitioning some of it's internal core services to publicly available web services.

The reasons for doing so are two fold:
1. Make it easier to develop custom applications utilizing the same core platform services in order to reduce development and maintenance time (vs packaged modules).
2. Allow external customers the ability to develop their own applications using the Vocabra platform.

The services will be designed to be multi-tenant with the intention that most customers will be hosted in a shared database instance. The same architecture could be used for any customer that needs more data isolation by simply creating a single tenant in a separate database instance (which would likely be accompanied by separate application instances as well; at significant additional cost due to deployment and maintenance issues).

A new surrogate key design is necessary to provide security and data migration ability with minimal effort since it is entirely likely that over a short period of time, customers may be merged/split into other database instances and the core services may be versioned quickly (until a more complete version is reached). From a security perspective, it is useful to provide an external customer with a non-sequential nor easily guessable record identifier to prevent any mild form of hacking the URL's. (internally, the queries will also provide mechanisms for preventing unauthorized viewing/editing of records)

The proposed solution

Each table (for publicly available records) will have the following base columns:
id - auto increment default integer field used for internal joining of records (considered to be private id)
uid - unique integer field that will be displayed in the url and provided as a reference id externally (considered to be public id)

Generating the 'uid' field
The 'uid' field will be generated through the following algorithm:
1. Generate a random UUID using java.util.UUID.randomUUID().
2. Hash the random UUID using SHA-256
java.security.MessageDigest md = java.security.MessageDigest.getInstance("SHA-256");
md.update(uuidStr.getBytes("UTF-8"));
3. Take the first 6 bytes of the resulting hash and create a hex value (will be used for display of key; 12 characters).
4. Store the hex value as a long value for database querying

What this solution gets us:
1. Non-sequential URL identifiers (useful to avoid simple 'plus one' URL hacking) that aren't terribly ugly or long.
2. Auto increment keys still exist for joining records in the backend, which should be easier on the database
3. 'Public key' stored as integer value for faster searching rather than trying to compare against a character value (hex form) but is easily switchable from format to format
4. Minimal migration headaches since each record already has a unique identifier and can simply be inserted into the new database without having to worry about the auto increment field. (associations are still 'work' though)
5. Values up to 281,474,976,710,655 (big enough to not worry about, will require sharding or something at that point)

Drawbacks:
1. Insert times will increase, as will CPU IO due to a more complicated key generation routine. (currently not concerned with that since we'll plan on load balancing if it becomes an issue)
2. Requires modification of URL creation to use alternate id field for URL generation (modify Grails templates)
3. Overhead in converting URL keys from hex to integer in order to find records in database (again adding to additional IO)
4. Database querying more difficult since the hex value doesn't exist to search by, requires conversion to integer value or additional scripting
5. Values up to 281,474,976,710,655 (not a large concern, reasoning above)

Outstanding questions:
1. Is taking the first 6 bytes of the SHA hash going to have any problems with collision, it's understood the randomness should be OK.
2. Overkill anywhere?
3. Is it worth the bother to keep the auto increment field around? If we're storing the external key as an integer value what are the performance hits for not having them be sequential (ignoring storage requirements, we'll get a bigger storage device)
4. Is there any benefit to storing the original UUID? Perhaps in case we'd like to regenerate keys at some point in the future (ignoring the user frustration level)
5. Is 'uid' an appropriate name for the alternate key? What are alternative names?

Whatcha guys think?

Steve · « **Reply #1 on:** February 02, 2009, 03:24:10 PM »

I think you are much smarter then i am. It's amazing to me that i have as many skills and abilities as i do, yet cant understand the technology behind alot of them....

hans · « **Reply #2 on:** February 02, 2009, 03:36:20 PM »

Quote from: RoD on February 02, 2009, 03:24:10 PM

It's amazing to me that i have as many skills and abilities as i do, yet cant understand the technology behind alot of them....

I think that's part of the game. Too much stuff out there to know everything. I've taken the know enough about a lot of stuff approach; expert in none.

Steve · « **Reply #3 on:** February 02, 2009, 03:44:36 PM »

Yea. Theres systems and things i can gain illegal access too fairly easily, but if someone who knew the science tried to talk to me about it they would think i was clueless haha.

Perspective · « **Reply #4 on:** February 02, 2009, 03:45:43 PM »

Quote from: tgm on February 02, 2009, 03:18:30 PM

Outstanding questions:
1. Is taking the first 6 bytes of the SHA hash going to have any problems with collision, it's understood the randomness should be OK.

Collisions are definitely possible. You never know how much variation will be in the first 6 bytes.

Quote from: tgm on February 02, 2009, 03:18:30 PM

2. Overkill anywhere?

I think the uid generation may be overkill since the point of the SHA hash is to get uniform(ish)ly distributed key values with low collisions... but you're likely losing that property by only taking the first 6 bytes anyway. A data structure style of hash function will be more efficient to compute and can still give you nice distribution among the 12 digit space.

Either that or use the entire SHA hash string (but that's uglier as you say).

Quote from: tgm on February 02, 2009, 03:18:30 PM

3. Is it worth the bother to keep the auto increment field around? If we're storing the external key as an integer value what are the performance hits for not having them be sequential (ignoring storage requirements, we'll get a bigger storage device)

There are no performance hits so long as you've got an index on the column. The only advantage with sequential is if it's a primary key column the db can have a special index (basically an array). But in this case the uid "keys" are actually the values in the table, so they would likely be B+ tree indexed even if they were sequential.

As for keeping the auto-increment field, I thought that's how you are mapping everything back to your internal data? This may not be a bad idea to help make SQL injection attacks against your private data more difficult (attackers don't get keys from URLs).

Quote from: tgm on February 02, 2009, 03:18:30 PM

4. Is there any benefit to storing the original UUID? Perhaps in case we'd like to regenerate keys at some point in the future (ignoring the user frustration level)

I don't see a point. They are random to begin with and if you ever want to regenerate new keys you can just start fresh.

Quote from: tgm on February 02, 2009, 03:18:30 PM

5. Is 'uid' an appropriate name for the alternate key? What are alternative names?

So long as you actually enforce it to be unique I guess it's an ok name

hans · « **Reply #5 on:** February 02, 2009, 04:30:25 PM »

The way I understand the sha algorithm is that it's fairly uniform over the length of however far I decide to truncate. The reason I picked 6 bytes is because I wanted to keep the URL ID short and my original choice of 8 bytes would be too big for Java's long forcing me to store the value as a character string. I've always read comparing strings is very bad in DB's and I liked 12 better than 14.
And if I went without the auto inc field all joins would be done on the uid field. But I like your SQL injecting comment so I think we'll keep the auto inc.

hans · « **Reply #6 on:** February 02, 2009, 04:37:27 PM »

Oh, just though of something. Should I be pulling the last x bytes or won't that matter?

Perspective · « **Reply #7 on:** February 02, 2009, 05:26:34 PM »

>The way I understand the sha algorithm is that it's fairly uniform over the length of however far I decide to truncate.

Well if that's the case it should work fine, you still need to check and handle collisions of course.

>Should I be pulling the last x bytes or won't that matter?

no idea

I don't know enough about SHA.

hans · « **Reply #8 on:** February 06, 2009, 08:47:05 PM »

OK. Made some good progress in things I think. After lots of research and a bit of testing I think the SHA is not going to be necessary. The only reason I wanted to use SHA is because the UUID was too long and ugly. But as I was playing around with things I started reading more about other hashing algorithms and finally just came to reading about UUID's in general after failing to find a hash algorithm that could be truncated and still have a uniform distribution.
So, thanks to Wikipedia, I learned there are multiple types of UUID's and Java's implementation (that I'm using) is a type 4 which is just a random set of 122 bits. So I grabbed the code for it (since Java is now open source) and saw all it was was a function that used Java's SecureRandom class (which I'd been playing with) that flipped some bits to set the version and variants in the UUID.
The new plan is now just to use the same code but allow a variable set of random bits, (which we'll likely start with at 64 giving us a hex string of 16 characters).

And after finding a math equation (also on Wikipedia) supposedly to calculate the collision probability of n+1 records with the previous n records I think we have a winner.

Plugging in the values we're planning on using:
1 - (e^-( (100,000,000^2) / (2*2^64) ) )
gives us ~0.0003 (thank you google calcuator)

So after 100 million records, we'll have a 3 in 10,000 chance of getting a collision on the next record.

At a more "reasonable" 10 million we've got a 3 in a million probability.

And I think we're going to be storing the values as hex so before we reach that level of records I can increase the number of bits and start all over again since there won't be overlap, which would be a problem if I stored the values as integers in the database.

The only think I'm still trying to figure out if there is a large performance issue by using a unique constraint and storing the values as characters instead of integer (or bigint rather). Since it's a btree thought I think the fact that it's unique should keep the performance acceptable (at the cost of larger storage needs).

Perspective · « **Reply #9 on:** February 06, 2009, 10:14:55 PM »

Be sure to declare the column as UNIQUE in your schema, this is actually more efficient than non-unique since leaf-nodes always correspond to at most one record. I don't think you'll see any significant performance hits having character strings. You could always test by generating two big tables and running a few queries.

EntropySink

News:

Author Topic: Design for public API/URL identifiers (Read 3140 times)

hans

Design for public API/URL identifiers

Steve

Re: Design for public API/URL identifiers

hans

Re: Design for public API/URL identifiers

Steve

Re: Design for public API/URL identifiers

Perspective

Re: Design for public API/URL identifiers

hans

Re: Design for public API/URL identifiers

hans

Re: Design for public API/URL identifiers

Perspective

Re: Design for public API/URL identifiers

hans

Re: Design for public API/URL identifiers

Perspective

Re: Design for public API/URL identifiers