Thursday, October 15, 2009

Differences between varchar and nvarchar in SQL Server

The broad range of data types in SQL Server can sometimes throw people through a loop, especially when the data types seem to be highly interchangeable. Two in particular that constantly spark questions are VARCHAR and NVARCHAR: what's the difference between the two, and how important is the difference?

VARCHAR is an abbreviation for variable-length character string. It's a string of text characters that can be as large as the page size for the database table holding the column in question. The size for a table page is 8,196 bytes, and no one row in a table can be more than 8,060 characters. This in turn limits the maximum size of a VARCHAR to 8,000 bytes.

The "N" in NVARCHAR means uNicode. Essentially, NVARCHAR is nothing more than a VARCHAR that supports two-byte characters. The most common use for this sort of thing is to store character data that is a mixture of English and non-English symbols — in my case, English and Japanese.

The key difference between the two data types is how they're stored. VARCHAR is stored as regular 8-bit data. But NVARCHAR strings are stored in the database as UTF-16 — 16 bits or two bytes per character, all the time — and converted to whatever codepage is being used by the database connection on output (typically UTF-8). That said, NVARCHAR strings have the same length restrictions as their VARCHAR cousins — 8,000 bytes. However, since NVARCHARs use two bytes for each character, that means a given NVARCHAR can only hold 4,000 characters (not bytes) maximum. So, the amount of storage needed for NVARCHAR entities is going to be twice whatever you'd allocate for a plain old VARCHAR.

Because of this, some people may not want to use NVARCHAR universally, and may want to fall back on VARCHAR — which takes up less space per row — whenever possible.