A customer found that if they passed Unicode strings (which in Windows means strings encoded as UTF-16LE using the two-byte data type wchar_t
as code units) which are not on even addresses, then some—but not all—functions fail to accept those strings. Why isn’t this documented?
This is one of the ground rules for programming: Pointers must be properly aligned unless explicitly permitted otherwise.
In the C and C++ languages, forming an unaligned pointer is explicitly specified to return no useful value.
In C:
(6.3.2.3 Pointers) If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.
In C++:
[expr.static.cast](13) If the original pointer value represents the address A of a byte in memory and A does not satisfy the alignment requirement of T, then the resulting pointer value is unspecified.
Therefore, simply creating a misaligned pointer already takes you outside the world of allowable (in C) or at least meaningful (in C++) operations, so you shouldn’t be surprised that using misaligned pointers results in nonsense.
As for why certain functions get more upset than others, it’s all a matter how how those functions use the pointers and who detects the misaligned pointer.
If you are using a processor that is alignment-sensitive, you will probably get a failure when the code tries to read the data from that pointer. If the access is made in user mode, you will get an access violation exception, and the process will probably crash. If the access is made in kernel mode, the kernel mode parameter validator will probably return an invalid parameter error. (Kernel mode must protect itself from user mode.)
If you are using a processor that forgives misaligned data accesses, then you may get away with it for a while, until the code does something with the data that requires alignment. For example, atomic operations typically require aligned data, even on processors that are normally forgiving of misalignment.
And even though x86-64 is generally alignment-forgiving, there are still places where it is alignment sensitive. For example, some instructions involving SIMD registers require alignment. SIMD registers are often used for copying blocks of memory around, and since wchar_t
has 2-byte alignment, the switch
statement for performing block copies has only 8 legal starting points out of 16, since all the odd addresses are invalid. If you pass an odd address, you might well fall through the switch
statement and perform garbage copies.
The Microsoft C++ compiler has a special nonstandard keyword __unaligned
for declaring that a pointer may be unaligned, and this tells the compiler that any accesses to the data behind that pointer must use instructions that are alignment-forgiving. For some processors, this can be quite expensive.
Limit your use of misaligned pointers to places where misaligned pointers are expressly permitted. You can tell where those places are by looking for the Windows SDK macro UNALIGNED
. For example:
LWSTDAPI_(int)
SHFormatDateTimeA(
_In_ const FILETIME UNALIGNED * pft,
_Inout_opt_ DWORD * pdwFlags,
_Out_writes_(cchBuf) LPSTR pszBuf,
UINT cchBuf);
Hmm, passing unicode strings to API functions.
I wish I could pass std::wstring_view (-ish object, or reference to) to them, and save myself copy and/or allocation, when most of them internally convert it again to UNICODE_STRING anyway.
I request a follow-up, to tell us why SHFormatDateTimeA() needed to accept unaligned FILETIME structs .. sounds like there’s a good story there?
FILETIME is misaligned in some on-disk structures, so it could not be fixed even when we made the jump to 64 bit Windows. Not taking misaligned FILETIMEs is asking for trouble.
Some binary data files were written with 32-bit windows, but need to be read by 64-bit code? Sounds plausible — I’m just surprised this doesn’t come up more often.
I’m working with a very old (late 90s) codebase that reads and writes a ton of little binary file formats, like that. I wonder how much we’re getting away with x64 being “alignment-forgiving”. We are porting to arm64, so I guess we’ll find out. :/
Every time I read something about alignment issues, I ask myself: What really did we start building byte adressable machines for?
If you cannot use byte addresses freely, but must restrict yourself to using word addresses, what are byte addresses for? Even old architectures from the 1950s-60s, such as the Univac 1100 series (and dozenzs of others), did have facilities for handling character strings. Maybe the original IBM 360 design made it simpler ... until those super-performance architectures arrived, telling: Don't make use of that new-won freedom you got! Do as you did before IBM 360 - stick to word addresses!
I...
But you can freely address bytes. You get into trouble only when you are pointing at larger elements…
@ketil albertsen
Seems to be because most modern frameworks are based on Javascript, Java, or C#. But in fact we did a lot of internal string representation in UTF-8 in prior decades. PHP used an internal string representation of UTF-8.
I'm looking at an oddball case; if we had a complete string processing library in UTF-8 available to me in .NET; including .cshtml files we would almost certainly go through our innards with a rototill to replace the builtin string with a UTF-8 string everywhere, and push remaining UTF-16 calls down into the database adapter. The performance boost would be worth...
"... only when you are pointing at larger elements ..."
Such as UTF-16. Independent 8-bit entities are really a special case nowadays. The very most of data elements are part of larger structures. A class member is part of the structure of instance value members. If an instance has single local byte variable, it cannot, on an alignment sensitive architecture, be allocated at an arbitrary address, but must honor alignment requirements.
Cole Turbin mentions byte addressed UTF-8, which is certainly the best choice for external data representation. I refuse to believe that it is "the most common code page" for internal working...
Exactly. UTF-8, the most common codepage (and the only one you should ever use), is entirely byte-addressed!
The only one I would have actually expected to work is MultiByteToWideChar:
MultiBiyteToWideChar(1200, 0, unaligned_ptr, byte_length, aligned_wide_char_ptr, aligned_length);
MultiBiyteToWideChar(1201, 0, unaligned_ptr, byte_length, aligned_wide_char_ptr, aligned_length);
But it doesn’t because 1200 and 1201 aren’t implemented. So native programmers get to write compile-time checks for alignment.
s/alignment/endianness/