Optimizing Character Replacement in C#

For some time, I’ve had a utility function that cleans strings of Unicode characters that can cause issues for the users of the applications I manage.

It runs through a series of regular expression statements that replace problematic Unicode characters with replacement strings. In the examples below, I’m showing only the replacements related to single-quote characters. In the full code (which has 10 separate RegEx calls), some characters are simply stripped out, others are replaced with two-character strings. Here’s how the old code worked:

public static string FixBadUnicode1(string s) {
  return RegEx.Replace(s,
    "[\u2018\u2019\u201A\u201B\u0091\u0092\u2032\u2035\u05FE]",
    "'");
}

This performs well, but due to the number of times it is called, the total execution time is non-trivial, so I decided to test it against a character loop.

To perform my tests, I performed replacements on a corpus of 38,000 representative strings. Most *don’t* have problematic characters, so the overhead of *looking* for the characters outweighs the cost of the actual replacements. I used the safe native method QueryPerformanceCounter, which provides a higher-resolution counter than the .NET library.

The baseline (RegEx) code ran consistently in around 20 million counter cycles, or about 9 seconds.

My recent experience has been that Dictionary usually beats switch{} when there are more than a handful of cases, so I decided to create a static Dictionary<char, string> to look up the problem characters. Here’s the first iteration:

public static string FixUnicodeChars(string s) {
    var mappers = UnicodeMapper;
    var sb = new StringBuilder(s.Length);
    string fix = null;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(fix!=null) sb.Append(fix);
        } else {
            sb.Append(c);
        }
    }
    return sb.ToString();
}

If the “fix” string is null (i.e., I just want to strip the character), it skips the StringBuilder.Append() call.

The dictionary UnicodeMapper looks like this (again, only the single-quote portion):

    private static Dictionary<char, string> _unicodeMapper;
    private static Dictionary<char, string> UnicodeMapper {
        get {
            if(_unicodeMapper==null) {
                _unicodeMapper = new Dictionary<char,string>();
                lock(_unicodeMapper) {
                    // Fix all single quotes    [\u2018\u2019\u201A\u201B\u0091\u0092\u2032\u2035\u05FE]
                    _unicodeMapper.Add('\u2018', "'");
                    _unicodeMapper.Add('\u2019', "'");
                    _unicodeMapper.Add('\u201A', "'");
                    _unicodeMapper.Add('\u201B', "'");
                    _unicodeMapper.Add('\u0091', "'");
                    _unicodeMapper.Add('\u0092', "'");
                    _unicodeMapper.Add('\u2032', "'");
                    _unicodeMapper.Add('\u2035', "'");
                    _unicodeMapper.Add('\u05FE', "'");
                }
            }
            return _unicodeMapper;
        }
    }

This performs in just around 600,000 cycles, a more than 30x improvement over RegEx. This surprised me quite I bit — I expected RegEx to use unsafe pointers and other tricks to loop through the strings more quickly than I could enumerate the characters. Part of this performance is due to the fact that Char.GetHashCode() is far more efficient than String.GetHashCode(), so the Dictionary lookups are very fast.

On reviewing the code, I determined that since most strings don’t have any replacements, it made no sense to call sb.ToString() unless a replacement actually occurred. I added a flag, with this result:

public static string FixUnicodeChars2(string s) {
    var mappers = UnicodeMapper;
    var sb = new StringBuilder(s.Length);
    string fix = null;
    var hadChanges = false;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(fix!=null) sb.Append(fix);
            hadChanges = true;
        } else {
            sb.Append(c);
        }
    }
    return hadChanges ? sb.ToString() : s;
}

This provided an additional 12% improvement… not as much as the first change, but worth the effort.

Next, I noted that until the first replacement occurs, I don’t need to instantiate the StringBuilder at all — I can just create it when the first replacement happen and populate it with the substring before the replacement:

public static string FixUnicodeChars3(string s) {
    var mappers = UnicodeMapper;
    StringBuilder sb = null;
    string fix = null;
    var hadChanges = false;
    int pos = 0;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(sb == null) {
                sb = new StringBuilder(s.Length);
                if(pos > 0) sb.Append(s.Substring(0, pos));
                hadChanges = true;
            }
            if(fix!=null) sb.Append(fix);
        } else if(hadChanges) {
            sb.Append(c);
        } else {
            pos++;
        }
    }
    return hadChanges ? sb.ToString() : s;
}

This final version requires us to keep track of the current position in the string until the first replacement (if any) is found, at which point we create and start using the StringBuilder. We’re also able to avoid assigning hadChanges more than once. This provides an additional 7% improvement over version 2.

As an aside, I tested using a for{} loop (since it would give us the position for “free”), but it was actually about 10% slower than using foreach{}. I also tried removing the null check before the StringBuilder.Append() call, since Append() already performs a null check. However, the overhead of calling Append() unnecessarily resulted in a 2% increase in execution time.

Another tweak would be to unroll the foreach{} loop into two loops — one until the first match is found, the other for subsequent tests and appends. That would remove the need for the “sb==null” and “hadChanges” checks during each character iteration. But it would also require either using for{} or creating a substring for foreach{} to work on, and I’m reasonably confident that these would eat up any savings.

The final potential improvement I can think of would be to use char[] instead of StringBuilder. Unfortunately, since some replacements result in a longer string and I can’t predict the replacements that will occur, there would be some complexity in bounds-checking and reallocating the array that could eat up all of the savings. It’s tempting to try, but maybe some other day. For now, I’m quite satisfied with a 42x performance improvement. :)

Oops… Lost my backup…

I temporarily misplaced the database driving this blog during an upgrade, but I’m back in business! *sigh*

I’m in the final stages of replacing my main web site (www.tallent.us), so that should be up soon. I’ve been hosting it with Zenfolio for a number of years, but since I don’t do commercial work anymore, it didn’t make sense to pay someone else for a fancier web site than I actually need. Plus, it gave me a chance to play with some Javascript and CSS features that I can’t use at work.

I’d love to know if anyone out there still even uses RSS readers, and if so, if you’re subscribing to this site. I could dig through my Apache log files, but I’d love to know actually *who* reads this, not just a number of visitors, and if *you* have a blog that I should be following.

If you have a second, please leave a comment with your name and, if applicable, blog URL…

SQL Server needs dynamic enums

I manage a number of databases at work that have varchar/nvarchar columns with very restricted values — usually fewer than a dozen valid choices.

The problem is that storing and indexing these values in Microsoft SQL Server is highly inefficient.

Enterprise Edition users can work around this by partitioning their tables on such a column, but (a) that only works for one column and (b) partitioning a table isn’t necessarily good for performance overall unless the partition function aligns well with the queries you tend to run on the table.

Anyone who has studied database theory knows the pat answer — create another relation (table) that has two columns: one with a meaningless unique int value, the other with the string value (also unique). Then in the main table, refer to each allowed value by its number, not its actual value, and join the tables for output. However, there are a few problems with this approach:

  1. It’s a pain in the ass — extra joins and aliasing for SELECT, translation on INSERT/UPDATE, etc.
  2. Adding new “valid” values over time requires inserting into the lookup table. That may work for, say, a list of countries, but is less useful for, say, a list of cities that is discrete but grows over time.
  3. If you have a number of columns like this, you’ll end up adding a ton of these list-lookup tables, which is messy. And if you decide to be clever and use an EAV approach of one lookup table with the ID number, list name, and list value, you’ll run into problems setting up standard referential integrity constraints when a table has multiple “pick-list” columns.

MySQL has an alternative — ENUM columns. These store a numeric value, but act like varchar() columns, and the list of valid choices is part of the column definition. While it looks tempting at first, the evils with this approach are many.

Instead, I would recommend that MSSQL server add a new keyword for varchar/nvarchar column definitions called LOW_CARDINALITY*. This would tell MSSQL to to the following in the background (transparently):

  • Maintain a dictionary of “used” values in the column;
  • Assign each unique value a surrogate integer value;
  • In the table or indexes covering the column, store the integer value, not the string.

This would result in highly efficient storage of columns that contain a limited number of choices, without needing to rely on relational joins, “magic” numbers (usually managed by the business logic), or other clumsy workarounds. It could make a huge difference for table and index storage, memory usage, and query performance. And since it is transparent to all queries, it’s the best kind of new feature — one that can be added to an existing database without having to make additional changes to views, stored procedures, or code that connects to the database.

I’m sure the devil is in the details, particularly when it comes to things like replication, but the benefits would far outweigh the effort.

* I’m sure someone at Microsoft could come up with a better keyword than LOW_CARDINALITY. Something like LIST might work but seems a bit colloquial. DISCRETE or SET are other options, though SET is already a reserved word.

A replacement for evil outs

I was reading a post by Jon Skeet and he mentioned the evil of using “out” parameters.

Anyone with a functional programming background understands what he’s referring to here — pure functions should have no side effects, they should only return a value.

The problem is, the world isn’t pure, and it’s quite common to need to return multiple results.

Probably the most common example in C# is the “TryGetValue()” pattern:


public bool TryGetValue(
TKey key,
out TValue value
)

C# 6 allows us to call this in a way that declares the “out” at the same time as the TryGetValue call (where before, we would need to declare the output variable first):


if(dict.TryGetValue("mykey", var myoutput) {
Response.Write(myoutput);
}

Jon makes the valid point that this will only encourage “out” parameters, which should really only be the exception to the rule. A more functional approach would be:


public struct TryGetValueResult(bool success, T value) {
public bool Success { get; } = success;
public T Value { get; } = value;
}

var result = dict.TryGetValue("mykey");
if(result.Success) {
Response.Write(result.Value);
)

Much cleaner from a functional standpoint, but still has that annoying declaration of a variable I’ll probably only use within the “if” block. (I could use “using,” but the line count is the same and I’m retreating one more brace level).

But what if we could do the following:


if(var result = dict.TryGetValue("mykey").Success) {
Response.Write(result.Value);
)

Essentially, allow variable declarations to create a variable scoped to the “if/else” block, assign a value, *and* return the value.

I don’t know if C# 6 would support this type of use, but if it would, it might be the best of both worlds.

Two Code Smells to Learn from Apple’s SSL Security Bug

I was reading an excellent ACM article on the recent Apple security bug, and it struck me that the author skipped completely over one of the *true* root causes of this bug.

Here’s the buggy code:


if ((err = SSLFreeBuffer(&hashCtx)) != 0)
goto fail;
if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;

Of course, the issue is in the duplicated “goto fail,” which, while indented, is not controlled by the preceding ‘if’ statement.

The ACM author blames the lack of unit tests. He has a point, but I think there are two other root causes: action repetition and poor formatting.

DRY – Don’t Repeat Yourself

The goto in itself isn’t “bad,” but the duplication of the goto is what leads to the problem.

When two conditions lead to the same result, you should favor code that “measures twice and cuts one.” You may have to engage a few more brain cells to make this code style perform as well as spaghetti branches, but it’s worth the investment.

Consider the following replacement code:

err = SSLFreeBuffer(&hashCtx)
|| ReadyHash(&SSLHashSHA1, &hashCtx)
|| SSLHashSHA1.update(&hashCtx, &clientRandom)
|| SSLHashSHA1.update(&hashCtx, &serverRandom)
|| SSLHashSHA1.update(&hashCtx, &signedParams)
|| SSLHashSHA1.final(&hashCtx, &hashOut);
if (err != 0) goto fail;

Bitwise OR will shortcut on the first non-zero value, so in terms of performance, this is equivalent to the goto-fail mess above, and also equivalent to what most programmers would suggest as an alternative (adding a clutter of braces to the “if” statements so the extra “goto” sticks out better).

But unlike the original code, the example above calculates the condition of the existence of an error *once* and then performs *one* action (the goto) if an error is found. IMHO, the code is more clear as well, since it doesn’t rely on goofy right-hand evaluation of an assignment (which in itself is a shortcut I was never fond of in C).

Sentences should not
have arbitrary line breaks and neither
should code

Sure, braces would have helped here, but IMHO, simple if statements should simply include their action on the same line, just like they did back in the ol’ line-numbered BASIC days. And, ideally, similar logic should align their actions. Consider this code:


if ((err = SSLFreeBuffer(&hashCtx)) != 0) goto fail;
if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0) goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0) goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0) goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0) goto fail;

(Yes, I noticed this actually does soft-wrap in my default WordPress theme. Paste it into Notepad and behold its awesomeness. :) )

While this still exhibits the same overuse of evaluating assignments and duplicating actions, at least it is abundantly clear what is going on, and an extra “goto” will stick out like a sore thumb.

I realize this is an unorthodox way to format C code, but it’s a favorite of mine. Maybe my mind is just crippled by the semester of COBOL I took in college, but I like aligned code. Like a good grid vs. a CSV text file, the alignment makes it far easier to spot inconsistencies. I use the same approach in CSS and T-SQL, where it makes sense to do so.

How Pandora can be profitable

I’m a fan of streaming music services. Pandora clued me in on a number of bands I would’ve never heard of otherwise. But services like these currently have unsustainable business models. They pay exorbitant license fees for the music, and they also have to pay for bandwidth. And with the end of network neutrality around the corner, their bandwidth will likely skyrocket, putting the last nail in the coffin.

I have a solution.

As a music fan, I already own a good portion of the music that pops up on my Pandora stations. I either bought it because I was already following the band, or because I found it on Pandora and wanted to be able to play it on demand.

I suspect many Pandora users are like me. They own a lot of music, but still use Pandora because it’s better at matching their mood than iTunes, and because they like hearing something fresh occasionally.

So, here’s my solution: simply have the app catalogue my device’s music library, and any time it would normally choose to play a song I already own, play it locally rather than streaming it.

Some benefits of doing this:

  • Pandora doesn’t have to pay streaming or bandwidth fees. This could cut their overhead drastically without degrading the user experience.
  • I hear the songs I own in high quality, not compressed for streaming.
  • I can use the app even if my connection is poor or if I’m off the grid.
  • I use less bandwidth against my mobile plan cap.
  • I still get to hear music mixed in that I don’t currently own.
  • Because of all the above, I have an incentive to buy songs I really like, and Pandora could even sell them to me so they become part of my normal music library (competing directly with iTunes and Amazon). They make money, I get a better music collection.
  • The above gives Pandora strong incentive to continue to improve their prediction algorithms.
  • When connectivity is slow, to avoid skips, Pandora could buffer a streamed track in the background while playing my own tracks.
  • Music companies could pay Pandora (or forego streaming fees) to get new music out there. It’s like the old “payola” system, but can be engineered to only play for users who might actually like the music, and only puts it in the queue a few times so it doesn’t annoy them. (I would encourage Pandora to visibly display that the track was sponsored, and respect the user’s decision if they thumbs-down the suggestion.)
  • It’s more sustainable. It makes no sense to waste so much Internet bandwidth streaming songs to people who already own them.
  • Pandora could allow the user to select the mix of local vs. streamed music. That way the user also controls how many ads they hear (which are only needed to pay for streamed songs), or the user could set a monthly bandwidth cap to avoid going over their mobile plan limit.

Seems like a rather obvious solution to me, I don’t understand why it isn’t also painfully obvious to Pandora’s 740 employees.

C# 6.0 — Wish List

I was reading this blog, and had a few thoughts of my own about what should come next in C#.

1. Monadic null checking. This was #7 on the aforementioned blog. Used correctly, this could be a good time-saver.

2. VS support for columnar-aligned code. This is really a Visual Studio thing, but I really wish the IDE wouldn’t try to stop me from using tabs to align my code. I’ve been doing this ever since I learned COBOL, and probably before that. Code like this makes me happy:

switch(c) {
   case "apple":     DoAppleStuff();
   case "banana":    DoBananaStuff();
   case "orange":    DoOrangeStuff();
   default:          DoGenericFruitStuff;
}

See how neat and tidy that looks? But unless I turn off code formatting entirely, VS reformats these types of formatting constantly, so I’m forever hitting Ctrl-Z to undo it. I don’t want to give up code formatting overall, just have more control over what it deems to be a format issue.

3. Support for strongly-typed RegEx and XML literal values. These data types shouldn’t have to be birthed from strings, there should be first-class scalar support for them, including Intellisense. Example:

var re = @This is a re[g]ular (expression)\. You have to escape \@ as you would " with a string. Don't use / like Perl, it is used too often and escaping it is a pain.@;
var xml = <tag><child /><child attribute="foo">XML support too, with one parent node.</tag>;

4. Pure functions. Borrowed from this blog.

5. Intrinsic support for code contracts. Strong typing is good, but contracts are better.

6. Switch…Case support for lists. I.e., “case 1, 2, 3:”.

That’s all I can think of at the moment, but I may come back later and add a few more…

Welfare Cliffs are Amusing, but Useless

In a recent post, Philip Greenspun tries to make the case that the ACA is just one more situation where people will suck on the government teat rather than doing something productive, because they actually do better on assistance programs.

While such “cliff” maths are interesting for academic trivia, they are practically useless.

The vast majority of families making $50k a year or more (73%) have private health insurance (mostly employer-paid), and that goes up to 90% for families making around $100k/year. (Source: Kaiser Family Foundation study released in October).

The number of New Hampshire families making $100k a year and NOT being eligible for an employer-provided plan would be a statistical anomaly (if you are *eligible* for such a plan, it would be almost impossible to obtain any sort of subsidy).

As for the AEI’s ridiculous class-baiting chart and accompanying article, most single mothers with two children don’t get “job opportunities” to jump $30k a year in wages (in the same city at least). This hypothesized Sophie’s Choice of gub-ment help versus a doubling of take-home pay is ludicrous on its face, and such an extreme edge case should not be used to determined overall economic policy.

The real situation for an average a single mother in America is much bleaker than the AEI’s fantasy charts. It is just silly to think that people are “gaming” the government by refusing a $30k job promotion because they’ll lose CHIP or Medicaid benefits. And even in such a ridiculous scenario, again, it is highly likely that such a job will come with employer-provided healthcare, which isn’t shown on their chart as income but handily replaces (and greatly improves on) the key government benefit they would lose with the new job.

Sure, we could use tapering formulas instead of cut-offs when determining eligibility for various entitlements and assistance, and thus avoid these statistical cliffs.

But let’s not kid ourselves by imagining that there’s some horde of single moms who are refusing lotto-winning-level job raises because they’ll lose some much-needed help that they now receive. These fictions are not based in reality and are intended to serve one purpose: to demonize the poor in our country by creating bugaboo stereotypes of “welfare-queens” who would rather benefit from government largess than having the dignity of supporting their families on their own.