Thursday, 1 December 2016

Potential Security Hole In RMS with File Classification Infrastructure

Brace yourself for a long post on what to expect when embarking on protecting files with RMS and Windows Server File Classification Infrastructure (FCI). It isn't pretty, no matter how I look at it.

The Short Story

A client had a need to protect files and information against data leak. Being an Office 365 tenant also, the obvious choice was Azure RMS.

I implemented Azure RMS in conjunction with File Classification Infrastructure. It came together nicely, however during testing it became evident that Microsoft still has a long way to go to deliver a dependable solution.

I found that RMS in combination with FCI is unreliable and exposes unprotected information to the risk of being leaked.

The Flaw

A deadly combination of PowerShell scripts and uncorrelated tasks leads to failures due to file sharing violations, and ultimately to unprotected files and potential data leak.



While my implementation used Azure RMS, I dare say that a full-blown on-premise AD RMS solution would be plagued by the same flaw.

The Nitty-Gritty

The official Microsoft implementation document is available at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci. Says the article (highlights from me):


I did just that. Then I started testing. I set the schedule for the task (remember, "continuous operation" was also enabled).

The first thing that I noticed was that some of my test files remained unprotected after the task ran. In the Application event log I noticed the following error:



Tried the easy way out, checking the HTML report:

Error: The command returned a non-zero exit code.

Useless as a windshield wiper on a submarine.


Every time the scheduled RMS script ran, I got random files being listed as failing. Most weird...

NOTE: Before anyone starts to agitate, I want to note that I dropped a copy of the RMS-Protect-FCI.ps1 file in the test folder. This is NOT the copy that's being run to protect files, but just a copy of it for testing generic protection.

I checked the log mentioned in the event. The result: same as above (useless):


More digging reveals that additional logs are created in the same folder as the one above. I noticed that the log in the screenshot (see further below) correlates to the one above in that it has been created roughly 5 minutes after the above log - consistently approx. 5 minutes delay during my tests (your tests may deem different delays depending on server performance but it should be consistent between test runs).

In the log I was getting the following:

Item Message="The properties in the file were modified during this classification session so they will not be updated. Another attempt to update the properties will be made during the next classification session."

and

Rule="Check File State Before Saving Properties"


These are different files, but nevertheless I had them in the same test folder and they correlate with previous events. We are getting somewhere but not quite there yet. The information is insufficient to draw any conclusions.

For the record, log locations are configured as follows:


A gut feeling prompted me to look at the Event Viewer. I turned on the Command Line field to see what and how is launched, and started watching what was chewing CPU power and how it was launched. I was in for a rude shock: the PowerShell task configured on the Action tab of the File Management Task properties was being re-launched as soon as it exited, in an endless loop. It was obvious that it was my RMS script. The following screenshot cannot convey the dynamics, but if you watch Task Manager for a while you'll see how PowerShell is (re)protecting all files in the folder, in a never ending loop:


Then I tried to manually protect a file. It produced a file corruption error:

Protect-RMSFile : Error protecting Shorter.doc with error: The file or directory is corrupted and unreadable. HRESULT: 0x80070570

I instantly kicked off CHKDSK, but it came back clean. I tried to open the file in question - successful. Apparently there was no corruption, I could open the file, yet RMS failed due to file or folder corruption. Go figure.


Then I wanted to see what the file protection status was in general. I ran the following command as documented at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci:

foreach ($file in (Get-ChildItem -Path C:\RMSTest -Force | where {!$_.PSIsContainer})) {Get-RMSFileStatus -f $file.PSPath}

To my surprise, I got a file access conflict error:

Get-RMSFileStatus : The process cannot access the file because it is being used by another process. HRESULT: 0x80070020

I ran this command numerous times during my tests, but I've never seen this error before. A subsequent run of the same command came back clean. Is it backup? Is it the antivirus? It was in the middle of the night so no-one was accessing it except my script. I was puzzled...


A bit later, I was watching Task Manager while the scheduled task was in progress, with "continuous operation" also enabled. I noticed that I had two RMS PowerShell tasks running concurrently:


One of them disappeared as soon as the scheduled task completed, while the one spawned by what I thinks is "continuous operation", remained. This may explain the random sharing violations and errors in the logs above.

I also noticed that while new files that I created via Word or Excel were protected almost instantly as a result of having continuous operation enabled, files that I copied from elsewhere have not been protected neither by the "continuous operation" task nor by the scheduled task.

I also want to point out that the PowerShell script can be scheduled no faster than once a day. Therefore if it fails to protect a file, then there is at least a 24-hour window until it runs again, during which unprotected files can be leaked.

As I mentioned it before, I've been watching Task Manager. I noticed that the RMS PowerShell script took up significant processing power. The official Microsoft documentation at RMS protection with Windows Server File Classification Infrastructure (FCI) (also linked above) states that the script is run against every file, every time:


Knowing the effects an iterative file scan has on disk performance, I kicked off a Perfmon session that shows CPU utilization and % Disk Time, just to check. Then, while watching it, I turned off "continuous operation" in File Server Resource Manager:



The effect was staggering: the CPU almost instantly took a holiday and the disk too was much happier:



Putting the puzzle together:
  • The File Server Resource Manager runs a scheduled task and a continuous task concurrently.
  • I am getting random files failing to be protected (see SRMREPORTS error event 960 and associated logs above)
  • I am getting random sharing violation errors.
  • I had one corruption report, although I cannot be sure it was the result of my setup.
  • I was expecting that the "continuous operation" mode is event-driven (it is triggered only when a new file is dropped in the folder). What Task Manager indicates though is that it is an iterative task in an endless loop that constantly scans the disk. This seems to be corroborated by Perfmon data.
  • The "continuous operation" PowerShell task is a resource hog.

Choices, Choices...

It looks like having "continuous operation" enabled for instant protection of new files works okay. Therefore you would want it turned on to ensure that chances of leaking sensitive data are minimised. However, as I have observed, it conflicts with the scheduled file protection task, leaving random files unprotected.

On the other hand, I also observed that if I turn off continuous operation, then, besides the fact that the system breathes a lot easier, the scheduled task protects files more reliably. On the downside, it leaves a significant time window during which files aren't protected and thus sensitive information can be compromised.


In Conclusion

My observations lead me to the conclusion that when using PowerShell scripts to RMS-protect files, the File Server Resource Manager fails to coordinate scheduled and continuous operation tasks and scripts, and thus it results in random sharing violations and ultimately unprotected files. As a result, the primary purpose of RMS, that of protecting files, falls flat on its face, giving the unsuspecting or malicious user plenty of time to leak potentially sensitive data.

Additionally, the way PowerShell scans the file system and (re)protects files, results in significant performance degradation even on lightly utilised systems.

While RMS is a great piece of technology that mitigates the issue of leaking sensitive data, it seems that it has a long way ahead to become a dependable tool when it comes to integrating it with File Classification Infrastructure.

PS: All this is based on observation and trial and error, and hence it may be incorrect in some respects. Information is scarce out there. I welcome any comments, suggestions and pointers to additional information.

Friday, 18 November 2016

Integration Is Not That Great Between Batch And Individual Mailbox Migration Tools

I was migrating a small batch of some 30 mailboxes from on-premise to Exchange Online in a hybrid configuration. I chose the option to suspend the migration, then complete it manually later.

While most mailboxes reached the AutoSuspended status, some have not even started to migrate (PercentComplete = 0). These few mailboxes appeared to be infinitely looping through statuses like TransientFailureSource, LoadingMessages, InitializingMove and the like:


I said to myself, I couldn't let these "transient" failures and apparent delays hold up the rest of the pack. Experience shows that re-migrating them from scratch usually gets them through.

I was thinking that at the end of the day a batch migration is just a bunch of  mailbox moves treated as one unit, and I should be able to manage individual move requests within the batch via *-MoveRequest commands. I went ahead and finalised the AutoSuspended moves with Resume-MoveRequest. Little did I know that I was about to upset the batch migration logic to the point where it could not recover, resulting in all mailbox moves being reported as "Failed".

To recap:
1. Most mailboxes in the batch finished initial sync and entered the AutoSuspended state.
2. Some mailboxes failed to sync, quoting transient failures and looping indefinitely through various states, never reaching the AutoSuspended state.
3. Using Resume-MoveRequest in remote PowerShell, I completed the AutoSuspended moves.
4. Using PowerShell, confirmed that all resumed migrations have reached the Completed status:


At this point I stopped the batch, removed the failing mailboxes, and restarted the batch. Then I chose "Complete the migration" in the action pane.

I was expecting that the batch will realise that the mailbox moved have been completed and will finish successfully. Well, good luck with that... All mailboxes that have completes successfully following Resume-MoveRequest, were reported as Failed in the GUI. Yes, Failed.

I checked individual reports. They confirmed that the mailboxes were indeed successfully migrated - and please note, these reports were returned by the GUI's very own Download the report for this user link:


The Migration Batch GUI report however showed them all as Failed:


Get-MigrationBatch seemed to be supporting its GUI sibling, returning a status of CompletedWithErrors:


An educated guess indicates that the batch migration wizard is retrieving the report from individual Get-MoveRequestStatistics -IncludeReport commands.

So should I be worried? In my experience, definitely not. Confirmed with users that the above discrepancy didn't affect their ability to connect to and use their migrated mailbox.

Takeaway: Integration between *-MigrationBatch and *-MoveRequest commands (and their siblings) isn't that great. Expect inconsistent reports when mixing them. If you started moving mailboxes using one tool then use the same tool until the migration is complete - don't mix them, no matter how tempting it may be. The inconsistent reports can be misleading but data integrity is maintained. The reports to rely on are those created by *-MoveRequest and *-MoveRequestStatistics. If these commands say that the move has been completed successfully then they have been completed successfully, regardless what other reports say.

Hope that one day Microsoft will improve reporting and integrate batch and individual mailbox migration tools better.

Happy reporting :-)

Saturday, 12 November 2016

Removing Exchange 2010? Disable Performance Logs & Alerts!

In a recent upgrade (a.k.a. "transition") from Exchange 2010 to Exchange 2016 project I got to the point where I had to uninstall Exchange 2010. To my surprise it failed to uninstall, despite the fact that all services that I was aware of, such as antivirus and backup, were disabled or otherwise not affecting Exchange components.

The uninstall wizard displayed a long list of rundll32 processes that had open files and this prevented the removal of my Exchange 2010 server:


In order to see which process holds files open, I had to know what was loaded with rundll32. At this point, however, the only clue I had was the process ID, a.k.a. PID. Not very useful.

It is not all gloom and doom though. Task Manager is a very useful troubleshooting beast, with lots of information hidden from the default view.

In my case I had to see what was loaded with rundll32. Let's look specifically at PID 7172, the first in the pre-flight check error report.

In Task Manager, I enabled the Command Line column...


... which then showed me the path to the DLL that was loaded in the process:


Doing a quick Google search on pla.dll, it looked like the files were open by the Performance Logs & Alerts service. Indeed, this service was started:


I disabled _and_ stopped the service, ...


... retried the uninstall wizard, and voilĂ , the pre-flight check succeeded:


IMPORTANT: Stating the obvious, it is not sufficient to just _disable_ the service. You must also _stop_ it to make sure it releases all file locks and resources. Disabling it ensures that the service doesn't start and lock files while in the middle of uninstalling Exchange, while stopping it actually releases the file locks.

In a similar way, we can identify any process that may hold files open when uninstalling not only Exchange 2010 but other software too.

Takeaway: Add the Performance Logs & Alerts service to the list of services to disable when uninstalling Exchange 2010, or perhaps any other version of Exchange server as a matter of fact.



Saturday, 5 November 2016

StalledDueToTarget_MdbCapacityExceeded

Intro

The Mailbox Migration Performance Analysis blog posted by the Exchange team back in 2014 provides a great deal of details about monitoring and measuring migration performance using the various states and the time a job spent in each state.

What it doesn't touch on, however, is what appears to be a rare condition: sometimes the Exchange Online mailbox database designated to store a particular on-boarded mailbox apparently runs out of space. At least that's what an educated guess made me to conclude.

Some Context

In one of my recent Office 365 migration projects one particular mailbox was stuck for hours in the StalledDueToTarget_MdbCapacityExceeded state.

Get-MoveRequest <Identity> showed that the request has been queued. Nice detail but in this case it wasn't very informative.

Displaying some more details, including the status report, returned no useful information either. In fact the report may be a bit misleading in that it may suggest that the mailbox is too big and it doesn't fit within the quota. Well, the figures indicated that both the mailbox and the archive were well below the 50Gb mailbox and unlimited archive quota.

Get-MoveRequest <Identity> | Get-MoveRequestStatistics -IncludeReport | fl DisplayName, StatusDetail, TotalMailboxSize, TotalArchiveSize,Report


I have never come across this particular condition and I turned to my omniscient technical advisor, GG (Google the Great). No luck there either. Not one hit. Am I the only one in the universe who's seen this so far? I started to speculate.

The error suggests that the mailbox in question has been designated to be stored on a specific mailbox database, however the database had no capacity to accommodate the mailbox.

Assuming that I am correct, I have no way of knowing how this happened. I can think of two possible scenarios:

1. I can imagine that under some rare conditions the database selection algorithm may over-provision mailbox databases. The system detects it down the line, but it fails to re-allocate the mailbox to another eligible database. and the mailbox move gets stuck and never completes. I suspect this is the most likely cause.

2. The maximum size of the target mailbox database has been administratively reduced *after* the move request has been queued. Unlikely but not impossible.

I may never find out, but it is irrelevant anyway.

After the troubled mailbox has been sitting in this state for over 5 hours, I said to myself that I have to do something about it. Maybe removing the mailbox from the batch and re-migrating it separately will put the mailbox into a roomier database. And it did!

The Fix

This worked for me:

1. Stop the migration batch. Mailboxes cannot be removed from a batch that is still running.

Stop-MigrationBatch "Batch Name" -Confirm:$false


2. Give it a couple of minutes and confirm that the batch has stopped. If it is still "stopping" then give it some more time.

Get-MigrationBatch "Batch-Name"


3. Find the user's primary SMTP address by running one of the following commands on the on-premise system:

((Get-Mailbox -Identity UserName).PrimarySmtpAddress).ToString()

or

Get-Mailbox -Identity UserName | fl PrimarySmtpAddress

4. Remove the mailbox from the batch. Substitute Primary_SMTP_Address with the one obtained above.

Remove-MigrationUser <Primary_SMTP_Address>


5. Restart the migration batch.

Start-MigrationBatch "Batch Name"


6. Now that the user has been removed from the migration batch, create a new batch and add the mailbox. This time Exchange Online will place the mailbox into a database with sufficient space and the mailbox will be moved successfully - or at least that's what happened in my case.

That's it. Enjoy your move!

Thursday, 6 October 2016

Are All Your Exchange Services Running?

How any Exchange troubleshooting should start:

  1. Scratch head.
  2. Confirm all Exchange server services are running.
  3. Plot out the next steps.
I have recently been involved in troubleshooting an issue where resource automatic booking stopped working. Additionally OOF replies were not sent either.

All sorts of elaborate reasons flew through my mind as to why this may happen, from misconfiguration right down to mailbox corruption. Went through all troubleshooting steps imaginable and things that worked elsewhere with similar settings didn't work for this organization.

Then I came across Terence Luk's post here. And then I had a D'OH! moment: are the services running? Sure enough, the Microsoft Exchange Mailbox Assistants service stopped.


Started the service, tested it again, and Bingo!

As the screenshot indicates, there may be other services stopped, affecting different functionality. Services may stop randomly for random reasons. The point is that I have come across very few organizations that implement Exchange server service level monitoring and alerting.

The morale: Make it a habit to check services first before engaging in more complex troubleshooting.

Thursday, 29 September 2016

Don't Forget to Bounce the Autodiscover Application Pool

This has been widely documented but I feel the need to share my experience.

The Scenario

Mailboxes are moved from Exchange 2010 to Exchange 2013 or 2016.
Users are prompted to restart Outlook.
Outlook fails to reconfigure the profile and it doesn't connect to the mailbox on the new server.
Test E-mail AutoConfiguration fails:



The Gist

When you finish moving mailboxes from Exchange 2010 to 2013 or 2016, don't forget to restart the Autodiscover application pool on your new 2013/2016 CAS servers:

Restart-WebAppPool MSExchangeAutodiscoverAppPool

Exchange 2016 is an all-in-one server which combines all roles, but it shows as Mailbox server when doing a Get-ExchangeServer command, so you will have to bounce the pool on every Exchange 2016 server.



The Reason

It doesn't happen too often, but when it does then it can be inconvenient. I just finished moving some mailboxes from Exchange 2010 to Exchange 2016, only to be called because users' Outlook failed to connect to their mailbox.

My initial thoughts were that Outlook is not up to date and it fails to connect via MAPI-HTTP (enabled by default in Exchange 2016), or there may be a proxy (mis)configuration. It even crossed my mind that there may be a dodgy HOSTS file entry that overrides DNS settings.

None of these proved to be the case. Instead, the issue lies in the fact that the Autodiscover application on the Exchange 2013/2016 server caches configuration information. After a mailbox is moved, the application still thinks that the mailbox is on the old server. Therefore it will proxy the connection to the old Exchange 2010 server, which then bounces it back to Exchange 2016. They will be ping-ponging Outlook profile reconfiguration autodiscover requests until the Autodiscover application cache expires and the information is refreshed. The workaround is documented at https://support.microsoft.com/en-au/kb/3097392 and it seems to suggest that the timeout can be up to 12 hours. Recycling the application pool will refresh the cache and Outlook will connect successfully.

Therefore do yourself a favor and recycle the app pool when the moves are complete. It will improve your sleep (think supporting users in different time zones).

Even better, before you start moving mailboxes, configure the pool to recycle on a schedule, say every 5 or 10 minutes, for then you don't have to watch the progress. The steps are documented at https://technet.microsoft.com/en-us/library/cc733120(v=ws.10).aspx and it boils down to the following:


Happy moving!

Friday, 9 September 2016

Think Load Balancing Before You Ditch Your Exchange DAG Administrative Access Point

With Exchange 2013 SP1, Microsoft introduced an alternative to the traditional failover cluster on which DAGs are based: no virtual IP, no Cluster Name Object, no worries. For
details see https://technet.microsoft.com/en-AU/library/dd979799(v=exchg.150).aspx.

Backup is the most quoted reason why you may still want to have an administrative access point. Backup software needs an always-on resource to connect to in order to back up
Exchange. And that is, conveniently, the cluster administrative access point. However backup software vendors are catching up with the times and they don't really need a
cluster resource anymore.

So why would you still want to have one? Well, think of inbound SMTP load balancing, and, implicitly, High Availability (HA).

Have you ever tried to load-balance SMTP? If so then you've probably encountered some or all of the following:
  • In the Exchange transport logs, all SMTP traffic appears to be coming from a single IP address - that of the load balancer. How do you tell the "real" source without looking at the e-mail headers?
  • You have a number of internal devices that need to send unauthenticated e-mail to any recipient, including external ones. Think open relay with restricted access. But if all traffic seems to come from the load balancer, then the only way to allow your multifunction device to scan a document to an external mailbox is to authorise the load balancer as approved source of unauthenticated SMTP traffic. Or you need to send e-mail alerts to an external party as par of your support agreement. Suddenly you have a security problem and a career challenge (you'll cause your organization's servers to be blacklisted in no time and you'll need to explain that to your manager).
  • Enable source routing on the load balancer. Bingo, your Exchange server sees the real source. Open relay issue solved. But hey, that will result in asymmetrical routing. Not ideal and usually difficult to diagnose if issues arise. So by solving one problem you introduced another one...

Most of these, and some other aspects, are documented in Paul Cunningham's blog at http://exchangeserverpro.com/issues-with-load-balancing-smtp-traffic/.

The alternative: hang on to your cluster administrative point. The benefits:
  • You can send SMTP traffic from internal sources straight to the DAG administrative access point, bypassing the load balancer. No open relay issues, no transport log analysis issues.
  • No asymmetrical routing issues.
  • Multi-role servers are becoming the norm. In fact, with Exchange 2016, it is the only option. Therefore as long as you have at least one online DAG member, you'll also have implicitly a Hub Transport server too to accept SMTP traffic which will be reachable via the - guess what - thine good olde DAG administrative access point. Therefore service is maintained even if individual DAG members fail.
The solution has one obvious drawback: all inbound SMTP traffic will be processed without any load balancing, by a single Exchange server, the one which holds the name and IP address of the DAG
administrative access point.

As at the time of this post I am yet to see an Exchange infrastructure whith such a heavy inbound SMTP traffic that load balancing (not to be confused with HA) would be a necessity. Therefore the only drawback that I can spot is so insignificant that it can be safely discounted in most environments.

Having said that, I learned to "never say never". If you know of any circumstances where this workaround would not work then please drop me a comment.

Cheers,
Zoltan