Friday, 10 February 2017

Email and UPN do not belong to the same namespace

Hi There,

Just recently I helped out in a case where hybrid Exchange users with the mailbox in the cloud were failing to retrieve autodiscover configuration data.

It was an environment with multiple federated domains, and mailboxes split between on-premises and O365. On-premises mailboxes worked well, only O365 mailboxes were failing, and, more bizarrely, one of the domains worked while the others didn't.

As we know, mail clients for onboarded hybrid mailboxes go through a double autodiscover process:

  1. The first iteration discovers the on-premises mail user, which then redirects the client to the cloud mailbox.
  2. The client then goes through a second autodiscover process, this time against the cloud mailbox.

Looking at the Microsoft Remote Connectivity Analyzer output, I noticed that the first iteration succeeded. The issue was with the second pass. The error:

X-AutoDiscovery-Error: LiveIdBasicAuth:LiveServerUnreachable:<X-forwarded-for:40.85.91.8><ADFS-Business-105ms><RST2-Business-654ms-871ms-0ms-ppserver=PPV: 30 H: CO1IDOALGN268 V: 0-puid=>LiveIdSTS logon failure '<S:Fault xmlns:S="http://www.w3.org/2003/05/soap-envelope"><S:Code><S:Value>S:Sender</S:Value><S:Subcode><S:Value>wst:InvalidRequest</S:Value></S:Subcode></S:Code><S:Reason><S:Text xml:lang="en-US">Invalid Request</S:Text></S:Reason><S:Detail><psf:error xmlns:psf="http://schemas.microsoft.com/Passport/SoapServices/SOAPFault"><psf:value>0x800488fc</psf:value><psf:internalerror><psf:code>0x8004786c</psf:code><psf:text>Email and UPN do not belong to the same namespace.%0d%0a</psf:text></psf:internalerror></psf:error></S:Detail></S:Fault>'<FEDERATED><UserType:Federated>Logon failed "User@domain.tld".;

Needless to say, I checked on-premises and cloud user and mailbox properties, UPNs, addresses, connector address spaces, proxy addresses, the lot, to no avail. It was all configured correctly.

Then I checked ADFS. There I found a couple of errors, so I turned on debug trace. Nothing obvious there either, so I turned it off.

I tested ADFS logon via https://sts.domain.tld/adfs/ls/IdpInitiatedSignon.aspx: Successful.

ADFS was working well per se. Somehow, it was getting incorrect details from O365.

I also suspected that somehow the ADFS proxy was breaking the SSL stream - I dealt with a similar situation before. However this idea has been dropped when the original (3rd party) proxy was replaced with Microsoft's native Web Application Proxy and the issue remained.

To recap:

  • Users had matching e-mail and UPN
  • ADFS itself was working
  • Multiple federated domains
  • ADFS Proxy (WAP) ruled out


Then I decided to re-federate the domains. Since I had very little (a.k.a. none whatsoever) information on how it was set up initially, and no details on any deployment history, tabula rasa seemed in order.

So I converted the domains to Managed, then re-federated them. Since there are multiple federated domains, I used the Convert-MsolDomainToFederated cmdlet with the SupportMultipleDomain switch.

Coffee time, to allow some time to pass to do its things (it's a distributed environment and things don't happen instantly).

<suspense>Then came the test</suspense> ...  Lo and behold: everything started to work! Every federated domain, every service.

In summary: The hybrid environment and user accounts were configured correctly, yet the wrong details were passed by O365 to ADFS. Either the trusts were incorrectly configured, or the federation metadata got corrupted. Whatever it was, re-federating the domains fixed it.

Sunday, 29 January 2017

The Curious Case of GroupPolicy Error 1053

Recently I was involved in a case where a user was losing his Taskbar icons and email signatures. He had all possible folders redirected to a remote network share over a WAN link. The icons are stored in the AppData (Roaming) structure and I wanted to bring it back to the local computer (a.k.a. cancel redirection for the AppData folder) without affecting other folders.

I knew I had to fiddle with some GPOs, and this is where the fun started.

I created a new GPO, applied security to only scope it to this one user, and applied GPUPDATE. It bombed out:


The Details of the same GPO suggests that the user is disabled:

ErrorCode: 1331
ErrorDescription: This user can't sign in because this account is currently disabled.


I new network connectivity wasn't an issue. I checked DNS, although computer settings were applied successfully. This uncovered a whole heap of inconsistencies which I corrected, however it didn't affect my GPOs. I even turned on USERENV logging - nothing useful in there either, and neither in the GroupPolicy log.

I new the account wasn't disabled because I was logged on with that account.

I was banging my head against the wall. Then I came across a forum where someone suggested to clear the credentials vault.

Checked the vault, and surprises surprise, found the credentials of another user, and yes, that user's account was DISABLED! Enabled the account, and GPUPDATE now reported that the credentials are invalid. Indeed, I noticed Security-Kerberos Warning 14 events in the user's System log to which I didn't give much attention before:


I was getting somewhere.

So apparently, once upon a time, my user had to access another user's resources for whatever reason. However, in time, the other user moved on, his password was changed and the account was disabled. The credentials in my user's Credentials Manager vault were not refreshed or removed.

There was no trace anywhere in any log that I could find of the account in the vault: not in the USERENV log, not in the domain controller's Security log - nowhere. Not one hint.

Clearing the obsolete account's credentials from my user's Credential Manager fixed it: the GPOs are now applying happily.



Takeaway point: GroupPolicy Error 1053 events in the System log with ErrorCode 1331 is likely caused by wrong credentials of another user in Credential Manager. Go check your vault and clear/update anything suspicious.

Thursday, 1 December 2016

Potential Security Hole In RMS with File Classification Infrastructure

Brace yourself for a long post on what to expect when embarking on protecting files with RMS and Windows Server File Classification Infrastructure (FCI). It isn't pretty, no matter how I look at it.

The Short Story

A client had a need to protect files and information against data leak. Being an Office 365 tenant also, the obvious choice was Azure RMS.

I implemented Azure RMS in conjunction with File Classification Infrastructure. It came together nicely, however during testing it became evident that Microsoft still has a long way to go to deliver a dependable solution.

I found that RMS in combination with FCI is unreliable and exposes unprotected information to the risk of being leaked.

The Flaw

A deadly combination of PowerShell scripts and uncorrelated tasks leads to failures due to file sharing violations, and ultimately to unprotected files and potential data leak.



While my implementation used Azure RMS, I dare say that a full-blown on-premise AD RMS solution would be plagued by the same flaw.

The Nitty-Gritty

The official Microsoft implementation document is available at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci. Says the article (highlights from me):


I did just that. Then I started testing. I set the schedule for the task (remember, "continuous operation" was also enabled).

The first thing that I noticed was that some of my test files remained unprotected after the task ran. In the Application event log I noticed the following error:



Tried the easy way out, checking the HTML report:

Error: The command returned a non-zero exit code.

Useless as a windshield wiper on a submarine.


Every time the scheduled RMS script ran, I got random files being listed as failing. Most weird...

NOTE: Before anyone starts to agitate, I want to note that I dropped a copy of the RMS-Protect-FCI.ps1 file in the test folder. This is NOT the copy that's being run to protect files, but just a copy of it for testing generic protection.

I checked the log mentioned in the event. The result: same as above (useless):


More digging reveals that additional logs are created in the same folder as the one above. I noticed that the log in the screenshot (see further below) correlates to the one above in that it has been created roughly 5 minutes after the above log - consistently approx. 5 minutes delay during my tests (your tests may deem different delays depending on server performance but it should be consistent between test runs).

In the log I was getting the following:

Item Message="The properties in the file were modified during this classification session so they will not be updated. Another attempt to update the properties will be made during the next classification session."

and

Rule="Check File State Before Saving Properties"


These are different files, but nevertheless I had them in the same test folder and they correlate with previous events. We are getting somewhere but not quite there yet. The information is insufficient to draw any conclusions.

For the record, log locations are configured as follows:


A gut feeling prompted me to look at the Event Viewer. I turned on the Command Line field to see what and how is launched, and started watching what was chewing CPU power and how it was launched. I was in for a rude shock: the PowerShell task configured on the Action tab of the File Management Task properties was being re-launched as soon as it exited, in an endless loop. It was obvious that it was my RMS script. The following screenshot cannot convey the dynamics, but if you watch Task Manager for a while you'll see how PowerShell is (re)protecting all files in the folder, in a never ending loop:


Then I tried to manually protect a file. It produced a file corruption error:

Protect-RMSFile : Error protecting Shorter.doc with error: The file or directory is corrupted and unreadable. HRESULT: 0x80070570

I instantly kicked off CHKDSK, but it came back clean. I tried to open the file in question - successful. Apparently there was no corruption, I could open the file, yet RMS failed due to file or folder corruption. Go figure.


Then I wanted to see what the file protection status was in general. I ran the following command as documented at https://docs.microsoft.com/en-us/information-protection/rms-client/configure-fci:

foreach ($file in (Get-ChildItem -Path C:\RMSTest -Force | where {!$_.PSIsContainer})) {Get-RMSFileStatus -f $file.PSPath}

To my surprise, I got a file access conflict error:

Get-RMSFileStatus : The process cannot access the file because it is being used by another process. HRESULT: 0x80070020

I ran this command numerous times during my tests, but I've never seen this error before. A subsequent run of the same command came back clean. Is it backup? Is it the antivirus? It was in the middle of the night so no-one was accessing it except my script. I was puzzled...


A bit later, I was watching Task Manager while the scheduled task was in progress, with "continuous operation" also enabled. I noticed that I had two RMS PowerShell tasks running concurrently:


One of them disappeared as soon as the scheduled task completed, while the one spawned by what I thinks is "continuous operation", remained. This may explain the random sharing violations and errors in the logs above.

I also noticed that while new files that I created via Word or Excel were protected almost instantly as a result of having continuous operation enabled, files that I copied from elsewhere have not been protected neither by the "continuous operation" task nor by the scheduled task.

I also want to point out that the PowerShell script can be scheduled no faster than once a day. Therefore if it fails to protect a file, then there is at least a 24-hour window until it runs again, during which unprotected files can be leaked.

As I mentioned it before, I've been watching Task Manager. I noticed that the RMS PowerShell script took up significant processing power. The official Microsoft documentation at RMS protection with Windows Server File Classification Infrastructure (FCI) (also linked above) states that the script is run against every file, every time:


Knowing the effects an iterative file scan has on disk performance, I kicked off a Perfmon session that shows CPU utilization and % Disk Time, just to check. Then, while watching it, I turned off "continuous operation" in File Server Resource Manager:



The effect was staggering: the CPU almost instantly took a holiday and the disk too was much happier:



Putting the puzzle together:
  • The File Server Resource Manager runs a scheduled task and a continuous task concurrently.
  • I am getting random files failing to be protected (see SRMREPORTS error event 960 and associated logs above)
  • I am getting random sharing violation errors.
  • I had one corruption report, although I cannot be sure it was the result of my setup.
  • I was expecting that the "continuous operation" mode is event-driven (it is triggered only when a new file is dropped in the folder). What Task Manager indicates though is that it is an iterative task in an endless loop that constantly scans the disk. This seems to be corroborated by Perfmon data.
  • The "continuous operation" PowerShell task is a resource hog.

Choices, Choices...

It looks like having "continuous operation" enabled for instant protection of new files works okay. Therefore you would want it turned on to ensure that chances of leaking sensitive data are minimised. However, as I have observed, it conflicts with the scheduled file protection task, leaving random files unprotected.

On the other hand, I also observed that if I turn off continuous operation, then, besides the fact that the system breathes a lot easier, the scheduled task protects files more reliably. On the downside, it leaves a significant time window during which files aren't protected and thus sensitive information can be compromised.


In Conclusion

My observations lead me to the conclusion that when using PowerShell scripts to RMS-protect files, the File Server Resource Manager fails to coordinate scheduled and continuous operation tasks and scripts, and thus it results in random sharing violations and ultimately unprotected files. As a result, the primary purpose of RMS, that of protecting files, falls flat on its face, giving the unsuspecting or malicious user plenty of time to leak potentially sensitive data.

Additionally, the way PowerShell scans the file system and (re)protects files, results in significant performance degradation even on lightly utilised systems.

While RMS is a great piece of technology that mitigates the issue of leaking sensitive data, it seems that it has a long way ahead to become a dependable tool when it comes to integrating it with File Classification Infrastructure.

PS: All this is based on observation and trial and error, and hence it may be incorrect in some respects. Information is scarce out there. I welcome any comments, suggestions and pointers to additional information.

Friday, 18 November 2016

Integration Is Not That Great Between Batch And Individual Mailbox Migration Tools

I was migrating a small batch of some 30 mailboxes from on-premise to Exchange Online in a hybrid configuration. I chose the option to suspend the migration, then complete it manually later.

While most mailboxes reached the AutoSuspended status, some have not even started to migrate (PercentComplete = 0). These few mailboxes appeared to be infinitely looping through statuses like TransientFailureSource, LoadingMessages, InitializingMove and the like:


I said to myself, I couldn't let these "transient" failures and apparent delays hold up the rest of the pack. Experience shows that re-migrating them from scratch usually gets them through.

I was thinking that at the end of the day a batch migration is just a bunch of  mailbox moves treated as one unit, and I should be able to manage individual move requests within the batch via *-MoveRequest commands. I went ahead and finalised the AutoSuspended moves with Resume-MoveRequest. Little did I know that I was about to upset the batch migration logic to the point where it could not recover, resulting in all mailbox moves being reported as "Failed".

To recap:
1. Most mailboxes in the batch finished initial sync and entered the AutoSuspended state.
2. Some mailboxes failed to sync, quoting transient failures and looping indefinitely through various states, never reaching the AutoSuspended state.
3. Using Resume-MoveRequest in remote PowerShell, I completed the AutoSuspended moves.
4. Using PowerShell, confirmed that all resumed migrations have reached the Completed status:


At this point I stopped the batch, removed the failing mailboxes, and restarted the batch. Then I chose "Complete the migration" in the action pane.

I was expecting that the batch will realise that the mailbox moved have been completed and will finish successfully. Well, good luck with that... All mailboxes that have completes successfully following Resume-MoveRequest, were reported as Failed in the GUI. Yes, Failed.

I checked individual reports. They confirmed that the mailboxes were indeed successfully migrated - and please note, these reports were returned by the GUI's very own Download the report for this user link:


The Migration Batch GUI report however showed them all as Failed:


Get-MigrationBatch seemed to be supporting its GUI sibling, returning a status of CompletedWithErrors:


An educated guess indicates that the batch migration wizard is retrieving the report from individual Get-MoveRequestStatistics -IncludeReport commands.

So should I be worried? In my experience, definitely not. Confirmed with users that the above discrepancy didn't affect their ability to connect to and use their migrated mailbox.

Takeaway: Integration between *-MigrationBatch and *-MoveRequest commands (and their siblings) isn't that great. Expect inconsistent reports when mixing them. If you started moving mailboxes using one tool then use the same tool until the migration is complete - don't mix them, no matter how tempting it may be. The inconsistent reports can be misleading but data integrity is maintained. The reports to rely on are those created by *-MoveRequest and *-MoveRequestStatistics. If these commands say that the move has been completed successfully then they have been completed successfully, regardless what other reports say.

Hope that one day Microsoft will improve reporting and integrate batch and individual mailbox migration tools better.

Happy reporting :-)

Saturday, 12 November 2016

Removing Exchange 2010? Disable Performance Logs & Alerts!

In a recent upgrade (a.k.a. "transition") from Exchange 2010 to Exchange 2016 project I got to the point where I had to uninstall Exchange 2010. To my surprise it failed to uninstall, despite the fact that all services that I was aware of, such as antivirus and backup, were disabled or otherwise not affecting Exchange components.

The uninstall wizard displayed a long list of rundll32 processes that had open files and this prevented the removal of my Exchange 2010 server:


In order to see which process holds files open, I had to know what was loaded with rundll32. At this point, however, the only clue I had was the process ID, a.k.a. PID. Not very useful.

It is not all gloom and doom though. Task Manager is a very useful troubleshooting beast, with lots of information hidden from the default view.

In my case I had to see what was loaded with rundll32. Let's look specifically at PID 7172, the first in the pre-flight check error report.

In Task Manager, I enabled the Command Line column...


... which then showed me the path to the DLL that was loaded in the process:


Doing a quick Google search on pla.dll, it looked like the files were open by the Performance Logs & Alerts service. Indeed, this service was started:


I disabled _and_ stopped the service, ...


... retried the uninstall wizard, and voilĂ , the pre-flight check succeeded:


IMPORTANT: Stating the obvious, it is not sufficient to just _disable_ the service. You must also _stop_ it to make sure it releases all file locks and resources. Disabling it ensures that the service doesn't start and lock files while in the middle of uninstalling Exchange, while stopping it actually releases the file locks.

In a similar way, we can identify any process that may hold files open when uninstalling not only Exchange 2010 but other software too.

Takeaway: Add the Performance Logs & Alerts service to the list of services to disable when uninstalling Exchange 2010, or perhaps any other version of Exchange server as a matter of fact.



Saturday, 5 November 2016

StalledDueToTarget_MdbCapacityExceeded

Intro

The Mailbox Migration Performance Analysis blog posted by the Exchange team back in 2014 provides a great deal of details about monitoring and measuring migration performance using the various states and the time a job spent in each state.

What it doesn't touch on, however, is what appears to be a rare condition: sometimes the Exchange Online mailbox database designated to store a particular on-boarded mailbox apparently runs out of space. At least that's what an educated guess made me to conclude.

Some Context

In one of my recent Office 365 migration projects one particular mailbox was stuck for hours in the StalledDueToTarget_MdbCapacityExceeded state.

Get-MoveRequest <Identity> showed that the request has been queued. Nice detail but in this case it wasn't very informative.

Displaying some more details, including the status report, returned no useful information either. In fact the report may be a bit misleading in that it may suggest that the mailbox is too big and it doesn't fit within the quota. Well, the figures indicated that both the mailbox and the archive were well below the 50Gb mailbox and unlimited archive quota.

Get-MoveRequest <Identity> | Get-MoveRequestStatistics -IncludeReport | fl DisplayName, StatusDetail, TotalMailboxSize, TotalArchiveSize,Report


I have never come across this particular condition and I turned to my omniscient technical advisor, GG (Google the Great). No luck there either. Not one hit. Am I the only one in the universe who's seen this so far? I started to speculate.

The error suggests that the mailbox in question has been designated to be stored on a specific mailbox database, however the database had no capacity to accommodate the mailbox.

Assuming that I am correct, I have no way of knowing how this happened. I can think of two possible scenarios:

1. I can imagine that under some rare conditions the database selection algorithm may over-provision mailbox databases. The system detects it down the line, but it fails to re-allocate the mailbox to another eligible database. and the mailbox move gets stuck and never completes. I suspect this is the most likely cause.

2. The maximum size of the target mailbox database has been administratively reduced *after* the move request has been queued. Unlikely but not impossible.

I may never find out, but it is irrelevant anyway.

After the troubled mailbox has been sitting in this state for over 5 hours, I said to myself that I have to do something about it. Maybe removing the mailbox from the batch and re-migrating it separately will put the mailbox into a roomier database. And it did!

The Fix

This worked for me:

1. Stop the migration batch. Mailboxes cannot be removed from a batch that is still running.

Stop-MigrationBatch "Batch Name" -Confirm:$false


2. Give it a couple of minutes and confirm that the batch has stopped. If it is still "stopping" then give it some more time.

Get-MigrationBatch "Batch-Name"


3. Find the user's primary SMTP address by running one of the following commands on the on-premise system:

((Get-Mailbox -Identity UserName).PrimarySmtpAddress).ToString()

or

Get-Mailbox -Identity UserName | fl PrimarySmtpAddress

4. Remove the mailbox from the batch. Substitute Primary_SMTP_Address with the one obtained above.

Remove-MigrationUser <Primary_SMTP_Address>


5. Restart the migration batch.

Start-MigrationBatch "Batch Name"


6. Now that the user has been removed from the migration batch, create a new batch and add the mailbox. This time Exchange Online will place the mailbox into a database with sufficient space and the mailbox will be moved successfully - or at least that's what happened in my case.

That's it. Enjoy your move!

Thursday, 6 October 2016

Are All Your Exchange Services Running?

How any Exchange troubleshooting should start:

  1. Scratch head.
  2. Confirm all Exchange server services are running.
  3. Plot out the next steps.
I have recently been involved in troubleshooting an issue where resource automatic booking stopped working. Additionally OOF replies were not sent either.

All sorts of elaborate reasons flew through my mind as to why this may happen, from misconfiguration right down to mailbox corruption. Went through all troubleshooting steps imaginable and things that worked elsewhere with similar settings didn't work for this organization.

Then I came across Terence Luk's post here. And then I had a D'OH! moment: are the services running? Sure enough, the Microsoft Exchange Mailbox Assistants service stopped.


Started the service, tested it again, and Bingo!

As the screenshot indicates, there may be other services stopped, affecting different functionality. Services may stop randomly for random reasons. The point is that I have come across very few organizations that implement Exchange server service level monitoring and alerting.

The morale: Make it a habit to check services first before engaging in more complex troubleshooting.