Notes from the field: 2015

Tuesday 24 November 2015

Error Sending Email to Subdomains from O365

Suppose you have an on-premise Exchange and SharePoint server. You configure SharePoint to allow adding content by e-mail (see here). Your main domain is yourdomain.com and the SharePoint domain is sp.yourdomain.com. A sample SharePoint address may look like documents@sp.yourdomain.com.

You set up a dedicated Send connector and scope it for the sp.yourdomain.com domain, using the SharePoint server as smart host.

It works just fine.

Then you enable Hybrid Exchange, move a couple of mailboxes into the cloud, and you switch your MX record(s) to Exchange Online Protection.

In this scenario users with mailboxes in the cloud may fail to submit documents to SharePoint via e-mail and they'll get the following error:

LED=550 5.4.101 Proxy session setup failed on Frontend with '554 5.4.4 SMTPSEND.DNS.Non-ExistentDomain; nonexistent domain'

Here is what fixed it for me, and hopefully it will fix it for you too:

1. In the O365 Exchange Admin Center, under Mail flow / Accepted domains:

Change your parent domain (yourdomain.com) from Authoritative to Internal Relay and enable Accept mail for all subdomains (see here - a tad old, due for an update).

2. Under Mail Flow / Connectors:

Edit the connector which routes mail from O365 to your organization and click Next until the Edit Connector page is displayed.

Add the *.yourdomain.com domain to the list.

Click next and work your way through the wizard. Don't change anything else. Save the settings. In the process you'll need to provide a valid e-mail address to validate the connection. Provide one in the yourdomain.com domain.

3. On the on-premise Exchange server configure the following:

Create a new accepted domain of type "External Relay" for sp.yourdomain.com:

There should already be a dedicated Send connector. If not, create a new dedicated Send connector for the sp.yourdomain.com address space:

On the Send connector, ensure that the SharePoint server is listed as smart host. Add it if it isn't.

E-mail will now start flowing as expected.

Your environment might be slightly different, but the principle still applies.

Wednesday 29 July 2015

SIDHistory and Exchange Resource Forests

The Environment

I have been recently involved in a business consolidation project where two businesses joined forces and formed a new entity. Inevitably, it had a strong migration component.

My client stipulated that assuming the new identity in the early stages is critical and it has to be reflected in communications. Therefore e-mail was the first component to be migrated. The plan was to let users log on and access resources such as file, print, application etc. as normal, with their LEGACY\oldlogon account, while sending/receiving e-mail should happen under their NEW\newlogon identity.

To achieve this first milestone, the new AD forest has been created and Exchange 2013 CU8 has been installed and configured.

The timeline was unclear as to when will users start to log on with their NEW\newlogon accounts, and in relation to that, whether users will start logging on to the new forest *before* or *after* other resources will have been migrated.

For this reason user accounts have been migrated along with the SIDHistory attribute in a “disabled” state, and their mailboxes moved cross-forest as Linked Mailboxes. Effectively an Account/Resource forest topology has been built. For more on Account/Resource forests see https://technet.microsoft.com/en-us/library/aa998031(v=exchg.150).aspx.

The Issue

All worked well, except delegation. Delegation was flaky at best, users regularly losing access, and admins unsuccessfully trying to re-establish permissions. While initially Get-MailboxPermission displayed the correct LEGACY\oldlogon accounts, they’ve been replaced, automagically, with NEW\newlogon as time passed. Inconsistency was the only constant.

I searched the Internet far and wide, asked on professional forums, however my efforts yielded no useful results.

I ended up opening a case with Microsoft.

The Workaround

For each SID in the shared mailbox’s ACL Exchange tries to find its corresponding AD account, and it searches its own forest first. It finds the SID of the LEGACY\oldlogon account in the SIDHistory attribute of the migrated NEW\newaccount, and it says “BINGO! This is the account!” Therefore it stops searching and does NOT follow the chain of trusted forests to identify the real owner of the SID. Therefore it incorrectly assigns permissions to NEW\newlogon in the resource forest instead of LEGACY\oldlogon in the account forest.

If the SID on the shared mailbox’s ACL isn’t found in the Exchange server’s home forest, then, and only then, Exchange will move on to find the account in one of the trusted forests.

To date the only way to get delegated access to work again is to clear the sIDHistory attribute of the NEW\newlogon account. There are a number of ways to clear the SIDHistory. My preferred tool is Ashley McGlone’s PowerShell module.

WARNING: Do NOT clear the SIDHistory unless the effects are fully understood AND if the effects will not affect the business in a negative way! Otherwise consider a career change.

While clearing the SIDHistory solves the immediate issue, it introduces new challenges later on when the merger/migration is refocused onto accounts and resources. Further steps must be carefully planned and coordinated, user accounts potentially re-migrated to re-populate the sIDHistory attribute, otherwise users risk to lose access (and the admin his/her job).

In my opinion the logic by which Exchange finds delegated accounts is flawed, and therefore I see the removal of SIDHistory not as a “fix” but rather as a quick and dirty “workaround”, lacking strategic view of the bigger picture. Hope Microsoft will fix it one day.

Happy migration!

Tuesday 2 June 2015

Confused QMM Exchange Native Move Jobs

Setting the Scene

In a recent migration engagement I've come across an error, the cause of which, to date, is unknown even to Dell, yet it seems to be associated with target Exchange server availability.

The target Exchange server crashed due to some underlying storage issues while a couple of Native Move mailbox migration jobs were in progress.

I confirmed that from an Exchange server perspective the mailbox moves survived the crash (Get-MoveRequest | Get-MoveRequestStatistics), because the databases failed over to the surviving DAG member. Once the server was recovered and the databases moved back to the original server, I clicked Retry Move in the QMM console. I was expecting that QMM will update itself by querying Exchange and getting fresh information. Instead, QMM stuck its tongue out at me with the following error:

"Parameter set cannot be resolved using the specified named parameters."

The Issue

When in Native Move migration mode, QMM is nothing more than an easy to drive GUI which behind the scenes builds New-MoveRequest PowerShell commands and submits them to the target server.

QMM keeps track of the migration progress in its own SQL database. In a way this is good because it provides access to quick status and configuration information. However, in the case of Native Move jobs, where QMM is no longer in control of data movement between servers, QMM still appears to look at its SQL database as THE source of truth, as opposed to what the Exchange server is reporting. As a result, things get out of sync quite easily with nasty consequences in terms of time and cost of the migration project.

In case of Exchange server failure QMM appears to pollute its SQL database with wrong data. Common sense dictates that if a move request has successfully been submitted and in progress for some time, one would query the Exchange server to obtain fresh progress information and update its records accordingly. QMM, however, appears to looks exclusively at its own database with no regard to Exchange status reports. When I click "Retry Move", QMM fails to detect that a move request is already in progress despite the fact that its very own database contains a record of a successful command submission, which in itself is a hint that QMM should go and ask Exchange for an update. Instead it attempts to build a new move request command using polluted data from its SQL database.

Specifically, this is what happened in my case:

Fact: QMM submits New-MoveRequest commands on the target server. Data is "pulled" by the target server from the source server. Therefore parameters are interpreted from the target server's perspective.
Considering the point above, the correct parameter to use would be -TargetDatabase. Instead QMM uses -RemoteTargetDatabase, as if data would be "pushed".
When the target Exchange server tries to execute the command, it encounters the -RemoteTargetDatabase parameter pointing to its very own, local database.
The target Exchange server scratches its head: "I own this database, why on earth is the command telling me that it's remote???"
Quite rightly, when QMM attempts to submit the wrongly built command, Exchange gets confused and throws the error.

The Fix

Dell was unable to put the finger on the cause. However they did come up with the following workaround:

Open SQL Server Management Studio. In the QMM Exchange database, locate and open the MAILBOX table.

Find the failing mailbox in the MAILBOX table and take note of the value in the ID field.

From the MAILBOX_PROCESSING_PROPERTY table, delete the entry matching the ID identified above:

If there is a mailbox move in progress (check in the Exchange Management Shell), then cancel it. Then, in the QMM console, retry the mailbox move. This time the correct command is assembled and successfully submitted.

Other Errors

Some other errors I've come across with Native Move jobs:

Error: Mailbox 'Migrated_User' is already being moved to 'Target_Database'.

Reason: QMM spawns multiple New-MoveRequest commands in rapid succession, a couple of seconds apart, before Exchange gets a chance to process the first instance. Effectively QMM appears to be a tad too impatient, not allowing sufficient time for Exchange to process the first command, thus causing QMM to believe that its request went no-where. Thus it will issue the second instance of the same command. However, by then, Exchange would have already processed the first command, and the second attempt results in the above error.
Workaround: Highlight the mailbox with this error and click Retry Move. This forces QMM to query Exchange and update the status of the job.
Note: Dell did neither confirm nor deny this to be a bug. The case was closed without a resolution.

Error: Failed to retrieve RootDSE at 'Source.Domain' under 'DOMAIN\account'
Reason: The domain controller configured in QMM was down for maintenance or otherwise unavailable (e.g. network down).
Workaround: Highlight the mailbox with this error and click Retry Move. This forces QMM to retry the connection to the domain controller.
Note: Dell did neither confirm nor deny this to be a bug. The case was closed without a resolution.

These errors occurred for about 25-30% of the entire migrated user base. Cutover schedules (or mailbox switches) were dead in the water, given that a human had to push a button to retry the jobs.

Since one cannot rely on the schedule to switch over mailboxes, you'll have to pay someone to watch it overnight and push the button as needed, or inconvenience users during the day.

Final Thoughts

QMM for Exchange is a great tool and it can streamline much of the work. Its Native Move engine however is riddled with bugs and it virtually lacks any smarts to recover from Exchange and AD server or network failures. It also seems to have a flawed logic in keeping up with where the Exchange servers are up to. Its scheduling options are great, provided that during the migration the domain controllers and Exchange servers stay available.

A word of caution: QMM doesn't work well with DAGs. Failing over databases to other DAG members doesn't go down well with QMM. Regardless what others are saying, this was my experience. It may change in the future as Dell improves the product.

At the time of this writing, however, if you will only use it in Native Move mode, then carefully consider the cost of the software versus that of a consultant who would build the scripts which could be used in conjunction with QMM-AD.

You would have to use Native Move if:

You have lots of remote users with poor links and large mailboxes, and thus you want to preserve the OST file.
The target server is Exchange 2013.
You migrate to a resource forest (users will keep logging in to their legacy domain with their legacy credentials).

If all three conditions are true then the only option that preserves the OST file is Native Move. In this instance you might find that using QMM-AD with custom Exchange PowerShell scripts will be more cost effective.

Hope this will save some headache for someone out there.

Wednesday 8 April 2015

Tame Your Exchange 2013 Logzilla

First Things First

The term Logzilla, as used in this post, has nothing to do with the syslog analysis solution named LogZilla.

Big things are sometimes called Whateverzilla. Oversized Exchange 2013 Managed Availability logs beg for the name. Let me know if it is upsetting to anyone.

Now that IP has been taken care of, skip to The Solution section below if you know all about Logzilla and just want to tame it, or read on for some background information.

Short Primer

Managed Availability is a great feature of Exchange 2013 in large deployments where automation is important. However it can be a hindrance for the small business with limited resources and little experience.

In a nutshell, Managed Availability constantly monitors every aspect of an Exchange 2013 environment, and where appropriate, it automatically takes proactive action to ensure that service levels aren’t affected. Fix first, analyse later. When we consider that Microsoft has hundreds of thousands of servers in Office 365 servicing millions of users, you can immediately see the benefits.

Also, Microsoft’s stance is that it shortens service call resolution time if the events that caused an outage are caught as soon as they happen. We no longer have to turn on logging, reproduce the issue, then turn logging off, because Managed Availability logs will have captured the original event already. It makes perfect sense in the case of the odd issue that’s difficult to reproduce.

To meet these goals, Managed Availability continuously gathers system performance data. As we can imagine, detailed logs need quite some space to be stored. This is why Microsoft recommends “at least” (a.k.a. minimum) 30Gb free space for the disk on which Exchange is installed. As we know from past experience, when Microsoft says “minimum”, practically it means that the software will install, however it will be unable to do anything useful.

The Issue

Exchange can be installed in a variety of storage configurations. Traditionally, it is not uncommon to see a 50-60Gb C: drive, where along the OS, there is Exchange also. If logging needs to be turned on, then traditionally the log location could be easily redirected to another drive with sufficient free space so that C: doesn’t fill up.

With Exchange 2013 you will quickly find that C: will fill up steadily and you can do nothing about it. At least not easily. And you haven't even turned on any logging!

The cause is Managed Availability: logs just keep (re)growing like cancer. By default. Unstoppable (well, MA can be disabled but we don't want to).

At this point you have probably crawled the Internet for solutions. You’ve probably come across articles which tell you to reconfigure the log location in Perfmon or in the Registry. You even tried it, only to find that the logs are back on C: in a few days. In your desperation you want to disable Managed Availability altogether, but that's not a good thing either.

The Future

I have recently attended the O365 Summit in Sydney where I had the chance to ask, straight from the source, how to relocate Logzilla.

The bad:

There is no supported way to move Managed Availability logs.
There are no plans to enable administrators to configure log paths in any future Exchange 2013 patch.

The good:

The Exchange team has heard the community and is considering granting administrators the option to move the logs. However it will only be available in Exchange v.Next.

The Solution

If you are installing a new Exchange 2013 server then install it on a drive other than C:. A 100Gb drive should be sufficient, then you shouldn’t have to worry about moving logs, queues etc – just leave them in their default location. Optionally you may still want to move or logrotate IIS logs as by default they are stored on C:, regardless where Exchange is installed.

However if you already have an existing deployment which struggles with free space, and you cannot move your e-mail to a new Exchange server, then here is what you can do to address the immediate issue:

Download Sysinternals’ Junction tool from here.
Identify the folders which hold the largest logs, and those which grow the fastest on your C: drive. You can use a tool like WinDirStat.
Add a new volume with sufficient space to the Exchange server (100Gb will do)
Create a folder structure where you will store the logs. I prefer something that resembles the original structure.
Restart the server in Safe Mode. This is needed because some services run in the System context (PID 4) and will not let you implement the changes for the files will be locked.
Delete the original log folder, for instance “ETLTraces“ in “C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\Diagnostics\”. If you want to preserve the log files then move them first to the new location.
Create a junction point in the original location, pointing it to the new target. For example:

JUNCTION.EXE “C:\Program Files\Microsoft\Exchange Server\V15\Bin\Search\Ceres\Diagnostics\ETLTraces” “L:\Logs\Microsoft\Exchange Server\V15\Bin\Search\Ceres\Diagnostics\ETLTraces”

NOTE: Just to make it very clear, to successfully replace the folder with a junction, the original folder must be deleted first, and the new target folder must already exist.

Restart the server.
Optional but recommended: migrate to a server with a properly designed storage as soon as practical and decommission the trouble server.

WARNING:

While I found this solution to work and I could identify no ill effects, it is NOT supported by Microsoft. Implement it at your own risk and test it in a lab first. I accept no responsibility for any damage, downtime or loss of any kind it may cause.
Read point 1 again.

There you have it. Hope it saves you some headache.