4. RSS from blogs.it.ox.ac.uk

  • <xptr url="http://blogs.it.ox.ac.uk/networks/feed/"
    		 type="transclude" rend="rss"/>
    FroDo IOS upgrade

    I’d like to announce a staged upgrade of IOS on all FroDos. This blog post aims to answer some of the questions this work will raise. Feel free to contact the Networks team with any questions at networks@it.ox.ac.uk.

    Why?

    We currently run 19 different versions of IOS across FroDos. Some of the switches haven’t been upgraded since the original installation (the longest running FroDo had an uptime of over 7 years). Whereas it may be advantageous to stick to a version that works fine on the switch, we decided to roll out updates on all FroDo switches in production. There are 3 main reasons for the mass-upgrade:
    - bug fixes
    - unification of versions and consistency
    - new features

    Our intention is to run a single IOS version per platform (3750[G], 3750-X, 3560[CG], 3850, 4900M, 4948E). I’m sure the question will spring to mind – why commit to this work when TONE is under way? Despite work progressing on the new backbone, it’s still quite a long time away and regardless of the fine details of its delivery, we will retain the concept of Point-of-Presence in the future design and thus keep existing switches in production for a considerable length of time. It therefore makes sense to consolidate the IOS versions at this point.

    Timescale

    We plan to upgrade on a per C-router basis. The schedule we devised is to upgrade and reload roughly 10 FroDos every Tuesday, Wednesday and Thursday until all switches are up to date. The following table details the process:

    Date Device VLANs affected Notes
    8 April Frodo-110 (acland)
    Frodo-113 (edstud)
    Frodo-116 (38-40-woodstock-rd)
    Frodo-120 (maison-francaise)
    Frodo-149 (physics-dwb)
    Frodo-150 (eng-ieb)
    Frodo-151 (maths)
    Frodo-152 (wolfson-building)
    Frodo-154 (lady-margaret-hall)
    Frodo-155 (mdx-eng)
    102, 104, 113, 118, 120, 125, 150, 151, 182, 183, 187, 189, 190, 191, 199, 397, 598, 691, 720, 994 Affects ResNet
    9 April Frodo-156 (materials-hume-rothery)
    Frodo-157 (e-science)
    Frodo-161 (eng-thom)
    Frodo-162 (eng-jenkin)
    Frodo-163 (eng-holder)
    Frodo-164 (eng-etb)
    Frodo-165 (14-15-parks-rd)
    Frodo-167 (radcliffe-infirmary)
    Frodo-168 (new-maths)
    Frodo-169 (wolfson)
    101, 102, 105, 106, 109, 111, 115, 121, 127, 151, 156, 163, 167, 186, 189, 193, 195, 196, 199, 288, 397, 398, 517, 694, 787, 788, 792, 904, 954, 967, 985 Affects Engineering WLC
    10 April Frodo-202 (careers)
    Frodo-204 (voltaire)
    Frodo-208 (12-bevington)
    Frodo-212 (belsyre-court)
    Frodo-217 (nissan-institute)
    Frodo-219 (wolsey-hall)
    Frodo-249 (begbroke)
    Frodo-250 (kellogg)
    Frodo-251 (ewert-house)
    Frodo-282 (williams)
    Frodo-293 (summertown-house)
    Frodo-296 (st-annes-robert-saunders)
    Frodo-297 (merrifield)
    202, 204, 208, 220, 222, 249, 252, 282, 283, 285, 286, 289, 290, 292, 296, 297, 298, 299, 397, 675, 678, 717, 720, 722, 794, 977, 989
    15 April Frodo-253 (mdx-sthughs)
    Frodo-255 (begbroke-iat)
    Frodo-257 (st-hughs)
    Frodo-258 (st-antonys)
    Frodo-260 (univstavertonrd)
    Frodo-262 (st-annes-frodo)
    Frodo-263 (green-college)
    Frodo-264 (wuhmo)
    Frodo-203 (13-bradmore-road)
    Frodo-281 (vc101br)
    Frodo-283 (areastud)
    Frodo-292 (trinity-staverton-rd)
    Frodo-569 (saville-house)
    Frodo-662 (new-college)
    121, 187, 188, 196, 203, 205, 206, 209, 214, 257, 279, 280, 281, 284, 284, 293, 295, 295, 296, 297, 329, 608, 673, 677, 679, 680, 681, 681, 682, 720, 796, 856, 989
    16 April Frodo-306 (safety)
    Frodo-308 (rh)
    Frodo-309 (linc-mus-rd)
    Frodo-310 (security-services)
    Frodo-313 (rai)
    Frodo-316 (physics-aopp)
    Frodo-324 (dlo)
    Frodo-351 (rex-richards)
    Frodo-352 (rodney-porter)
    Frodo-353 (dyson-perrins)
    Frodo-354 (stats)
    Frodo-355 (ocgf)
    112, 202, 305, 306, 308, 309, 310, 314, 319, 320, 351, 355, 372, 377, 388, 391, 397, 398, 399, 526, 595, 717
    17 April Frodo-356 (mdx-mus)
    Frodo-358 (chem-physical)
    Frodo-359 (beach)
    Frodo-360 (rsl)
    Frodo-361 (mansfield)
    Frodo-362 (bioch)
    Frodo-363 (physiology)
    Frodo-366 (inorganic-chemistry)
    Frodo-367 (keble)
    Frodo-368 (earth-sciences)
    Frodo-369 (9-parks-rd)
    Frodo-370 (museum)
    Frodo-625 (exam-schools)
    191, 301, 314, 315, 320, 323, 328, 329, 351, 361, 367, 368, 369, 370, 373, 375, 378, 379, 389, 391, 393, 394, 395, 396, 397, 398, 595, 625, 902, 906, 968, 970, 972, 997 Affects Museum Lodge WLC
    22 April Frodo-513 (stx-bnc-annexe)
    Frodo-515 (merton-annexe)
    Frodo-517 (english)
    Frodo-518 (law-library)
    Frodo-523 (zoo)
    Frodo-524 (mrc)
    Frodo-527 (mstc)
    Frodo-531 (club)
    Frodo-549 (balliol-holywell)
    Frodo-550 (mdx-zoo)
    Frodo-552 (social-sciences)
    Frodo-553 (stcatz)
    397, 510, 514, 515, 516, 517, 518, 523, 524, 527, 531, 552, 589, 594, 596, 597, 598, 687, 797, 977, 997
    23 April Frodo-554 (qeh)
    Frodo-555 (plants)
    Frodo-559 (chemistry-research-laboratory)
    Frodo-561 (path)
    Frodo-562 (tinsley)
    Frodo-563 (islamic-studies)
    Frodo-564 (mdx-ompi)
    Frodo-566 (pharm)
    Frodo-568 (psy)
    74, 182, 183, 214, 288, 301, 351, 360, 378, 388, 389, 391, 397, 398, 501, 507, 522, 553, 559, 561, 562, 580, 588, 590, 591, 592, 593, 595, 596, 597, 599, 678, 683, 694, 719, 727, 810, 860, 893, 893, 902, 948, 955, 956, 968, 976, 977
    24 April Frodo-602 (bod-old)
    Frodo-604 (music)
    Frodo-606 (sheldonian)
    Frodo-607 (bod-camera)
    Frodo-609 (ruskin-sch)
    Frodo-615 (bod-clarendon)
    Frodo-619 (all-souls)
    Frodo-627 (mhs)
    Frodo-628 (jesus)
    360, 397, 602, 604, 607, 609, 611, 615, 617, 619, 672, 682, 683, 683, 686, 697, 782, 997
    29 April Frodo-629 (exeter)
    Frodo-630 (queens)
    Frodo-631 (st-edmund-hall)
    Frodo-632 (10-merton-street)
    Frodo-634 (pembroke-college)
    Frodo-635 (chch)
    Frodo-639 (albion)
    Frodo-640 (hmc)
    Frodo-641 (old-indian-institute)
    Frodo-645 (campion)
    553, 610, 612, 620, 621, 631, 634, 640, 645, 662, 680, 684, 686, 688, 695, 919, 962
    30 April Frodo-649 (oii)
    Frodo-650 (trinity)
    Frodo-651 (sers)
    Frodo-652 (magd)
    Frodo-653 (littlegate)
    Frodo-654 (oriel)
    Frodo-655 (balliol)
    Frodo-656 (blue-boar-st)
    Frodo-657 (mdx-ind)
    Frodo-660 (mdx-chch)
    Frodo-689 (botanic-garden)
    Frodo-692 (stanford-house)
    Frodo-698 (chaplaincy)
    Frodo-699 (shop)
    15, 197, 378, 389, 397, 398, 601, 603, 614, 626, 627, 638, 639, 650, 654, 656, 676, 677, 678, 689, 690, 692, 694, 696, 698, 699, 722, 749, 787, 902, 905, 967, 981, 989, 997 Affects Indian Institute WLC
    1 May Frodo-661 (mdx-daubeny)
    Frodo-663 (axis-point)
    Frodo-664 (corpus-christi)
    Frodo-665 (pembroke)
    Frodo-666 (merton)
    Frodo-667 (univcoll)
    Frodo-669 (hertford)
    Frodo-671 (wadham)
    Frodo-76 (harkness)
    Frodo-77 (gibson)
    199, 214, 285, 297, 397, 398, 515, 605, 613, 634, 662, 663, 664, 669, 671, 673, 691, 792, 794
    6 May Frodo-702 (taylorian)
    Frodo-703 (old-boys-high-school)
    Frodo-707 (9-stjohnsst)
    Frodo-708 (bnc-frewin)
    Frodo-711 (arch)
    Frodo-713 (classics)
    Frodo-716 (clarendon-press)
    Frodo-717 (survey)
    Frodo-721 (barnett-house)
    Frodo-725 (some)
    397, 687, 702, 703, 707, 711, 713, 717, 721, 725, 749, 781, 787, 788, 796, 799, 954, 959, 977, 985, 997
    7 May Frodo-726 (25-wellington-square)
    Frodo-728 (sbs)
    Frodo-729 (sackler)
    Frodo-730 (lincoln-clarendon-st)
    Frodo-732 (oxford-union)
    Frodo-734 (castle-mill)
    Frodo-749 (orient)
    Frodo-750 (worcester-st)
    Frodo-751 (dartington)
    Frodo-754 (mdx-ash)
    284, 309, 397, 398, 675, 716, 720, 728, 729, 732, 749, 761, 783, 789, 790, 797, 906, 959, 975, 977, 997 Affects Ashmolean WLC and ResNet
    8 May Frodo-755 (mdx-socstud)
    Frodo-756 (ashmolean)
    Frodo-757 (stx)
    Frodo-759 (regents-park)
    Frodo-761 (rewley-house)
    Frodo-762 (sjc)
    Frodo-764 (st-peters-frodo)
    Frodo-765 (castle-mill-2)
    Frodo-766 (worcester)
    Frodo-767 (nuffield)
    Frodo-792 (worcester-street)
    Frodo-794 (hayes-house)
    320, 330, 370, 374, 375, 397, 398, 611, 675, 680, 691, 697, 701, 705, 709, 710, 715, 718, 720, 722, 733, 734, 756, 757, 781, 782, 784, 786, 793, 794, 795, 797, 977, 989
    13 May Frodo-809 (ocdem)
    Frodo-821 (fmrib)
    Frodo-851 (sports-distributor)
    Frodo-855 (well)
    Frodo-862 (mdx-ihs)
    Frodo-863 (iffley-rd)
    Frodo-864 (st-hildas)
    Frodo-865 (ndm)
    Frodo-867 (kennedy)
    Frodo-869 (ccmp)
    Frodo-890 (ssho)
    Frodo-899 (imm)
    Frodo-881 (alan-bullock)
    15, 214, 395, 397, 398, 398, 515, 682, 684, 691, 695, 698, 720, 805, 806, 807, 808, 809, 812, 851, 852, 854, 855, 856, 864, 880, 881, 882, 883, 887, 890, 892, 893, 894, 902, 962, 968, 975 Affects IHS WLC

    To find out the number of your backbone VLAN and annexe connections, use Looking Glass.

    If your FroDo isn’t listed above, it most likely has been upgraded already. The following switches run current IOS as a result of other maintenance work:
    Frodo-101 (physics-theory); Frodo-102 (materials-21-banbury); Frodo-104 (materials-12-13-parks-rd); Frodo-159 (mdx-edstud); Frodo-207 (43-banbury-rd); Frodo-213 (anthropology-58a-br); Frodo-215 (anthropology-64-br); Frodo-218 (anthropology-51-br); Frodo-220 (anthropology-61-br); Frodo-301 (physics-clarendon); Frodo-323 (robert-hooke); Frodo-349 (prm); Frodo-357 (mdx-plants); Frodo-551 (life-sciences); Frodo-557 (medawar); Frodo-560 (pathology); Frodo-567 (linacre); Frodo-623 (linc); Frodo-633 (sbs-phase-2); Frodo-648 (mdx-ind2); Frodo-658 (mdx-all-souls); Frodo-659 (mdx-merton); Frodo-670 (brasenose); Frodo-712 (eng-osney); Frodo-752 (beaver-house); Frodo-801 (botnar); Frodo-802 (psych); Frodo-849 (jr2); Frodo-853 (rob); Frodo-856 (richard-doll); Frodo-857 (psych-meg); Frodo-858 (rosemary-rue); Frodo-859 (orcrb); Frodo-905 (16-wellington-square); Frodo-908 (phonetics); Frodo-909 (theology-34a-st-giles); Frodo-910 (counselling); Frodo-914 (new-barnet-house); Frodo-916 (37a-st-giles); Frodo-962 (egrove); Frodo-963 (offices); Frodo-964 (ertegun); Frodo-969 (mdx-oucs); Frodo-972 (oucs)

    Impact

    Depending on hardware platform, the expected downtime is about 8 to 30 minutes. Catalyst 3750 – the dominant platform – takes only a few minutes to reload to new IOS, but others may include a microcode upgrade, which takes up to half hour. We intend to upgrade and reload the switches on early mornings (7:30-9am) to minimise impact on backbone connections. In the event of a hardware failure, a replacement FroDo will be installed. In reading the above table and assessing disruption to your connectivity, keep in mind annexe connections.

    I just received a spam email from my own address

    Our team was asked to answer some queries about how it’s possible to receive mail that has been forged as being from your email address. This article slightly overlaps with a previous article in 2011 that covered similar ground. Please note that the target audience for this article is end users, not technical support staff and so some of the technical descriptions (and especially the diagrams) are simplified in order to explain the overall theory or process.

    Someone is sending mail as being from my address, how is that possible?

    It’s best to think of emails as postcards. Anyone can write on the postcard a false sender – anyone could send you a postcard ‘from’ you and the postman would still deliver it.

    How can I stop someone outside the university receiving an email pretending to be from me?

    One of the most reliable ways to establish that a mail if from you is to install, setup and use PGP/GnuPG mail signing on your mail client and have the receiver of your mail always check that the signature is valid. This can be complicated at first and it’s best to involve your local IT support.

    This is does not perfectly address the question however. People on the internet will still be able to send email as your sender address and the recipient outside the university may or may not check the signature. To explain why it is possible for the university not to be able to affect this, here’s a diagram showing a mail being delivered from an Internet Service Provider (ISP, like BT, or Virgin Media) to a destination site with the sender address forged:

    I’ve simplified the communications involved but you’ll notice that there’s no involvement with the university systems in the above diagram. The university will have no logs or any other interaction in the above example. This is one reason why we ask that all legitimate mail for the domains of ox.ac.uk are sent through the university systems, consider this scenario:

    When someone sends mail via a 3rd party mail submission server we don’t have any involvement. Imagine you gave a physical letter to a coworker to hand deliver, it didn’t arrive and then you tried to complain to the postman – it’s a similar scenario.

    I’ve heard that SPF is the answer to this.

    In an ideal world (or for a small company), SPF would be of immediate use but the University of Oxford mail environment does not currently match what SPF wants to describe. We can use it for increasing the spam score of inbound mail but we can’t reject on it nor currently publish a restrictive SPF record designating exactly which mail servers can send mail for ox.ac.uk domains. I’ll explain further.

    With SPF we essentially state in a public DNS record “the following servers can send mail for the ox.ac.uk domain”, the idea is that the receiving server checks if the mail server that has sent them the mail matches the list of authorised sending mail servers. The following diagram shows the basic process in action:

    So in this example the ISP SMTP server contacts a 3rd party site and attempts to deliver a message that’s from an address at ox.ac.uk. The site being delivered to looks up our SPF records and sees that the SMTP server that’s trying to deliver to it is not listed as a valid server for our domain and so rejects the mail. Sounds perfect? Sadly there are a number of problems with this

    • Firstly, even if there were no other problems, there is no way we can enforce that a 3rd party receiving site is checking SPF records for inbound mail for mail it receives from other 3rd party servers.
    • Secondly we hit a problem with the list of ‘authorised servers’ specifically that even if the 20 or so separate units with SMTP exemptions to the internet are included in the list, we then have to include any NHS mail servers, any gmail.com mail servers and a selection of other sources where users are currently legitimately sending as their university addresses but from a 3rd party. Each time we open up one of these online services, the SPF rules become less useful, since now anyone on gmail or NHS servers could send as any ox.ac.uk address and pass the SPF test.
    • Thirdly, we need the receiving sites not to break (refuse messages) if messages are forwarded and we have strict SPF records in place

    A solution to the later problem would be a university wide decree that mail sent from ox.ac.uk must go via the university mail servers. That’s not likely to be a popular idea but I list it for completeness, I’ll discuss this further in the conclusion.

    You could still check SPF inbound to the university in general though?

    Yes, we’ve done some work in this area. It’s not a boolean solution to anything however as some spammers have perfect SPF records and some legitimate sites have broken SPF records. We could increment the spam score based on the result but a knee-jerk decree of ‘block all mail SPF fails for’ would be quite interesting in terms of support calls and perhaps short lived as a result.

    Just order the remote sites to fix their configuration!

    We do talk to remote sites about delivery issues. The problem comes when the remote site says ‘no’ either because they don’t understand the issue or because they don’t agree. There comes a point at which no matter what technical argument we make, the remote site will refuse to accept an issue exists. We have no authority to force them into any course of action.

    As an example of this, most mail sending ‘rules’, as defined by documents called RFCs, have been in place for decades (the first one came out in 1982). There are still however lots of mail administrators that do not adhere to the basics and will aggressively argue against any such prodding. This includes small hosting companies, massive telecommunications providers and even some mail administrators in the university. Example problems include having a valid helo/ehlo (this one simple test rejects about 95% of inbound connections – spam – for a false positive of about one or two incidents a year). There’s also other issues like persuading the remote sender to send mail from a DNS domain that actually exists and having valid DNS records for the sending server.

    Since we can’t get the internet to agree on what’s already established as rules for mail server for decades, it’s not likely that we’ll be able to enforce that a 3rd party site performs SPF checking.

    Well what about DKIM?

    We like DKIM as a technology but in our environment we will hit similar issues as described for SPF. Before any technical contacts fill up the comments section, I’d like to make it clear that DKIM and SPF are not identical in what they do, but for the purposes of the problem being addressed in this article and for describing this aspect of their operation to end users they can be considered roughly similar. Here’s a very simplified diagram of DKIM in operation

    In an ultra-simplified form, the difference is that DKIM adds a digital signature to each outbound message (more accurately, a line in the header, which cryptographically signs the messages delivery information) , which the receiving server is checking (using cryptographic information we publish in the DNS), rather than checking a list of valid source IPs. This would work great in a politically simpler environment and with all sites on the internet joining in. It wouldn’t end spam (an attacker could still compromise a users account and so send mail that was then legitimately received), but it would make spamming more constrained (such as to new short lived domains purchased with stolen credit cards and similar, which is a different issue) and by doing so you can use other anti-spam techniques more effectively.

    • Again, the problems are that for a 3rd party site delivering to a 3rd party site, we cannot force the receiving site to have implemented DKIM
    • If we state that all legitimate mail from ox.ac.uk is DKIM signed, then mail sent from gmail or nhs mail servers as ox.ac.uk addresses will be considered invalid by sites that do check the DKIM information for inbound mail.

    In our team we’ve done some trials on scoring inbound mail based on DKIM and sadly there is a number of misconfigured sites out there that are sending what appears to be legitimate mail but that, according to the DKIM information for the domain, is invalid. As for SPF, we could increment the spam score slightly for invalid DKIM results to improve the efficiency of inbound mail scoring.

    DKIM signing for outbound mail is a little trickier as we’d have to either share the private signing key with the 20 other units that are SMTP exempted and get them to implement DKIM. Getting the sites to implement DKIM I would say from my experience in talking to internal postmasters when reducing the number of exempted mail servers from 120 down to about 20 is near impossible.

    Another solution would be to force all outbound mail connections for the remaining SMTP exempted mail servers to go via the oxmail mail relay cluster and sign at that one point. There are two problems with this. Firstly [please note that this is my personal subjective opinion] it isn’t a service with a dedicated administrative post, so any political emergencies in any other service leave the mail relay undeveloped/administered. This by itself isn’t a massive problem normally – the service is kept alive, the hardware renewed, the operating systems updated and there is some degree of damage limitation in a crisis. What is needed if the relay becomes the single point of failure for the entire organisation, is permanent active daily development – for example to proactivly stop the mail relay from ever being blacklisted. Otherwise a disaster occurs and the units that were forced to use the mail relay demand political allowance to connect to the internet directly (because they want to get on with their work, which is a legitimate need), and then DKIM has to be ripped out in order for those exemptions to work.

    This leads onto the second problem in that forcing anyone to do anything needs a lot of political support, will be highly unpopular (some mail administrator have been independent for decades and have a setup similar to oxmail – a cluster, clamav and spamassassin), and people resent political upsets for a long period of time (as an example, a staff dispute that had occurred 25 years ago caused problems for an IT support call I worked on when I previously was employed in one of the sub units of the university).

    Isn’t it simple? Just stop delivery attempts coming in to the university from outside that state the mail is ‘from’ an ox.ac.uk address?

    This would currently block a lot of legitimate mail (users sending via gmail, nhs users etc). I anticipate that within a short time of being order to implement such a rule it would be ordered to be withdrawn due to the negative user impact on legitimate mail.

    So, in summary, what are you telling me?

    We can never totally stop a 3rd party site from accepting mail from another 3rd party site, where the sender is pretending to be an ox.ac.uk sender address. There will always be receiving sites that will not implement the technologies that can assist in that scenario and cannot be influenced or argued with.

    If you want to send a mail to a 3rd party and have them know within (almost) perfect reasonable doubt that the mail is from you, then you require PGP or GnuPG to digitally sign each mail you send. Providing you become familiar with the process and don’t get confused into sending your private signing key to other people, an attacker would have to compromise your workstation in order to get your private signing key in order to sign mails as you, which is a large step up in complexity from simply sending spam.

    We could make improvements to the inbound spam scoring to reduce spam coming in to the university in general, this takes time in order to find a point between the amount of spam being correctly identified and the amount of legitimate mail from misconfigured sites being left unaffected. A factor in this is that there are currently only two systems administrators for all of the networks services so human resources are an issue (this is not the only service with political demands for changes).

    If there was a university wide policy that all mail from ox.ac.uk addresses was to be sent from inside the university, then we could implement SPF and (perhaps in time) DKIM, which could help reduce the problem of forged mail from/to external 3rd parties pretending to be form ox.ac.uk senders. In my opinion the university should fund a full time post dedicated to the mail relay if it wishes to do this however, since it’s not a simple task in terms of planning and political/administrative overhead.

    And lastly, we know that spam is frustrating – spam costs the university in terms of human time but also dedicated hardware. There’s an actual financial cost to the university for spam. Why don’t we just stop it? There’s lots of anti-spam techniques we do actively use that I haven’t covered in this article and we do think about various improvements and test them but despite decades of the problem worldwide, there is no perfect anti spam system currently in existence worldwide. The university will therefore not have a perfect anti spam system until such time as one is devised. You may have less spam received using another organisations server, that doesn’t mean you were sent the same amount of spam.

    I hope this article has been of some use. Please also check out the article from 2011 that was previously mentioned.

    Migrations

    In December and January we’ve completed some service migrations, we’ve been auditing some services and some new staff members have joined our team, which makes this a good time to clarify what it means to have a migration completed. Although we migrate roughly 15-20 servers per year, the number of servers isn’t all that significant but rather the number of services on each server. More servers sometimes makes things a lot easier – in my experience an old host with multiple services on it can be much harder to untangle and migrate than four servers hosting one clearly defined service each. Especially with virtualisation (and our existing configuration management system) our team appears to be moving more towards the model of one service per host for reduced complexity. As older systems are replaced it’s getting easier with time as our documentation and internal policies/processes are maturing.

    Our team has a handful of public/end-user facing services but these represent a small tip of the iceberg – we provide a lot of inter-team and unit level IT support services, plus the fully-team-internal services that in turn support those. As a result of this distribution, a migration task will typically be to migrate a background or inter-team service that’s run for five or six years to new hardware and software, with fairly little in the way of any political involvement. Note that you will see little in the way of end user consultation in the below checklists as a result of these being background supporting services, and financial funding and similar are left out as something that would be done before getting to this stage.

    So this post is aimed at IT Support Staff performing a similar migration, to give some extra ideas as to the questions and checklists to run through. If you think you spot something that’s missed off, please do mention this in the comments.

    Pre-Migration

    Audit the existing team documentation for the service

    For a complex service, auditing the existing internal team documentation helps ensure nothing is missed when planning the migration, by going through and fact checking and updating the existing documentation.

    The existing documentation should cover, or be modified to cover:

    Test Result
    Requests for change (discussion and links to related support tickets)
    Known defects / common issues experienced and their solutions
    Troubleshooting steps for support queries
    Notes about data feeds, web interfaces and other interactions with other teams for this service
    Notes about the physical deployment
    Notes about the network deployment
    A clear test table for service verification
    Links to any documentation we provide to the public/end-users for this service

    If this hasn’t been done the symptom is (aside from inaccurate documentation) that despite the migration being declared complete, small issues crop up over the next month due to missed or miss-understood sub parts of the service.

    For service verification tests I like to keep it to a simple table with something similar to

    • What the test is
    • Command to type (and from where)
    • Expected result

    So for example if I was writing some tests for the DNS system, I might test name resolution for an external domain name, and I’m also interested in ensuring the authoritative name servers for ox.ac.uk don’t give a result, as that would be outside of their design behaviour and indicate something was wrong. So one test might look like:

    Test Command Expected Result (resolver) Expected Result (auth)
    External site query from internal host (from a university host) dig www.bbc.co.uk @$dns_ipv4 +tcp
    (from a university host) dig www.bbc.co.uk @$dns_ipv4 +notcp
    DNS record negative response

    This example isn’t perfect. The person performing the test has to know to substitute $dns_ipv4 for the dns servers ipv4 service interface and I haven’t fully described what a ‘negative response’ or ‘DNS Record’ will look like in their terminal, but it a good starting point. It would be one of many tests (test from an external host, test a record from our own domain, test a record that should be invalid….) and as you improve them the tests that you define for service verification typically end up being a good basis as commands to automate for service monitoring, such as via Zabbix or Nagios.

    For our own test tables, the tests include checking that when you log into the server, the Message Of The Day tells you what the server is used for, and if it’s safe to reboot the host for kernel updates of if special consideration is needed. It might also include tests to check that data feeds are coming in correctly (and not just the same data file, never updating), or that permissions are correctly reset on web files if altered (guarding against minor mistakes by team members).

    Audit the public documentation

    Our team may have a good opinion of what we believe the service is, but does the public documentation match that? We may not have written the documentation, or the person that did may have left, and we want to ensure we don’t overlook some subtle implied sub-service or service behaviour that would otherwise not be noticed.

    For example, if the public documentation mentions DNS names or IP addresses, then we should avoid changing these whereever possible, so that many IT officers and end users aren’t inconvenienced into having to reconfigure their clients. If the documentation mentions that we keep logs for 90 days, then we should have 90 days of logs, not less (because we wont be able to troubleshoot issues up to the state retention length) and not more (because this is users confidential data that we shouldn’t be keeping longer than we promised as in the wrong hands it might represent account compromise, financial loss, embarrassment or similar).

    Are there open change requests for this service?

    If we’re migrating a service, now might be a good time to implement any open change requests that we can accommodate.

    Sometimes we can’t change one aspect without altering other parts of the service, but when re-deploying/migrating the service we have an opportunity to alter the architecture and perhaps still provide the same end user facing service, but with improvements that have been requested.

    If we can’t implement the change on this cycle (for cost or lack of human resource reasons), lets keep the change request in our pile, but document why so that we know when asked.

    If we won’t implement the change (for political reasons, or technical sanity), again lets keep the change request but document the official statement on why it wont be implemented so that we can give a quick consistent response to queries instead of laboriously explaining each time it’s raised.

    Using our knowledge, what can we improve with regards to how the service is delivered?

    Requests for change aside, perhaps we can see ways from our experience and skill set to improve the quality of the service, the usability or the maintainability.

    If end users have to configure software to use our service, can we alter the service to reduce the configuration?
    If we previously had restrictions in place due to service load, can these now be lifted on the newer hardware?

    If historical scripts import the data or are used to rebuild configuration files, do those scripts pass basic modern coding sanity checks?

    Test Result
    The code isn’t doing something that’s fundamentally no longer needed
    The code is documented (e.g. perldoc pod format)
    Any configuration or static/hardcoded variables are declared near the start (we might separate them out into a configuration file later)
    The code passes basic static code analysis (perlcritic -5)
    The code makes use of common team modules for common tasks (Template toolkit, Config::Any, Net::MAC etc)
    The code meets basic team formatting requirements (run through perltidy)
    The basic task the code is doing is documented in our team docs as part of the service
    Does an automated test script exist to help regression test the code after changes?

    During the migration

    This is usually service specific but generic planning features might be

    • Can we eliminate downtime during the migration? (for instance, migrate one node of a cluster at a time with no affect on service?)
    • If not can we minimise the downtime by careful planning? (research all the commands in advance, document them as a migration process and test the process)
    • If we must have downtime, can we perform the downtime in a low usage period (out of hours, such as 7am or similar)

    With the last point, remember to check that if the worst or supposedly impossible happens you can physically get into the building where the hardware is (switch/router/server). The only thing worse than a 7am walk of shame to physically turn on/reconnect a device after cutting off your remote access during planned maintenance work, is doing so only to discover that the building doesn’t open until 9am, making a ten minute service outage in the early morning instead into a two hour service outage that’s noticeable to everyone and runs into business hours.

    Post migration

    Decommissioning the old hosts

    Test Completed
    Required meta data (such as mail relay summary data used to make annual stats reports) has been copied from the host
    The host has no outstanding running processes related to its function (e.g. a mail relay has no mail remaining in its queue)
    If we search the the team documentation system, have all references to the old host been updated?
    Have the previous hosts been marked decommissioned in the inventory system?
    Have the previous hosts been deracked and all rack cables untangled/removed?
    Have the previous hosts had their disks wiped (DBAN) and been marked for disposal?
    In our configuration management system, have references to the previous, now decommissioned hosts been removed?
    Remove the host from DHCP if present
    Remove the host from DNS
    Remove the host service principal from Kerberos

    New hosts

    Test Completed
    Are all hosts involved documented in the team documentation system?
    Are all hosts involved documented in the team inventory system?
    Are all hosts involved now monitored in the team monitoring system?
    In the rack, are all cables labelled at both ends and is the server labelled?
    Is the service address/name itself being monitored by the teams monitoring system?
    Is the host reporting errors into our daemons queue?

    Service Verification

    No doubt you’ll have lots of quite service specific migration checks to perform but add to these:

    • Ask another team member, not involved in the migration, to read through the documentation. In my experience this works especially well if you can offer a prize, such as a sweet per unique mistake found (Think: Roses, Quality Street). I’m not joking here, people have their own tasks and generally will get bored of reading your documentation within a short space of time, no matter how well structured, which means it’s poorly tested. Offering a group of people an incentive costs very little and sparks interest, you’ll have problems found that you hadn’t thought of. Even if you don’t agree with their criticism, give out the reward for each unique issue raised. In my opinion if you correct as they check it’ll motivate them more as it’s obvious you’re taking action based on their feedback.
    • Ask someone more junior in your team, or skilled in a different service area, to run through your service verification tasks (without you stood over them). If it’s not clear to them where to run the check from, or how to run the check, then do not criticise their skills but instead make your test documentation clearer. When the key specialist[s] for the service are on holiday and the service appears to break, perhaps someone from senior management will be standing over them demanding an explanation. At that point you want the service verification tasks to be as clear and comprehensive as possible so that there’s little opportunity to misunderstand them and as a result of running them successfully there’s no doubt that your teams service is not at fault (or if it is at fault, the issue is clearly cornered/defined by the tests and easier to fix).

    Perhaps the important concluding point in all of the above is to have the self-discipline not to declare to anyone that the migration as complete until all service documentation has been tested, any migration support tickets/defects successfully addressed and all traces of the previous service tidied away.

    Chris Cooper (pod)

    Chris Cooper (nicknamed ‘pod’, with deliberate lower case) joined our team in the past year on secondment from the Systems Development team where his main work for the department had been (such as on the site wide Single Sign On system, Kerberos infrastructure and similar). He had a strong knowledge of LDAP, Kerberos and system administration in general, so his skills expanded the team knowledge and a number of long standing issues were cleared up in a short space of time thanks to his involvement.

    Sadly pod developed cancer and after an initial operation to deal with it via chemotherapy and by removing the majority of his stomach, the cancer came back and spread, leaving it inoperable. As a result, pod passed away on the 28th December 2012. This post is not intended an official summary – there’s been more formal commemorative provisions that we’ve assisted with – but this is just a note from our team on his passing.

    pod was quite a logical thinker, and I think he had time for anyone no matter what the previous history, as long as they thought things through in what they were discussing. I found this made him refreshingly easy to deal with in a political/professional environment, and a good second opinion or sanity check to run technical ideas past – even if they weren’t his area of technical experience. From a workplace perspective I think his legacy or challenge is for remaining staff to understand and think through issues and service migrations to the depth that pod would have – that is, I mean to say his attention to detail and meticulousness is something to live up to.

    Socially I think he took effort to analyse his own reactions and behaviour and this probably contributed to his large group of friends, and no enemies that I was ever aware of. These and his other qualities also made him a good personal friend to share a drink with.

    Everyone is going to miss pod.

    The Business Case for Single Sign on

    The intended audience for this document is appliance and software product vendors. The background is we’d like appliance vendors to support Single Sign On mechanisms natively.

    SSO? Yes, we already support LDAP and Active Directory against which to authenticate logins to our appliance.

    This is shared sign on, not true single sign on. Users visiting shared sign on protected sites enter the same credentials in each site to access each facility in turn. Although this is better than having many passwords to remember, the more you convince your users it’s ok to type their credentials into multiple web interfaces, the more exposed they are to two threats

    • They are more likely to eventually be successfully phished by a request for them to enter their credentials in a site.
    • A single compromised site/appliance or site admin can harvest login credentials and use them elsewhere in your organisation.

    Those sound like rhetorical issues. What are you proposing?

    The user visits your site, your site redirects to an (external to your appliance) authentication portal, the user successfully authenticates and your site then receives the user plus a cryptographic token. If the user visits any other SSO enabled site, then that token already exists, so no login is needed, they seamlessly access the next site without any login credentials using the token.
    The appliance/site never sees the user login credentials themselves and the authentication portal is always the same site.

    A truly SSO site would have the user log in in the morning, and then their mail client, web browser and other applications don’t need a password entered as they all use the SSO authentication as the token.

    Yeah, that sounds complicated to implement? Maybe we should talk about managing expectations…

    It’s not any more complicated than your existing modules. As an appliance vendor, where you have your existing LDAP and Active Directory authentication/authorisation modules, you’d add a third, the packages for common platforms are prebuilt, there’s a little configuration, it’s not a big deal. You could use Webauth with LDAP or you could use Shibboleth.

    As an example, if your product is using Apache under the hood, you could install the webauth authentication module along aside your existing authentication modules and with a minor amount of system configuration you will have $REMOTE_USER value available to your application as per normal, once the user authenticates. Then use LDAP to get group/authorisation details.

    If you use the Shibboleth based SSO method, then you don’t need LDAP for group/authorisation information as the details will be appended as user attributes in the information provided to the application by the authentication module. Shibboleth is using SAML, it’s fairly straight forward if you’ve already built the ability into your product to use LDAP and Active Directory.

    OK. You’re the only site that’s asked for this though so it sounds pretty site specific?

    Firstly this is using technology used by multiple other universities, it’s not unique to our site, and no doubt more sites would use it if any appliance vendors supported the technologies involved (see also IPv6, which until recently took a long time for vendors to take seriously).

    Secondly, to some degree feature requests are self-selecting – potential customers look up vendors products, see that they don’t support the technologies they need, and then go off to create a solution or workaround without contacting the vendor.

    Our existing sites that have implemented our product don’t seem too interested

    Mentally put yourself in the shoes of your customer
    They have just implemented your product, it’s deployed and working. They are not likely to want to suddenly change the authorisation and authentication. For a complex environment this would typically only come about from service review and replacement. In laypersons terms – if it isn’t broken (already deployed and working), customers won’t typically attempt to fix it, especially something as fundamental as the authentication/authorisation mechanism.

    None of our competitors offer this either so we don’t see that we have to match them

    If you are the first and only vendor to support Single sign on, then over time word will spread and you will be the known appliance vendor in your niche that Single Sign On capable sites go to. They will overlook minor flaws because you support this key feature.

    So this is something you want to implement? Sounds a bit like pie in the sky. Has anyone at all got it working?

    This is already implemented and working site wide for many years. The only exceptions come when we’re forced to use a vendor’s product that has no facility for apache authentication modules, or built in support for either Webauth or Shibboleth.

    Other sites using this technology include any site using Stanford Webauth or Shibboleth

    What’s the bottom line?

    If you support this feature, you will gain more customers and so earn more money in the long term. Customers (new and existing) will be happier because they have the option of deploying a new or integrating an existing Single Sign On site-wide system that includes your product.

    NTP service changes Nov 2012

    Over the next month we’ll be doing some work to consolidate our NTP stratum 2 and 3 services into what will hopefully (subject to antenna installation) be a four system stratum 1 service. All historical IP addresses and DNS names will continue to function but keen IT officers in local units monitoring the central service may spot individual NTP nodes disappearing and reappearing one at a time as the transition takes place.

    The intended audience of this post is IT Support Staff inside the university (the university has a federated support model) however it is public in the hope that it’s of interest to other sources.

    If you aren’t sure what NTP is, it provides a method of network time synchronisation between computers. This is important for log correlation for troubleshooting and security analysis, but it’s also essential that the time be within a given synchronisation threshold in order for some types of encrypted communications and authentication to take place. The traditional method was to have a few servers in your organisation querying external accurate sources, a tier of servers (stratum) below this then queries those servers and all your hordes of client machines queried that lower tier.

    Why?

    You might perceive NTP services as fairly maintenance free. This is true, but the main reason for the work is to separate out the NTP service from other services – currently each NTP node is served by a machine that’s also supplying another more critical service. The main mail relay nodes provide the current stratum 2 and various assorted servers provide stratum 3 (a database server, a webserver and so on).

    Normally this isn’t a problem, but it can cause issues/complications when there’s work to be done on one service/host because it affects the operation of other service that’s also resident. Some of these services/hosts need replacing or other maintenance work, and separating out NTP is a fairly small task that makes that maintenance easier.

    The full set of objectives is

    • To consolidate stratum 2 and 3 services (make the service simpler to understand)
    • To move the public NTP service to hosts dedicated to only that role
    • To add non network time sources (GPS and Radio)
    • Improve the user facing documentation
    • To ensure the service is geographically spread out

    On this last point it’s worth noting that we always try to spread services out, however in this case we made an error. We very carefully/methodically audited and spent time moving our main mail relay nodes to different physical sites one at a time so as to make the mail relay service fault tolerant of an issue at any single physical site. The mail relay had a lot of nodes at the time and as part of this work four of the mail nodes (that in hindsight happened to be the four that jointly host all the NTP stratum 2 service interfaces) ended up at one remote site, a situation which Murphy spotted and took advantage of with a power cut locally at the site. In the aftermath we received a number of very polite suggestions that we should try and spread our NTP service out geographically so as to avoid single points of failure based on physical location which we had to politely acknowledge was indeed true.

    The Solution

    We already had an NTP appliance, which due to human resource constraints (NTP is not a politically squeaky wheel) hadn’t been deployed into a production role. Some testing on this revealed it could at least run as a normal network synchronised stratum 2 device, with successful GPS and radio antenna installations able to set it running as a stratum 1 source. It could listen on multiple interfaces, could have custom NTP configuration added and could also be secured for network duties on a public IP address.

    The plan was hence to purchase three more of these, making four appliances in total. The historical NTP stratum 2 and 3 service IP addresses currently in use by many devices university wide would be served by the appliances (one address from each stratum by each), and each appliance would be placed at a different physical location. The user documentation would be updated and with approval of the owners of various buildings we should be able to install antennas to elevate the service to Stratum 1 on all four appliances.

    So this solution would separate the NTP service out onto dedicated hardware and so the NTP service would not be affected by alterations or work on other services (within reason: a loss of the backbone network obviously wouldn’t be survivable without service connectivity disruption for instance).

    It’s unlikely that we’ll lose the internet connectivity to the joint academic network for any length of time but just in case, the stratum 1 independent time sources would prevent the time service drifting or shutting down which in turn will prevent time related issues with kerberos authentication and similar in a suitably apocalyptic disaster scenario. The GPS/radio antennas are also fairly cheap and shouldn’t need replacing.

    The Cost

    The total cost of all three extra appliances including GPS/radio antennas and a 5 year hardware warranty was less than the cost of a single typical mail node.

    We spent a little more money on one of the appliances (in the region of £100 more) to make it a more powerful model, with the idea that once our service deployment is complete we’d like to offer this node as a time source back to the UK NTP pool. I think this is ethical behaviour, to contribute back to the community.

    The human resource time including physical deployment, antenna mounting, documentation and so forth is perhaps in the region of 4 -5 person days – the majority of which will be the political and physical work involved in having holes drilled in buildings for antennas to be installed. Configuration and testing is only two days including initial setup, this blob post, updating user facing documentation and IPv6 testing.

    What’s the status?

    The status of this is that the hardware has arrived, has been labelled and base configured and is working on live IPv4 testing addresses. I’m performing the IPv6 testing today and preparing revised service documentation (essentially better instructions on service usage). One of the four sites has an antenna installation request open, I’ll be creating requests for the other three sites today.

    We should be able to start moving stratum 3 nodes to the new service today, but this will be done one at a time, verifying the service after each move.

    Stratum 2 is more complicated, due to the fact the actual historical IPv4 service address is also in use by another internal service. I need to work on that related service to separate the addresses (which actually means migrating the other service to a new host) which may take a 4 weeks not by itself, but perhaps using each weeks at risk period to move one node at a time.

    General queries people might have

    • “I think your time is probably of low quality, I think it’s 5 minutes out! I’m going to use the UK NTP pool instead!”

    Some years ago, access to our stratum 2 nodes was by registration only (but stratum 3 was unrestricted), so people that didn’t notice this restriction would sometimes point their servers at stratum 2, watch the time drift out on their server and then complain that our service must have an incorrect time (out by the amount that their device had drifted out by) and that they’d have to use an external source instead (which then worked and corrected their time because the external source replied to them, the symptoms reinforcing their belief that our time was minutes out). We removed the restrictions since NTP load was not an issue to the modern servers and since it was causing unnecessary user confusion and wasted effort.

    The above is an example of why it’s important to drill down to testable evidence wherever possible, rather than guesses based on symptoms, so if you’re unfamiliar with NTP and want to see what the exact accuracy of our service is, log in to a Linux machine and use ntpq -p

    ntpq -p ntp1.oucs.ox.ac.uk
     remote refid st t when poll reach delay offset jitter
    ==============================================================================
    +badajoz.oucs.ox 193.62.22.82 2 u 408 1024 377 0.357 -0.918 0.132
    *corunna.oucs.ox 193.62.22.74 2 u 551 1024 377 0.317 -1.496 0.413
    +vimiera.oucs.ox 131.188.3.221 2 u 395 1024 377 1.250 -1.482 0.221
    -salamanca.oucs. 131.188.3.222 2 u 888 1024 377 0.887 -0.760 0.546
    -2001:630:306:10 158.43.192.66 2 u 544 1024 377 8.685 -0.061 0.184
    -ntp0.cis.strath 192.93.2.20 2 u 601 1024 377 10.182 0.806 0.058
     LOCAL(0) .LOCL. 13 l 12 64 377 0.000 0.000 0.001

    Some of the formatting will appear better on the terminal but essentially you can see exactly what a node is synchronised with. Note that offset and jitter is in milliseconds. There’s probably similar commands for Windows and Mac which I leave as an exercise for the reader to find. It’s fair to say that the NTP results are of good quality.

    So feel free to use the UK public NTP pool if you wish, but please use repeatable tests, not guesses when making technical decisions.

    • That command doesn’t work outside the university, it just times out. I can query the time however with ntpdate or ntpd however.

    Sources inside the university can query the full state of our NTP servers, sources externally can just retrieve the time.

    NTP uses UDP, which is a connectionless network protocol which in layperson terms has the side effect that it’s easier to forge the sender IP address. There’s been some fears that NTP servers can be used as an amplification attack vector, essentially someone says “Hi I’m www.example.com, tell me all about your current status” our NTP server then replies with a lot of information, but the destination we are sending to was not actually the originator. The attacker would send such a request to many NTP sites  at once with the aim being to make the forged sender receive massive amounts of traffic that would make their normal business operations unable to function.

    By restricting status queries we reduce the potential usefulness of our service for malicious use, whilst still serving the core server (time readings). It is regrettable not to be able to offer the server status externally but we may have a better solution in the longer term.

    • Can I use the NTP service outside the university?

    If you’ve a laptop set to use one of our NTP servers it will be able to retrieve time from our service inside the university or out. If your device only accepts one name/address you could use the round robin DNS record specifically ntp.ox.ac.uk or ntp.oucs.ox.ac.uk but the user facing documentation will be updated shortly with more details and ntp.conf examples of Linux system administrators and similar.

    If you are not a member of the university the short version is that non university sources should use the UK NTP pool. In reality if you point your home desktop at our NTP service we wouldn’t notice but in terms of configuration it’s better for you to use the UK ntp pool , which we hope to contribute to once the setup is finished. So use the name of uk.pool.ntp.org in your configuration if you are an external non university member in the UK.

    On this subject commercial entities are another matter that can cause issues and we’ll be updating the official documentation with some suitable legal disclaimer. Note that with regards to the ntp.org pool vendors get specific instructions on what to do.

    • What if one node suffers some sort of issue and the time drifts out?

    If you define multiple nodes in your configuration, your ntp server/client will automatically mark as bad any server that drifts out significantly compared to your other time sources and will ignore it.

    • Further questions

    If you are a member of university IT support, do please email in to networks at the usual address with any concerns, corrections or queries. External persons might prefer to reply on this blog post.

    Using Microsoft Active Directory as the Authentication server for an SSL VPN on a Cisco ASA.

    Background

    We wanted to be able to run an SSL VPN for a second team (Team B) on one of our ASA pairs. It was important to give each team a different VPN pool for security reasons. The first team (Team A) ran their own tacacs+ server for authentication. We had leveraged that as the VPN authentication system with no issues. Team B already had a Active Directory (AD) deployment so the challenge was to get this working with the ASA and their new SSL VPN Pool.

    ASA config

    We needed two pieces of information from Team B.

    1. The IPs of their AD Domain Controllers (DCs).
    2. The AD realm

    With this data we could create the following config.

    aaa-server TEAMB_AD protocol kerberos
     aaa-server TEAMB_AD (outside_interface) host 192.0.2.1
     kerberos-realm TEAMB.DOMAIN
     aaa-server TEAMB_AD (outside_interface) host 192.0.2.2
     kerberos-realm TEAMB.DOMAIN
    tunnel-group TEAMB_GROUP type remote-access
     tunnel-group TEAMB_GROUP general-attributes
     address-pool TEAMB_VPN_POOL
     authentication-server-group TEAMB_AD
     default-group-policy TEAMB_POLICY
     no strip-realm
     strip-group
     tunnel-group TEAMB_GROUP webvpn-attributes
     group-alias teamb enable
    group-policy TEAMB_POLICY internal
     group-policy TEAMB_POLICY attributes
     dns-server value 8.8.8.8
     vpn-tunnel-protocol ssl-client
     password-storage enable
     split-tunnel-policy tunnelspecified
     split-tunnel-network-list value TEAMB_SPLIT_TUNNEL
     webvpn
     anyconnect keep-installer installed
     always-on-vpn profile-setting

    Since Team B have a group alias of ‘teamb’ at login which won’t be understood by AD, we strip that out. We don’t want to strip the realm though as that is needed by the AD server.

    The VPN exists to allow Team B to manage some of their equipment, so the TEAMB_SPLIT_TUNNEL ACL simply defines the networks to which we wish to encapsulate traffic. NTP was also enabled and running on the ASA, which is a prerequisite of working Kerberized services. Finally we asked Team B to open up UDP port 88 inbound from our ASAs to their AS DCs. We asked Team B users to login with username@TEAMB.DOMAIN.
    The second part of this post is going to be written by Jemima Spare, the Windows Administrator of Team B.

    AD Settings

    No real changes needed to be made on the domain. The Cisco documentation mentions the following settings to be made on the :
    • Using Active Directory to Force the User to Change Password at Next Logon.
    • Using Active Directory to Specify Maximum Password Age.
    • Using Active Directory to Override an Account Disabled AAA Indicator
    • Using Active Directory to Enforce Password Complexity.
    These all seem there to mirror settings that you might want to make on the ASA, for example, if you want to make sure that the AD settings are not more or less restrictive than the ASA settings.
    As password complexity and maximum password age settings were adequate, no changes were made.

    The Team A requested IP addresses of the AD servers and the AD realm. IP addresses were straighforward and the AD realm could be checked by running set USERDNSDOMAIN on the command line on a domain controller. In this case, it was the same as the Fully Qualified Domain Name (FQDN).

    The firewalls in front of the domain controllers had to be opened up to allow UDP 88.

    Having done all of the above, we tried to connect and failed. Part of the troubleshooting involved checking the logs on the domain controllers’ firewall, and this was where we were able to see that the ASA was using TCP port 88 and not UDP port 88. The change was made to the firewall and voila! the vpn connected.

    Disabling 802.11b

    We have been pondering the idea of disabling 802.11b for some time. Research into the subject has proved that it will be feasible.

    What’s the difference?
    802.11b was the first standard of wireless networking conceived by IEEE in 1999. It’s been a game changer and led to ubiquity of mobile devices. As happens in the technology industry, it became obsolete before long and complemented by its successor – 802.11g. Apart from the increase in speed, the two standards differ in how data, management and control traffic is implemented. 802.11b uses Direct-Sequence Spread Spectrum (DSSS) modulation technique, whereas its successor (along with 802.11a) uses Orthogonal Frequency-Division Multiplexing (OFDM) for encoding digital data. Despite both standards operating in the same frequency of 2.4GHz, different modulation standards are used. Backwards compatibility with the older standard was achieved in 802.11g in the form of using additional steps whilst talking to “b” clients on a “g” network. The mechanism used to facilitate this compatibility is called RTS/CTS (Request to Send/Clear to Send) and is responsible for reducing frame collisions. Such “protection” against legacy clients has a drawback in the form of reduced throughput (as it involves more control plane frames).
    The other notable difference lies in security. 802.11b devices don’t support AES encryption and often have driver-related issues with support for enterprise security (802.1X).

    802.11g and 802.11b control plane comparison

    802.11g and 802.11b control plane comparison

    What’s the plan?
    We plan to disable 802.11b compatibility on the centrally managed wireless service (OWLv2) on July 31. This will hopefully give you and your customers plenty of time to prepare for the change.

    What’s the impact?
    We monitored the client protocol distribution over past weeks and the number of clients connecting with the old standard was marginal – we recorded the average of 3 devices out of roughly 3700 connecting through wireless b. This makes us believe that the benefit of disabling the ‘b’ standard outweighs the need for legacy support.

    Client Protocol Distribution

    Client Protocol Distribution

    Eduroam capping

    There has been a lot of discussion recently about capping eduroam on ITSS-D. I’d like to take the opportunity to present the state of the centrally managed wireless network, but also to provide some rationale behind this decision, which was taken back in 2009. I hope this will provide some context.

    The OWL 2 project started in 2008 with the goal of providing centrally managed wireless service to cover public areas of the University. Since its inception, the network has grown considerably in size. At the moment we run four Cisco 5508 controllers and manage 858 access points covering most of the public areas of the University. A fifth controller has been purchased and will be put into production shortly. In peak periods of the year, we have about 4,000 simultaneous clients. At the time of writing, about 3600 clients in total are connected through eduroam, OWL and a number of local SSIDs.

    Fortnightly client count

    Since 2008, when first devices were deployed, traffic patterns have changed significantly. Popularity of video streaming is on the increase and thanks to the ubiquity of mobile devices, demand for wireless access has been growing fast.

    In 2009 we introduced an application firewall, to tackle p2p activity on the wireless service. At the same time we have imposed a throughput cap, to provide fair service for all users. It was then agreed, to provide a service that was equivalent to a home ADSL, providing 2Mbps for downlink and 512Kbps for uplink. We appreciate concerns from some heavy users that these may be insufficient in the light of today’s standards, however the rationale behind the decision hasn’t changed – we don’t aim at providing a cutting edge network to compete with the wired network, but simply providing a convenient way to access the Internet for local and roaming users across various departments.

    Hardware considerations
    Devices which were deployed in the initial phase of the project were not 802.11n capable, hence the benefits of using MIMO and more throughput are not applicable across the entire network. 802.11n standard had only been published a year into the project. Cisco LAP-1142N which is currently our dominant platform for new provisions, accounts for just under half of the WAPs deployed (48%). Such state of play is a hurdle in relaxing throughput restrictions, as our priority is clear – we aim to provide a reliable network. If we were to double or treble the current cap, units whose wireless mainly consists of 802.11g devices would be at risk, compared to the ones running the latest standard. To ensure a reliable service we are compelled to use the lowest common denominator.

    Access Points by model

    Local network
    Another reason why our approach is somewhat pragmatic is the fact, that some units’ LANs have many more access points than others. We have at least a dozen units with over 20 access points and one of the largest ones has over 50 devices. While some departments or colleges may have a Gigabit connection to the backbone and use more than one FroDo to connect annex sites, others only have a 100Mbit feed on a single FroDo. A quick calculation shows that uncapped wireless traffic alone could saturate the “slower” backbone links.

    Events
    We had number of units contacting us to say, that their clients reported slow connection to wireless and complained about connection dropouts. Upon investigation, it turned out, that a unit was hosting a conference. As a result there was a large increase (doubling or even quadrupling) in the number of clients on each access point. That in turn put a heavy strain on the wireless service (also used to provide network access to Visitor Network). This is another reason, why we are rather modest in our approach – it’s a constant balancing of priorities to keep as many customers happy as possible. We find similar dilemmas in other services, eg. disk space, inbox size, etc.
    Visitors Network numbers graph

    Wireless phones
    We host a physically separate network to connect wired VoIP phones and security appliances. It’s different in the case of wireless phones, which use the eduroam SSID to reach their Call Manager. I assume everyone realizes the importance and sensitivity of voice traffic on the network.

    To summarize, it’s not really our whim or determination to inconvenience you and your customers. Rather a challenging battle to provide a reliable service balancing the many, often conflicting, constraints. There is room for improvement and we review our policies, but each decision has to be carefully considered, with the greater picture in mind. I trust I have given you a better understanding these concerns necessary compromises. We welcome your opinions and suggestions, so please get in touch with the networks team on networks@oucs.ox.ac.uk if you have questions or doubts.

    Edit: As of 7 May 2013, the throughput cap is set to 8Mbit/s symmetric.

     

    ASA 5505 Transparent Mode DHCP and Memory fun

    We have a customer who uses a Cisco ASA 5505 in transparent mode to protect a small LAN. They did the right thing and took out SmartNet cover, but the reseller botched something and the TAC wouldn’t play with them when they had problems. They gave me a call and the results were interesting enough to prompt this blog.

    Problem

    After reading the latest Cisco Advisory (worth doing), they had upgraded the software on the ASA from 8.2 to 8.4. However, after doing this DHCP no longer worked on their subnet. The ASA rules needed to get that working were in place. More detail on the DHCP side of things at the bottom of this post.

    Cause

    When the customer upgraded, they didn’t note the memory requirements needed for version 8.4. They had 256 MB instead of the required 512 MB. It is a Very Good Idea to check this when upgrading the image on any Cisco device, details near the bottom of this post. As we found here, sometimes the device will accept and run code that it shouldn’t. You do get a warning message on boot telling you your device doesn’t have enough memory. In this case, the engineer performing the upgrade didn’t know to look for this.

    Impact

    Any clients with a static IP were able to access the Internet fine, but no DHCP requests made it through the firewall. The counters on the ACL didn’t even increment. What I find interesting is that the device booted up and sort of ran. Before seeing this I would have assumed a more catastrophic failure. I wonder if a less subtle failure would have been easier to deal with? Since there isn’t always enough flash to store multiple images, not booting at all may not be the best idea. Perhaps booting, not passing any client traffic and filling the logs with memory grumbles is the answer..

    Solution

    The customer downgraded the image on their ASA and DHCP sprang back into life. They are going to order some more memory before repeating the upgrade. This was a good reminder that an engineer should always check the release notes when upgrading software.

    More on memory

    Since you may be reading this long after 8.4 is current, and since cisco.com is a complicated beast, I would suggest going to http://www.cisco.com/go/asa (or go/6500 or go/MYDEVICE) and then clicking on ‘Release and General Information’ if something like that still exists. You should then be able to find the release notes for the version of code you wish to install. Any memory requirements are in there.

    ASA Memory Requirements

    Additional DHCP mutterings

    Although not strictly relevant here, DHCP through a transparent mode ASA is a bit of a pain as you have to explicitly let everything through. I was sidetracked by this at first due to the symptoms the customer experienced. Their ASA was configured correctly as I said. What follows is a run through of their config and the general idea.

    The customer uses our central DHCP servers rather than the ASA’s daemon. The gateway for their network is an SVI on a Cisco 6500 with an ip helper-address configured for each DHCP server. A simplified version of what should happen follows:

    1. The clients broadcast for a DHCP server
    2. The firewall allows this through
    3. The gateway proxies the broadcast to the DHCP server
    4. The DHCP server replies to the gateway
    5. The gateway sends the reply to the client
    6. The firewall allows this reply through

    There are further messages involved, have a look at RFC 2131 if curious. Since the ASA is in transparent mode, inbound and outbound access-list rules are required for steps 2 and 5 to work. The Cisco config guide doesn’t include example access-lists so I will below.

    # Inbound access-list
    outside_access_in remark Allow DHCP offer access-list
    outside_access_in extended permit udp host <ip of default gateway> any eq bootpc
    # Outbound access-list
    inside_access_in remark Allow DHCP discovery / request access-list
    inside_access_in extended permit udp host 0.0.0.0 host 255.255.255.255 eq bootps
    access-list inside_access_in remark Allow DHCP access-list
    inside_access_in extended permit udp any object-group <group with all dhcp servers> eq bootps
  • <xptr url="http://blogs.it.ox.ac.uk/networks/feed/"
    		 type="transclude" rend="rss rsssummary rsslimit-2"/>
    FroDo IOS upgradeI&#8217;d like to announce a staged upgrade of IOS on all FroDos. This blog post aims to answer some of the questions this work will raise. Feel free to contact the Networks team with any questions at networks@it.ox.ac.uk. Why? We currently &#8230; <a href="http://blogs.it.ox.ac.uk/networks/2014/04/04/frodo-ios-upgrade/">Continue reading <span class="meta-nav">&#8594;</span></a>
    I just received a spam email from my own addressOur team was asked to answer some queries about how it&#8217;s possible to receive mail that has been forged as being from your email address. This article slightly overlaps with a previous article in 2011 that covered similar ground. Please note &#8230; <a href="http://blogs.it.ox.ac.uk/networks/2013/03/08/i-just-received-a-spam-email-from-my-own-address/">Continue reading <span class="meta-nav">&#8594;</span></a>
  • <xptr url="http://blogs.it.ox.ac.uk/networks/feed/"
    		 type="transclude" rend="rss rsslimit-2"/>
    FroDo IOS upgrade

    I’d like to announce a staged upgrade of IOS on all FroDos. This blog post aims to answer some of the questions this work will raise. Feel free to contact the Networks team with any questions at networks@it.ox.ac.uk.

    Why?

    We currently run 19 different versions of IOS across FroDos. Some of the switches haven’t been upgraded since the original installation (the longest running FroDo had an uptime of over 7 years). Whereas it may be advantageous to stick to a version that works fine on the switch, we decided to roll out updates on all FroDo switches in production. There are 3 main reasons for the mass-upgrade:
    - bug fixes
    - unification of versions and consistency
    - new features

    Our intention is to run a single IOS version per platform (3750[G], 3750-X, 3560[CG], 3850, 4900M, 4948E). I’m sure the question will spring to mind – why commit to this work when TONE is under way? Despite work progressing on the new backbone, it’s still quite a long time away and regardless of the fine details of its delivery, we will retain the concept of Point-of-Presence in the future design and thus keep existing switches in production for a considerable length of time. It therefore makes sense to consolidate the IOS versions at this point.

    Timescale

    We plan to upgrade on a per C-router basis. The schedule we devised is to upgrade and reload roughly 10 FroDos every Tuesday, Wednesday and Thursday until all switches are up to date. The following table details the process:

    Date Device VLANs affected Notes
    8 April Frodo-110 (acland)
    Frodo-113 (edstud)
    Frodo-116 (38-40-woodstock-rd)
    Frodo-120 (maison-francaise)
    Frodo-149 (physics-dwb)
    Frodo-150 (eng-ieb)
    Frodo-151 (maths)
    Frodo-152 (wolfson-building)
    Frodo-154 (lady-margaret-hall)
    Frodo-155 (mdx-eng)
    102, 104, 113, 118, 120, 125, 150, 151, 182, 183, 187, 189, 190, 191, 199, 397, 598, 691, 720, 994 Affects ResNet
    9 April Frodo-156 (materials-hume-rothery)
    Frodo-157 (e-science)
    Frodo-161 (eng-thom)
    Frodo-162 (eng-jenkin)
    Frodo-163 (eng-holder)
    Frodo-164 (eng-etb)
    Frodo-165 (14-15-parks-rd)
    Frodo-167 (radcliffe-infirmary)
    Frodo-168 (new-maths)
    Frodo-169 (wolfson)
    101, 102, 105, 106, 109, 111, 115, 121, 127, 151, 156, 163, 167, 186, 189, 193, 195, 196, 199, 288, 397, 398, 517, 694, 787, 788, 792, 904, 954, 967, 985 Affects Engineering WLC
    10 April Frodo-202 (careers)
    Frodo-204 (voltaire)
    Frodo-208 (12-bevington)
    Frodo-212 (belsyre-court)
    Frodo-217 (nissan-institute)
    Frodo-219 (wolsey-hall)
    Frodo-249 (begbroke)
    Frodo-250 (kellogg)
    Frodo-251 (ewert-house)
    Frodo-282 (williams)
    Frodo-293 (summertown-house)
    Frodo-296 (st-annes-robert-saunders)
    Frodo-297 (merrifield)
    202, 204, 208, 220, 222, 249, 252, 282, 283, 285, 286, 289, 290, 292, 296, 297, 298, 299, 397, 675, 678, 717, 720, 722, 794, 977, 989
    15 April Frodo-253 (mdx-sthughs)
    Frodo-255 (begbroke-iat)
    Frodo-257 (st-hughs)
    Frodo-258 (st-antonys)
    Frodo-260 (univstavertonrd)
    Frodo-262 (st-annes-frodo)
    Frodo-263 (green-college)
    Frodo-264 (wuhmo)
    Frodo-203 (13-bradmore-road)
    Frodo-281 (vc101br)
    Frodo-283 (areastud)
    Frodo-292 (trinity-staverton-rd)
    Frodo-569 (saville-house)
    Frodo-662 (new-college)
    121, 187, 188, 196, 203, 205, 206, 209, 214, 257, 279, 280, 281, 284, 284, 293, 295, 295, 296, 297, 329, 608, 673, 677, 679, 680, 681, 681, 682, 720, 796, 856, 989
    16 April Frodo-306 (safety)
    Frodo-308 (rh)
    Frodo-309 (linc-mus-rd)
    Frodo-310 (security-services)
    Frodo-313 (rai)
    Frodo-316 (physics-aopp)
    Frodo-324 (dlo)
    Frodo-351 (rex-richards)
    Frodo-352 (rodney-porter)
    Frodo-353 (dyson-perrins)
    Frodo-354 (stats)
    Frodo-355 (ocgf)
    112, 202, 305, 306, 308, 309, 310, 314, 319, 320, 351, 355, 372, 377, 388, 391, 397, 398, 399, 526, 595, 717
    17 April Frodo-356 (mdx-mus)
    Frodo-358 (chem-physical)
    Frodo-359 (beach)
    Frodo-360 (rsl)
    Frodo-361 (mansfield)
    Frodo-362 (bioch)
    Frodo-363 (physiology)
    Frodo-366 (inorganic-chemistry)
    Frodo-367 (keble)
    Frodo-368 (earth-sciences)
    Frodo-369 (9-parks-rd)
    Frodo-370 (museum)
    Frodo-625 (exam-schools)
    191, 301, 314, 315, 320, 323, 328, 329, 351, 361, 367, 368, 369, 370, 373, 375, 378, 379, 389, 391, 393, 394, 395, 396, 397, 398, 595, 625, 902, 906, 968, 970, 972, 997 Affects Museum Lodge WLC
    22 April Frodo-513 (stx-bnc-annexe)
    Frodo-515 (merton-annexe)
    Frodo-517 (english)
    Frodo-518 (law-library)
    Frodo-523 (zoo)
    Frodo-524 (mrc)
    Frodo-527 (mstc)
    Frodo-531 (club)
    Frodo-549 (balliol-holywell)
    Frodo-550 (mdx-zoo)
    Frodo-552 (social-sciences)
    Frodo-553 (stcatz)
    397, 510, 514, 515, 516, 517, 518, 523, 524, 527, 531, 552, 589, 594, 596, 597, 598, 687, 797, 977, 997
    23 April Frodo-554 (qeh)
    Frodo-555 (plants)
    Frodo-559 (chemistry-research-laboratory)
    Frodo-561 (path)
    Frodo-562 (tinsley)
    Frodo-563 (islamic-studies)
    Frodo-564 (mdx-ompi)
    Frodo-566 (pharm)
    Frodo-568 (psy)
    74, 182, 183, 214, 288, 301, 351, 360, 378, 388, 389, 391, 397, 398, 501, 507, 522, 553, 559, 561, 562, 580, 588, 590, 591, 592, 593, 595, 596, 597, 599, 678, 683, 694, 719, 727, 810, 860, 893, 893, 902, 948, 955, 956, 968, 976, 977
    24 April Frodo-602 (bod-old)
    Frodo-604 (music)
    Frodo-606 (sheldonian)
    Frodo-607 (bod-camera)
    Frodo-609 (ruskin-sch)
    Frodo-615 (bod-clarendon)
    Frodo-619 (all-souls)
    Frodo-627 (mhs)
    Frodo-628 (jesus)
    360, 397, 602, 604, 607, 609, 611, 615, 617, 619, 672, 682, 683, 683, 686, 697, 782, 997
    29 April Frodo-629 (exeter)
    Frodo-630 (queens)
    Frodo-631 (st-edmund-hall)
    Frodo-632 (10-merton-street)
    Frodo-634 (pembroke-college)
    Frodo-635 (chch)
    Frodo-639 (albion)
    Frodo-640 (hmc)
    Frodo-641 (old-indian-institute)
    Frodo-645 (campion)
    553, 610, 612, 620, 621, 631, 634, 640, 645, 662, 680, 684, 686, 688, 695, 919, 962
    30 April Frodo-649 (oii)
    Frodo-650 (trinity)
    Frodo-651 (sers)
    Frodo-652 (magd)
    Frodo-653 (littlegate)
    Frodo-654 (oriel)
    Frodo-655 (balliol)
    Frodo-656 (blue-boar-st)
    Frodo-657 (mdx-ind)
    Frodo-660 (mdx-chch)
    Frodo-689 (botanic-garden)
    Frodo-692 (stanford-house)
    Frodo-698 (chaplaincy)
    Frodo-699 (shop)
    15, 197, 378, 389, 397, 398, 601, 603, 614, 626, 627, 638, 639, 650, 654, 656, 676, 677, 678, 689, 690, 692, 694, 696, 698, 699, 722, 749, 787, 902, 905, 967, 981, 989, 997 Affects Indian Institute WLC
    1 May Frodo-661 (mdx-daubeny)
    Frodo-663 (axis-point)
    Frodo-664 (corpus-christi)
    Frodo-665 (pembroke)
    Frodo-666 (merton)
    Frodo-667 (univcoll)
    Frodo-669 (hertford)
    Frodo-671 (wadham)
    Frodo-76 (harkness)
    Frodo-77 (gibson)
    199, 214, 285, 297, 397, 398, 515, 605, 613, 634, 662, 663, 664, 669, 671, 673, 691, 792, 794
    6 May Frodo-702 (taylorian)
    Frodo-703 (old-boys-high-school)
    Frodo-707 (9-stjohnsst)
    Frodo-708 (bnc-frewin)
    Frodo-711 (arch)
    Frodo-713 (classics)
    Frodo-716 (clarendon-press)
    Frodo-717 (survey)
    Frodo-721 (barnett-house)
    Frodo-725 (some)
    397, 687, 702, 703, 707, 711, 713, 717, 721, 725, 749, 781, 787, 788, 796, 799, 954, 959, 977, 985, 997
    7 May Frodo-726 (25-wellington-square)
    Frodo-728 (sbs)
    Frodo-729 (sackler)
    Frodo-730 (lincoln-clarendon-st)
    Frodo-732 (oxford-union)
    Frodo-734 (castle-mill)
    Frodo-749 (orient)
    Frodo-750 (worcester-st)
    Frodo-751 (dartington)
    Frodo-754 (mdx-ash)
    284, 309, 397, 398, 675, 716, 720, 728, 729, 732, 749, 761, 783, 789, 790, 797, 906, 959, 975, 977, 997 Affects Ashmolean WLC and ResNet
    8 May Frodo-755 (mdx-socstud)
    Frodo-756 (ashmolean)
    Frodo-757 (stx)
    Frodo-759 (regents-park)
    Frodo-761 (rewley-house)
    Frodo-762 (sjc)
    Frodo-764 (st-peters-frodo)
    Frodo-765 (castle-mill-2)
    Frodo-766 (worcester)
    Frodo-767 (nuffield)
    Frodo-792 (worcester-street)
    Frodo-794 (hayes-house)
    320, 330, 370, 374, 375, 397, 398, 611, 675, 680, 691, 697, 701, 705, 709, 710, 715, 718, 720, 722, 733, 734, 756, 757, 781, 782, 784, 786, 793, 794, 795, 797, 977, 989
    13 May Frodo-809 (ocdem)
    Frodo-821 (fmrib)
    Frodo-851 (sports-distributor)
    Frodo-855 (well)
    Frodo-862 (mdx-ihs)
    Frodo-863 (iffley-rd)
    Frodo-864 (st-hildas)
    Frodo-865 (ndm)
    Frodo-867 (kennedy)
    Frodo-869 (ccmp)
    Frodo-890 (ssho)
    Frodo-899 (imm)
    Frodo-881 (alan-bullock)
    15, 214, 395, 397, 398, 398, 515, 682, 684, 691, 695, 698, 720, 805, 806, 807, 808, 809, 812, 851, 852, 854, 855, 856, 864, 880, 881, 882, 883, 887, 890, 892, 893, 894, 902, 962, 968, 975 Affects IHS WLC

    To find out the number of your backbone VLAN and annexe connections, use Looking Glass.

    If your FroDo isn’t listed above, it most likely has been upgraded already. The following switches run current IOS as a result of other maintenance work:
    Frodo-101 (physics-theory); Frodo-102 (materials-21-banbury); Frodo-104 (materials-12-13-parks-rd); Frodo-159 (mdx-edstud); Frodo-207 (43-banbury-rd); Frodo-213 (anthropology-58a-br); Frodo-215 (anthropology-64-br); Frodo-218 (anthropology-51-br); Frodo-220 (anthropology-61-br); Frodo-301 (physics-clarendon); Frodo-323 (robert-hooke); Frodo-349 (prm); Frodo-357 (mdx-plants); Frodo-551 (life-sciences); Frodo-557 (medawar); Frodo-560 (pathology); Frodo-567 (linacre); Frodo-623 (linc); Frodo-633 (sbs-phase-2); Frodo-648 (mdx-ind2); Frodo-658 (mdx-all-souls); Frodo-659 (mdx-merton); Frodo-670 (brasenose); Frodo-712 (eng-osney); Frodo-752 (beaver-house); Frodo-801 (botnar); Frodo-802 (psych); Frodo-849 (jr2); Frodo-853 (rob); Frodo-856 (richard-doll); Frodo-857 (psych-meg); Frodo-858 (rosemary-rue); Frodo-859 (orcrb); Frodo-905 (16-wellington-square); Frodo-908 (phonetics); Frodo-909 (theology-34a-st-giles); Frodo-910 (counselling); Frodo-914 (new-barnet-house); Frodo-916 (37a-st-giles); Frodo-962 (egrove); Frodo-963 (offices); Frodo-964 (ertegun); Frodo-969 (mdx-oucs); Frodo-972 (oucs)

    Impact

    Depending on hardware platform, the expected downtime is about 8 to 30 minutes. Catalyst 3750 – the dominant platform – takes only a few minutes to reload to new IOS, but others may include a microcode upgrade, which takes up to half hour. We intend to upgrade and reload the switches on early mornings (7:30-9am) to minimise impact on backbone connections. In the event of a hardware failure, a replacement FroDo will be installed. In reading the above table and assessing disruption to your connectivity, keep in mind annexe connections.

    I just received a spam email from my own address

    Our team was asked to answer some queries about how it’s possible to receive mail that has been forged as being from your email address. This article slightly overlaps with a previous article in 2011 that covered similar ground. Please note that the target audience for this article is end users, not technical support staff and so some of the technical descriptions (and especially the diagrams) are simplified in order to explain the overall theory or process.

    Someone is sending mail as being from my address, how is that possible?

    It’s best to think of emails as postcards. Anyone can write on the postcard a false sender – anyone could send you a postcard ‘from’ you and the postman would still deliver it.

    How can I stop someone outside the university receiving an email pretending to be from me?

    One of the most reliable ways to establish that a mail if from you is to install, setup and use PGP/GnuPG mail signing on your mail client and have the receiver of your mail always check that the signature is valid. This can be complicated at first and it’s best to involve your local IT support.

    This is does not perfectly address the question however. People on the internet will still be able to send email as your sender address and the recipient outside the university may or may not check the signature. To explain why it is possible for the university not to be able to affect this, here’s a diagram showing a mail being delivered from an Internet Service Provider (ISP, like BT, or Virgin Media) to a destination site with the sender address forged:

    I’ve simplified the communications involved but you’ll notice that there’s no involvement with the university systems in the above diagram. The university will have no logs or any other interaction in the above example. This is one reason why we ask that all legitimate mail for the domains of ox.ac.uk are sent through the university systems, consider this scenario:

    When someone sends mail via a 3rd party mail submission server we don’t have any involvement. Imagine you gave a physical letter to a coworker to hand deliver, it didn’t arrive and then you tried to complain to the postman – it’s a similar scenario.

    I’ve heard that SPF is the answer to this.

    In an ideal world (or for a small company), SPF would be of immediate use but the University of Oxford mail environment does not currently match what SPF wants to describe. We can use it for increasing the spam score of inbound mail but we can’t reject on it nor currently publish a restrictive SPF record designating exactly which mail servers can send mail for ox.ac.uk domains. I’ll explain further.

    With SPF we essentially state in a public DNS record “the following servers can send mail for the ox.ac.uk domain”, the idea is that the receiving server checks if the mail server that has sent them the mail matches the list of authorised sending mail servers. The following diagram shows the basic process in action:

    So in this example the ISP SMTP server contacts a 3rd party site and attempts to deliver a message that’s from an address at ox.ac.uk. The site being delivered to looks up our SPF records and sees that the SMTP server that’s trying to deliver to it is not listed as a valid server for our domain and so rejects the mail. Sounds perfect? Sadly there are a number of problems with this

    • Firstly, even if there were no other problems, there is no way we can enforce that a 3rd party receiving site is checking SPF records for inbound mail for mail it receives from other 3rd party servers.
    • Secondly we hit a problem with the list of ‘authorised servers’ specifically that even if the 20 or so separate units with SMTP exemptions to the internet are included in the list, we then have to include any NHS mail servers, any gmail.com mail servers and a selection of other sources where users are currently legitimately sending as their university addresses but from a 3rd party. Each time we open up one of these online services, the SPF rules become less useful, since now anyone on gmail or NHS servers could send as any ox.ac.uk address and pass the SPF test.
    • Thirdly, we need the receiving sites not to break (refuse messages) if messages are forwarded and we have strict SPF records in place

    A solution to the later problem would be a university wide decree that mail sent from ox.ac.uk must go via the university mail servers. That’s not likely to be a popular idea but I list it for completeness, I’ll discuss this further in the conclusion.

    You could still check SPF inbound to the university in general though?

    Yes, we’ve done some work in this area. It’s not a boolean solution to anything however as some spammers have perfect SPF records and some legitimate sites have broken SPF records. We could increment the spam score based on the result but a knee-jerk decree of ‘block all mail SPF fails for’ would be quite interesting in terms of support calls and perhaps short lived as a result.

    Just order the remote sites to fix their configuration!

    We do talk to remote sites about delivery issues. The problem comes when the remote site says ‘no’ either because they don’t understand the issue or because they don’t agree. There comes a point at which no matter what technical argument we make, the remote site will refuse to accept an issue exists. We have no authority to force them into any course of action.

    As an example of this, most mail sending ‘rules’, as defined by documents called RFCs, have been in place for decades (the first one came out in 1982). There are still however lots of mail administrators that do not adhere to the basics and will aggressively argue against any such prodding. This includes small hosting companies, massive telecommunications providers and even some mail administrators in the university. Example problems include having a valid helo/ehlo (this one simple test rejects about 95% of inbound connections – spam – for a false positive of about one or two incidents a year). There’s also other issues like persuading the remote sender to send mail from a DNS domain that actually exists and having valid DNS records for the sending server.

    Since we can’t get the internet to agree on what’s already established as rules for mail server for decades, it’s not likely that we’ll be able to enforce that a 3rd party site performs SPF checking.

    Well what about DKIM?

    We like DKIM as a technology but in our environment we will hit similar issues as described for SPF. Before any technical contacts fill up the comments section, I’d like to make it clear that DKIM and SPF are not identical in what they do, but for the purposes of the problem being addressed in this article and for describing this aspect of their operation to end users they can be considered roughly similar. Here’s a very simplified diagram of DKIM in operation

    In an ultra-simplified form, the difference is that DKIM adds a digital signature to each outbound message (more accurately, a line in the header, which cryptographically signs the messages delivery information) , which the receiving server is checking (using cryptographic information we publish in the DNS), rather than checking a list of valid source IPs. This would work great in a politically simpler environment and with all sites on the internet joining in. It wouldn’t end spam (an attacker could still compromise a users account and so send mail that was then legitimately received), but it would make spamming more constrained (such as to new short lived domains purchased with stolen credit cards and similar, which is a different issue) and by doing so you can use other anti-spam techniques more effectively.

    • Again, the problems are that for a 3rd party site delivering to a 3rd party site, we cannot force the receiving site to have implemented DKIM
    • If we state that all legitimate mail from ox.ac.uk is DKIM signed, then mail sent from gmail or nhs mail servers as ox.ac.uk addresses will be considered invalid by sites that do check the DKIM information for inbound mail.

    In our team we’ve done some trials on scoring inbound mail based on DKIM and sadly there is a number of misconfigured sites out there that are sending what appears to be legitimate mail but that, according to the DKIM information for the domain, is invalid. As for SPF, we could increment the spam score slightly for invalid DKIM results to improve the efficiency of inbound mail scoring.

    DKIM signing for outbound mail is a little trickier as we’d have to either share the private signing key with the 20 other units that are SMTP exempted and get them to implement DKIM. Getting the sites to implement DKIM I would say from my experience in talking to internal postmasters when reducing the number of exempted mail servers from 120 down to about 20 is near impossible.

    Another solution would be to force all outbound mail connections for the remaining SMTP exempted mail servers to go via the oxmail mail relay cluster and sign at that one point. There are two problems with this. Firstly [please note that this is my personal subjective opinion] it isn’t a service with a dedicated administrative post, so any political emergencies in any other service leave the mail relay undeveloped/administered. This by itself isn’t a massive problem normally – the service is kept alive, the hardware renewed, the operating systems updated and there is some degree of damage limitation in a crisis. What is needed if the relay becomes the single point of failure for the entire organisation, is permanent active daily development – for example to proactivly stop the mail relay from ever being blacklisted. Otherwise a disaster occurs and the units that were forced to use the mail relay demand political allowance to connect to the internet directly (because they want to get on with their work, which is a legitimate need), and then DKIM has to be ripped out in order for those exemptions to work.

    This leads onto the second problem in that forcing anyone to do anything needs a lot of political support, will be highly unpopular (some mail administrator have been independent for decades and have a setup similar to oxmail – a cluster, clamav and spamassassin), and people resent political upsets for a long period of time (as an example, a staff dispute that had occurred 25 years ago caused problems for an IT support call I worked on when I previously was employed in one of the sub units of the university).

    Isn’t it simple? Just stop delivery attempts coming in to the university from outside that state the mail is ‘from’ an ox.ac.uk address?

    This would currently block a lot of legitimate mail (users sending via gmail, nhs users etc). I anticipate that within a short time of being order to implement such a rule it would be ordered to be withdrawn due to the negative user impact on legitimate mail.

    So, in summary, what are you telling me?

    We can never totally stop a 3rd party site from accepting mail from another 3rd party site, where the sender is pretending to be an ox.ac.uk sender address. There will always be receiving sites that will not implement the technologies that can assist in that scenario and cannot be influenced or argued with.

    If you want to send a mail to a 3rd party and have them know within (almost) perfect reasonable doubt that the mail is from you, then you require PGP or GnuPG to digitally sign each mail you send. Providing you become familiar with the process and don’t get confused into sending your private signing key to other people, an attacker would have to compromise your workstation in order to get your private signing key in order to sign mails as you, which is a large step up in complexity from simply sending spam.

    We could make improvements to the inbound spam scoring to reduce spam coming in to the university in general, this takes time in order to find a point between the amount of spam being correctly identified and the amount of legitimate mail from misconfigured sites being left unaffected. A factor in this is that there are currently only two systems administrators for all of the networks services so human resources are an issue (this is not the only service with political demands for changes).

    If there was a university wide policy that all mail from ox.ac.uk addresses was to be sent from inside the university, then we could implement SPF and (perhaps in time) DKIM, which could help reduce the problem of forged mail from/to external 3rd parties pretending to be form ox.ac.uk senders. In my opinion the university should fund a full time post dedicated to the mail relay if it wishes to do this however, since it’s not a simple task in terms of planning and political/administrative overhead.

    And lastly, we know that spam is frustrating – spam costs the university in terms of human time but also dedicated hardware. There’s an actual financial cost to the university for spam. Why don’t we just stop it? There’s lots of anti-spam techniques we do actively use that I haven’t covered in this article and we do think about various improvements and test them but despite decades of the problem worldwide, there is no perfect anti spam system currently in existence worldwide. The university will therefore not have a perfect anti spam system until such time as one is devised. You may have less spam received using another organisations server, that doesn’t mean you were sent the same amount of spam.

    I hope this article has been of some use. Please also check out the article from 2011 that was previously mentioned.

  • <xptr url="http://blogs.it.ox.ac.uk/networks/feed/"
    		 type="transclude" rend="rssbrief"/>
    FroDo IOS upgrade
    I just received a spam email from my own address
    Migrations
    Chris Cooper (pod)
    The Business Case for Single Sign on
    NTP service changes Nov 2012
    Using Microsoft Active Directory as the Authentication server for an SSL VPN on a Cisco ASA.
    Disabling 802.11b
    Eduroam capping
    ASA 5505 Transparent Mode DHCP and Memory fun
  • <xptr url="http://blogs.it.ox.ac.uk/dcut/feed/"
    		 type="transclude" rend="rss"/>
    Blog posting on Harvard Plagiarism issue

    Melissa Highton, Director of Academic IT Services (Learning and Teaching) has put up this thought-provoking blog piece: http://blogs.it.ox.ac.uk/melissa/2012/10/06/scandal/

    ArtsWeeks at OUCS

    OUCS is running ArtsWeek again displaying art by IT staff at Oxford University. Opens tomorrow and runs all next week – all welcome!

    OUCS’s Great War Archive needs your vote

    The GWA has entered the EngageU: The European Competition for Best Innovation in University Outreach and Public Engagement. Please vote for it and ask others to too (as it is the best entry after all!)… http://engageawards.com/entry/81

    Oxford’s WW1 user generated content project steams on

    Today Luxembourg, in two weeks Dublin, then Preston, Slovenia, Denmark, etc. Oxford’s Great War Archive project rolls on …

    http://www.europeana1914-1918.eu/en

    Any good ideas on what we should be providing online for students?

    The student digital experience project is looking for your ideas on what Oxford should be providing!

    “What’s on your wish-list for your digital experience at Oxford: more mobile, more wireless, more WebLearn…? Tell us at dige@oucs.ox.ac.uk”

    Free Training at OUCS on the “Extra” Day

    I like this. We get an extra day in our lives (OK so it’s in February but we can’t have everything) and OUCS offers free training to celebrate:

    “On Febrary 29th OUCS are offering all courses for FREE. USe your extra day this year to develop a skill, grow your knowledge and understanding, or explore the world of free e-books. http://bit.ly/yVRU8I

    Or tweet friendly:

    On Feb 29th all courses @OUCS are FREE. Learn something new on your extra day this year! http://bit.ly/A1dddk”

    Want to get more out of WebLearn?

    Then check this new site of online guides: https://weblearn.ox.ac.uk/portal/hierarchy/info

    Eduroam app for iPad iPhone

    If you use Eduroam (and let’s face it, who doesn’t?) then you may be interested in these: https://www.ja.net/janetnews/2012/01/05/eduroam-companion-app-for-iphone-and-ipad/

    Plain guide to FOI

    Freedom of Information, one of the most abused bit of legislation ever now has a new ‘plain English’ guide:

    http://www.ico.gov.uk/for_organisations/freedom_of_information/guide.aspx

    Oxford gets JISC grant to explore Open Educational Resources re WW1

    Another success for Oxford’s LTG in terms of developing its work in open educational resources: http://jiscww1.jiscinvolve.org/wp/jisc-ww1-oer-project-2/

Up: Contents Previous: 3. Tables Next: 5. Reading multiple RSS feeds