By Robert L. Scheier
March 12, 2001
Need to give your Web site the reliability of a mainframe?
Then standardize and automate your change-management processes..
Steve Etzell saw for himself how quickly a minor unauthorized
change can foul up a Web site. Etzell, director of Web technology
at Select Comfort Corp. in Minneapolis, was on vacation when
he got a call telling him the bed maker and retailer's Web
site performance had gone "into the tank." The reason:
A developer had let a business group user "twist his
arm" into dynamically generating user-specific price
quotes on a Web page that showed an entire category of Select
Comfort's products. The site had previously sent users to
a cached page that showed the same prices to everyone.
That change "seemed fairly innocuous," Etzell recalls.
But the page "is accessed potentially 100,000 times per
day . . . and you'll bring the server to its knees" by
forcing it to dynamically create the page for each visitor,
he adds.
"About an hour later, they realized what they had done
and turned the caching back on" for the category page,
says Etzell. The long-term answer was to remove prices from
the categories page and instead put them on Web pages that
describe specific products. Since those are accessed far less
often than the categories page, the site can deliver customized
pricing without taking a huge performance
hit.
Balancing Act
It's that kind of unplanned, untested change that Web site
managers hate and users love. Managing change on the Web is
a "balancing act" between the need to keep your
very public Web site up and running and the need to update
it often enough to keep it attractive to visitors, says Etzell.
The more important your site, the more reliable it needs
to be. The more transactions you do, the more it needs the
kind of rock-solid stability once associated only with mainframes
in a data center. Keeping embarrassing and costly outages
to a minimum requires IT managers to create standard change-management
policies, automate them as much as possible and outsource
them if they must. Repeatable, consistent procedures, performed
either by skilled support staff or automated tools, are the
best way to cope with the pressures of a public-facing Web
site.
The Web environment is unique because users demand changes
within hours, not weeks. Changes to content aren't done by
database administrators who first check the validity of the
data and its effect on site performance, but by marketing
managers. There's no single mainframe vendor to release updates
or patches on a regular schedule, but rather a half-dozen
or more suppliers that find and fix flaws in their products
on their own schedules.
Then there's security, which can require major changes to
sites as hackers discover new ways to bring them down. "There's
a lot more changes going on in these Web-facing systems, with
most of those relating to security," says Jason Lochhead,
co-founder and chief technology officer at Data Return Corp.,
a Dallas-based managed hosting company. "You didn't have
to worry so much on legacy systems because they're isolated
from public traffic." Microsoft Corp. acknowledged in
late January, for example, that its defenses had been inadequate
after it was hit by denial-of-service attacks two days in
a row. In response, Microsoft planned changes to its network
architecture, including a backup set of domain name servers
(DNS).
Even routine, planned changes can crash a site if they're
done incorrectly. Just days before the hackers hit, Microsoft
suffered a 22-hour outage that left many of its Web sites
unavailable. The company blamed the problem on a faulty configuration
change to the routers on its DNS network.
When Don Ursem compares the reliability of his Web site with
that of the telephone system, he isn't kidding. Ursem is vice
president of network operations at VocalPoint Inc., a San
Francisco-based application service provider that lets consumers
access Web sites via phone by converting HTML into voice responses.
VocalPoint sells the service to telephone companies and in
vertical markets such as health care. For the end user, "it's
a telephone application," not a computer application,
and "you expect your telephone to work all of the time,"
says Ursem.
But that's easier said than done. First, there's the volume:
VocalPoint leases two T3 data lines, each of which can handle
644 simultaneous incoming calls and needs 135 servers to process
them. Then there's growth: As VocalPoint adds T3 lines, Ursem
expects that he'll be managing about 650 servers across three
sites by June.
VocalPoint rolls out a new release of its voice Web-browsing
software every three months and is converting about 30 Windows
NT servers to Linux to support a new text-to-speech engine.
Then there are routine upgrades and patches to the databases,
operating systems, network switches and EMC Corp. Symmetrix
storage-area networks. Each must be tested for its effect
on the system, rolled out in a coordinated way and tracked
so that if any updates backfire, the offending change can
be pulled out of production. And such caution is warranted.
According to a survey conducted last year by Framingham, Mass.-based
IDC, 46% of IT managers said software updates gone wrong played
a role in their site outages.
Ursem, a former mainframe data center manager, ended up outsourcing
to Intira Corp., a managed service provider in Pleasanton,
Calif. The selection came after a grueling examination of
seven San Francisco Bay area outsourcers to see how they matched
up with his goals of outsourcing and automating change management.
Ursem wanted a service-level agreement that covered not only
the servers and network, but also the incoming T3 lines and
their links to the servers. He insisted on choosing his server
hardware and software, which ruled out many outsourcers that
require customers to use standard offerings.
He also insisted that the outsourcer's staff follow written
procedures and that he have access to an online monitoring
tool to ensure that those procedures were being followed.
(For security reasons, Intira won't let Ursem into the data
center running his applications.) Ursem demanded and got contractual
commitments "that there would be no changes made to my
environment without my prior approval," including updates
to network switches, storage environments or software drivers.
Intira monitors the operation of its systems with Hewlett-Packard
Co.'s OpenView, which would have been bogged down if Ursem
had also used it to do continuous, real-time monitoring for
any changes in every server.
Using StatePoint Plus, a change-management tool developed
by Monroeville, Pa.-based Westinghouse Electric Co. for its
own use and now sold to other companies, "I have the
ability, from San Francisco, to link into the Intira data
center and compare any set of servers against a reference
server" to find and investigate any unexpected changes,
Ursem says.
"I don't want things done manually by gangs of people,"
says Ursem. "Then you would suffer from human inconsistencies.
I'm looking to reduce that. Anything I can automate, I will.
Anything I can outsource, I will."
Old Rules, New Game
Select Comfort has built a multitiered process for making
changes to its site, which can get as many as 8,000 unique
visitors per day.
It created a content-management application that five people
in marketing can use for live updates of information such
as product descriptions and availability. But "we really
try to keep the control tight," says Etzell.
Select Comfort follows a mix of written and unwritten rules,
such as "don't change things at peak use time if you
don't have to." This select group of users can make changes
either live on the site immediately or to a staging server,
where changes can be reviewed before going live. The company
also does weekly batch updates of changes, as well as a "major
monthly push" in which more complicated functional changes
(compared with content-based changes) are put into place,
says Etzell.
Like Ursem, Etzell has taken pains to document the change-management
procedures for his environment, which includes Windows NT
4.0 servers and SQL Server 7.0 databases, as well as Austin,
Texas-based Vignette Corp.'s StoryServer 5. He says he also
tries to make sure everyone on staff knows who is responsible
for which parts of the infrastructure so they can be notified
of changes that might affect them.
The strongest change-management processes, says Etzell, were
adapted from those already used by the technical services
group responsible for Select Comfort's backbone enterprise
resource planning, financial and other systems. These processes
cover changes to infrastructure hardware and software, with
written test plans before an update is put into service. But
even then, "some arm-twisting goes on, and we'll change
something on the fly," Etzell says.
Keeping those exceptions to a minimum is part of the art
of change management. It's when you try to "short-circuit"
your own procedures, Etzell says, that you get into troublewhich
can mean a nasty wake-up call for the entire business.
Scheier is a freelance writer in Boylston, Mass.
|