9Development of the ScientificComputing Center at Vanderbilt University
Lawrence Fu
Background
When Jason Moore1 came to Vanderbilt University in 1999 as a professor in the Depart-ment of Molecular Physiology and Biophysics, he knew that he needed a parallel computer (a computer with more than one central processing unit, used for parallelprocessing) to conduct his research. His research involved the statistical analysis ofgenetics, specifically the study of gene-gene interactions and the implications fordisease risk. The work he wanted to do would require computational power that couldbe provided only with high-performance computing (HPC).* The first step he tooktoward this goal was to apply to the Vanderbilt University Medical Center for a Vanderbilt University discovery grant.2 This program was a mechanism to stimulate thedevelopment of new ideas and allow investigators to develop them for future externalfederal funding. He received $50,000 to build a parallel computer.
Instead of simply starting work on building a system, he decided to find out if anyother researchers at Vanderbilt were working on developing a parallel computer. Aftertalking to other researchers from all over the campus, he discovered that Paul Sheldon,3
a professor in the Department of Physics and Astronomy, had done more work thananyone else in this area. Paul’s area of research was elementary particle physics andthe study of the physics of heavy quarks. He had worked on the development of a workstation farm called Vanderbilt University physics analysis cluster (VUPAC).4 Aworkstation farm is a cluster of workstations loosely coupled to provide a very coarseparallel computing environment. Initial support for VUPAC was provided by aNational Science Foundation (NSF) academic research infrastructure grant withmatching funds from Vanderbilt University. Additional funding by the NSF and theDepartment of Energy later facilitated upgrades, administration, and maintenance.
Jason and Paul decided to work together and develop a shared resource which theycalled Vanderbilt multiprocessor integrated research engine (VAMPIRE).5 Paulremembers: “Jason and I quickly realized that we pretty much wanted to do the samethings. We had similar goals and similar amounts of money to do it. Basically, it was ameeting of the minds, and we realized [that working together] was the right way to doit. It was an interesting thing to try.” Additional funding for the project was providedby a second Vanderbilt University discovery grant and from the startup funds of
92
*This type of computing requires scientific workstations, supercomputer systems, high speed net-works, a new generation of large-scale parallel systems, and application and systems software withall components well integrated and linked over a high-speed network.
LTF9 10/11/2004 8:46 AM Page 92
Copyright 2005. Springer.
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
EBSCO Publishing : eBook Collection (EBSCOhost) – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITYAN: 145750 ; Nancy M. Lorenzi, Joan S. Ash, Jonathan Einbinder, Wendy McPhee, Laura Einbinder.; Transforming Health Care Through Information Account: s6527200.main.eds
another physics investigator. This Vanderbilt University discovery grant came approx-imately a year after Jason’s initial grant, but this time it came from the university siderather than the medical center. All together, the group had secured about $150,000 toaccomplish the project.
Developing VAMPIRE
Since the group had limited funds, financial costs played a major role in hardware andsoftware decisions. All hardware including the central processing units (CPUs), harddrives, and networking cards were purchased on the Internet for the cheapest pricespossible. From the beginning, they knew that they wanted to use Linux for the oper-ating system, but deciding on the specific build and distribution took some time andeffort. Information technology services (ITS), the campus agency responsible for over-seeing the information infrastructure of the university as a whole, provided muchhelpful assistance by supplying personnel support for this decision and other technicaldetails. Another software issue was providing a mechanism to share the resource effec-tively with many users.Two different packages, MAUI6 and OpenPBS,7 were used. Bothof these are freely available HPC cluster resource management and scheduling systems.Unfortunately, these options did not provide all the functionality that was needed andwere not significantly supported by their developers. However, the fact that the software was free outweighed the shortcomings.
While Jason and Paul were the leaders in making these types of decisions, AlanTackett8 played a critical role in the technical development of VAMPIRE. Alan’sresearch background is computational physics, and he came to Vanderbilt in 1998 as apostdoctoral research fellow in physics. In 1999, he heard about the VAMPIRE effortgetting under way and became involved. With previous parallel computing experience,he ultimately took the lead on technical details. He was instrumental in providing tech-nical expertise and input for key hardware and software decisions. Another contribu-tion Alan made was leading the outreach efforts to attract new investigators. Heroutinely met with research groups, learned about their work, and explained to themhow a parallel computer could aid them in their research.
Finding physical space for the VAMPIRE system was not a difficult task. ITS, besidesproviding helpful input, volunteered space in one of its raised-floor air-conditionedrooms within the Hill Center, where ITS was located. Jason notes that “ITS was instru-mental throughout the whole process. Having a group on campus that was willing tosupport us with space and resources was key. ITS was incredibly helpful. If ITS hadn’tbeen involved, space would have been a bigger issue.”
In the spring of 2000, the group, along with the help of graduate students and post-doctoral research fellows, assembled VAMPIRE. A 2-day pizza party coincided withthe activities. It required about 48 hours for the group to assemble by hand the paral-lel computer with fifty-five dual-processor nodes. Since that time, VAMPIRE has beenoperational 24 hours a day. There have been some hardware failures such as losing afew CPUs, memory sticks, and hard drives, but these types of issues are expected for asystem of this size.
In the beginning, only a few other investigators were involved. They made contri-butions to the system in exchange for access to VAMPIRE. One such person was WalterChazin,9 professor of biochemistry and director of the center of structural biology.Another group that was involved early on was the nuclear physics group. The numberof investigators started at five in 2000, grew to ten in 2001, and continued to grow to
9. Development of the Scientific Computing Center at Vanderbilt University 93
LTF9 10/11/2004 8:46 AM Page 93
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
sixteen in 2002. The popularity of VAMPIRE grew as others heard about its useful-ness. There was no formal mechanism for attracting other researchers, but word ofmouth was particularly effective. Initially, Paul and Jason knew of a few people withwhom they wanted to talk, but others simply approached them after hearing about theeffort from others. One person who helped publicize VAMPIRE and brought peopletogether was Chip Cox, director of the Vanderbilt Internet 2 project. After the initialsetup, additional funding resulted from the participation of new investigators.Two engi-neering professors contributed a large sum of money. One provided $250,000 as partof his startup funds, and another $250,000 came from a U.S. Navy grant. At this time,Ron Schrimpf10 joined the effort and would play a large role in the further maturationof VAMPIRE into a larger system. Ron, a professor from the Department of Electri-cal Engineering, contributed a large number of nodes for VAMPIRE through one ofhis department’s research programs. His research deals with the interface of physicsand the semiconductor aspects of electrical engineering. He requires the use of heavy-duty computing for simulations, and his role represents the perspective of a major userof the system.
Growing VAMPIRE into the Scientific Computing Center
The success of VAMPIRE alleviated many initial concerns about its viability. Therewere questions about whether different research cultures would clash, whether theycould all agree on hardware and software decisions, whether it was possible to createa fair sharing mechanism for all users, and whether there would be synergy among theusers. VAMPIRE proved that all these concerns could be handled. Jason believes thatVAMPIRE was key in making the idea of an even larger computing facility seem fea-sible:“VAMPIRE was critical because it showed that an interdisciplinary team of inves-tigators from across the entire university could come together and work on a project.It brought the School of Medicine, School of Arts and Sciences, and School of Engineering together on a single project. It got us talking to one another. That in andof itself is a tremendous achievement for the university. . . . VAMPIRE provided a focalpoint for bringing together investigators. It was a successful pilot project that showedthat we could all work together towards a common goal.” Building on the achieve-ments of VAMPIRE, Paul, Ron, and Jason developed the idea for a scientific comput-ing center (SCC). It would not merely be a larger system accommodating more usersbut would also entail educational outreach efforts to introduce inexperienced users tothe world of HPC.
Paul agrees that VAMPIRE was essential to the development of a more compre-hensive computing center for the university: “In our minds, we were going to see howthis [VAMPIRE] went. This was a test case to see if we could work together. Alwaysin the back of my mind, I knew that I was going to need significantly more computing.There was never any question in my mind that I was going to have to find some wayto get it. Exactly how much wasn’t clear. Once we got things together and working andmoving forward, we realized that we could work together, and it was a great idea. Italways seemed to us that we were going to grow.There was talk very early on of a largesystem. It wasn’t the SCC, but there was talk of a large facility. The SCC and its ideadeveloped and grew over time.”
Another important lesson learned from VAMPIRE was that the education outreachefforts and attracting new users were possible. Paul emphasized this point: “We real-
94 Section III. Implementation
LTF9 10/11/2004 8:46 AM Page 94
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
9. Development of the Scientific Computing Center at Vanderbilt University 95
ized that the whole education outreach efforts and low barriers to participation werepossible. For example, Alan interacted with other research groups on campus, talkingwith them, and working with them. He also taught a class11 with Greg Walker12 fromengineering about methods of parallelizing applications. . . . We realized that there wasa lot of interest on campus. There weren’t just going to be a few dedicated computernerds using it. There were a lot of people on campus who could benefit from this witha little bit of help.” Ultimately, the SCC would not merely cater to a few users but wouldaim to serve the university community as a whole. In Jason’s words, “We wanted to setup a center that will span the entire university and reaches out to all people doing com-putational work in every department. We eventually hope to get people from music,law, and business using the system.”
Obtaining Funding
Once the concept of the SCC was developed, the next step toward making it a realitywas to secure funding. However, initial attempts to find funding were unsuccessful.Tworequests were made to the NSF through its major research instrumentation13 (MRI)program. This program aims to increase the scientific and engineering equipment forresearch by supporting large-scale instrumentation investments.Awards typically rangebetween $70,000 and $140,000. Both applications for the SCC asked for $1.5 millionbut barely missed approval. In addition, Jason in 2001 submitted an application to thehigh-end instrumentation program14 of the National Institutes of Health (NIH). Itreceived good scores and good reviews, but it did not get approval for the $1.5 millionamount that he requested.
Besides external federal funding, internal funding through the university was possi-ble. The university’s Academic Venture Capital Fund15 (AVCF) was established tolaunch major new transinstitutional initiatives in order to advance Vanderbilt to thefront rank of American research universities. The application process required sub-mission to at least one of two strategic academic planning groups (SAPGs), whichincluded one for the medical center and one for the university central. In the eventthat a proposal involved both the medical center and the university, simultaneous con-sideration would be conducted by both SAPGs, and this was the case with the SCCproposal. If SAPG approval is given, proposals are forwarded to the integrated finan-cial planning (IFP) council for further consideration, and the final step for approval isa recommendation to the university chancellor for funding. One of the central require-ments for a successful proposal was for it to satisfy a set of ten prespecified selectioncriteria including the following:
1. The proposed effort is in accord with the Vanderbilt University chancellor’s fivebasic goals for academic excellence and strategic growth:• We must renew our commitment to the undergraduate experience at Vanderbilt.• We must reinvent graduate education at Vanderbilt.• We must reintegrate professional education with the intellectual life of the
university.• We must reexamine and restructure economic models for the university.• We must renew Vanderbilt’s covenant with the community.
2. The proposed effort will help advance Vanderbilt to the front rank of Americanuniversities. To offer only two examples, this could be accomplished by bringing
LTF9 10/11/2004 8:46 AM Page 95
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
96 Section III. Implementation
together existing institutional strengths in a new and distinctive way, or by pro-posing a creative way to strengthen a critical area that limits Vanderbilt’s ability tomove forward.
3. The proposed effort enhances the learning environment and opportunities for undergraduate, professional, and graduate students and recognizes the need to recruit and retain an intellectually, racially, and culturally diverse campus community.
4. The proposed effort will require a significant investment in graduate education,and, if successful, will improve the national ranking of one or more graduate programs.
5. The proposed effort involves a broad range of faculty rather than a few individu-als and will foster greater collaboration among the schools.
6. The proposed effort will strengthen disciplinary integrity and expand the interdis-ciplinary range of departments.
7. The faculty leadership is already in place.8. The proposed investment will strengthen the core disciplines.9. The proposed effort is bold, requiring significant intellectual and financial invest-
ment, with anticipated gains commensurate with the magnitude of the investment.10. The proposed effort shows clear promise for generating the funding needed to
sustain itself after the initial period of AVCF support (of no more than 5 years).
In 2002, the first proposal was submitted to the AVCF but did not receive approval.It was an administration-driven effort led by the director of ITS at the time. Then inmid-2003, a second proposal spearheaded by Jason, Paul, and Ron was submitted tothe AVCF and received approximately $8.2 million in funding for the SCC. One impor-tant distinction to note between the different sources of funding is that federal fundingwould have provided means solely for building the computer. It would not havecovered any other aspects of the SCC. On the other hand, the internal AVCF fundingprovided capital for data storage, data archiving, data visualization, and personnelrelated to outreach and support efforts.
Details About the SCC
The approved AVCF proposal explicitly laid out the administrative and organizationalstructure for the SCC. Jason, Paul, and Ron were the principal investigators and makeup the steering committee, while Alan served as project administrator. The steeringcommittee is responsible for all major decisions but will seek input from other com-mittees. There are four other committees:
1. The investigators committee consists of all faculty members who are using or willbe using the system. It currently contains approximately fifty investigators from theuniversity. This internal advisory committee is chaired by Walter Chazin, Peter Cummings16 (chemical engineering), Mark Magnuson17 (molecular physiology and biophysics, assistant vice chancellor for research), and Nancy Lorenzi18 (biomedicalinformatics, assistant vice chancellor for health affairs), and it will provide a diversearray of opinions.
2. The external advisory committee consists of three or four individuals from outsideVanderbilt in order to provide an objective perspective.
3. The technical advisory committee, chaired by Jarrod Smith19 from the Depart-ment of Biochemistry, will make hardware and software recommendations.
LTF9 10/11/2004 8:46 AM Page 96
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
4. The users committee will communicate the needs of the daily users such as grad-uate students. It is chaired by Greg Walker, a professor from the Department ofMechanical Engineering.
There is an organized reporting structure in place to facilitate communication betweencommittees. Alan, the project administrator, submits quarterly reports to the steeringcommittee. The technical advisory committee provides an evaluation of current oper-ations as well as recommendations for future infrastructure through quarterly reportsto the steering committee. The external advisory committee provides biannual reviewsto the steering and investigators committee. The steering committee submits annualreports to the investigators committee for approval, and it also provides the annualreport to Dennis Hall,20 associate provost for research, and Lee Limbird,21 associatevice chancellor for research.
Within the proposal, three types of targeted users are enumerated:
1. Experienced investigators who use parallel computing regularly will be able toimmediately take advantage of the center.
2. Users who regularly do computing may have never had the resources to do paral-lel computing. These users know about parallel computing but never have had theopportunity to take advantage of it.
3. People who do not know about parallel computing and are not aware that it canhelp them in their research are still able to use the resources.
In order to aid the second and third types of users, the SCC will employ an educationand outreach staff. Informational and tutorial sessions will provide assistance toresearchers on how to take advantage of HPC. In Ron’s opinion,“the educational activ-ities in a way are more important [than the computer]. We’re going to have hardware,and we need hardware. If it sits there by itself without anyone helping new people useit, it’s not going to have a big impact on the culture of the campus. What will really betransformative about the center is the other side of it [education outreach], which willhelp people get involved.” In addition to the outreach staff, there will also be an oper-ations staff, which will maintain the hardware and software resources, and a scientificstaff, which will include visiting scholars and center fellows.
By identifying the three types of potential users, the SCC emphasizes catering to theneeds of researchers. One of the core philosophies of the center is that it is an inves-tigator-driven resource. Jason believes that this idea is central to the ultimate successof the SCC: “From the start, this has been a grassroots effort. This has been an inves-tigator initiated project. We said we needed this resource, and we’re going to puttogether the funds to get it started. This was our project. We started it, we organizedit, we put it together, and we made it work. Our philosophy is that nobody knows betterwhat we need for our research than us. It’s going to be a center run by the investiga-tors for the investigators.” Paul echoes this sentiment: “I don’t think it makes sense anyother way. Investigators are the ones with the stake in it and motivation to make itwork. . . . I think the day it stops being that is the day it starts falling apart.”
In order to allow simultaneous use by many people, the SCC follows a relativelystraightforward sharing mechanism. Investigators gain access by contributingresources, such as CPUs, to the center.The use of these resources is guaranteed to themwhenever they want them. However, people do not use their resources all the time.Consequently, the pool of excess resources can be split among all other users. So far,this arrangement has worked smoothly. The beauty of this simple agreement can be
9. Development of the Scientific Computing Center at Vanderbilt University 97
LTF9 10/11/2004 8:46 AM Page 97
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
summarized in Paul’s words, “You can buy thirty machines and get access to a thousand machines.”
Current State
At this point, the SCC has not yet grown to its full size. It currently contains 400 proces-sors and ranks as number 199 among the top national HPC clusters.22 Eventually, theSCC will possess 2,000 processors or 1,000 dual-processor nodes. The first major hard-ware purchases will occur in January or early 2004, and there is a rolling schedule forhardware purchases. Each year, one third of the processors will be added, so the systemwill not reach full capacity for 3 years. Afterward, the oldest third of the nodes will bereplaced each year because the processors typically have a 3-year life cycle beforebecoming obsolete. While VAMPIRE originally consisted of commodity-priced partsassembled by the group, the SCC will purchase hardware from a third-party vendorwho will assemble and test the system.
Besides the processors, supporting infrastructure was another consideration of theproposed budget. The groundwork has been laid for a large tape archive facility witha $75,000 tape library purchase. A disk storage system has been chosen that can suffi-ciently handle the large amounts of data that will be generated. It will be flexibleenough to handle growing user needs. Furthermore, the budget allocated funds for spe-cialized visualization hardware that will enable real-time analysis of large, complex datasets with immersive display technologies.
The SCC’s budget calls for an initial large investment in equipment. In subsequentyears, the funds will shift a greater percentage to personnel and will reach a steadystate of personnel and equipment costs. After 5 years, the center hopes to be able tosustain itself financially because the AVCF provides funding for a maximum of 5 years.To reach this end, the steering committee plans to hire a financial director in the begin-ning of 2004. The director’s responsibilities will include overseeing the finances as wellas driving the outreach efforts. The ideal candidate will have management, financial,and accounting experience. In Jason’s opinion, the financial independence of the SCCwill be the biggest challenge to the center. He realizes that this will require much effortbut is optimistic: “I think it will work since there are so many people at the universitywho will use the center. There will be a lot of funding coming into the center, and weshould be able to recover most of the costs to keep it going.”
Another major consideration for the future is how to accommodate the needs of somany users. When VAMPIRE was in its beginning stages, involving only a few inves-tigators, Paul recalls that the small group had good communication and a loose orga-nization. They saw eye to eye on most issues. However, as the SCC grows larger,decisions become more complicated: “When everything was small and friendly, it waseasy. Now, it has to be big and professional. We have to work for a lot of people. Insome cases, we have competing needs and issues. What do we do first [in ramping upthe system]? What do we emphasize? Should we spend the personnel and resourcesthis way or another way?” One current example of varying needs of users is the fol-lowing: One individual requires each CPU that he utilizes to have 4 gigabytes ofrandom access memory (RAM) instead of the customary 1 gigabyte. According toJason, the steering committee, with input from the technical advisory committee, mustanswer such questions as, “Do we want to have every node have 4 gigabytes? Is it cost-effective to do that? If it’s too expensive, can we have 10 percent of the nodes have 4
98 Section III. Implementation
LTF9 10/11/2004 8:46 AM Page 98
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
gigabytes? How feasible is it to have one part of the system have an increased amountof RAM?”
So far, the SCC has been able to bring together a diverse community. An increasedrate of scientific discovery by university researchers should be possible because previ-ously prohibitive computational work is now possible. Other anticipated benefitsinclude enhancing education for students, and the center can serve as a recruitmenttool for new faculty. Those who played a major role in its development undoubtedlyhave learned many lessons along the way. Paul admits that “there a lot of little thingsthat I would have liked to have done differently. I wish I’d understood better that tapesystems are such a headache, but it wouldn’t have mattered since there would havebeen other technical issues. I wish that we had all understood better how best to getthis project going. . . . This was the first time I ever took on a project of this magnitude.You learn things about management along the way . . . how to handle the peopleinvolved in the project and the people who will benefit from the project.”
However, future unanticipated obstacles may arise because the SCC is still matur-ing and has yet to achieve its envisioned size. The steering committee is well aware ofthe fact that what works on a 55-node cluster will not necessarily work on a 1,000-nodecluster. Paul notes that “If you have 100 different groups, each may be able to con-tribute in only a special way since NSF [or whichever funding agency] says that theycan only spend the money a certain way. We have this infrastructure we have to payfor, and we have to somehow find a way to allocate it back to the users. We’ve beensuccessful so far, and everybody’s been happy.”
Questions
1. The initial phases of development with VAMPIRE went relatively smoothly. Whatfactors contributed to this success?
2. What were key factors in making an interdisciplinary project of this size work?3. Were there any decisions that you would have made differently?4. Are there any potential issues you believe that the steering committee may not have
considered?5. If another university were planning to set up a similar computing center, what are
the most important lessons that they should learn from the Vanderbilt Universityexample?
References1. http://phg.mc.vanderbilt.edu/jason.shtml.2. http://medschool.mc.vanderbilt.edu/oor/pd/index.php?PD=4.3. http://www.hep.vanderbilt.edu/~sheldon/.4. http://www.vupac.vanderbilt.edu/.5. http://vampire.vanderbilt.edu/.6. http://mauischeduler.sourceforge.net/.7. http://www.supercluster.org/projects/pbs/.8. http://vampire.vanderbilt.edu/staff.php#tacketar.9. http://structbio.vanderbilt.edu/chazin/.
10. http://www.vuse.vanderbilt.edu/~schrimpf/persinfo.html.11. http://tplab.vuse.vanderbilt.edu/~walkerdg/hpc.html.12. http://frontweb.vuse.vanderbilt.edu/vuse_web/directory/facultybio.asp?FacultyID=341.
9. Development of the Scientific Computing Center at Vanderbilt University 99
LTF9 10/11/2004 8:46 AM Page 99
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use
13. http://www.eng.nsf.gov/mri/.14. http://grants1.nih.gov/grants/guide/rfa-files/RFA-RR-03-009.html.15. http://medschool.mc.vanderbilt.edu/oor/pd/doc/AVCF_Guidelines_2002_03.doc.16. http://www.vuse.vanderbilt.edu/~cheinfo/cummings1.htm.17. http://www.mc.vanderbilt.edu/vumcdept/mpb/magnuson/.18. http://www.mc.vanderbilt.edu/dbmi/people/faculty/lorenzi_nancy/index.html.19. http://structbio.vanderbilt.edu/~jsmith/home.html.20. http://www.physics.vanderbilt.edu/cv/dghall.html.21. http://medschool.mc.vanderbilt.edu/limbirdlab/.22. http://www.top500.org/.
100 Section III. Implementation
LTF9 10/11/2004 8:46 AM Page 100
EBSCOhost – printed on 6/14/2023 5:58 PM via WALDEN UNIVERSITY. All use subject to https://www.ebsco.com/terms-of-use