ETR-RT15

IEEE Communications Society (ComSoc)
Technical Committee on Communications Quality & Reliability (CQR)

Emerging Technology Reliability Roundtable 2015
(ETR-RT 2015)
Co-located with IEEE CQR 2015 International Workshop
Monday, May 11, 2015
Francis Marion Hotel
387 King Street
Charleston, South Carolina 29403, USA
http://www.francismarionhotel.com/

Scope of the Roundtable

Discuss and identify the RAS (Reliability, Availability and Serviceability) challenges, requirements and methodologies in the emerging technology areas like the Cloud Computing, Wireless/Mobility, NFV (Network Functions Virtualization), SDN (Software Defined Networking), or similar large-scale distributed and virtualization systems.
Discuss the RAS requirements and technologies for mission-critical industries (e.g., airborne systems, railway communication systems, the banking and financial communication systems, etc.), with the goal to promote the inter-industry sharing of related ideas and experiences.
Identify potential directions for resolving identified issues and propose possible solutions.

Introduction by Chi-Ming Chen and Spilios Makris

David Lu – Network Transformation and Its Impacts on Reliability
Eric Bauer – NFV Quality Management Framework Proposal
Steve Hunter – A Scalable Networking Architecture for Improving Performability Within and Across Data Centers
Sean Gong – Design for the RAS challenges of NFV
Bruce Wong – Chaos - Addressing the challenges of Complex Distributed Systems at Scale
John Govert – Test and Monitoring in Virtual Networks
Spilios Makris – Overview, Issues & Next Steps for the SRPSDVE Study Group (presented from Baku, Azerbaijan)
Marcus Schöller – Introduction to ETSI's NFV Reliability and Availability WG (presented from Germany)
Mike Totorella – The Service Reliability Ecosystem (presented from New Jersey, USA)
Warren Volk – Operations Aspects of Reliability: Measurements and Network Management

Summary by Kelly Krick and David Lu (to be posted)

Special thanks to

Prof. Carol Davids, Illinois Institute of Technology for making the remote presentations possible, and
Hwey Chang, AT&T Labs for coordinating the Roundtable event and on-site management.

David Lu, Vice President, Business Solutions Development, AT&T, USA

David Lu, Vice President – Business Solutions Development, is responsible for Business Sales and Contracting, Global Service Assurance, Managed Services Platforms, Field Operation Dispatching, and Business Billing Solutions at AT&T. He leads an organization with more than 3,000 people across the globe.

David is a well-respected leader in software architecture and engineering, network performance and traffic management, business solutions, large data DB implementation/mining/analytics, software reliability and quality, and network operations process engineering.

Since joining AT&T Bell Labs in 1987, he has served in various leadership positions at AT&T. He holds 26 patents and has frequently appeared as a guest speaker at technical and leadership seminars and conferences throughout the world.

Topic: Network Transformation and Its Impacts on Reliability

Abstract:

This talk will cover the following topics: Network Virtualization and Its Impact to Network Reliability; SDN and Future Business Model and Operation Model; Reliability Implications by Open Source and New Software Ecosystem; Real Time Data Analytics & Adaptive Network; 5G Broadband and Next Generation Killer Apps; Cyber Security, PSI Protection, and Service Reliability; and Digital User Interaction.

Eric Bauer, Reliability Engineering Manager, IP Platforms Group, Alcatel-Lucent, USA

Eric Bauer is reliability engineering manager in the IP Platforms Group of Alcatel-Lucent. He has worked on reliability of Alcatel-Lucent’s platforms, applications and solutions for more than a decade. Before focusing on reliability engineering topics, Mr. Bauer spent two decades designing and developing embedded firmware, networked operating systems, IP PBXs, internet platforms, and optical transmission systems. He has been awarded more than two dozen US patents, authored “Service Quality of Cloud-Based Applications,” “Reliability and Availability of Cloud Computing,” “Beyond Redundancy: How Geographic Redundancy can Improve Service Availability and Reliability For Computer-Based Systems,” “Design for Reliability: Information and Computer-Based Systems,” and “Practical System Reliability” (all published by Wiley-IEEE Press) and has published several papers in the Bell Labs Technical Journal. Mr. Bauer holds a BS in Electrical Engineering from Cornell University, Ithaca, New York, and an MS in Electrical Engineering from Purdue University, West Lafayette, Indiana. He lives in Freehold, New Jersey.

Topic: NFV Quality Measurement

Abstract:

Network function virtualization explicitly decouples application software from underlying physical resources to create a far more flexible and dynamic operational environment. Objective, quantitative measurement of key quality characteristics are essential to rapidly localize errors, faults and failures, drive root cause analysis and decide on corrective actions in this shared, decoupled, flexible, dynamic and multi-vendor environment.

Dr. Steve Hunter, IBM Fellow and Adjunct Professor, NC State University (NCSU), USA

Steve Hunter started his career in the IBM Networking Division in 1984 where he has worked on multiple networking products, as well as network system architecture and technologies in general. In 1997, Steve joined the IBM Systems and Technology Group (STG) where he was systems architect for systems X and BladeCenter products, and Chief Technology Officer (CTO) of the BladeCenter system family.

Steve is an IBM Fellow and Chief Architect for Next Generation Computing Systems in IBM Research and an Adjunct Professor in the NCSU Electrical & Computer Engineering and Computer Science Departments focusing primarily on computer and networking architecture and technology.

Steve's areas of interest include parallel and distributed system architectures and technologies associated with computing, networking, clustering, high availability, and power-efficiency.

Steve received BS in Electrical and Computer Engineering, Auburn University, Auburn, AL, MS in Electrical and Computer Engineering, NC State University, Raleigh, NC and PhD in Electrical and Computer Engineering, Duke University, Durham, NC.

Topic: A Scalable Networking Architecture for Improving Performability Within and Across Data Centers

Abstract:

As data centers are scaling in size and across multiple geographic locations, some goals include improving performance, high availability, data locality, and disaster recovery. A scalable networking architecture is proposed with a hybrid combination of the Spline/Leaf architecture with optical switching (OCS) to provide dynamic low-latency communication. Advancements in OCS technology combined with the control plane of Software Defined Networking (SDN) technology provides a more dynamic approach than has been available in the past. Stochastic models are starting to be explored as an evaluation method with initial interest in the composite Performability metric.

Xuewen (Sean) Gong; Director of Corporate Reliability Technical Committee, Chief Expert of NFV/SDN RAS Design, Huawei Technologies, China

Sean has been working for Huawei for 18 years, as one of the key founders for Huawei’s reliability engineering, has made important contributions on setting up the whole reliability process and technology system for Huawei, now is focusing on the new RAS (Reliability, Availability and Serviceability) challenges in telecom and IT industry, he is the leader of Huawei NFV/SDN RAS Research & Development project.

Topic: NFV RAS Design: New Way to Meet the New Challenges

Abstract:

NFV brings profound changes to the telecom network and equipments: COTS hardware, virtualization, decoupling, flexibility, openness, etc. All of them will bring dramatic benefits to the telecom operators, but new RAS (Reliability, Availability, Serviceability) challenges are introduced as well, which will be critical for NFV’s successful deployment. This presentation will discuss the new RAS challenges of NFV, and the new technologies we are working on.

Bruce Wong, Technology Leader, ex-Netflix, USA

Bruce Wong is a technology leader at Netflix. He is passionate about identifying needs, building high impact engineering teams, tackling hard problems and building compelling solutions. Bruce's current team is Chaos Engineering. Injecting failure into our live production systems to ensure we're resilient.

Topic: Injecting Chaos - Addressing the Challenges of Complex Distributed Cloud Systems at Scale

Abstract:

There are many challenges building and running systems that handle Internet scale traffic. Chaos or failure injection is one philosophy that has proven to provide reliability value in these systems. This presentation will look at the challenges, why conventional testing isn't enough, and why Chaos helps to make a robust fault tolerant system.

John Govert, Office of the CTO, JDSU’s Network and Service Enablement Group

John Govert brings nearly 25 years of test product marketing, engineering and General Management experience to his position of Chief Technologist in JDSU’s Network and Service Enablement Business Segment. In this capacity, Govert is responsible for leading and developing effective strategies that enable JDSU to provide its customers with innovative solutions to assist in deploying new services, lowering operational costs, lowering churn, and increasing productivity within communication businesses. His current technical focal areas include NFV/SDN and evolving 5G needs.

Topic: Evolving Service Test and Assurance Needs in Software-Defined/Virtual Networks

Abstract:

Software Defined Networks and Network Function Virtualization is revolutionizing the way communication networks and services are architected, deployed, and maintained. Network and Cloud Service Providers will need new test and assurance strategies to support the emerging needs created by the network cloud.

Chair: Spilios Makris, PhD, Director, Palindrome Technologies, USA and Chairman of the IEEE SRPSDVE Study Group

Spilios Makris is currently the Director of Network Resilience and Business Continuity Management (BCM) in Palindrome Technologies. Spilios has extensive experience in BCM and network resilience serving as Director and Senior Consultant at Telcordia Technologies (formerly Bellcore) for over 28 years, conducting studies and developing methodologies along with industry Best Practices for over 50 Tier 1&2 telecom companies, telecom vendors, and Telecom Regulatory Authorities (TRAs) worldwide. Spilios has served as Chair, Vice-Chair, Lead Contributor of the Standards T1A1.2 WG on "Network Survivability Performance” (was renamed PRQC Reliability Task Force) for 20 years. He successfully managed the development and regular update of Telcordia Generic Reliability Requirements documents establishing them as the “de facto” industry standards (e.g., SR-332 on Reliability Prediction Procedure for Electronic Equipment).

Spilios is currently serving as the Chair of the IEEE Study Group for Security, Reliability, and Performance for Software Defined and Virtualized Ecosystems (e.g., SDN, NFV, etc.). (http://grouper.ieee.org/groups/srpsdv/meeting_information.html).

Spilios received his PhD in Industrial Engineering & Operations Research from the University of Massachusetts at Amherst, Mass., MS in Engineering Management from Northeastern University, Boston, Mass., and Diploma (equiv. to MS) in Electrical & Mechanical Engineering from the National Technical University of Athens, Greece.

He is a Certified Business Continuity Professional (CBCP) by the Disaster Recovery Institute International (DRII) and a Senior Member of IEEE.

Topic: Overview, Issues, and Next Steps for the IEEE SRPSDVE Study Group

Abstract:

This talk will give an overview of the activities at the IEEE Study Group for the Security, Reliability and Performance of Software Defined and Virtualized Ecosystems (SRPSDVE). It will briefly discuss the challenges and hot issued debated at the Group, the decisions that need to make in the coming weeks (e.g., propose the formation of IEEE SDN/NFV Working Groups along with possible areas for further study and standardization). Specific examples will be given of possible options and approaches for a future IEEE Working Group that will address SDN/NFV reliability issues.

Dr. Marcus Schöller, Chairman of the ETSI NFV REL Working Group and Professor at Reutlingen University, Germany

Marcus has joined the Computer Science Department of the Reutlingen University in September 2014 as a full professor for Cloud Computing. Before that, he was a Senior Researcher at NEC Laboratories Europe, Germany, and held a postdoc position at Lancaster University, UK, on self-organizing and resilient networked systems. His interests include security and dependability of networks, systems, and cloud environments, as well as privacy aspects of ICT.

In February 2015, Marcus was elected as Chairman of the "Reliability, Availability, and Assurance (REL)" Working Group within ETSI's ISG on Network Function Virtualization (NFV).

Marcus received his Diploma in Computer Science in 2001 and his Doctorate in Engineering in 2006 from the University of Karlsruhe, Germany (Topic: Robustness and Stability of Programmable Networks). He has been working on multiple European Union (EU) and national projects (e.g., ANA, ResumeNet, SECCRIT) with a focus on network function resilience and critical services on the cloud. He has published his research results in multiple journal and conference papers, books, and patent applications.

Topic: Introduction to ETSI's NFV Reliability and Availability WG

Abstract:

This talk will summarise the main achievements of NFV REL phase 1 before presenting the currently active work items at the REL Working Group covering topics like: (i) scalable architectures for reliability management, (ii) models and features for E2E reliability, and (iii) active monitoring for reliability and availability testing.

Michael Tortorella, PhD, Managing Director, Assured Networks

Dr. Tortorella is a leading communications industry expert in reliability management, engineering, modeling, and life data analysis. Over a 26-year career at Bell Laboratories he was responsible for research and implementations in fundamental system, network, and service reliability engineering methodologies as well as for management of reliability in such critical projects as the SL-280 undersea cable system, the world's first application of fiber-optic technology in an intercontinental, undersea system. He played a major role in many AT&T and Lucent product and service reliability studies, culminating in the creation of CADRE, a reliability modeling system for circuit packs that encompasses circuit simulation, thermal analysis, and uncertainty modeling in a single package fully integrated with computer-aided design systems used for circuit pack creation.

Formerly technical manager and a Distinguished Member of Technical Staff in the Design for Reliability Processes and Technologies Group in Bell Laboratories, Dr. Tortorella is now a research professor of industrial and systems engineering at Rutgers University. In addition to teaching courses in operations research and statistics, he maintains a robust research program that has direct impact on the concerns of the CQR. This program includes investigations into how the stochastic flows in an IP network determine the performance and reliability of services carried on those networks, design for network resiliency, developing modeling frameworks for control of IP networks under stressed conditions, and foundational issues in queueing theory. Additional current research interests include stochastic flows, network performance, management, and control, stochastic processes and their applications to reliability, life data analysis, and next-generation networks, as well as design for reliability methods and technologies. Dr. Tortorella has published extensively in these areas. He received the Ph. D. degree in mathematics from Purdue University in 1973. He is Advisory Editor for Quality Technology and Quantitative Management, where he has worked to increase the number of publications pertaining to the communications industry. His recently written book, Reliability, Maintainability, and Supportability: Best Practices for Systems Engineers has just been published by John Wiley and Sons.

Topic: The Service Reliability Ecosystem

Abstract

Telecommunications services have developed widely and rapidly away from the simple, ubiquitous POTS of decades past. The current environment is characterized by a vast diversity of new services, many service providers, and compelling trends in new infrastructures for service delivery. We aim to provide clarity for service users and service providers in the vital area of service reliability so that the telecommunications industry may effectively and profitably deliver services meeting service customer requirements for reliability. Our key points are that (1) service reliability is different from product reliability, (2) it is necessary to abstract the requirements for service reliability from the reliability requirements for the service delivery infrastructure (SDI) and its elements, and (3) the SDI network reliability requirements must arranged so that the end-user (service purchaser) service reliability requirements will be met.

Warren Volk, Senior-Network Support, Global Network Operations Center (GNOC), AT&T, USA

Warren Volk has spent 24 years working at AT&T, with over 16 years focusing on the function of Network Traffic & Performance Management. He has worked as a Network Management user, and Operations and OSS Planner to deliver tooling with changing technologies. Warren holds a BS in Computer Information System, and an MBA in Management & Strategy.

Topic: Operations Aspects of Reliability: Measurements and Network Management - Applying Legacy Principles to New & Emerging Technology

Abstract:

This discussion will focus on Network Management Principles and data collection which have been utilized by the TDM Switching Network Management community for decades. Examples of Network data measurements and standards for providing this data in vendor generic Operations Support Systems (OSS) will be discussed, along with a comparison to emerging technologies traffic flow. Although new technology offers a significant shift from the architecture of existing TDM networks, the Principles of Network Management and existing data measurements should still be the driving factor to jumpstart Network Management capabilities for protecting the Network and ensuring the best possible traffic completions.

Vice Chair: Tzyh-Jong (TJ) Wang, PhD, AT&T, USA

TJ Wang is currently with AT&T since 2008. He is a system engineer for mobility operations support systems focusing on mobility network end-to-end performance and reliability. Prior to joining AT&T, TJ was with DEC, Bellcore, Lucent Technologies and UTStarcom between 1987 and 2008.

He received his Ph.D. in Industrial Engineering from the University of Wisconsin-Madison in 1987; and B.S. in Industrial Engineering from Tsing Hua University, Taiwan, in 1978.

Advisor: Chi-Ming Chen, PhD, AT&T Labs, USA

Chi-Ming Chen joined AT&T in 1995. His current responsibility is the operations support system (OSS) architecture. Prior to joining AT&T, Chi-Ming was with Bell Communications Research (Bellcore) from 1985 to 1995. He was a faculty member at Tsing Hua University, Hsinchu, Taiwan from 1975 to 1979.

He received his Ph.D. in Computer and Information Science from the University of Pennsylvania in 1985; M.S. in Computer Science from the Pennsylvania State University in 1981; M.S. and B.S. in Physics from Tsing Hua University, Taiwan, in 1973 and 1971 respectively.

Chi-Ming Chen is a Life Senior Member of IEEE and Senior Member of ACM. He is an Advisory Board Member of IEEE Communications Society (ComSoc) Technical Committee on Communications Quality & Reliability (CQR), a member of the IEEE GLOBECOM & ICC Management & Strategy (GIMS) Standing Committee, and a member of the Industry Content and Exhibits Committee (ICEC). He has chaired several GLOBECOM and ICC Industry Forums and served as an IF&E (Industry Forum & Exhibits) Advisor for GLOBECOM 2014 and ICC 2015.

Resource Links:

Last updated on Sunday, December 20, 2015