WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
THE ROLE:
The Quality Engineering team is looking for an experienced Failure Analysis Engineer focused on Power and Thermal, with strong expertise in power behavior, thermal analysis, liquid-cooling performance, failure isolation, and rail bring-up. This individual will support customer and factory failure investigations for GPU accelerators, with primary ownership of PCB triage and board-level fault isolation for power- and thermal-related issues. They will review schematics and layouts to develop targeted debug strategies, set up scope measurements and diagnostics, run functional test DOE’s to reproduce and isolate failures, and work closely with design, validation, FW, and manufacturing teams to accelerate root cause analysis and corrective actions. Your contributions will directly impact product quality, reliability, and customer satisfaction.
THE PERSON:
The ideal candidate is a hands-on engineer with a strong hardware foundation and deep experience in power- and thermal-related failure analysis, debug, and board bring-up. They bring a strong analytical mindset and are skilled at triaging complex PCB failures by narrowing issues to the board, component, rail, thermal condition, cooling behavior, or system interaction level. They are comfortable reviewing schematics, setting up scope captures, running diagnostics, and designing functional test DOE’s to reproduce and isolate hard-to-find failures, while working effectively across design, validation, manufacturing, and repair teams. A strong understanding of liquid-cooling fundamentals—including flow rates, heat dissipation, and thermal transfer behavior—is important for this role. Their communication and documentation skills enable clear reporting and collaboration, and their curiosity and persistence help drive timely, high-quality root cause analysis and corrective actions.
KEY RESPONSIBILITIES:
Support internal and external requests to troubleshoot AMD GPU product failures with primary focus on Power and Thermal failure analysis, PCB triage, and board-level failure isolation for continuous yield, quality, and customer support improvements.
Develop and execute diagnostics, scope-based measurements, and functional test DOE’s to reproduce, characterize, and isolate difficult board-, power-, and thermal-related failures.
Develop Automation and tools to run tests and analyze results/logs.
Perform structured PCB triage by narrowing failures to the board, component, power rail, layout interaction, or system integration level, and work with the contract manufacturer and internal AMD teams to reproduce failures, isolate root cause, and determine the most effective next steps for debug and corrective action.
Use schematics, layout data, lab measurements, and power/thermal behavior knowledge to understand system behavior, trace likely fault paths, form debug hypotheses, and build targeted validation plans that drive efficient fault isolation and high-quality failure analysis.
Analyze liquid-cooling performance and thermal dissipation behavior, including flow-related conditions, heat transfer effectiveness, and cooling-path anomalies, to support efficient root cause analysis of Power and Thermal failures.
Document all findings into FA database and create a complete failure analysis report for customer consumption as needed.
Present findings to key stakeholders, including senior management.
Implement ongoing continuous improvements of failure analysis process & techniques and create procedures of the steps to follow.
Oversee the set-up of new products and test stations for Failure Analysis operations.
PREFERRED EXPERIENCE:
Deep expertise in hardware debug fundamentals, Power and Thermal failure analysis, diagnostics, and structured failure reproduction, including functional test development.
Skilled in using lab equipment (oscilloscopes, logic analyzers, power analyzers, and custom test tools) to capture scope shots, validate rails, characterize thermal behavior, and support hardware debug.
Working knowledge of liquid-cooling systems and thermal management concepts, including flow rates, heat dissipation, thermal transfer, and cooling loop behavior in high-power server environments.
Strong background in PCB triage, board-level failure analysis, schematic review, diagnostics, and failure isolation techniques, with added experience debugging power and thermal issues from NPI through production.
Proficient in Python, shell scripting, and working across Windows and Linux environments.
Solid understanding of firmware, drivers, and hardware interactions, with the ability to tune firmware as needed.
Extensive experience in hardware verification and system integration.
Hands-on experience assembling, installing, and configuring computer systems and servers.
Strong communication, documentation, collaboration, and presentation skills.
Able to read schematics, interpret datasheets, identify components, and perform soldering/rework to support efficient hardware debug and failure isolation.
Knowledge of high-speed digital design, power delivery networks, voltage regulator behavior, memory interfaces (HBM, GDDR), PCIe, and display outputs (DP, HDMI).
Experience with GPU data center infrastructure and AI/ML technologies is a plus.
ACADEMIC CREDENTIALS:
Bachelor’s degree in Electrical Engineering, Computer Engineering, or a related field.
LOCATION:
Secaucus, NJ
This role is not eligible for Visa sponsorship
#LI-AP2
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Learn more about this Employer on their Career Site
