Filtros de búsqueda

Lista de obras de Mateo Valero Cortés

A Complexity-Effective Simultaneous Multithreading Architecture

artículo científico

A Comprehensive Analysis of Indirect Branch Prediction

A Cost-Effective Architecture for Vectorizable Numerical and Multimedia Applications

A DRAM/SRAM memory scheme for fast packet buffers

A Decoupled KILO-Instruction Processor

A Quantitative Analysis of OS Noise

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications

A Two-Level Load/Store Queue Based on Execution Locality

A Vector-µSIMD-VLIW Architecture for Multimedia Applications

A block algorithm and optimal fixed-size systolic array processor for the algebraic path problem

scholarly article by Fernando J. Núñez & Mateo Valero Cortés published October 1989 in Journal of Signal Processing Systems

A case for resource-conscious out-of-order processors

A data cache with multiple caching strategies tuned to different types of locality

A distributed processor state management architecture for large-window processors

scholarly article published November 2008

A dynamic scheduler for balancing HPC applications

A method for implementation of one-dimensional systolic algorithms with data contraflow using pipelined functional units

scholarly article by Miguel Valero-García et al published February 1992 in Journal of Signal Processing Systems

A new pointer-based instruction queue design and its power-performance evaluation

A performance characterization of high definition digital video decoding using H.264/AVC

A simple speculative load control mechanism for energy saving

A survey of dual data cache systems


AMMC: Advanced Multi-Core Memory Controller


scholarly article

Adapting cache partitioning algorithms to pseudo-LRU replacement policies

Advanced Pattern based Memory Controller for FPGA based HPC applications

Align and distribute-based linear loop transformations

Alya: Multiphysics engineering simulation toward exascale

An Abstraction Methodology for the Evaluation of Multi-core Multi-threaded Architectures

scholarly article published July 2011

An Analyzable Memory Controller for Hard Real-Time CMPs

An ISA Comparison Between Superscalar and Vector Processors

An MPEG-4 performance study for non-SIMD, general purpose architectures

An overview of selected hybrid and reconfigurable architectures

Analyzing reference patterns in automatic data distribution tools

Analyzing the Efficiency of L1 Caches for Reliable Hybrid-Voltage Operation Using EDC Codes

Architectural Support for Fair Reader-Writer Locking

scholarly article published December 2010

Architecture Performance Prediction Using Evolutionary Artificial Neural Networks

scholarly article by P. A. Castillo et al published 2008 in Lecture Notes in Computer Science

Assessing Accelerator-Based HPC Reverse Time Migration

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

scholarly article by Luc Jaulmes et al published 1 September 2018 in IEEE Transactions on Parallel and Distributed Systems

Atomic quake

Author retrospective for software trace cache

Automatic Exploration of Potential Parallelism in Sequential Applications

BSC Vision Towards Exascale

BSLD threshold driven power management policy for HPC centers

Balanced loop partitioning using GTS

Balancing HPC applications through smart allocation of resources in MT processors

Bandwidth of Crossbar and Multiple-Bus Connections for Multiprocessors

Better Branch Prediction Through Prophet/Critic Hybrids

Big Data Processing: Data Flow vs Control Flow (New Benchmarking Methodology)

Branch classification to control instruction fetch in simultaneous multithreaded architectures

Branch predictor guided instruction decoding

Breaking the bandwidth wall in chip multiprocessors

CATA: Criticality Aware Task Acceleration for Multicore Processors

CODOMs: Protecting software with Code-centric memory Domains

CPU Accounting for Multicore Processors

artículo científico publicado en 2012

CPU Accounting in CMP Processors

Carotid-radial pulse wave velocity as a discriminator of intrinsic wall alterations during evaluation of endothelial function by flow-mediated dilatation

artículo científico publicado en 2011

Chained In-Order/Out-of-Order DoubleCore Architecture

scholarly article

Chairmen's introduction

Characterizing Power and Temperature Behavior of POWER6-Based System

Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster

Characterizing the resource-sharing levels in the UltraSPARC T2 processor

Circuit design of a dual-versioning L1 data cache

Circuit design of a dual-versioning L1 data cache for optimistic concurrency

Clock gate on abort: Towards energy-efficient hardware Transactional Memory

Code Semantic-Aware Runahead Threads

Code layout optimizations for transaction processing workloads

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

Comparing last-level cache designs for CMP architectures

article published in 2010

Conflict-free access for streams in multimodule memories

Conflict-free access to streams in multiprocessor systems

Contention-Based Nonminimal Adaptive Routing in High-Radix Networks


Control-Flow Independence Reuse via Dynamic Vectorization

Cost effective memory disambiguation for multimedia codes

Cost-conscious strategies to increase performance of numerical programs on aggressive VLIW architectures

Cost-effective compiler directed memory prefetching and bypassing

Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures


DReAM: Per-Task DRAM Energy Metering in Multicore Systems

DTM, DVS, optimal control

DeTrans: Deterministic and Parallel execution of Transactions

article published in 2014

Debugging programs that use atomic blocks and transactional memory

Debugging programs that use atomic blocks and transactional memory

Design and implementation of high-performance memory systems for future packet buffers

Designing OS for HPC Applications: Scheduling

Determinism at Standard-Library Level in TM-Based Applications

Discovering and understanding performance bottlenecks in transactional applications

Dynamic Cache Partitioning Based on the MLP of Cache Misses

Dynamic Memory Instruction Bypassing

Dynamic transaction coalescing

Dynamic-vector execution on a general purpose EDGE chip multiprocessor

Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory

EVX: Vector execution on low power EDGE cores

Early 21st Century processors


EcoTM: Conflict-aware Economical Unbounded Hardware Transactional Memory

Effective Instruction Prefetching via Fetch Prestaging

Effective communication and computation overlap with hybrid MPI/SMPSs

scholarly article published 2010

Effective communication and computation overlap with hybrid MPI/SMPSs

Efficient Routing Mechanisms for Dragonfly Networks

Efficient Sorting on the Tilera Manycore Architecture

Efficient runahead threads

Eliminating cache conflict misses through XOR-based placement functions

Emergent Behaviors in the Internet of Things: The Ultimate Ultra-Large-Scale System

Enabling preemptive multiprogramming on GPUs

Energy-Aware Accounting and Billing in Large-Scale Computing Facilities

Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Evaluating kilo-instruction multiprocessors

Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads

Evaluating the Impact of TLB Misses on Future HPC Systems

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Evolutionary system for prediction and optimization of hardware architecture performance

Explaining Dynamic Cache Partitioning Speed Ups

Exploiting Inactive Rename Slots for Detecting Soft Errors

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Exploiting instruction- and data-level parallelism

Exploring pattern-aware routing in generalized fat tree networks


FIMSIM: A fault injection infrastructure for microarchitectural simulators

Fair CPU time accounting in CMP+SMT processors

Fast Speculative Address Generation and Way Caching for Reducing L1 Data Cache Energy

Fetching instruction streams

Fine- and Coarse-Grain Reconfigurable Computing



From Plasma to BeeFarm: Design Experience of an FPGA-Based Multicore Prototype

Future Vector Microprocessor Extensions for Data Aggregations

Fuzzy Memoization for Floating-Point Multimedia Applications

General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models

Global misrouting policies in two-level hierarchical networks

Guest Editors' Introduction Cache Memory And Related Problems: Enhancing And Exploiting The Locality

HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications

HPC System Software for Regular and Irregular Parallel Applications

scholarly article published May 2013

Hardware Round-Robin Scheduler for Single-ISA Asymmetric Multi-core

Hardware Transactional Memory with Operating System Support, HTMOS

Hardware schemes for early register release

Hardware support for WCET analysis of hard real-time multicore systems

Hardware support for accurate per-task energy metering in multicore systems

Hardware transactional memory with software-defined conflicts

Hierarchical clustered register file organization for VLIW processors

Hybrid Cache Designs for Reliable Hybrid High and Ultra-Low Voltage Operation

Hybrid Transactional Memory with Pessimistic Concurrency Control

Hybrid high-performance low-power and ultra-low energy reliable caches

IA^3: An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems

ITCA: Inter-task Conflict-Aware CPU Accounting for CMPs

Identifying Critical Code Sections in Dataflow Programming Models

Implementing kilo-instruction multiprocessors

Implicit vs. explicit resource allocation in SMT processors

article published in 2004

Imposing coarse-grained reconfiguration to general purpose processors

Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems

publication published on 15 June 2020

Improving Cache Management Policies Using Dynamic Reuse Distances

Increasing multicore system efficiency through intelligent bandwidth shifting

Initial Results on Fuzzy Floating Point Computation for Multimedia Processors

Instruction fetch architectures and code layout optimizations

article published in 2001

Interconnection Networks in Petascale Computer Systems

Joint Circuit-System Design Space Exploration of Multiplier Unit Structure for Energy-Efficient Vector Processors

Kernel-to-User-Mode Transition-Aware Hardware Scheduling

Kilo-instruction processors, runahead and prefetching

LPA: A First Approach to the Loop Processor Architecture

scholarly article

Late allocation and early release of physical registers

Latency tolerant branch predictors

Levels and rates of change in carotid-radial pulse wave velocity associated with reactive hyperaemia: Analysis of the dependence on transient ischemia length

artículo científico publicado en 2010

Lifetime-sensitive modulo scheduling in a production environment

Linear programming based parallel job scheduling for power constrained systems

scholarly article published July 2011

Load balancing using dynamic cache allocation

Long DNA Sequence Comparison on Multicore Architectures

Loop parallelization: revisiting framework of unimodular transformations

M-users B-servers arbiter for multiple-busses multiprocessors

MAPC: Memory access pattern based controller

MFLUSH: Handling Long-Latency Loads in SMT On-Chip Multiprocessors

scholarly article published September 2008

MLP-Aware Dynamic Cache Partitioning

MUSA: A Multi-level Simulation Approach for Next-Generation HPC Machines

Measuring Operating System Overhead on CMT Processors

Message from General Chairs

Message from the Program Chair

Message from the Program Co-Chairs

Microarchitectural Support for Speculative Register Renaming

Modulo scheduling with reduced register pressure

Moving from petaflops to petadata

MultiLayer processing - an execution model for parallel stateful packet processing


Multicore Resource Management

Multicore: The View from Europe

Multithreaded software transactional memory and OpenMP

Nebelung: Execution Environment for Transactional OpenMP

Network unfairness in dragonfly topologies

New Benchmarking Methodology and Programming Model for Big Data Processing


Novel SRAM bias control circuits for a low power L1 data cache


OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management


Oblivious routing schemes in extended generalized Fat Tree networks


On the Problem of Evaluating the Performance of Multiprogrammed Workloads

On the Problem of Minimizing Workload Execution Time in SMT Processors

On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications

On the maturity of parallel applications for asymmetric multi-core processors

On the selection of adder unit in energy efficient vector processing

On the simulation of large-scale architectures using multiple application abstraction levels

article published in 2012

On-the-Fly Adaptive Routing in High-Radix Hierarchical Networks

scholarly article published September 2012

On-the-fly adaptive routing for dragonfly interconnection networks


Online Prediction of Applications Cache Utility

Optimal task assignment in multithreaded processors

Optimizing job performance under a given power constraint in HPC centers

Overlapping communication and computation by using a hybrid MPI/SMPSs approach

scholarly article published 2010

PAMS: Pattern Aware Memory System for embedded systems




PVMC: Programmable Vector Memory Controller

Parallel job scheduling for power constrained HPC systems

article by M. Etinski et al published December 2012 in Parallel Computing

Parallel processing in biological sequence comparison using general purpose processors

Partitioning: An Essential Step in Mapping Algorithms Into Systolic Array Processors

Per-task Energy Accounting in Computing Systems

Performance Analysis of Sequence Alignment Applications

Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering

Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications

Performance analysis of a hardware accelerator of dependence management for task-based dataflow programming models

Performance and Energy Efficient Hardware-Based Scheduler for Symmetric/Asymmetric CMPs

Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors

Physical vs. Physically-Aware Estimation Flow: Case Study of Design Space Exploration of Adders

Picos, A Hardware Task-Dependence Manager for Task-Based Dataflow Programming Models

Picos: A hardware runtime architecture support for OmpSs

Power and performance aware reconfigurable cache for CMPs

Power and thermal characterization of POWER6 system

Power-aware load balancing of large scale MPI applications

Power-efficient VLIW design using clustering and widening

Predictable performance in SMT processors: synergy between the OS and SMTs

Preliminary Analysis of the Cell BE Processor Limitations for Sequence Alignment Applications

Profile-guided transaction coalescing—lowering transactional overheads by merging transactions

Profiling and Optimizing Transactional Memory Applications

Programmability and portability for exascale: Top down programming methodology and tools with StarSs

QoS for high-performance SMT processors in embedded systems


Quantifying the Potential Task-Based Dataflow Parallelism in MPI Applications

Quantitative analysis of sequence alignment applications on multiprocessor architectures

scholarly article published 2009



scholarly article published in 2011

RVC-based time-predictable faulty caches for safety-critical systems

Rapid Development of Error-Free Architectural Simulators Using Dynamic Runtime Testing


Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling

Reducing fetch architecture complexity using procedure inlining

Reduction of Connections for Multibus Organization

article published in 1983

Register constrained modulo scheduling

article by J. Zalamea et al published May 2004 in IEEE Transactions on Parallel and Distributed Systems

Registers Size Influence on Vector Architectures

Reimagining Heterogeneous Computing: A Functional Instruction-Set Architecture Computing Model

Reimagining Heterogeneous Computing: a Functional Instruction Set Architecture (F-ISA) Computing Model

Resource-bounded multicore emulation using Beefarm

Runahead Threads to improve SMT performance

Runtime-Aware Architectures

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures

Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes

scholarly article published 2016

SMT Malleability in IBM POWER5 and POWER6 Processors

STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems

Scalability Analysis of Progressive Alignment on a Multicore

Scalability of Macroblock-level Parallelism for H.264 Decoding

Scalable multicore architectures for long DNA sequence comparison

Scaling Irregular Applications through Data Aggregation and Software Multithreading

Selection of the Register File Size and the Resource Allocation Policy on SMT Processors

Sensible Energy Accounting with Abstract Metering for Multicore Systems

Simulating Whole Supercomputer Applications

Simulation environment for studying overlap of communication and computation

Soft Real-Time Scheduling on SMT Processors with Explicit Resource Allocation

Software Trace Cache

Software Trace Cache for Commercial Applications

scholarly article

Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures

Software trace cache

Software-Controlled Priority Characterization of POWER5 Processor

Spark deployment and performance evaluation on the MareNostrum supercomputer

Speculative early register release

Speculative execution for hiding memory latency

Stand-Alone Memory Controller for Graphics System

Studying New Ways for Improving Adaptive History Length Branch Predictors

Supercomputing for the Future, Supercomputing from the Past (Keynote)

Supercomputing with commodity CPUs

SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory

article published in 2011

TERAFLUX: Harnessing dataflow in next generation teradevices

TM-dietlibc: A TM-aware Real-World System Library

TMbox: A Flexible and Reconfigurable 16-Core Hybrid Transactional Memory System

Taking the heat off transactions: Dynamic selection of pessimistic concurrency control

Task Superscalar: An Out-of-Order Task Pipeline

The Impact of Application's Micro-Imbalance on the Communication-Computation Overlap

The International Exascale Software Project roadmap

The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community

article published in 2009

The Mont-Blanc Prototype: An Alternative Approach for HPC Systems

The Network Adapter: The Missing Link between MPI Applications and Network Performance


The Problem of Evaluating CPU-GPU Systems with 3D Visualization Applications

scholarly article by Javier Verde et al published November 2012 in IEEE Micro

The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices

The impact of traffic aggregation on the memory performance of networking applications

The limits of software transactional memory (STM)

Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach

Thread Assignment of Multithreaded Network Applications in Multicore/Multithreaded Processors

Thread Lock Section-Aware Scheduling on Asymmetric Single-ISA Multi-Core

Thread to Core Assignment in SMT On-Chip Multiprocessors

Thread to strand binding of parallel network applications in massive multi-threaded systems

Thread to strand binding of parallel network applications in massive multi-threaded systems

Three-dimensional memory vectorization for high bandwidth media memory systems

Throughput Unfairness in Dragonfly Networks under Realistic Traffic Patterns

Trace filtering of multithreaded applications for CMP memory simulation

Trace-driven simulation of multithreaded applications

Transactional Memory and OpenMP

Transactional Memory: An Overview

Trends and techniques for energy efficient architectures

scholarly article published September 2010

Turbocharging boosted transactions or

Understanding the future of energy-performance trade-off via DVFS in HPC environments

Using Dynamic Runtime Testing for Rapid Development of Architectural Simulators


Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory

Utilization driven power-aware parallel job scheduling

VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors

Vector Extensions for Decision Support DBMS Acceleration

article published in 2012

Vectorized AES Core for High-throughput Secure Environments

Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications

artículo científico publicado en 2022

Workshop 20 introduction: Workshop on multithreaded architectures and applications - MTAAP’08


A latency-conscious SMT branch prediction architecture

artículo científico publicado en 2004

A low-complexity fetch architecture for high-performance superscalar processors

artículo científico publicado en 2004

DIA: A Complexity-Effective Decoding Architecture

artículo científico publicado en 2009

Enlarging Instruction Streams

artículo científico publicado en 2007

Introducing Kilo-instruction Multiprocessors


Kilo-Instruction Processors: Overcoming the Memory Wall

artículo científico publicado en 2005

Selecting Where to simulate SPEC2000 Using Stream Analysis


Stream Predictor Guided Instruction Decoding


Toward kilo-instruction processors

artículo científico publicado en 2004