Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Manage Hadoop clusters with HDFS operations, YARN job tuning, and distributed processing diagnostics.
Manage Hadoop clusters with HDFS operations, YARN job tuning, and distributed processing diagnostics.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
If ~/hadoop/ doesn't exist or is empty, read setup.md and start the conversation naturally.
User works with Hadoop ecosystem (HDFS, YARN, MapReduce, Hive). Agent handles cluster diagnostics, job optimization, storage management, and troubleshooting distributed processing failures.
Memory lives in ~/hadoop/. See memory-template.md for structure. ~/hadoop/ โโโ memory.md # Cluster configs, common issues, preferences โโโ clusters/ # Per-cluster notes and configs โ โโโ {name}.md # Specific cluster context โโโ scripts/ # Custom diagnostic scripts
TopicFileSetup processsetup.mdMemory templatememory-template.mdHDFS operationshdfs.mdYARN tuningyarn.mdTroubleshootingtroubleshooting.md
Before any operation, check cluster health: hdfs dfsadmin -report yarn node -list Never assume cluster is healthy. A single dead DataNode changes everything.
HDFS issues cascade into job failures. Always check: hdfs dfs -df -h # Capacity hdfs fsck / -files -blocks # Block health A job failing with "No space left" is storage, not code.
YARN allocates based on configured scheduler. Know which is active: yarn rmadmin -getServiceState rm1 cat /etc/hadoop/conf/yarn-site.xml | grep scheduler Default (Capacity) vs Fair scheduler behave very differently.
Default replication=3. For temp data, suggest 1-2 to save space: hdfs dfs -setrep -w 1 /tmp/scratch/ For critical data, verify replication is honored: hdfs fsck /data/critical -files -blocks -replicaDetails
Hadoop logs scatter across machines. Key locations: ComponentLog PathNameNode/var/log/hadoop-hdfs/hadoop-hdfs-namenode-*.logDataNode/var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.logResourceManager/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.logNodeManager/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.logApplicationyarn logs -applicationId <app_id>
NameNode enters safe mode on startup or low block count: hdfs dfsadmin -safemode get # Check status hdfs dfsadmin -safemode leave # Exit (if blocks OK) Never force-leave if blocks are actually missing.
90% of "job killed" issues are memory: # Container settings yarn.nodemanager.resource.memory-mb # Total per node yarn.scheduler.minimum-allocation-mb # Min container mapreduce.map.memory.mb # Map task mapreduce.reduce.memory.mb # Reduce task Check these before assuming code is wrong.
# Navigation hdfs dfs -ls /path hdfs dfs -du -h /path # Size with human units hdfs dfs -count -q /path # Quota info # Data movement hdfs dfs -put local.txt /hdfs/ # Upload hdfs dfs -get /hdfs/file.txt . # Download hdfs dfs -cp /src /dst # Copy within HDFS hdfs dfs -mv /src /dst # Move within HDFS # Maintenance hdfs dfs -rm -r /path # Delete (trash) hdfs dfs -rm -r -skipTrash /path # Delete (permanent) hdfs dfs -expunge # Empty trash
# Find corrupt blocks hdfs fsck / -list-corruptfileblocks # Delete corrupt file (after confirming unrecoverable) hdfs fsck /path/file -delete # Force replication hdfs dfs -setrep -w 3 /important/data/
# List applications yarn application -list # Running yarn application -list -appStates ALL # All states # Application details yarn application -status <app_id> # Kill stuck application yarn application -kill <app_id> # Get logs (after completion) yarn logs -applicationId <app_id> yarn logs -applicationId <app_id> -containerId <container_id>
# List queues yarn queue -list # Queue status yarn queue -status <queue_name> # Move application between queues yarn application -movetoqueue <app_id> -queue <target_queue>
Deleting without -skipTrash on full cluster โ Trash still uses space, cluster stays full Setting container memory below JVM heap โ Instant container kill, confusing errors Ignoring speculative execution on slow jobs โ Wastes resources on duplicated tasks Running fsck on busy cluster โ Performance impact, run during maintenance Assuming HDFS = POSIX semantics โ No append-in-place, no random writes Forgetting timezone in scheduling โ Oozie/Airflow jobs fire at wrong times
Data that stays local: Cluster notes saved in ~/hadoop/clusters/ Preferences and environment context What commands access: hdfs/yarn commands connect to your Hadoop cluster Some commands read system paths (/var/log, /etc/hadoop/conf) Destructive commands require explicit user confirmation This skill does NOT: Store credentials (use kinit/keytab separately) Make external API calls beyond your cluster Run destructive commands without asking first
Install with clawhub install <slug> if user confirms: linux โ system administration docker โ containerized deployments bash โ shell scripting
If useful: clawhub star hadoop Stay updated: clawhub sync
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.